Community
Data Fabric: The Origination Story
Throughout my career, enterprise data management paradigms have come and gone. Let’s briefly trawl some data management history to understand how data fabric emerged.
Popular data management paradigms include:
Okay, “data swamp” is somewhat satirical, but it is an unfortunate consequence of the intersection of the plethora of data paradigms and the pragmatic demands of the real world.
Indeed, I contributed to the swampish madness bringing one short-lived 'Data Timehouse' paradigm to market. My then company, with an excitable new management and a strategic consultant, wanted to impress a high profile research analyst with a new technology category. Thus the 'Data Timehouse' got born. I had to backfill quickly by writing explainers and blogs to a) get some internet presence b) explain to our customers and field teams what the "new" category was all about. I and my fellow minions didn't do too badly in a limited amount of time, but the analyst didn't adopt the term and management reprioritized in line with the next shiny thing. The Data Timehouse got put to one side.
But I digress. Let's get back to data management categories that really stood the test of time leading to the data fabric.
From Data Warehouse to Data Streaming to Data Fabric
Since the late 1980s, data warehouses have a substantial history in decision support and business intelligence applications. Massively parallel proprietary processing architectures facilitated the handling of larger structured data sizes - at high cost. However, unstructured and semi-structured data proved troublesome, and data warehouses, primarily centered on three phase ETL (Extract, Transform, Load) processes, found data-sets featuring high variety, velocity, and volume to be beyond reach.
Then, about 15 years ago, architects envisioned the data lake, a perceived cost effective and scalable single system to house data for many different analytic products and repositories for raw data in a variety of formats. The sizeable Apache Hadoop ecosystem was its focal point, beloved by advocates but seen by others as extending distance between lake and practical use, helped popularize the term data swamp.
Great for storing data, data lakes did not support transactions, they relied on upstream quality processes to enforce data quality, and their lack of consistency and isolation made it hard to mix appends and reads. Data lakes were, for many, unwieldy for batch analytics and impotent for streaming. However, I proudly have my 'Data is the New Bacon' T-shirt courtesy of a certain Hadoop services provider tucked away in my T-Shirt draw.
With data warehouse versus data lake dividing opinions like pineapple on pizza, practitioners and business teams used the cards they were dealt and quietly got on with their jobs. They used databases, file storage, and other data repositories alongside BI platforms, and popular modelling and analysis packages, which include the ever-present Excel and the language of data science Python. Data science only became a relatively popular term after 2010 capturing the intersection of statistics, data engineering and the then evolving machine learning technologies.
Step forward the open-source Apache Spark project and its unicorn corporate sugar-parent, Databricks, which, with its Delta Lake table format incorporating standard warehouse functionality (transactions, efficient upserts, isolation, time-travel query) into standard data lake technology, coined the phrase data lakehouse. More open and inclusive, the lakehouse facilitated co-existence of lake and warehouse, with low-cost cloud storage, object stores, and open formats to enable moderate interoperability and somewhat efficient batch (analytics) processing. Recently, the increasingly popular Iceberg table, a data format similar to Delta Lake, has grown from zero to hero, helping introduce data icehouse nomenclatures and ecosystems.
The data lakehouse ecosystem, with Databricks at its fore, grew from strength to strength, but so too has a corresponding data streaming and analytics ecosystem led by Confluent, based on Apache Kafka and Apache Flink. Its ecosystem has harnessed real-time data stores and analytics plaforms like Apache Pinot, Apache Druid and RocksDB-based Rockset, the latter recently acquired by OpenAI for its fast analytics and vector search capabilities. Lakehouse and streaming analytics intersect, compete and overlap, with Confluent-supported Flink, providing a “unified stream-processing and batch-processing framework,” while Spark Structured Streaming connects Spark to Kafka.
From warehouse to data streaming, a critical enabler to creating business value is analytics, the oil that informs business value. Data without analytics is like the proverbial noise in the forest when there is no one there to hear it. Data needs to be queried, analyzed, modelled, and accessed where it’s needed. In most data paradigms, analytics has been subsidiary to data management.
Data Mesh and Data Fabric
Thus enter the vendor neutral terms data mesh and data fabric, which marry abstract data management to practical business value, with analytics being a uniting enabler, through Business Intelligence and increasingly Decision Intelligence.
Data fabric and data mesh are similar but different, co-existing when required. A data mesh is a philosophical concept akin to software development Agile or Lean methodologies, with data fabric a technology pattern. According to a well known analyst firm, “a data fabric is an emerging data management and data integration design concept. Its goal is to support data access across the business through flexible, reusable, augmented and sometimes automated data integration.”
Technology, culture and business needs rarely make choices straightforward. Data mesh allows for autonomy, but most products will want and use a shared platform per the data fabric. Indeed the latter, when successful, enables many data assets to be discovered and used beyond managed data products, offering an exciting backbone for knowledge discovery.
Here's the problem. A data fabric brings together data physically at the point of processing or consumption and transforms it into a common shape. However, it neither truly unifies it - data can still be duplicated and with poor quality attributes - nor creates and encapsulates graph data assets as knowledge, nor fully invigorates context into analytics and models.
To create higher quality data products, a data fabric should incorporate entity resolution and impactful master data quality capabilities, and facilitate knowledge creation and analysis.
Entity Resolution, Knowledge Graphs and the Financial Services Data Fabric
Entity Resolution entails working out whether multiple records are referencing the same real-world thing, such as a person, organization, address, phone number, bank account, or device. Entity Resolution takes multiple disparate data points — ideally from external and internal sources — and resolves them into one distinct, unique entity.
Every decision your organization makes relies on accurate and complete data. And while we have access to more data than ever before, connecting today’s infinite data points and turning them into actionable, valuable insights presents a considerable challenge.
When decisions involve people, places and things, entity accuracy matters above all others.
Data fabrics do not come with entity resolution - and therefore assured entity accuracy - out of the box. They unify, but don't de-duplicate and optimize. In financial services, and in particular in large tier 1 organizations, where technology stacks have evolved into integrated data fabrics, the need to reconcile and manage entity quality across federated business units is often a major gap.
Through entity resolution, data quality management can facilitate both on-demand decisioning (KYC decisions or TradeAML monitoring) and batch analytics (for example, AML investigations). Most processes - risk and AML being two - require both batch and dynamic approaches.
Then the ability to unify data-sets as a knowledge graph facilitates the ability to contain, harness and discover knowledge across your fabric. It provides a flexible, visual representation of your data organized around understandable data sets and the cross-references between them, integrating vast amounts of disparate data for visualization and analysis. For example, graphs centered on knowledge and networks can scan and assess relationships in order to facilitate discovery, to infuse risk decisions or financial crime investigations.
In this way, a traditional data fabric becomes a contextual data fabric, a true and high quality knowledge-illuminating fabric that can drive multiple use cases: fraud, AML, KYC, customer intelligence and credit and counterparty risk. In addition, it helps infuse higher quality data management by resolving entities to appropriate levels of accuracy in golden source data stores. Such a unified approach makes a data fabric thrive because:
In conclusion, when a data fabric is enhanced with a contextual fabric, one which de-duplicates data for utmost accuracy and drives knowledge discovery to infuse enterprise models and analytics, financial services use cases are better able to derive business value.
Acknowledgements
My colleague Martin Maisey kindly commented on and advised on much of the source material for this article.
This content is provided by an external author without editing by Finextra. It expresses the views and opinions of the author.
David Smith Information Analyst at ManpowerGroup
20 November
Konstantin Rabin Head of Marketing at Kontomatik
19 November
Ruoyu Xie Marketing Manager at Grand Compliance
Seth Perlman Global Head of Product at i2c Inc.
18 November
Welcome to Finextra. We use cookies to help us to deliver our services. You may change your preferences at our Cookie Centre.
Please read our Privacy Policy.