Join the Community

22,348

Expert opinions

44,252

Total members

380

New members (last 30 days)

158

New opinions (last 30 days)

28,791

Total comments

Join Sign in

How a Data Fabric Delivers Better Financial Services Outcomes

05 August 2024 Be the first to comment

Steve Wilcockson

Technical Product Marketing

Quantexa

Data Fabric: The Origination Story

Throughout my career, enterprise data management paradigms have come and gone. Let’s briefly trawl some data management history to understand how data fabric emerged.

Popular data management paradigms include:

Data Warehouse
Data Lake
ETL, i.e. Extract, Transform, Load
ELT, i.e. Extract, Load, Transform
Data Lakehouse
Data Streaming
Data Mesh
Data Fabric
Data Swamp

Okay, “data swamp” is somewhat satirical, but it is an unfortunate consequence of the intersection of the plethora of data paradigms and the pragmatic demands of the real world.

Indeed, I contributed to the swampish madness bringing one short-lived 'Data Timehouse' paradigm to market. My then company, with an excitable new management and a strategic consultant, wanted to impress a high profile research analyst with a new technology category. Thus the 'Data Timehouse' got born. I had to backfill quickly by writing explainers and blogs to a) get some internet presence b) explain to our customers and field teams what the "new" category was all about. I and my fellow minions didn't do too badly in a limited amount of time, but the analyst didn't adopt the term and management reprioritized in line with the next shiny thing. The Data Timehouse got put to one side.

But I digress. Let's get back to data management categories that really stood the test of time leading to the data fabric.

From Data Warehouse to Data Streaming to Data Fabric

Since the late 1980s, data warehouses have a substantial history in decision support and business intelligence applications. Massively parallel proprietary processing architectures facilitated the handling of larger structured data sizes - at high cost. However, unstructured and semi-structured data proved troublesome, and data warehouses, primarily centered on three phase ETL (Extract, Transform, Load) processes, found data-sets featuring high variety, velocity, and volume to be beyond reach.

Then, about 15 years ago, architects envisioned the data lake, a perceived cost effective and scalable single system to house data for many different analytic products and repositories for raw data in a variety of formats. The sizeable Apache Hadoop ecosystem was its focal point, beloved by advocates but seen by others as extending distance between lake and practical use, helped popularize the term data swamp.

The Hadoop Ecosystem

Great for storing data, data lakes did not support transactions, they relied on upstream quality processes to enforce data quality, and their lack of consistency and isolation made it hard to mix appends and reads. Data lakes were, for many, unwieldy for batch analytics and impotent for streaming. However, I proudly have my 'Data is the New Bacon' T-shirt courtesy of a certain Hadoop services provider tucked away in my T-Shirt draw.

With data warehouse versus data lake dividing opinions like pineapple on pizza, practitioners and business teams used the cards they were dealt and quietly got on with their jobs. They used databases, file storage, and other data repositories alongside BI platforms, and popular modelling and analysis packages, which include the ever-present Excel and the language of data science Python. Data science only became a relatively popular term after 2010 capturing the intersection of statistics, data engineering and the then evolving machine learning technologies.

Step forward the open-source Apache Spark project and its unicorn corporate sugar-parent, Databricks, which, with its Delta Lake table format incorporating standard warehouse functionality (transactions, efficient upserts, isolation, time-travel query) into standard data lake technology, coined the phrase data lakehouse. More open and inclusive, the lakehouse facilitated co-existence of lake and warehouse, with low-cost cloud storage, object stores, and open formats to enable moderate interoperability and somewhat efficient batch (analytics) processing. Recently, the increasingly popular Iceberg table, a data format similar to Delta Lake, has grown from zero to hero, helping introduce data icehouse nomenclatures and ecosystems.

The data lakehouse ecosystem, with Databricks at its fore, grew from strength to strength, but so too has a corresponding data streaming and analytics ecosystem led by Confluent, based on Apache Kafka and Apache Flink. Its ecosystem has harnessed real-time data stores and analytics plaforms like Apache Pinot, Apache Druid and RocksDB-based Rockset, the latter recently acquired by OpenAI for its fast analytics and vector search capabilities. Lakehouse and streaming analytics intersect, compete and overlap, with Confluent-supported Flink, providing a “unified stream-processing and batch-processing framework,” while Spark Structured Streaming connects Spark to Kafka.

From warehouse to data streaming, a critical enabler to creating business value is analytics, the oil that informs business value. Data without analytics is like the proverbial noise in the forest when there is no one there to hear it. Data needs to be queried, analyzed, modelled, and accessed where it’s needed. In most data paradigms, analytics has been subsidiary to data management.

Data Mesh and Data Fabric

Thus enter the vendor neutral terms data mesh and data fabric, which marry abstract data management to practical business value, with analytics being a uniting enabler, through Business Intelligence and increasingly Decision Intelligence.

Data fabric and data mesh are similar but different, co-existing when required. A data mesh is a philosophical concept akin to software development Agile or Lean methodologies, with data fabric a technology pattern. According to a well known analyst firm, “a data fabric is an emerging data management and data integration design concept. Its goal is to support data access across the business through flexible, reusable, augmented and sometimes automated data integration.”

Data Fabric versus Data Mesh

Technology, culture and business needs rarely make choices straightforward. Data mesh allows for autonomy, but most products will want and use a shared platform per the data fabric. Indeed the latter, when successful, enables many data assets to be discovered and used beyond managed data products, offering an exciting backbone for knowledge discovery.

Here's the problem. A data fabric brings together data physically at the point of processing or consumption and transforms it into a common shape. However, it neither truly unifies it - data can still be duplicated and with poor quality attributes - nor creates and encapsulates graph data assets as knowledge, nor fully invigorates context into analytics and models.

To create higher quality data products, a data fabric should incorporate entity resolution and impactful master data quality capabilities, and facilitate knowledge creation and analysis.

Entity Resolution, Knowledge Graphs and the Financial Services Data Fabric

Entity Resolution entails working out whether multiple records are referencing the same real-world thing, such as a person, organization, address, phone number, bank account, or device. Entity Resolution takes multiple disparate data points — ideally from external and internal sources — and resolves them into one distinct, unique entity.

Resolving Entities

Every decision your organization makes relies on accurate and complete data. And while we have access to more data than ever before, connecting today’s infinite data points and turning them into actionable, valuable insights presents a considerable challenge.

When decisions involve people, places and things, entity accuracy matters above all others.

Data fabrics do not come with entity resolution - and therefore assured entity accuracy - out of the box. They unify, but don't de-duplicate and optimize. In financial services, and in particular in large tier 1 organizations, where technology stacks have evolved into integrated data fabrics, the need to reconcile and manage entity quality across federated business units is often a major gap.

Through entity resolution, data quality management can facilitate both on-demand decisioning (KYC decisions or TradeAML monitoring) and batch analytics (for example, AML investigations). Most processes - risk and AML being two - require both batch and dynamic approaches.

Then the ability to unify data-sets as a knowledge graph facilitates the ability to contain, harness and discover knowledge across your fabric. It provides a flexible, visual representation of your data organized around understandable data sets and the cross-references between them, integrating vast amounts of disparate data for visualization and analysis. For example, graphs centered on knowledge and networks can scan and assess relationships in order to facilitate discovery, to infuse risk decisions or financial crime investigations.

How Accurate Entities Drive Knowledge

In this way, a traditional data fabric becomes a contextual data fabric, a true and high quality knowledge-illuminating fabric that can drive multiple use cases: fraud, AML, KYC, customer intelligence and credit and counterparty risk. In addition, it helps infuse higher quality data management by resolving entities to appropriate levels of accuracy in golden source data stores. Such a unified approach makes a data fabric thrive because:

Analysts can analyze accurate data-sets
Data scientists can prototype, build and deploy models incorporating contextual knowledge
Data engineers can process, transform and distribute data with confidence
SMEs, and managers can incorporate context into their decision-making, and infuse automated processes with micro decision making.
Data Technology Owners can facilitate meta-data management alongside entity quality data management.

In conclusion, when a data fabric is enhanced with a contextual fabric, one which de-duplicates data for utmost accuracy and drives knowledge discovery to infuse enterprise models and analytics, financial services use cases are better able to derive business value.

Acknowledgements

My colleague Martin Maisey kindly commented on and advised on much of the source material for this article.

External

This content is provided by an external author without editing by Finextra. It expresses the views and opinions of the author.

4270

Report

Channels

/artificial intelligence /regulation & compliance

Data Management and Governance

Anything that can be used to better manage and govern data.

Join group

53 opinions 6 members 03 December 2024

Comments: (0)

Steve Wilcockson

Technical Product Marketing

Quantexa

Member since

28 Feb 2014

Location

Diss / London

More expert opinions

Alex Kreger Founder & CEO at UXDA