Join the Community

22,914

Expert opinions

43,804

Total members

477

New members (last 30 days)

224

New opinions (last 30 days)

28,964

Total comments

Join Sign in

Data Quality in Machine Learning.

07 September 2020 2 comments

Yousaf Hafeez

We regularly see and hear phrases like “data is the life blood of an organisation” or “the world’s most valuable resource is no longer oil, but data”. There is no denying that data is an incredibly valuable resource. But a theme that is overlooked in many articles or only mentioned in passing is the importance of data quality.

Technology by itself is not a panacea. You can have any technology you like, and you can have much data as you like but if you don’t have high quality data you are taking an immense risk.

This short paper starts by looking at different types of data: quantitative, qualitative, and then looks the challenges of using this data in Machine Learning applications.

Quantitative vs Qualitative Data

Quantitative data and the results stemming from it are applauded by many as being “scientific” and more “valuable” than non-quantitative data. However quantitative data is not without faults and limitations. Firstly, quantitative data often results in binary result for example a “yes” or “no” answer. This then maybe used to make decisions without understanding the true meaning of that answer. This approach can result in decisions that do not lead to the most optimal result and even opportunities being missed. Secondly, there have been many papers written expounding the benefits of quantitative data and it is reasonable to assume many more similar papers will be written in the future. Sometimes we fall into the trap of believing if something is said enough times it must be true or at least have an element of truth. Thirdly, it is assumed a strong correlation is synonymous with absolute certainty. We sometime say we have found a correlation with 95% certainty and we focus on the 95% certainty. We forget this also means there is a 5% chance the correlation does not exist.

Qualitative data also suffers from faults such as the bias of the researcher, it can be difficult or impossible to replicate the results of qualitative data and the cost and time to generate qualitative data can be considerable.

Whether you are using quantitative or qualitative data the quality of data is key. No matter what technology you use to cut and slice the data, but rubbish data generates rubbish results.

There are many articles extolling the benefits of Machine Learning or AI ability to improve decision making using qualitative or un-structured data quality of quantitative data.

It would make life of so many people easier if data came in nice easy to use structured packages. An article by Forbes estimated less than 20% of data is structured. Given that so much data is unstructured makes life a little more challenging. Machine Learning applications such as BERT from Google provide an excellent application for making meaning of many un-structured data sets.

Data in Machine Learning

Machine learning is dependent on data: quantitative, qualitative, structured and unstructured. More importantly, Machine Learning is dependent on good quality data. The importance of data is illustrated when looking at high level Machine Learning process.

Step 1. Data collection

Step 2 Data Annotation

Step 3 Ingest Data into model

Step 4 Train the model

Step 5 Evaluate results

Step 6 Additional classification of data /fine tuning

Step 7 Seek additional data to enhance model

Step 2 in this process provides an example of the importance of data. Annotation of data is very expensive and time consuming but also critical to the success of the machine leaning application. One challenge that is overlooked at this stage is the variation in understanding of text by those carrying out the classification. For example, if one person’s background allows them to use elaborated code and another uses restricted code it is likely to result in different interpretation of the same text. You can try and overcome some of these challenges by having guidelines, data reviewed several times and then reaching a consensus. But this adds to the cost and time to build a production Machine Learning application.

Another key challenge is selection bias throughout this process or even reverse engineering data to generate desired results. The issue of bias is not new, and many approaches have been implemented to reduce selection bias including taking care in selecting the learning model, taking care in selecting the training data etc. The success or these attempts to reduce or eliminate bias is questionable in many instances.

In conclusion Machine Learning offers huge potential but before even considering which Machine Learning technology to use attention must be paid to the quality of data and how you can source this data and possibly look at the return on Investment (ROI) as good quality data is not always cheap.

External

This content is provided by an external author without editing by Finextra. It expresses the views and opinions of the author.

6722

Report

Comments: (2)

Ketharaman Swaminathan Founder and CEO at GTM360 Marketing Solutions

08 September 2020

Re. "Sometimes we fall into the trap of believing if something is said enough times it must be true or at least have an element of truth.", sadly it's not a trap but a law:(

Called "Clear's Law of Recurrence", it states "The number of people who believe an idea is directly proportional to the number of times it has been repeated - even if the idea is false."

Going by this law, the more the number of people took this post at face value, the less the chances that it would be caught out.

1 Like

Report

Tejasvi Addagada Enterprise Data Head at Fortune 500 financial service provider

19 September 2020

The first Data Quality challenge is most often the acquisition of right data for Machine Learning Enterprise Use cases.

Even though the business objective is clear, data scientists may not be able to find the right data to use as inputs to the ML service/algorithm to achieve the desired outcomes.

As any data scientist will tell you, developing the model is less complex than understanding and approaching the problem/use-case the right way. Identifying appropriate data can be a significant challenge. You must have the “right data.”

More broadly speaking, Coverage, can be categorized under the Completeness Dimension of Data Quality and called the Record Population concept within the Conformed Dimensions standard. This should be one of the first checks to be performed before proceeding to other Data Quality checks.

Report

Yousaf Hafeez

Member since

15 Jan 2019

Location

London

More expert opinions

Prakash Bhudia HOD – Product & Growth at Deriv

External

This content is provided by an external author without editing by Finextra. It expresses the views and opinions of the author.

Join the Community

22,914

Expert opinions

43,804

Total members

477

New members (last 30 days)

224

New opinions (last 30 days)

28,964

Total comments

Join Sign in

Join the Community

Data Quality in Machine Learning.

External

Share

Comments: (2)

Natural Language Processing In Capital Markets

Future-gazing 2020: technology continues to drive financial markets

Five financial markets trends in 2019

More expert opinions

Is the U.S. already in a recession? A closer examination

From Experimentation to Transformation: How AI is Reshaping Financial Institutions

Simplify business transactions with batch payments

A Love Letter to Money20/20 – See You in June!

External

Join the Community

Now Hiring

Join the Community

Data Quality in Machine Learning.

External

Share

Comments: (2)

More expert opinions

External

Join the Community

Trending

Now Hiring