Ensuring high quality for structured data (with ML observability) and unstructured data (with Data Operations)
This is a joint piece written in partnership between Superb AI and Arize AI
As the practice of machine learning (ML) becomes more similar to that of other software engineering disciplines, it requires processes and tooling to ensure seamless workflows and reliable outputs. In particular, data quality has been a consistent focus, as poor data quality management leads to technical, architectural, and organizational bottlenecks.
Since ML deals with both code and data, modern MLOps solutions need to take care of both by incorporating tasks such as version control of code used for data transformations and model development, automated testing of ingested data and model code, deployment of the model in production in a stable and scalable environment, and monitoring of the model performance and prediction outputs. As pointed out by the Great Expectations team, data testing and data documentation fit neatly into different stages of the MLOps pipeline: at the data ingestion stage, during the model development stage, and after the model deployment stage.
It’s critical to recognize that data quality is a journey, not a destination. That is, data quality requires continued investment over time. For ML teams that are still in the nascent stage of dealing with data quality issues, the concept of a data quality flywheel can help kickstart their data quality journey by building trust in the data and improving the data quality in a self-reinforcing cycle.
In this blog post, we dive deep into the key dimensions of data quality and then explore the fundamental capabilities of robust data quality solutions both for structured and unstructured data. These ideas derive mainly from our experience collaborating with data and ML teams via our work at Arize AI and Superb AI.
Dimensions of data quality are the categories along which data quality can be grouped. These dimensions can then be instantiated as metrics of data quality that are specific and measurable. This excellent article from the Metaplane team drilled into the 10 data quality dimensions, broken down into intrinsic and extrinsic ones.
The intrinsic dimensions are independent of use cases, easier to implement, and closer to the causes.
The extrinsic dimensions are dependent on the use cases, requires cross-functional effort to implement, and are closer to the symptoms:
Now that we have the list of intrinsic and extrinsic data quality dimensions, how can we build solutions that provide tangible metrics for each of these dimensions? Let’s review them both in the structured and unstructured context.
While machine learning models rely on high-quality data, maintaining data quality in production can be challenging. The upstream data changes and the increasing pace of data proliferation as organizations scale can drastically change a model’s overall performance. Data quality checks in an ML observability platform identify hard failures within structured data pipelines through training and production that can negatively impact a model's end performance.
In short, data quality monitors enable teams to quickly catch when features, predictions, or actuals don't conform as expected. Teams can monitor for data quality to verify that feature data is not missing, catch when data deviates from a specified range or surpasses an accepted threshold, and detect extreme model inputs or outputs.
More specifically, data quality monitoring can be broken down into two main pipelines: categorical data and numerical data.
Categorical data monitors help ML teams easily identify:
Numerical data monitors help ML teams easily identify:
To keep an eye on model performance, ML teams can leverage an ML observability platform’s inference store to have a history of data to reference from training sets or from historical production data to set intelligent baselines and thresholds to balance the frequency of alerts and give power back to ML teams to ensure high-performing models in production.
In contrast to structured data, unstructured data such as text, speech, images, and videos do not come with a data model that enables a computer to use them directly. The rise of deep learning has helped us interpret the knowledge encoded in unstructured data using computer vision, language processing, and speech recognition methods. Therefore, unstructured data are used increasingly in decision-making processes. Although decisions are commonly based on unstructured data, data quality assessment methods for unstructured data are lacking.
All the dimensions presented in section 2 are relevant to structured and unstructured data. The three dimensions that we think are most relevant to dealing with unstructured data are Accuracy, Relevance, and Usability. If we think about processing unstructured data, the key elements include the input data, the real world, the data consumers, the task, and the knowledge extracted. Based on these elements, the quality of data D can be determined by comparing it to 3 classes of ideal data set: the data representing the real world D1 (Accuracy), the data optimized for the task D2 (Relevance), and the data expected by the current data consumers D3 (Usability).
Modern solutions to verify the quality of unstructured data should have these broad capabilities:
All of these can be baked into a Data Operations platform, which helps bring rigor, reuse, and automation to the development of ML pipelines and applications. DataOps can transform the data quality management process from ad-hoc to automated, from slow to agile, and from painful to painless.
In this post, we discussed the key dimensions of data quality and explored the broad capabilities of ML observability and Data Operations platforms as viable data quality solutions. Investing in data quality in your organization can pay dividends for years into the future with your ML initiatives.
Curious about how to ensure high quality to unlock the potential of your company’s structured and unstructured data? Request a trial of Arize AI’s ML Observability platform, or schedule a demo to examine Superb AI’s DataOps offerings today!