With the rise of the data-centric AI movement (of which computer vision is a subset), the spotlight has been shifting from algorithm design to dataset development. Data is the highest contributor to model performance for many modern neural network architectures. Adding layers to the network, skipping connections, or tuning certain hyperparameters have limited model performance effects. Many practitioners spend countless hours creating and curating labeled data to train state-of-the-art architectures at the penalty of algorithm development. Additionally, dataset creation is one of the most costly and demanding components of the entire computation pipeline. Therefore, good practices for data quality are critical to ensuring successful outcomes.
In short, the growing importance of analytics and ML applications demands modern data quality solutions:
1. More and more functional teams rely on data. Additionally, your customers need to trust data products and services to facilitate the adoption of analytics and ML.
2. Data quality issues can impact critical services and products. Therefore, companies need to systematically, holistically, and proactively tackle data quality.
3. Potential sources of error have increased. The volume, variety, velocity, and veracity of data continue to increase alongside the number and types of data sources and providers.
4. Data architectures have become more complex. The rise of ML and real-time analytics have led to complex data platforms with more risks and more challenging data quality problems.
Labeled datasets are among the most desired assets computer vision practitioners seek. Even though computer vision scientists and engineers are continuously researching novel methods to reduce the dependency of models on labeled data (i.e., active learning, self-supervised learning, adversarial learning, etc.), supervised learning techniques remain popular when it comes to computer vision models in production. Many potential sources of error can impact the quality of labeled data, including a lack of proper data management, instruction ambiguity, data misinterpretation due to a low signal-to-noise ratio in the source data, and the cognitive degree of difficulty required for certain labeling operations. If not detected early, these errors can have devastating effects, from cost considerations to underwhelming model performance. Thus, there is a need for instrumenting frameworks with proper mechanisms to monitor data quality as labeling efforts progress.
As a concept, data is of high quality if it fits the intended purpose of use. In the context of ML, data is of high quality if it correctly represents the real-world construct that the data describes, meaning that it is representative of the underlying population and scenarios. While good quality differs from case to case, there are common dimensions of data quality that can be measured.
The Collibra team put together a nice list with six dimensions of data quality: completeness, accuracy, consistency, validity, uniqueness, and integrity. Since Superb AI focuses on computer vision data, let’s examine how these dimensions fit into the context of visual data.
1. Completeness: Is your dataset sufficient to produce meaningful insights? Are there no “gaps” in your dataset? Does your dataset cover all the edge cases? Visual data can’t be considered complete until all the object classes of interest have been labeled, which is a requirement to kickstart the modeling process. These vital labels help your model learn and make predictions.
2. Accuracy: Does your dataset accurately describe the problem you want to solve? Do the feature entities exist? Accuracy is critical for highly regulated industries like government, healthcare, and finance. Measuring the accuracy of visual data requires verification with the domain experts - for example, you need a surgeon who can look at medical images and diagnose whether the tumor is cancerous or not.
3. Consistency: Are the values consistent across your dataset? If there are redundant values, do they have similar values? For visual data, you must ensure that the labels are consistent across the image or video artifacts. Label consistency is difficult to assess and requires a rigorous quality control process. Label consistency is often associated with label accuracy, and any visual dataset scoring high on both will be of high quality.
4. Validity: Does your dataset conform to business rules? Do the attributes align with the specific domain or requirement? In visual data, a labeled bounding box is valid if it is drawn in a rectangular shape. Any invalid labels will affect the completeness of visual data. To ensure completeness, you can define rules and checks to ignore or resolve the invalid labels.
5. Uniqueness: Are there no duplicates or overlaps in your dataset? Identifying overlaps in labeled objects within visual data can help maintain uniqueness. Note that uniqueness needs to be measured against all instances within a dataset or across datasets. Uniqueness also depends on the code (randomness import code) and situation (long-tailed sampling method).
6. Integrity: Are the attribute relationships in your dataset maintained correctly? As the number of data sources increases and gets used for diverse use cases, it is crucial to keep track of the data journey and transformation. Integrity in visual data ensures that all the labeled object classes can be traced and connected in a single source of truth.
According to a 2021 survey conducted by Datafold, data quality and reliability are top KPIs for data teams, followed by improving data accessibility, collaboration, and documentation. Data quality can’t be owned by any single team and needs to be addressed on a company level (in the same way security is) and requires close collaboration between teams. Unfortunately, most teams currently don’t have adequate processes and tools to address data quality issues.
Considering that data teams identify data quality as their primary KPI while lacking tools and processes to manage that, it is not surprising that they are haunted by manual work, as many routine tasks such as testing the changes to ETL code or tracing data dependencies can take days without proper automation. They need to write ad-hoc data quality checks or ask others before using the data for their work. A few teams use automated tests and data catalogs as a source of truth for data quality.
Sarah Krosnik has mapped out the data quality tooling landscape and placed them under one of four categories based on the approach each tool takes on data quality:
1. Auto-profiling data tools (Bigeye, Datafold, Monte Carlo, Lightup, Metaplane) are hosted tools that automatically profile data through either ML or statistical methods and alert upon changes based on historical behavior. Choose them if your team has a high budget, many data sources you don’t control, and fewer technical resources or time to create and maintain custom tests.
2. Pipeline testing tools (Great Expectations, Soda, dbt tests) are open-sourced with a paid cloud option to integrate directly into data pipelines and configure very granular unit tests to stop downstream tasks from running if the data doesn’t meet the specific acceptance criteria. Choose them if your team wants to start with a free solution and have high control/deep granularity into testing while easily integrating with and influencing the result of existing pipelines.
3. Infrastructure monitoring tools (Datadog, New Relic) are hosted tools to monitor cloud infrastructure, alerting on thresholds for resource utilization and availability. Choose them if your team is responsible for data platform infrastructure and wants to leverage common engineering tools to alert when infrastructure isn’t scaling well.
4. “All-In-One” tools (Databand, Unravel) are attempted unification of infrastructure monitoring and auto-profiling data, made specifically for analytics professionals and common data tools (Airflow, Spark). Choose them if you’d like to start with a single solution for auto-profiling and infrastructure monitoring to get the ball rolling and already use tools these companies integrate with.
There have not been many data quality tools that deal with unstructured visual data from my research. All of the tools mentioned above only deal with structured tabular data. Therefore, there’s an emerging opportunity to design such a tool given the untapped potential of visual data, which has a larger footprint than structured data and is powering more novel computer vision applications.
Should we care about the quality of our visual datasets? If the goal is to build algorithms that can understand the visual world, having high-quality datasets will be crucial. We outline below three recommendations for designing a data quality tool for computer vision.
Torralba and Efros, 2011 assessed the quality of various computer vision datasets based on cross-dataset generalization (training on one dataset and testing on another dataset). Their comparative analysis illustrates different types of bias in these datasets: selection bias (datasets often prefer particular kinds of images), capture bias (photographers tend to capture objects in similar ways), label bias (semantic categories are often poorly defined, and different labelers may assign different labels to the same type of object), and negative set bias (if what considered by the dataset as “the rest of the world” is unbalanced, that could produce models that are overconfident and not very discriminative).
To minimize the effects of bias during dataset construction, a data quality tool for computer vision should be able to:
1. Verify that the data is obtained from multiple sources to decrease selection bias.
2. Perform various data transformations to reduce capture bias.
3. Design rigorous labeling guidelines with vetted personnel and built-in quality control to negate label bias.
4. Add negatives from other datasets or use algorithms to actively mine hard negatives from a huge unlabeled set to remedy negative set bias.
In an exploratory study on deep learning, He et al., 2019 considered the four aspects of data quality based on AI, including:
1. Dataset equilibrium refers to the equilibrium degree of samples among classes and deviation of the sample distribution. For instance, we delete all the data of one specific object class in the training set to see the effect of the model when identifying the deleted object class and the undeleted object classes.
2. Dataset size is measured by the number of samples. Large-scale datasets typically have better sample diversity than smaller datasets. For instance, we modify the dataset size by randomly deleting a specific percentage of data in the training set.
3. Quality of label refers to whether the labels of the dataset are complete and accurate. For instance, we randomly change the label to the wrong one and try a different ratio of changed labels to see the effect on model robustness.
4. Dataset contamination refers to the degree of malicious data artificially added to index datasets. For instance, we use different methods such as contrast modification and noise injection to add some contamination to the images to see the effect on model robustness.
To solve the issues associated with the aspects mentioned above, a data quality tool for computer vision should be capable of:
1. Rebalancing samples among classes so that not any few classes are overly represented in the training set.
2. Suggesting a min/max threshold on the optimal number of samples required to train the model for the specific task.
3. Identifying label errors and providing sufficient quality control to fix them.
4. Adding noise to samples in the training set to help reduce generalization error and improve model accuracy on the test set.
Alsallakh etl al., 2022 presented visualization techniques that help analyze the fundamental properties of computer vision datasets. These techniques include pixel-level component analysis (principal component analysis, independent component analysis), spatial analysis (spatial distribution of bounding boxes or segmentation masks for different object classes), average image analysis (averaging a collection of images), metadata analysis (aspect ratios and resolution, image sharpness, geographic distribution), and analysis using trained models (feature saliency in a given input, input optimization, concept-based interpretation).
To improve understanding of computer vision datasets, a data quality tool for computer vision should offer visual analysis techniques mentioned above:
1. Pixel-component analysis is helpful in understanding which image features are behind significant variations in the dataset and (accordingly) predicting their potential importance for the model.
2. Spatial analysis is helpful in uncovering potential shortcomings of a dataset and assessing whether popular data augmentation methods are suited to mitigate any skewness in the spatial distribution.
3. Average image analysis is helpful for comparing subsets of images of the same nature and semantics, selecting subsets that represent representative manifestations or interesting outliers, and revealing visual cues in the dataset that the models can use as “shortcuts” instead of learning robust semantic features.
4. Metadata analysis is helpful in assessing the diversity of a dataset and exploring different manifestations of targeted classes and features to guide the curation and labeling of representative datasets. Furthermore, exploring datasets based on their temporal information is useful to assess their appropriateness for the target task.
5. Analysis using trained models is helpful in revealing (1) how different learning paradigms and architectures are impacted by inherent issues in the training data, (2) ambiguities and shortcomings of labels, and (3) properties of different groups of classes via embedding visualizations.
The understanding of the quality of data used to train a model, the clarity of the labeling process, and the knowledge of the strengths and weaknesses of the ground-truth data used to evaluate the models will lead to increased traceability, verification, and transparency in computer vision systems. In this article, we have given a tour of the data quality tooling landscape and proposed ideas to design a robust data quality tool for computer vision applications.
At Superb AI, we are building a CV DataOps platform to help computer vision teams automate data preparation at scale and make building and iterating on datasets quick, systematic, and repeatable. Our custom auto-label or upcoming AI features like mislabel detection and embedding store utilize the collected data to set data quality rules, which should be adaptive to the data we collect.