Common reasons behind computer vision projects failing are (1) a failure to make it to production, (2) the time where your coveted computer vision scientists and engineers spend too much of their time on menial tasks, and (3) increased governance risk. It is essential to invest in foundational activities for your computer vision teams to tackle these challenges - like getting production-quality labeled training data via data operations (DataOps) tools and processes.
In part 1 of this series, we provided an overview of DataOps for computer vision by (1) introducing the concept of DataOps for data analytics, (2) arguing for the case of using DataOps for computer vision, and (3) laying out the 6 DataOps principles for any enterprise computer vision system. In part 2, we (1) examined the three data-related challenges that any computer vision teams have to deal with, (2) proposed specific functions of an ideal DataOps platform to address those challenges, and (3) argued that enterprises need to structure a proper DataOps team to build incentives that recognize their essential work on data.
Part 3 underscores (1) the key personas that make up an ideal DataOps team and (2) the organizational structure needed to take advantage of the valuable data work. But first, let’s look at how to leverage DataOps when scaling your computer vision projects.
DataOps represents a culture change that focuses on improving collaboration and accelerating dataset delivery by adopting lean and iterative practices, where appropriate, to scale data pipeline operations from acquisition to delivery. As seen below, a typical DataOps pipeline for computer vision consists of data acquisition, annotation, debugging, augmentation, transformation, and curation stages. Hence, to implement DataOps, enterprises need to primarily focus on:
In brief, DataOps helps provide the agility, efficiency, and continuous assessment of data that is governed throughout its life cycle. The goal is to create data pipelines with orchestration tools that can be provisioned automatically within production environments, while ensuring governance and security across development and production environments.
Computer vision practitioners from an academic background are often hired to do technical work with algorithms, meaning a strong background in mathematics and computer science is needed. However, these skills don’t translate well into assembling a good dataset (thanks to Taivo Pungas for coming up with this list), which requires:
All activities can be brought under a single function of DataOps, whose mission is to assemble good datasets. Ideally, for every prediction task, this team (1) provides a corresponding dataset that comprehensively and accurately reflects what the product intends to achieve and (2) maintains its accessibility and freshness for downstream purposes.
When building a DataOps team, the ML/Data leads need to figure out their team composition and high-level structure for the DataOps team. They rightly do so; having a strong DataOps team is not a luxury anymore, but essential to the survival of any ML-first company today.
Before building a DataOps team, you must realize where you are in your “ML data management journey” because this will directly affect the structure of your team. If you’re not familiar with ML data management as a category, Astasia Myers provides a solid primer on the tools that help improve ML models by improving datasets. Essentially, these tools extract data quality best practices from the data analytics world and apply them to ML. They help the DataOps team curate good training datasets, detect mislabels, and identify challenging edge cases.
To evaluate ML data management maturity, we need to define the fundamentals mentioned here (Maslow’s hierarchy of needs heavily influences this). If we try to assemble a good training dataset, what are the fundamental things that enable such a goal?
The better your organization meets this “data hierarchy of needs,” the more mature your ML data management capabilities are.
We believe that an ideal DataOps team should be composed of the core functions outlined below:
There is no ideal structure for a DataOps team. Considering that your organization’s data needs will likely evolve rapidly, your DataOps team structure should also adapt accordingly. For this reason, we don’t prescribe a given structure, but rather present the most common models and how they can be suited to different types of businesses.
In a centralized model, the DataOps team has access to all the data and serves the whole organization in various projects. All data engineers, curators, and labeling managers within this team are managed directly by the head of DataOps. With this structure, the DataOps team collaborates in tandem with the ML stakeholders based in ML units in a consultant-type relationship.
This flexible model adapts to the continuously evolving ML needs of a growing business. If you’re at the beginning of your ML data management journey, this is the structure we recommend. The DataOps team’s initial projects will seek to bring visibility to the business, ensuring all ML teams in your organization have the high-quality training data they need.
In a decentralized model, each ML team needs its “own” DataOps people. Data labeling managers and data curators focus on the problems faced by their specific ML unit, with little interaction with data people from other ML teams of the company. With this structure, data curators report directly to the head of their respective ML unit.
A strong DataOps team is a key pillar you need to build if your company is to develop and deploy computer vision in the real world. The extent to which your model will extract predictive power from data ultimately depends on the strength of this team and how symbiotic it is with the rest of your organization. There is no made-to-order advice for the composition and structure of your DataOps team. That’s why you need to understand your organization’s ML Data Management maturity level so that you can build a DataOps team suited to your ML objectives and aligned with your business goals.
At Superb AI, we write about all the processes involved when building computer vision products: from the modern machine learning stack, to data operations teams composition, to data labeling and curation. Our blog covers the technical and the less technical aspects of bridging the gap between computer vision in academia and industry.
We are building a training data management product for the modern ML stack in the data-centric AI development world. We have designed our product to be enterprise-ready, automated, and intuitive.
Want to check it out? Reach out to us, and we will show you a demo.