In January, I attended TWIMLcon, a leading MLOps and enterprise ML virtual conference. It focuses on MLOps and how enterprises can overcome the barriers to building ML models and getting them into production. There was a wide range of both technical and case-study sessions curated for ML/AI practitioners. In this long-form blog recap, I will dissect content from the talks that I found most useful from attending the conference.
The post consists of 14 talks that are divided into 3 sections: (1) Case Study, (2) Technology, and (3) Perspectives.
Over 320 million Spotify users in 92 different markets worldwide rely on Spotify’s great recommendations and personalized features. Those users created over 4 billion playlists from a catalog of over 60 million tracks and nearly 2 million podcasts. With the massive inflows of data and complexity of the different pipelines and teams using the data, it’s easy to fall into the trap of tech debt and low productivity. The ML Platform at Spotify was built to address that problem and make all our ML Practitioners productive and happy. Aman Khan and Josh Baer gave a concrete talk that describes the history of the ML Platform at Spotify.
Currently, Spotify has over 50 machine learning teams using the ML platform. In 2020, those teams trained over 30,000 models using the platform team’s tools. These models ramped up to 300 thousand prediction requests per second using the internal serving framework. As measured by the number of model iterations per week, machine learning productivity increased by 700% in 2020 alone due to the platform enhancement. Here are a few applications of Machine Learning at Spotify: app personalization, recommendations, user onboarding, messaging quality, and expansion to new markets.
Spotify has used machine learning since the company origin. Early efforts were custom-built code that ran on a Hadoop cluster. For the first five-to-ten years, Spotify’s machine learning capabilities were all about growing the data capabilities: processing data at scale and handling unique datasets relevant to Spotify’s growing machine learning ambition. As more frameworks came into play and ML grew in popularity, they saw more combinations of tools being used in projects. This led to higher tech-debt for ML systems and more cognitive load when building new systems.
In 2017, Spotify created an internal infrastructure/platform team. Its core mission is to increase the speed of iteration for ML development by providing tooling that supports common development patterns. Here are some of the goals: (1) to build a platform for active ML use cases, (2) to reduce the cost to maintain ML applications, (3) to democratize ML, and (4) to support state-of-the-art ML.
The ML workflow at Spotify can be broken down into three iterative steps, from data exploration to model serving at a high level. The ML Platform then built 5 product components that serve various stages of the ML lifecycle:
The piece that ties these components together is the ML Engagement Team that focuses on ML education and outreach initiatives within Spotify.
Aman then dived deeper into Jukebox, Spotify’s internal feature store. The goal of creating Jukebox is to reduce users’ time to discover, share, and implement features for model training and serving. The platform team does this by simplifying feature management from experimentation all the way to online serving.
Let’s look at an example of building an Artist Preference Model that predicts how likely a user is a fan of a given artist. If done in batch, this would require an inefficient number of predictions, roughly about (300M users x 10M artists) pairs. If instead be done online at request time, it would be more scalable for other Spotify teams to build personalization into their applications (using this model’s outputs as the inputs to their models).
The diagram above displays a typical Jukebox Workflow:
Towards the end of the talk, Josh shared the three lessons from his experience leading the platform team:
The biggest tradeoff was to build a platform solely for the ML engineering use cases. Such focus allowed the platform team to reduce the maintenance burden for ML teams and free them up for building new ML projects. Of course, this means less of a focus on non-ML engineering work (such as earlier prototyping and exploration phases). In 2021, the platform team’s biggest goal is to expand the platform’s breadth offering and support all different parts of the production lifecycle. Besides that, here are other main themes that they will be working on:
Toyota Research Institute (TRI), Toyota’s research arm, was established in January 2016 with a billion-dollar budget. It has grown to more than 300 employees across three Cambridge, Ann Arbor, and Silicon Valley facilities. The main focus areas have been automated driving, robotics, advanced material design and discovery, and machine-assisted cognition. Sudeep Pillai shared an overview of the MLOps environment developed at TRI and discussed some of the key ways MLOps techniques must be adapted to meet the needs of high-stakes environments like robotics and autonomous vehicles.
TRI has a unique approach to Automated Driving called “One System, Two Modes”:
Machine learning is eating automated driving (AD) software across the industry: AD software was traditionally hand-engineered. It was modular and rule-based to make things composable and inspectable. However, it was fairly rigid and, as a result, generalizes poorly. As we move towards the software 2.0 world, the technologies for the core components of an AD stack (Perception, Prediction, Planning) start to adopt ML. The idea is to engineer an AD software that is data-driven and generalizable to diverse situations.
The challenging task is to build a unified MLOps infrastructure to support all ML models in safety-critical AD applications (2D/3D object detection, traffic light detection, semantic segmentation, bird-eye-view scene flow, monocular depth estimation).
TRI’s real goal is to build a unified platform for accelerated fleet learning. This was inspired by the “Toyota Production System” for vehicles. Given an assembly line that Toyota has, the goal is to optimize parts of this line to produce more vehicles at the end of any given day. In the context of MLOps, this platform needs to be domain-specific, model-agnostic, and applicable across heterogeneous fleets. The question becomes:
How to marry MLOps best practices with safety guarantees to deploy models in an AD setting?
While adapting MLOps AD applications, the TRI team came across a few key challenges (related to the three phases of design, development, and operations): (1) more complex model dependencies and cascades during the design phase, (2) more nuanced model acceptance criteria during the development phase, and (3) more operational burden for safety-critical applications during the operations phase.
For example, let’s say the goal is to improve cars’ performance at intersections. A design phase might look like the diagram above:
Modular architecture also poses the complex dependencies challenge, where we have models that are trained on predictions from other models. For example, if we want to re-train a further upstream model in the workflow, we need to consider all the possible downstream models affected by this.
To build MLOps for safety-critical applications, TRI designs a diverse suite of benchmarks and tests for evaluation during real-world and simulated scenarios (as seen below).
This unified and scalable infrastructure encourages fast iteration, scenario-driven, and safety-first deployment. A big problem in automated driving is the heavy-tail, where it’s challenging to find the edge cases. The vehicle needs to drive many miles to encounter the heavy-tail cases. Iterating fast is the only way to accelerate that process. TRI’s near-term goal is to scale such foundational capability for learning from the Toyota fleet to all forms of Toyota vehicles.
Note: Check out different TRI research lines if you are an autonomous vehicle geek.
Model quality is a common production problem because subtle failures are often caused by infrastructure and frequently undetected.
Additionally, model quality is not just an operational but also a trust problem.
Using a hand analysis of approximately 100 incidents tracked over 10 years, the Google Engineering Infrastructure team has looked carefully at cases where these models reached or almost reached the serving system. In his talk, Todd Underwood identified common causes and manifestations of these failures and provided some ideas for measuring the potential damage caused by them. Most importantly, he proposed a set of simple (and some more sophisticated) techniques for detecting the problems before they cause damage.
The example Google system in his talk is a part of a complex pipeline where thousands of models are trained on hundreds of features. Data pre-processing, training, and serving are asynchronously connected. Newly arriving (logged) data is joined to slower-changing data. Different models are trained on different collections of features. New models are trained continuously (every hour). An “experiment” system plumbed through from the data warehouse to model serving.
In particular, it is one of the largest and oldest ML systems at Google:
Todd brought up an interesting thing that Google treats outages like information. Failures are a gift, as they are a natural experiment of what, at least once, broke and usually why. Understanding failures can help them avoid, mitigate, or resolve those failures. In fact, Google maintains a searchable database of failures and their write-ups called “Requiem.”
Google’s initial hypothesis is that many ML failures probably have nothing to do with ML. Their impression is that many undetected ML failures aren’t modeling failures. Instead, boring and commonplace failures are more difficult to notice and more difficult to take seriously. Modelers are looking at modeling failures, while infrastructure engineers are not paying attention to model quality. This gap might also be a root cause. Evidence could confirm this and make it more actionable or contradict this and point towards suitable ML-centric quality.
Then, the Google Infrastructure team looked at the outages and developed a taxonomy of model failures. It is worth noting that this taxonomy is observational, not quantitative. Failures are categorized based on the experience and intuition of the authors. Furthermore, this taxonomy is deeply connected to Google’s underlying systems architecture. But in general, understanding even one failure can help prevent other quality outages.
To prevent these failures in the first place, a good practice is to automatically disallow configuration changes that exclude too much data from training.
To protect the system (generally speaking), we can monitor data flowing through various stages in the training pipeline and ensure that the data is within acceptable thresholds and does not change too suddenly. The distribution of features produced by the feature generation sub-system and the distribution of features ingested by a specific model should be covered. Periodically, we want to update thresholds to keep up with the changing world automatically.
Finally, we should always automatically pause the pipeline and raise alerts on data quality degradation.
Todd concluded his talk with these takeaways:
The story of build vs. buy is older than the software industry itself. It is a question of whether you want something now or later, simple or complex, etc. Most importantly, you need to decide what kind of “technical debt” you are getting into. With infrastructure, sometimes you have to buy.
If you want to integrate a Deep Learning component inside your product, then your options are more limited — due to variance in use cases (no one-size-fit-all), time constraints, much research needed to integrate with your own use cases, few “best practices” already known, and nascent infrastructure.
This even gets worse if Deep Learning is your core product. Off-the-shelf tools for scaling are not relevant yet, while infrastructure needs to be tailored to your use case. It would be best to emphasize the facile transition from research to production.
In their session, Ariel Biller from ClearML and Dotan Asselmann from Theator presented a specific case study of build-and-buy. For the context, Theator is a Surgical Intelligence Platform. It provides revolutionary personalized analytics on lengthy surgical operation videos. The deployed client-side inference pipeline consists of multiple deep learning models, so no off-the-shelf solutions truly fit the need for full Continuous Integration / Deployment / Training. Building such critical infrastructure within a startup’s constraints would be impossible if not for existing MLOps solutions. In the current landscape, ClearML offered Theator unique precursors to realize the needed designs with unprecedented integration ease.
Theator relies on ClearML for data versioning, querying, processing, and storage for their data stack. The only part that they built in-house is the GPU loading component.
Theator relies on ClearML for experiment management, training orchestration, and model training pipeline for their research stack. Of course, they wrote the research code themselves. They also developed an in-house interface called pipeline master controller that handles various tasks from configuring the model environment to spinning up cloud machines.
Theator builds almost everything from the ground up for their continuous training stack — a production training pipeline, a video inference pipeline, an auto-tagging capability, a model re-training trigger alert (for continuous training), and a suite of regression tests (for continuous deployment). The only part that they outsourced to ClearML is the data intake step (for continuous integration).
The session ended with a couple of lessons:
Note: Ariel argued that the MLOps stack should be bottom-up designed — meaning easy integration with research code, logging mechanisms, one-click orchestration, workflow versioning, etc. ClearML was built specifically for this paradigm, so check them out!
Prosus is a global consumer Internet group and one of the largest tech investors — serving 1.5B+ people in 80+ countries. They invested in 4 segments: classifieds (like OLX Group), payments and fintech (like PayU), food (like iFood), and education (like Brainly). Every interaction in these platforms invokes a swarm of ML models:
Thus, there is a dire need for tools and technologies established by these organizations to support and automate various ML workflow aspects. Paul van der Boor explored the evolution of ML Platform capabilities at several Prosus companies to enable applying machine learning at scale, including iFood’s ML Platform leveraging existing Amazon SageMaker capabilities, OLX’s development of data infrastructure for ML and model serving infrastructure based on KFServing, and Swiggy’s home-built Data Science Platform making a billion predictions a day.
iFood is the biggest food delivery service in Brazil. There are millions of real-time, synchronized decisions made on the platform every day: What does the user like to eat? Does the restaurant have capacity? Which restaurant should we recommend? Which rider should deliver the order? Should we offer incentives? How long will it take to deliver? Should we offer to pay later? Can we group orders?
The ML platform that the iFood Team built to handle these decisions is called Bruce. Bruce helps train models easily on AWS Sagemaker, creates Sagemaker endpoints to serve trained models, and integrates easily to a CI/CD environment. Bruce’s design enables a central point of collaboration where data scientists do not need to interface with the infrastructure directly.
Currently, the platform team at iFood is working on expanding Bruce’s capabilities with components such as:
OLX Group is a global online marketplace with its headquarter in Amsterdam, buying and selling services and goods such as electronics, fashion items, furniture. Its ML platform is very mature.
Bangalore-based Swiggy offers an on-demand food delivery platform that directly brings food from neighborhood restaurants to users’ doors. Facing the same challenges like iFood does, the Swiggy team built a home-grown ML deployment and orchestration platform for data scientists to easily integrate, deploy, and experiment with multiple models and abstract away integrations with feature teams to simple, one-time API contracts. The platform has been battle-tested at scale (3 years in production), part of the larger platform strategy at Swiggy, easily extensible, and focused on the ‘last mile’ of the ML workflow.
The diagram above displays the platform workflow. It is designed for scale:
Paul concluded the talk with a few general considerations for building ML platforms at scale:
As companies adopt AI/ML, they run into operational challenges with cost and ROI questions: How do you capture the costs of feature computation, model training, and model predictions? How do you forecast the costs? If a model needs an expensive GPU, what’s the ROI on a model or a set of features? Srivathsan Canchi and Ian Sebanja from Intuit gave a brief overview of Intuit’s ML platform, with a specific focus on operational cost transparency across feature engineering, model training and hosting.
At Intuit, the ML platform is responsible for the Model Development Lifecycle’s different parts.It serves 400+ models and 8,000+ features, alongside 25+ trainings and 15 billion feature updates per day. Because the majority of ML workloads are in the cloud, the cloud infrastructure is a substantial (and hidden) cost for Intuit. While their ML use cases have increased 15 times year-over-year, their costs have not increased at the same pace. Ultimately, the cost is a tradeoff between model throughput/latency requirements and financial resources.
Srivathsan mentioned the technical way to address this tradeoff by minimizing the data scientists’ overload. On day 1, every data scientist is granted a unique ID per model, where all resources are tracked with this ID. He/she can specify the projects to work on, and resources will be isolated for them. Over time, as his/her work pattern emerges, the platform will provide smart defaults for the most common use cases to automate the workflow as much as possible.
Ian also discussed different non-technical ways the Intuit platform team has utilized: providing information at the right points to encourage transparency and visibility, educating data scientists on cost optimization instances, collaborating with data scientists on solutions to evaluate model performance, and determining the ROI of costs saved.
Note: Overall, these practices to enable cost transparency helps Intuit articulate the impact of ML projects and encourage data scientists to make efficient use of internal resources.
Adapting a digital product A/B testing system to support complex ML-powered use cases requires advanced techniques, highly cross-functional product, engineering, and ML teamwork, and a unique design approach. Justin Norman explored lessons learned and best practices for building robust experimentation workflows into production machine learning deployments at Yelp.
Any sophisticated experiment management tool must enable the ML engineers to:
There are many proprietary tools in the market, such as Neptune, Comet, Weights and Biases, SageMaker, etc. There are also many open-source frameworks like Sacred, MLflow, Polyaxon, guild.ai, etc. Yelp invested in many open-source libraries that align the best with their needs and construct some thin wrappers around them that allow easier integration with their legacy code. They opted for MLflow, which automates model tracking and monitoring significantly, on the model experimentation aspect. They used MLeap for model serialization and deployment on the model serving aspect.
Justin then talked about the difference between testing and experimentation. The key difference is that experimentation is used to make a discovery or to test a hypothesis, while testing is used before it is taken into widespread use. The impact of the experiment results determines the type of experiment required. At Yelp, there are two types:
An experimentation platform answers the key question: Is the best-trained model indeed the best model, or does a different model perform better on new real-world data? More specifically, the ML engineers need a framework to:
Every experiment needs a hypothesis: If we [build this], then [this metric] will move because of [this behavior change]. At Yelp, the product managers choose the decision metrics, and the data scientists consult and sign-off on those metrics. The data scientists are also responsible for kicking off the conversation with data engineers to articulate the data needed for metric computation and understand which events will be logged. Justin brought up the concept of minimum detectable effect, the relative difference at which you actually start to care — is your experiment risky enough that you need to detect a percentage loss or gain in your primary metric to make a decision?
Today, nearly all data experimentation at Yelp — from products to AI and machine learning — occurs on the custom-built Bunsen platform, with over 700 experiments in total being run at any one time. Bunsen supports the deployment of experiments to large but segmented parts of Yelp’s customer population, and it enables the company’s data scientists to roll back these experiments if need be. Bunsen’s goal is to dramatically improve and unify experiments and metrics infrastructure by dynamically allocating cohorts for experiments.
Bunsen consists of a frontend cheekily dubbed Beaker, which product managers, data scientists, and engineers use to interact with the toolset. A “scorecard” tool facilitates the analysis of experimental run results, while the Bunsen Experiment Analysis Tool — BEAT — packages up all of the underlying statistical models. There’s also a logging system used to track user behavior and serve as a source of features for AI/ML models.
Bunsen is a distributed platform meant to be utilized by various roles. Product managers, engineers that are not in the machine learning and AI space, machine learning practitioners, data scientists, and analysts at Yelp consume information that either comes from Bunsen or working directly with Beaker to gather the information.
Justin ended the talk with a small discussion of multi-armed bandits, an algorithmic technique used to dynamically allocate more traffic to variants that are performing well while allocating less traffic to under-performing variations. Essentially, you get a higher value from the experiment faster. There are different multi-armed bandits algorithms, including epsilon-greedy, upper confidence bound, and Thompson sampling. At the moment, Yelp’s platform team is experimenting with contextual bandits, which uses context from incoming user data to make better decisions on what model to use for interference in near real-time.
Operational ML is redefining software today.
Let’s take a look at what differentiates Analytic ML from Operational ML:
Operational ML, however, is still very hard to do. Here are the signs that a team is struggling:
Building operational ML applications is very complex. Data and features are at the core of that complexity.
We have been building software applications for years and refined our development process with the DevOps pipeline, allowing us to build and deploy application code iteratively. Many tools have emerged to make that helpful. We have also seen a similar set of tools to manage ML models. All the MLOps platforms are great at model training, model experimentation, and model serving. However, tooling for managing features is almost non-existent. We have tools for exploratory data analysis and feature engineering but nothing to manage the features’ end-to-end lifecycle.
Mike Del Balso gave an excellent talk on how the feature store solves this problem. It is basically the interface between data infrastructure and the model infrastructure. It allows data scientists and data engineers to build the catalog of productionized data pipelines. The feature store can connect to existing data that lives on data warehouses or other databases. It can also connect to raw data in those systems and derive features from such data. Data scientists interact with a feature store by productionizing the features and making them reusable for the rest of the organization.
Generally speaking, a feature store helps you:
Here are common problems and solutions that concern a feature store:
Note: To get started with feature stores, you should check out Feast and Tecton. Feast is a self-managed open-source software that supports batch and streaming data sources and ingests feature data from external pipelines. Tecton is a fully-managed cloud service that supports batch, streaming, and real-time data sources and automates feature transformations.
To make ML work for organizations, there are three components that we need to nail down: the technology, the operational concerns, and the organizational concerns. Today, we are right where we need to be in technology. However, the operational and organization pieces are not yet figured out.
The typical ML lifecycle consists of three pieces: (1) Data Preparation (data collection, data storage, data augmentation, data labeling, data validation, feature selection), (2) Model Development (hyper-parameter tuning, model selection, model training, model testing, model validation), and (3) Model Deployment (model inference, model monitoring, model maintenance). In her talk, Jennifer Prendki dug into how the deficit of attention from experts to one of the most critical areas of the ML lifecycle — that of data preparation — is the likely cause for a still highly dysfunctional ML lifecycle. According to her, while general wisdom acknowledges that high-quality training data is necessary to build better models, the lack of a formal definition of a good dataset — or rather, the right dataset for the task — is the main bottleneck impeding the universal adoption of AI.
In ML, data preparation means building a high-quality training dataset. This entails asking questions such as: What data? Where to find that data? How much data? Where to validate that data? What defines quality? Where to store that data? How to organize that data?
Data labeling is the tip of the iceberg of data preparation. There are so many different pieces, including label validation, data storage, data augmentation, data selection, third-party data, synthetic data, data privacy, data aggregation, data fusion, data explainability, data scraping, feature engineering, feature selection, and ETL.
When we talk about data acquisition, we tend to think about data collection. There are both physical and operational considerations with this — ranging from where to collect the data to how much to collect initially. If we want to get data collection right, we need a feedback loop aligned with the business context.
Synthetic data generation is a second approach with pop-culture use cases such as DeepFakes. Many other use cases range from facial recognition and life sciences to eCommerce and autonomous driving. However, getting this synthetic data requires many training data and compute. Furthermore, synthetic data is not realistic for unique use cases, and there is a lack of fundamental research on their impact on model training.
A third approach is known as data ‘scavenging’ — where we scrape the web and/or use open-source datasets (Google Dataset Search, Kaggle, academic benchmarks)
Finally, we can acquire data by purchasing them (legally collected and raw/organized training data).
After acquiring the data, we would like to enhance them. The most critical part of data enhancement is, of course, data labeling. This is an industry of its own because we have to answer so many questions:
Indeed, getting the data labeling step right is extremely complicated because it is error-prone, slow, expensive, and often impractical. Efficient labeling operations would require a vetted process, qualified personnel, high-performance tools, its own lifecycle, a versioning system, and a validation process.
Even after getting the labels, we are not done yet. We need to validate the labels. This can be accomplished manually by annotators, internal team, and third-party. Here are potential challenges:
Data augmentation is a scientific process where we can manipulate the data via flipping, rotation, translation, color changes, etc. However, scaling data augmentation to bigger datasets, negating memorization, and handling biases/corner cases are fundamental issues.
The next phase of the Data Preparation lifecycle is data transformation. This phase includes three steps:
Data triaging is the belief that: because the dataset is so large, we cannot be picky about the types of data that we are going to use. Thus, we need to catalog and structure the data methodologically:
Finally, there is the concept of data selection. In any datasets, there will be high-value data (useful), redundant/irrelevant data (useless), and mislabeled/correlated/low-quality data (harmful).
Jennifer argued that
A data prep ops market is under-served, under-estimated, and misunderstood by the broader ML community.
Here are two advanced workflows that have shown promising potential for data prep ops:
In the future, data preparation needs to be elevated to a first-class citizen of the MLOps trilogy. Furthermore, data preparation is so complicated that it requires its own set of operations. Broadly speaking, the ML community needs a paradigm shift where (1) we do not consider data as a static object and (2) we understand that more data isn’t always better performance.
Note: This is probably the most useful talk at the conference for me personally. I look forward to seeing how Jennifer’s company, Alectio, elevates the data prep ops market.
Diving further into the firehose of feature stores, Monte Zweben presented a whole new approach to MLOps that allows you to successfully scale your models without increasing latency by merging the database with machine learning. He started the talk by defining a feature store as a centralized repository of continuously updated raw data and transformed data for machine learning. The whole idea is to get data scientists to leverage each other’s work and take the mundane work out of their every day’s feature engineering tasks.
In particular, the feature store takes data coming from the enterprise at different cadences (whether real-time events from websites and mobile apps or data sources from databases and data warehouses) and puts that raw data through transformations. These transformations may be batch or event-driven. Then the data lands into the feature store. Data scientists can search that feature store, form training sets from historical feature values, serve features in real-time to models, and look at the history of features (for governance and lineage purposes).
There are many requirements for a feature store. Monte listed a few below:
He then discussed the unique design of his company’s open-source platform, Splice Machine, which allows for the deployment of machine learning models as intelligent tables inside their unique hybrid (HTAP) database. Splice Machine’s simplicity is that it uses one engine to store online and offline features. This underlying data engine is a relational database that performs transactional workloads and analytical workloads in an ACID-compliant way.
What are the benefits of having a single store (vs. having separate online and offline stores)?
The underlying SQL platform can introspect and interrogate any SQL statements that come to it. It does so thanks to a cost-based optimizer using statistics. Those statements are then executed in an Apache HBase key-value store. On the other hand, if you perform many table scans, joins, or aggregations, the cost-based optimizer will send an instruction for that query to Apache Spark for execution.
A huge requirement of any feature store is solving the point-in-time consistency problem. This means keeping track of feature values as they change over time and keeping them consistent together to form training examples. How does Splice Machine accomplish that?
There are even additional benefits when you couple a feature store with deployment. Many current ML systems use the same kind of traditional deployment method with an endpoint (potentially containerized). Splice Machine also does database deployment in a single click — taking advantage of transactional database and its features to make model serving both transparent and efficient. Every new record automatically triggers predictions made and populated at sub-millisecond speed. There is no extra endpoint programming. Furthermore, this method enables easy model governance and traceability via SQL statements.
Transparency is probably the biggest benefit of all. The database deployment method memorizes every prediction and what model where the prediction was used. Then we can go back to other components of the system and look at what features were in there. Because those features in the feature store have a history, we can travel back to the raw data where the feature values were created.
Overall, the aspiration behind having this single feature store is to scale data science faster with fewer headcounts.
Note: Question for the readers — How do you compare Splice Machine and Tecton?
When many businesses start their journey into ML and AI, it’s common to place a lot of energy and focus on the coding and data science algorithms themselves. The reality is that the actual data science work and machine learning models themselves are only part of the enterprise machine learning puzzle. The success of ML adoption is intertwined, where collaboration is critical. Priyank Patel gave a presentation on how to effectively tackle the whole puzzle using the Cloudera Data Platform.
The Cloudera Data Platform is designed under the core principle that production ML requires an integrated data lifecycle — including data collection, data curation, data reporting, model serving, and model prediction. Any user of this platform can:
This full lifecycle enables collaboration between teams of data engineers, data scientists, and business users. More specifically, the platform team at Cloudera tailors unique offerings for each of these three personas:
For Data Scientists: Cloudera Data Science Workbench offers a collaborative development environment, flexible resource allocations, notebook interface, and personalized repetitive use cases with Applied Research documents created by Cloudera’s Fast Forward Labs.
For Data Engineers: Cloudera Data Engineering offers an integrated, purpose-built experience for data engineers with these capabilities:
For Business Users: Cloudera Data Visualization enables the creation, publication, and sharing of interactive data visualizations to accelerate production ML workflows from raw data to business impact. With this product:
Chip defined these two levels of real-time machine learning:
Latency matters a lot in online predictions. A 2009 study from Google shows that increasing latency from 100 to 400 ms reduces searches from 0.2% to 0.6%. Another 2019 study from Booking.com shows that a 30% increase in latency cost 0.5% decrease in conversion rate. The crux is that no matter how great your models are, users will click on something else if they take just milliseconds too long.
In the last decade, the ML community has gone down the rabbit hole of building bigger and better models. However, they are also slower, as inference latency increases with model size. One obvious way to cope with longer inference time is to serve batch predictions. In particular, we (1) generate the predictions in batches offline, (2) then store them somewhere (SQL tables, for instance), and (3) finally pull out pre-computed predictions given users’ requests.
However, there are two main problems with batch predictions:
Online predictions can address these problems because (1) the input space is in-finite and (2) dynamic features are the inputs. In practice, online predictions require two components: fast inference (models that can make predictions in the order of milliseconds) and real-time pipeline (one that can process data and serve models in real-time).
A model that serves online predictions would need two separate pipelines for streaming data and static data. This is a common source of errors in production when two different teams maintain these two pipelines.
Traditional software systems rely on REST APIs, which are request-driven. Different micro-services within the systems communicate with each other via requests. Because every service does its own thing, it’s difficult to map data transformations through the entire system. Furthermore, debugging it would be a nightmare if the system goes down.
An alternative approach to the above is the event-driven pub-sub way, where all services publish and subscribe to a single stream to collect the necessary information. Because all of the data flows through this stream, we can easily monitor data transformations.
There are several barriers to stream processing:
There is a small distinction between online learning and online training. Online training means learning from each incoming data point, which can suffer from catastrophic forgetting and can get very expensive. On the other hand, online learning means learning in micro-batches and evaluating the predictions after a certain period of time (whether offline or online). This is often designed in-tandem with offline learning.
The biggest use case for online learning right now is recommendation systems due to user feedback’s natural labels. However, not all recommendation systems need online learning, especially for the slow-to-change preferences such as static objects. For quick-to-change preferences such as media artifacts, online learning would indeed be helpful.
There are also other use cases for online learning, such as dealing with rare events, tackling the cold-start problem, or making predictions on edge devices.
Finally, here are the barriers to online learning:
Note: Be sure to read Chip’s blog post for the complete overview of real-time machine learning!
Food for thought: how should we design an MLOps tool for real-time purposes?
In the past 12 months, there have been myriad developments in the machine learning field. Not only have we seen shifts in tooling, security, and governance needs for organizations, but we’ve also witnessed massive changes in the field due to the economic impacts of COVID-19. Every year, Algorithmia surveys business leaders and practitioners across the field for an annual report about the state of machine learning in the enterprise. Diego Oppenheimer, the founder and CEO of Algorithmia, shared the top 10 trends driving the industry in 2021 and his tips for organizations that want to succeed with AI/ML in the coming year.
Overall, here are the 4 themes presented:
Here are the 3 trends regarding priority shifts:
Here are the 4 trends regarding the challenges:
Here are the 2 trends regarding technical debt:
The last trend in MLOps preferences is improving outcomes with MLOps solutions.
Diego concluded the talk with these notes:
The landscape of ML tooling has become richer and richer over the last few years. New tools are coming out every few weeks that solve that “one nagging problem” in the ML workflow. The result is a jungle of opinionated tooling in the ecosystem that can easily become overwhelming for machine learning engineers and leaders. Here are the common challenges organizations face to scale their ML operations: limited data access, limited ML infrastructure, disconnected ML workflow, tech mismatch, and limited visibility. The truth is that ML engineers often spend 50–70% of their time stitching phases together and maintaining technical debt — which includes glue code, pipeline jungles, and dead experimental code paths.
Unless you work for a top tech giant, the chances are that you have spent months building your current ML setup as follows:
Mohamed Elgendy shared a template of an end-to-end ML workflow stack that you need to consider when building your own ML workflow. He first identified the important characteristics of an ML workflow to look for:
He then outlined the ML stack template as seen above (the yellow components should be prioritized):
Note: Join the Kolena community if you are interested in this type of content.
That’s the end of this long recap. Follow the TWIML podcast for their events in 2021! Let me know if any of the particular talk content stands out to you. My future articles will continue covering lessons learned from future conferences/summits in 2021 🎆
This post was written by James Le, Developer Relations at Superb AI. The Original Post Link: