Problems & Challenges in MLOps

December 12, 2022

MLOps requires a lot of moving variables across the business, data, code as well as model engineering at each phase of the machine learning lifecycle introducing questions that are unique to ML.

Over the past few years, the adoption of cloud engineering, understanding of big data, and massive popularity of the open-source libraries for predictive modeling have made it easier for everyone to be enticed by the possibility to generate user insights or personalize their “software 1.0” by hiring a team of data scientists.

However, to most companies’ constant surprise, data scientists alone have been far from adequately equipped to handle/guide the end-to-end deployment or monitor or debug these models once deployed in production.

It truly needs an army, some in-house (DE/MLE) and some outsourced to the A.I. tooling/SaaS companies. While every data team at different companies is substantially different however there are some core challenges that are common to everyone.

For a machine learning model to be considered successful, it must be able to generate stakeholder buy-in. Thus, it becomes incredibly important to tie the models to business KPIs instead of the accuracy scores (F1, recall, precision, ROC, AUC). However, with business KPIs changing every so often through different stages it becomes incredibly hard to measure the model performances.
For any business to build powerful and reliable ML models, investing effort into creating and maintaining a data catalog becomes crucial to track meta-data and while debugging, to retrieve information on which data source is the model being trained on. While building a data catalog may not seem like a hard task but the real challenge is building relevancy into data discovery. This is often when a lot of companies give up. If you instead decide to opt for a commercial solution, most of the out-of-the-box commercial data cataloging solutions do not adapt well to different organizations’ data needs and cost several kidneys and more. Requesting a feature-add can put you on nothing short of an eight-ten month-long waitlist, optimistically speaking, if the requested feature is even aligned with their product plan. The final option i.e. to build an in-house solution requires upfront investment and a team with an excellent understanding of user-friendly database design practices thus making it a time and resource-consuming process. To make it even harder, there is a lack of documentation around best practices about creating, managing, and scaling an in-house data-cataloging tool and certain evaluation/compliance metrics so as to not end up with an incomplete catalog esp with the new live data being streamed into the system making the effort futile at best.
Your machine-learning model is only as good as your data. For any data science project to be successful, data quality and more importantly labeled data quantity are the biggest defining factors. However, best practices on data evaluation about how to standardize and normalize new incoming data are still case-by-case considerations. Most training environments need to come pre-loaded with few checks and balances based on the different stages of model deployment. For example, for a model that is being tested for production, has a random seed been set on the model to make sure that the data is divided the same way every time the code is run?
While there are many advantages to using commercial feature stores however they can also introduce inflexibility and also limit the customization of models and sometimes you simply don’t need them (more on this in a next month’s post). This inspires many to go with open-source solutions and develop their own on top of say Feast or DVC. While batch features may be easier to maintain, real-time features are sometimes inescapable for several reasons. Real-time features introduce a lot of complexity into the system, especially around back-filling real-time data from streaming sources with data-sharing restrictions. This requires not only technical but also process controls that are often not talked about. Recently, there has been more discussion around Data Contracts however, they are not yet a commonly accepted practice across organizations.
There is a lack of commonly well-defined best practices around creating model version control or project structures at different stages from exploration to deployment. Cookie cutter is one of the efforts toward developing a unified project structure for cross-team collaborations.

Undefined/poorly defined prerequisites about when to push a model in production can create unnecessary bugs and introduce delays during monitoring and debugging.
Code Reviews how much time should be spent on code review in different stages especially given the model behavior may not accurately represent live training data and how frequently should they happen? Different companies currently have different systems for it. While some prefer one-off deployment, others have more clustered deployment stages eg. test, dev, staging, shadow, and A/B for business-critical pipelines with different review stages and guidelines. However, even the end-to-end tools do not have any in-built support for the same. It’s very much only institutional knowledge as of now as to what makes good quality production code.
While it is clear to everyone that test-driven development is critical to catch minor errors early in the deployment stage however how much time and effort be invested in the same given there are large samples of data that can only be gathered once the model is deployed in production?
To use static validation sets to test the models in production which can introduce bias or dynamic validation sets to more closely resemble live data and address localized shifts in data.
Should we use model registries or change only config files instead of the model thus making it easier to debug? For the former, if model validation passes all checks, the model is usually passed to a model registry.
Having clearly defined rule-based tests to make sure the model outputs are not incorrect while factoring in when is it okay to give incorrect outputs (for eg. shopping recommendations) vs to give no outputs (for eg. cancer prescriptions).
Best practices around code quality and a need for similar deployment environments. While most data scientists prefer working with Jupyter NBs however the way code is usually written in NBs (copy-paste) instead of re-using functions, can introduce unnecessary bugs and introduce technical debt affecting the model as well as the integration code when the notebook owner leaves the team.
While experiment tracking tools and dashboards have added quite some observability to the model runs, contextual changes still remain majorly undocumented.
While sandbox tools to stress-test can be quite useful in some scenarios, however in others for eg. recsys it may not generate any useful information whatsoever.
The warnings about which alerts are critical and require a quick migration to a failsafe model (for eg. hate speech, racial or gender bias) and which ones are mere information to be factored in for the next model configuration phase still require close human monitoring.

I am working on a longer blog-post of the same for MLOps.Community - to recieve it when it goes out first subscribe to my Substack. It's free. I send out one letter a month exclusively on MLOps.

Key Papers in MLOps (as of Dec 2022)¶

Some more MLOps papers 📜 you may find interesting to read a little further into the best practices and challenges with machine learning models deployed in production.

This is not a paper but a blogpost from Microsoft around the same lines as Google's 2014 paper about High Credit Card Interest Debt paper by Sculley et al - 💬Technical debt in Machine Learning: Pay off this “high interest rate credit card” sooner rather than later.
Testing and monitoring are key considerations for ensuring the production-readiness of an ML system, and for reducing the technical debt of ML systems. In this paper, The ML test score: A rubric for ML production readiness and technical debt reduction the authors present 28 specific tests and monitoring needs.
In this paper Large-scale machine learning systems in real-world industrial settings, the authors identify a total of 23 challenges and 8 solutions related to the development and maintenance of large-scale ML-based software systems in industrial setting.
Remember distill.pub's Research Debt article that caused a massive debate in ML Res community about Reproducibility? In this paper, Do machine learning platforms provide out-of-the-box reproducibility, the authors propose a framework for it.
😍 by none other than Zachary Lipton et al - a no-brainer if you have read his excellent work in interpretability. Failing Loudly: An Empirical Study of Methods for Detecting Dataset Shift proposes a two-sample-testing-based approach for dim reduction.
One of my fav papers is Large scale distributed neural network training through online distillation by Rohan Anil et al that talks about the test-time cost for ensemble modelling and an alternative i.e. online distillation that's far more cost effective.
Privacy and security is such an important part of software development however, not often talked about in ML. In this paper, Adversarial Machine Learning-Industry Perspectives the authors interviewed 28 orgs to propose amends in the Security Development Lifecycle for industrial-grade software in ML era.
Last but not the least, is Towards Accountability for Machine Learning Datasets: Practices from Software Engineering and Infrastructure. It proposes a rigorous framework for dataset development transparency that supports decision-making and accountability.

Key Papers in MLOps (as of Oct 2022)¶

The top one would be Machine Learning: The High-Interest Credit Card of Technical Debt by D. Sculley et al. We invited him as a guest on our MLOps Community podcast (Spotify/iTunes) Episode #32 - def worth listening to!
If there's one I would def read it would be Machine Learning Operations (MLOps): Overview, Definition, and Architecture by Dominik, Niklas and Sebastian. It highlights necessary principles, components, and associated architecture and workflows in MLOps arxiv.org/abs/2205.02302
A recent one is Operationalizing Machine Learning: An Interview Study by Shreya et al which interviews 18 MLOps practitioners and discusses common practices across different stages of an ML project from experimentation ->deployment-> monitoring.
While a guide for academia, but generally applicable best practices for all Data Scientists and ML Engineers is How to avoid machine learning pitfalls: a guide for academic researchers by Michael A. Lones
While this one by Cote et al is the description of researchers' approach to designing a study that would hopefully guide how to build quality assurance tools in ML Software Systems (the study is yet to be out) but it does bring attention to an open challenge.
Next is a paper that talks about how to address the eng challenges associated with distributed training if u don't have the necessary infrastructure to match the big corps with infinite compute and a million hyperparameters. Training Transformers Together
Of course, the list won't be complete without a discussion of Jupyter NBs. But what would be the performance difference b/w notebooks vs scripts and the pros and cons of each? 📜A Large-Scale Comparison of Python Code in Jupyter Notebooks and Scripts
But what about Production Infrastructure? How to cater to data stalls in the pre-processing pipeline? Understanding Data Storage and Ingestion for Large-Scale Deep Recommendation Model Training
Last, how can all the progress in machine learning guide the future of chip design? The paper by Jeff Dean provides an interesting outlook on hardware for software folks. The Deep Learning Revolution and Its Implications for Computer Architecture and Chip Design

Citation¶

  @article{abi2022,
  title = "Problems & Challenges in MLOps",
  author= "Aryan, Abi",
  journal = "abiaryan.com"
  year = "2022"
  month = "Dec"
  url = "https://abiaryan.com/posts/mlops-open-problems/"
  }