Introduction to ML Ops: How to start integrating ML solutions in your strategy
Machine learning (ML) has been a buzzword since the past decade, with more and more companies being interested or even already investing in using technologies that implement it. The reason is simple. ML models provide a plethora of functionalities which include recommendation, classification, prediction, content generation, among many others. A handful of libraries, frameworks, code examples, among others; already exist to facilitate the creation of machine learning models within different use cases.
These, however, work very well to create PoC (Proof of Concept) prototypes; the process of moving this into a production environment (where live data is being used as the model has been tested and completed) is no simple feat. As a matter of fact, according to the team from Canonical (commercial support for Ubuntu) in [1], once you have built the model, you’re only 20% done in terms of the whole ML model lifecycle. The other 80%, which would be the model deployment is so complex that, as Gartner reports in [2], only 53% of ML projects make it successfully into production
What makes this deployment so hard?
Although in the world of software engineering, a set of modern practices for efficiently writing, integrating, and deploying software referred to as DevOps already exists; ML models have an extra layer of complexity on top of the code, which is the data. In the end, the creation of said model is done by applying an algorithm to a set of training data, which will affect the behaviour of the artifact in production.
Additionally, this behaviour will also be affected by the input data that will be received at prediction time, which can’t be known in advance. This means that although the code can be controlled by the aforementioned practices, data is unpredictable as it constantly changes. A controlled connect between code and data needs to be created, and this is where Machine Learning Operations (or ML Ops) were born.
Machine Learning Operations (ML Ops)
Machine Learning Operations (or ML Ops) is a set of practices that standardize and streamline the deployment of machine learning (ML) models into production or live environment. As mentioned before, it encompasses the machine learning knowledge to create the models, the DevOps that control the software lifecycle and the data engineering to handle the constantly mutable data; meaning it takes from all these disciplines as shown below:
Considering this, a collaboration of professionals from these different areas is needed to achieve a ML lifecycle. Given this, MLOps extends the traditional DevOps paradigm by:
- Continuous integration (CI) applies to testing and validating data, schemas, and models, instead of only to the code and components.
- Continuous deployment (CD) no longer refers to a single software package or a service, but a system (an ML training pipeline) that should automatically deploy another service (model prediction service).
- Continuous training (CT) is unique to ML building and stands for model service and retraining.
The mix of the different disciplines and their modifications will allow integrating the code and data aspects by a set of given steps, or machine learning pipelines
As shown in the previous diagram the modifications to the CI/CD pipeline and the data engineering transformations, allows us to connect the data and code planes in a structured and automated way that will facilitate the ML deployment.
As mentioned before in the continuous deployment section, there should be two pipelines one that trains the model and another one that serves the predictions. Both pipelines require similar data transformations, which might vary because, for instance, the serving pipeline requires less amount of data features. Nonetheless, the pipelines should be consistent as this will allow the reuse of code and data efficiently.
ML pipelines
Bearing the previous information in mind, the following diagram shows one of the most established formalization of the sequence of steps, or pipeline, to deploy an ML model proposed by Though Works in [3]; they call it Continuous Delivery for Machine Learning (CD4ML):
Following the sequence:
- Data engineers identify and prepare the data for training by cleaning, formatting, labelling, organizing, etc. (DataOps)
- Data scientists build the model by experimenting with different approaches to find the best candidate. This is done creating code which trains a given model with the data prepared in the previous step. The selection of the model is done using error analysis, error measurement, and model performance metrics.
- Developers or DevOps engineers deploy the model into a production environment. This is done by packaging the model and sending it to the desired environment in the cloud or edge devices. This process is called. This can be done as embedded in an application, served as an API which is called model serving, docker container deployed or saved in a framework standardized way (e.g., TensorFlow SavedModel, PMML, PFA, or ONNX), among other strategies [1].
If we were to look at it like a continuous cycle, as it is done in the DevOps approach, it would look like this according to [4]:
Where they add a scoping phase which conducts a previous check to see if the problem can be addressed using a machine learning model or if the relevant real-world data for the use case is available. The data engineering, modelling and deployment are conducted as explained before; but they have a back feeding loop as models require learning from user inputs and predictions.
Moreover, they add a monitoring component which is composed of two parts: (1) monitor the infrastructure: where the model is deployed for load, usage, and overall health (2) monitor the model for its performance: accuracy, loss, bias, and data drift, which will let us know is the model is performing as expected under a real-world scenario.
MLOps levels
As could be seen previously, the ML code is just the first of many steps to see the model effectively create a contribution that can work in live real-world systems:
As can be seen in the diagram, there is a vast number of complex activities that need to be done surrounding the ML code, before deployment. Ideally, all these activities should be completely automated, but it depends on the use case. That is the reason why, there are different MLOps levels proposed by [6] that help assess the level of automation needed for your models:
- Level 0: a process of building and deploying of ML model is entirely manual, sufficient for the models that are rarely changed or trained.
- Level 1: continuous training of the model by automating the ML pipeline, good fit for models that need to be retrained because of new data, but not sufficient to rapidly test other ML ideas or new pipeline components.
- Level 2: a robust CI/CD automated system is needed, this is needed when you want to provide data scientists with a rapid way to explore feature engineering, model architecture, and hyperparameters.
MLOps benefits and costs
According to Forbes, the ML Ops industry will be worth around four (4) billion by 2025 [5] which makes sense giving the amount of benefits it pertains:
- Automatic updating of multiples pipelines
- ML models scalability and management
- ML model’s health and governance
- Helps handling the unpredictability and quality of data
- Facilitates collaboration using CI/CD
Although these are desirable characteristics, as mentioned before, they come with a given amount of effort in certain aspects, which can be summarized as:
- Development: more frequent parameter, feature, and model manipulation; the experimentation is non-linear as it is traditionally done with DevOps
- Testing: data and model validation
- Monitoring: ML Ops systems require continuous monitoring and auditing for accuracy. For this, different types of monitoring are needed: memory usage monitoring when doing predictions, model performance monitoring and infrastructure monitoring.
As it can be seen, ML Ops — or Machine Learning Operations — are becoming a prerequisite for companies deploying ML in production, similar to DevOps with traditional software application.
However, ML Ops tools cannot simply be applied as a plug-and-play approach because there are numerous additional considerations. In particular, data and its complexities, as well as model construction and training, are new concerns that introduce new types of computational requirements. Nonetheless, the benefits of setting these automated pipelines into place, as shown before, will be worth the effort.
Additionally, there’s a big interest in this particular field, so it won’t take long for the implementation of these processes to become smoother and reliable.
Get in touch to learn more about our MLOps Services
Additional references.
[7] Breuel, C., 2021. ML Ops: Machine Learning as an Engineering Discipline. [online] Medium. Available at: <https://towardsdatascience.com/ml-ops-machine-learning-as-an-engineering-discipline-b86ca4874a3f> [Accessed 5 November 2021].
[8] Sciforce, 2021. MLOps: Comprehensive Beginner’s Guide. [online] Medium. Available at: <https://medium.com/sciforce/mlops-comprehensive-beginners-guide-c235c77f407f> [Accessed 5 November 2021].
[9] Google Inc., 2021. MLOps: Continuous delivery and automation pipelines in machine learning. [online] Google Cloud. Available at: <https://cloud.google.com/architecture/mlops-continuous-delivery-and-automation-pipelines-in-machine-learning> [Accessed 5 November 2021].