MLFlow as a core driver of your ML CI/CD workflow
Manoj Mahalingam (~manoj57) |
At Avalara, we have thousands of models that are live in production and many are updated at least once a week. As we scaled the number of models in our production systems and accelerated the frequency in which we deployed them, it became very apparent that reproducibility of the steps involved in creating the models was becoming a more manual task. And at times, it was voodoo!
Tracking which training data was used to build a model, the prep and pre-processing needed for the data, the hyperparams used, library versions etc. meant going through Jupyter notebooks or even the shell history of the person building the model! Add to that the fact that the models were being built every week and pushed to production, and that there were thousands of them, things soon came to a crawl and we just couldn’t reliably push to production anymore.
We envisioned that a model versioning system was the need of the hour. We see a model repository as being similar to other artifact repositories like Maven and Ivy. It should help us to add and track models based on different libraries (scikit-learn, MLLib, fastText etc.) along with all the associated metadata like the hyperparams and metrics. Essentially, everything that went into training the model (the notebook itself or library version, training data, hyper params etc) and all the output (including the model themselves along with all the relevant metrics, confusion reports, processed training data etc) should be versioned and available for consumption and introspection.
After an internal attempt, we changed our approach towards integrating MLFlow into our internal ML Platform. MLFlow is an open source platform for the entire end-to-end machine learning lifecycle. With its tracking component, it fit well as the model repository within our platform. With its Tracking API and UI, tracking models and experimentation became straightforward. With its AWS SageMaker support, we were also able to speed up our model building time and reduce the costs by moving away from dedicated training instances.
We also integrated MLFlow into our CI/CD tool of choice - GoCD - with an open source plugin (https://github.com/indix/mlflow-gocd). With this we are able to tag and "promote" experiments from MLFlow to our CI/CD system which triggers the builds, test data verification, and finally deployment of the models to production. With this, MLFlow become a core component of our existing CI/CD workflow and our ML Platform, but without having to replace components that already exist.
Blog - https://stacktoheap.com/blog/2018/11/19/mlflow-model-repository-ci-cd/
Open source plugin to GoCD to work with MLFlow - https://github.com/indix/mlflow-gocd
Manoj Mahalingam is a Principal Engineer at Avalara. Previously, he worked at Indix (acquired by Avalara) and ThoughtWorks.
Manoj is the author of the book Learning Continuous Integration with TeamCity.