Challenges doing ML Ops with Metaflow
Tanay PrabhuDesai (~tanay) |
A recent project that I was working on was one that has a significant amount of ML work. We as software engineers were supposed to work with data scientists. A significant amount of work done by data scientists needs a lot of computing power. No matter how powerful their laptop is, it's insignificant. Using Metaflow we can harness the power of the cloud. This lets the data scientists use the cloud in a simple and easy way. This talk is mainly intended for software engineers/DevOps who work along with data scientists. In other words, it is intended for those who want to use Metaflow for MLOps and want to know about the challenges. This talk should be used not just for what makes Metaflow good to be used but also for its pain points.
- A very short intro to Metaflow (5 minutes)
- Versioning the experiment (5 minutes)
- Support for other language code execution (2 minutes)
- Pulling from private repositories (5 minutes)
- Using GPUs (5 minutes)
- Deploying to production (3 minutes)
p.s. The points mentioned above might be slightly modified and the time allocated to each maybe moved around
We will start with a short intro to Metaflow. In this, we won’t be going into the depths of the concepts but will be skimming over them. Sample code examples will be shown of how a Metaflow experiment runs to set a baseline. A simple ML algorithm will be used for demonstration purposes as ML algorithms are not in the scope of this talk.
We then will dive into the versioning of experiments which was one of the main reasons to use Metaflow. In this, we will go through what a version of an experiment is? What all details an experiment contains? And how to go through the experiment via Jupyter notebook which is very helpful for data scientists.
Another important issue we had was that some parts of the experiment were in Java instead of Python. By default, Metaflow runs only Python containers. So here we will see how to use your custom container instead of the usual Python containers. This will be continued by showing how the Docker image used for the container can be loaded from a private Docker registry. In addition to loading Docker images from a private Docker registry, we will also look into loading Python packages from a private repository. This is very important esp. for companies that want proprietary reusable modules.
One of the biggest challenges was to use GPU instances for Metaflow. These are very hard to configure and have your Python code running on GPU. This should not even remotely be a concern of the data scientists. That’s why having a proper setup for GPU instances is important. This setup was a major failure for us. This section of the talk will talk about this part more from a failure point of view.
Last but not least if the trained models are never used, it’s all of no use. Here we will discuss how we loaded the models in the production services. There will be quite a lot of video examples of the samples running throughout the presentation. Which will support the code examples shown. All the demonstrations done in the presentation will have it’s associated code in a repository for the future reference of the audience.
- A rough knowledge about what MLOps is
- Basic experience with AWS and the different services offered by AWS, esp. Batch
- No knowledge of Metaflow is assumed and the talk includes an intro
- Since Metaflow is a Python library, intermediate Python knowledge is assumed
Tanay is a passionate software engineer who is constantly learning new things. Although using Python as the primary weapon in his arsenal, he is mostly technology agnostic. Mostly he is doing software-related hobbies in his free time. This can include anything from creating some pet project software to tinkering with DevOps-related tools like AWS. In addition to that, he has a very strong interest in well-developed and robust software that strictly follows most of the good practices. Tanay also loves to speak at Meetups and other similar events.