Reducing technical debt for ML platforms

Prakshi Yadav (~prakshi)


Description:

Deploying machine learning models at scale is a time-consuming process that involves many stages of simulations and stress testing. Continuous testing is needed to ensure that the engineers' ML Models are performing as anticipated in production - especially monitoring data/model drift. What if the data scientists want to put their latest model enhancements to the test in a simulated near-production environment?

For this, a workflow is essential that can build the said environment as soon as a new model prototype is pushed. These processes must also be constructed in a manner that does not demand too much manual intervention or involvement of the SRE team.

At Episource, we have developed a CI/CD pipeline to help data scientists host their models as APIs on-demand in a production-like environment. AWS ECS is the service that facilitates the deployment of our containers. Our CI/CD rolls out the test environment to host the Model API as soon as the engineer pushes the code into Github. The data scientist can run as many simulations as they want before agreeing on the efficacy of the latest work. This also makes it immensely straightforward to promote the new ML model to production at click of a button. This talk will go over how we developed a scalable simulation pipeline for our data scientists while adhering to the mantra - ship faster, ship consistent code, and ship fearlessly. Enabling production-like test environments necessitates stateless resource provisioning, which, if not performed in an automated environment, may result in subtle but significant drifts in production environments.

The following are some of the things that a participant can expect to learn during this talk:

  • Design parameters for ML deployment pipelines
  • Automation using Github Actions
  • Terraform usage for CI/CD jobs
  • Scalability: How do we ensure that our experiments are not competing for resources?

Prerequisites:

Basic understanding of requirements for building an ML Deployment pipeline and GitHub actions usage. The Knowledge of Terraform usage in writing Infrastructure as Code will be good to have to get the most out of this talk.

Video URL:

https://youtu.be/Jj9TdwZxGkQ

Content URLs:

https://medium.com/@YadavPrakshi/all-about-elastic-load-balancing-8594a65996d7

https://medium.com/@YadavPrakshi/automate-zero-downtime-deployment-with-amazon-ecs-and-lambda-c4e49953273d

https://medium.com/@YadavPrakshi/automation-using-github-events-23d602617348

https://medium.com/@YadavPrakshi/launch-amazon-ecs-cluster-in-a-private-subnet-with-extra-care-eeeba94d6592

Speaker Info:

Prakshi's technical background involves designing application architectures to build a bridge between business requirements and technical requirements at Episource especially architecture handling BIg Data processing gracefully. Designing architectures to optimize common quality attributes such as flexibility, scalability, security, and manageability. Specialization: AWS Cloud, Big Data tools, Serverless Computing, DevOps, MLOps

Speaker Links:

Speaker links:

Spark + AI Summit by Databricks: https://databricks.com/speaker/prakshi-yadav

FOSSASIA SUMMIT 2021: https://eventyay.com/e/fa96ae2c/session/6844

EuroPython 2021: https://ep2021.europython.eu/profiles/-309571/

Section: Data Science, Machine Learning and AI
Type: Talks
Target Audience: Intermediate
Last Updated: