Serving ML Models at scale using Torchserve

Nishant Bhansali (~nishantb06)




Imagine you are ready with State of the art ML models, which is going to serve Millions of requests in a day! Keeping in mind the business requirements, how do you ensure your ML models are deployed efficiently?

Challenges we face when taking these models to production are 1. Managing compute and latency requirements : Models need to be faster and use less and less compute (less GPU's) at the same time 2. Efficient Batching : Managing concurrent requests so that they can be batched and fed to the model together 3. Scaling: The number of incoming requests will be quite a lot and correspondingly quite less at different points of time. Ensuring Horizontal Scaling, that is more copies of your model are available for inferencing is important. 4. Managing Versions and Different Models: Models are often improved or changed after fine tuning. Therefore to ensure that new versions of the same model can be deployed easily. 5. Improved Networking and other optimisations: TCP might not always be the best protocol to serve ML models as the payload (quite often an image, audio or text) can be quite large

Torchserve is an open souce tool that solves these challenges by providing the users with a simple Python Interface and easy to use API for managing different configurations for your Model Server.

Talk outline

  1. Creating a suitable environment to serve your models using torchserve : Installing the necessary packages , creating your own torchserve docker environment
  2. Understanding Torchserve Handler files : How to use an existing Torchserve Handler and create a custom Handler file for your use case including how to trace your models for faster inferencing!
  3. Deploying your models and understanding different Configurations involved : Understanding how to create Model Archive file and different configurations involved
  4. Understanding Torchserve API : for inferencing, swapping and scaling your models
  5. Further optimisations : Configuring GRPC instead of TCP (which can be upto 2x faster!) Talk will have code demo's and deploying a sample model locally, followed by questions


Basic understanding of Python and Pytorch will be helpful, is not a requirement though.

Speaker Info:

Nishant has been working as a Machine Learning Engineer at Sharechat for the past 2 years. Sharechat is a Social Media Platform with a 300M user base and here Nishant has been involved in building Content Moderation Systems . This means training and deploying Deep learning models to filter out NSFW content from Videos, Images , Audios and Text, majorly working with Python, Golang and Kubernetes(GCP) The most challenging problem he has worked on is building a system for detecting explicit content for Livestreams, where multiple modalities, scale, latency and different languages are few of the problems which need to be tackled here.

Speaker Links:

Personal Website , LinkedIn , Github

Talk on Llama and Alpaca models at Cellstrat , community for GenAI Practitioners

Section: Artificial Intelligence and Machine Learning
Type: Talk
Target Audience: Intermediate
Last Updated: