Unified Backend for Productionizing and Orchestrating Foundational Models with Python
Devansh Ghatak (~devansh0) |
66
Description:
Description:
Foundational model-based applications are crucial in contemporary AI workflows. This talk is designed as an introductory guide for machine learning practitioners, emphasizing the concepts involved in deploying and optimizing foundational models, illustrated with specific tools and examples. While focusing on conceptual understanding, I will highlight Python-first open-source tools and share practical insights gained from developing SimpliManage—a Python-based framework for unified model compilation and orchestration. We aim to release SimpliManage as an open-source project before PyCon, making it available to the broader Python community.
In this talk, we will explore two broad areas:
- Generalised Hardware-Aware Model Compilation:
- Techniques for compiling and optimizing foundational models like LLMs, SDXL, and Whisper using Python.
- Best practices for hardware-aware optimization, considering various GPUs and compilation techniques.
- Multi-Cloud Deployment Strategy:
- Developing a Python-based framework to deploy models in multi-cloud environments efficiently.
- Strategies to optimize deployment considering load patterns, SLAs, and cost economics.
- Dynamic load balancing and rapid scale-up techniques to reduce cold start times.
Outline:
- Introduction to Foundational Models and Challenges
- Overview of LLMs, SDXL, Whisper, and other foundational models.
- Common challenges in compiling and optimizing these models for production.
- Unified Hardware-Aware Model Compiler:
- Insights into using Python for model compilation and optimization.
- Techniques for hardware-aware optimization, leveraging various GPUs and compilation techniques.
- Practical examples and best practices.
- Python-Based Multi-Cloud Deployment Framework:
- Designing a framework to handle multi-cloud environments efficiently.
- Strategies to optimize deployment based on load patterns, SLAs, and cost economics.
- Dynamic load balancing and rapid scale-up to reduce cold start times.
- SimpliManage:
- Overview of the features and capabilities of SimpliManage.
- Real-world case studies demonstrating SimpliManage in action.
- Plans for open-sourcing the library.
- Conclusion and Q&A:
- Summary of key takeaways.
- Interactive session to address audience questions.
Prerequisites:
Basic understanding of PyTorch and Huggingface.
Speaker Info:
Devansh Ghatak:
- Co-founder and CTO of Simplismart with close to 10 years of extensive experience in Python.
- Developed one of the fastest inference engines for foundational models at Simplismart.
- Created SimpliManage to automate model compilation, deployment, and management.
- Enabled scaling up to 100k RPS for some of the largest AI-first companies.
- Utilized an extensive tech stack including PyTorch, Huggingface, Kubernetes, Terraform.
Previous Experience:
- ML Engineer at Google:
- Part of the search team, worked extensively on LLMs to enable Google search users to retrieve accurate knowledge-based factual information from the internet.
- ML Engineer at Avaamo:
- Helped Avaamo become one of the first adopters of language models using Huggingface (then PyTorch-pretrained-bert).
- Developed a one-of-a-kind custom implementation of a RAG pipeline in early 2019 for an answering product, Avaamo Answers.
- Created a novel auto-train tool for training support chatbots using human conversations.