Jupyter to Jupiter : Scaling multi-tenant ML Pipelines

Vishal Gupta (~vishal07)


Description:

A brief talk summarising the journey of an ML feature from a Jupyter Notebook to production. At Freshworks, given the diverse pool of customers using our products, each feature has dedicated models for each account, churning out millions of predictions every hour. This talk shall encompass the different tools and measures we've used to scale our ML products. Additionally, I'll also be touching upon Apache Airflow, a workflow management platform and how we've used it to automate and parallelise various segments of our ML pipeline.

Outline

  • Introduction [3 minutes]
    • About myself 
    • Challenges of a multi-tenant ML pipeline
    • Role of a Data Scientist vs an ML Engineer
  • Incentives to scale your pipeline [3 minutes]
    • Reducing turnaround time (real-time vs batch-wise)
    • Increasing availability & adhering to SLA 
    • Enabling diverse customers to use your ML features
    • Automating workflows
  • Brief Intro to Airflow [4 minutes]
    • Why not cron?
    • DAGs, Tasks and Operators
    • Executors: LocalExecutor, CeleryExecutor, KubernetesExecutor
    • Controls: Task Pools, Queues and Scheduling rules, Parallelism, etc.
    • Reasons to avoid Airflow
  • Scaling different parts of an ML pipeline [15 minutes]
    • Data Ingestion and preprocessing
      • Data pipelines at Freshworks
      • Aggregating different types of data from different sources (be it streams, databases or S3)
      • Evaluating and choosing the right datastore to optimise retrieval and query loads
      • Cleaning data before insertion to optimise storage
      • Optimising preprocessing layers to adapt to the rate of incoming data
    • Model Training, Evaluation and Deployment
      • Offline ML platform & workflows at Freshworks
      • Periodically training model to adapt to recent data
      • Including customer-specific rules and features
      • Hyper-parameter tuning
      • Leveraging spark clusters to train faster
      • Evaluating models and monitor metrics over time
      • Maintaining model versioning to revert to older versions as a fallback
    • Prediction, Back-filling and Interpretability
      • Online ML platform & workflows at Freshworks
      • Building ML systems capable of scaling to handle more customers
      • Avoiding single point of failure with distributed execution
      • Establishing back-filling pipelines if historic predictions are of importance
      • Capturing and handling errors without disrupting the entire workflow
      • Setting up alerts to identify engineering and data science anomalies 
      • Providing interpretable insights to justify predictions to stakeholders
    • Misc. engineering practices
      • Planning before execution : Be it a new module or picking a tool.
      • Functional testing : Ensuring offline and online pipelines are on par
      • Application Security : Build data pipelines keeping regulations in mind
      • Documentation : Add docstrings, setup & deployment instructions and an elaborate README

Prerequisites:

Basic knowledge of a Machine Learning pipeline (ie. ETL, EDA, Training, Evaluation, Prediction, etc. )

Speaker Info:

As an ML Engineer at Freshworks, I work on build and scaling AI features for our CRM. In the past, I've worked on Deal Insights, a feature that predicts the success of active deals with interpretable insights about their progress and Deal Sentiment, a feature that predicts deal closure sentiment based on emails exchanged between agents and their customers during a sales funnel. When I'm not hacking together pet projects in Python, I'm either streaming Seinfeld, making pesto or falling asleep to audiobooks.

Section: Data Science, Machine Learning and AI
Type: Talks
Target Audience: Intermediate
Last Updated: