How Helpshift built machine learning platform using Python at large scale
Shyam Shinde (~shyam91) |
The purpose of this talk is to describe how helpshift has leveraged python ecosystem to build a machine learning platform without using any third party framework, and how you can build one too.
In particular, You can learn how to build the following components of machine learning platform using python from this talk.
How we use python celery framework to distribute model building tasks to celery workers
How models heavier in size can be served to prediction node in real time
How to monitor model building tasks on celery worker
Python data science stack in Helpshift - Numpy, Scipy, Scikit-learn, etc
Python libraries/framework used - Celery, S3/Azure Storage, Bottle, etc
Helpshift provides customer service platform to around 2000+ companies across various business domains like gaming, e-commerce, IoT, banking, entertainment, travel, hospitality, productivity apps and many more. Helpshift provides a suite of ML features that include auto ticket classification, FAQ suggestions to user query and other features. As each company using our platform has a different business domain, we build separate ML models for each of our customer and for each of feature.
To handle thousands of models and CRUD operations on them in production, we needed highly scalable and reliable machine learning platform for model building and serving models. Possible solution was to use Spark or Tensorflow for model building. But these frameworks did not provide facility to store thousands of models, and serve those for prediction in production. We decided to use celery framework for distributing model building tasks to celery workers and use core python data science libraries to build models.
Model building using celery worker
Each Celery worker in ML platform is registered to one or more model building queues. Each type of task is associated with one celery queue. In real time, the backend server submits model building task to pre-defined celery queue. One of available celery worker picks the pending task, builds the model and pushes it to blob storage like s3/azure with new model version.
Model management in s3/azure
We have written python wrapper around s3/azure client library to provide all required CRUD operation on models in s3. CRUD operations are simple operations like get_model, put_model, update_model with some version.
Serving models to prediction Nodes
Model size ranges from 5 - 25 mb. To do predictions within 30 ms, we have to either load all models in memory or store them on local disk of each prediction nodes. We decided to store all the models on local disk as loading them in memory was not a scalable approach. The challenge here is, whenever a particular model is updated, it has to be copied on each prediction node. A python service on the prediction node takes care of syncing updated model from s3 to local disk.
Prediction service is gunicorn server which fetches model from local disk and does prediction on incoming requests.
Monitoring model building task running on celery worker
As there are always some jobs in celery queue waiting for celery worker, we built active monitoring service which tracks the status of each submitted task. Monitoring service decodes metrics from celery worker to find failure of task and time spent by each task in waiting/running state. For any task that crosses the threshold time for wait or run state, an alert is sent.
- Basic knowledge of Python ecosystem
- Interested in building scalable machine learning platform
Will share slides link soon..
Key Takeaways from the talk:
- Why we decided to build our own machine learning platfrom from scratch
- How to build machine learning platform using python
- Lessons Learned while building machine learning services
- How to extend this platform by distributed computation engine like Spark and deep learning platform like tensorflow
Hello, I am
shyam shinde, actively developing machine learning platform at helpshift.
I have diverse experience in developing backend systems, designing and developing system to handle big data.
Developed production systems using Java, Clojure and Python. Currently, interested in deploying machine learning services at scale. As side projects, I learn machine learning concepts and try to implement them.
Apart from that, I like trekking, reading books and watching movies.