Continuous Integration for Data Scientists
Jaidev Deshpande (~jaidev) |
This talk is about debugging, scaling and ultimately deploying a prototype of a machine learning application to a production system.
Broadly, the development and maintenance stages of a machine learning product can be broken down as follows:
- Data Ingestion: data collection, cleaning and transforming (ETL)
- Feature Engineering
- Model Selection
- Training and Prediction
- Incorporating feedback from production into the training process
In a live project, these steps may not necessarily be separated in time, in that it may not be possible to do one after the other sequentially. Engineers normally need to keep rising, lathering and repeating these steps to get a workable product out. The point of this talk is that some of these stages are very repetitive and can be automated.
In general, data-driven projects have a lot to benefit from the kind of work that normally devops and build engineers do. Specifically, I will be taking some examples and demonstrating in detail how CI tools can be invaluable in iterating through the development cycle of a machine learning project. Continuous integration, in this case, can be thought of as simply something that runs a predefined script regularly, or is triggered by an external event. Indeed, this is useful (and popular) in building and testing, but the flexibility of a CI system can be leveraged to accomplish arbitrary tasks.
In part, this talk is also about culture and habits. I will be speaking in detail about what data scientists can learn from build and devops engineers - and how adopting even the most common CI practices can be extremely rewarding for a data scientist. I will try to justify the claim that data scientists need to spend as much time with Jenkins, as they do with IPython notebooks.
- Knowledge of basic machine learning nomenclature
- Some knowledge of writing modular Python packages
- Basic knowledge of pandas and sklearn.
A working draft of the slides can be found here: https://github.com/jaidevd/jaidevd.github.io/blob/source/blog/posts/continuous-integration-for-data-scientists.ipynb I'm updating this almost everyday.
I'm a data scientist at Cube26 Software Pvt Ltd. I have previously worked and consulted on a number of data science projects and products. I build data-driven products and the tooling around them for a living. My research interests are in signal processing and computational harmonic analysis. I'm obsessed with applications of machine learning in personal productivity and recommendation systems.