From OOP Principles To Scalable Machine Learning System
徐愷 (~kkshyu) |
Nowadays, data scientists use a lot of libraries such as sklearn, pandas, and tensorflow. However, the data analysis processes, from data retrieving, data cleaning, feature engineering, modeling, to visualization, are similar no matter which library we use. Thus, we spend lots of time copying and pasting codes, changing only a small part of this process. For example, we only change label encoding into one-hot encoding or change PCA into the polynomial transform.
Some advanced developers may use sklearn.pipeline to connect estimators. Unfortunately, we discovered that sklearn.pipeline is not suitable for other libraries. One cannot use pandas.DataFrame or keras.models properly in the pipeline. Besides the compatibility problems, one can neither scale the pipeline for different physical machines nor create a fault tolerance mechanism. We then dedicated ourselves to make a scalable machine learning system which can easily integrate different libraries do distributed-computing, and robust.
In this system, we refer to RapidMiner to design high scalability architecture, use Celery as our distributed task queue, and adopt the OOP principles - S.O.L.I.D. to build a solid and high serviceability system.
We highly recommend you to read Google’s Rules of Machine Learning: Best Practices for ML Engineering. It will be helpful when building a machine learning system.
The outline will be:
- [ 5mins] What is the best practice for Machine Learning system
- [ 5mins] The logic of RapidMiner and how does it work
- [ 5mins] How does Celery work and how to use it
- [15mins] System architecture design and OOP principles - S.O.L.I.D. for the system
- [10mins] Challenges faced and their resolutions + Q&A
This is my previous talk about this topic in Taiwan (sorry that the video of the English version has not released yet)
In this talk, I would like to share my experience of system architecting and let more people know about the importance of machine learning engineering. As a reminder, this talk is suitable for those equipped with a preliminary understanding of machine learning.
- Slide: https://docs.google.com/presentation/d/1iNxVTesNf7YbYgRvIcfRS88SuJffhYg1gsb60BEum1I/edit?usp=sharing
- Article: https://towardsdatascience.com/when-ai-meets-3000-year-old-chinese-palmistry-a767b7f3defb?gi=845bd51f6450