Software Engineering Best Practises Applied to Machine Learning Research - AllenNLP and Data Version Control
Sai Prasanna (~sai80) |
With the deep learning explosion, more code gets written for research in machine learning. A problem in writing code for research is that its easy to sacrifice good practices in software engineering for apparent gains in speed of experimentation.
But this often is counter productive, not only leading to hard to read code, but also making it hard or impossible to reproduce. And often bad code makes it hard for other researchers to build upon existing work ,making it hard to stand on the shoulders of giants.
We use AllenNLP, a library for deep learning research in NLP to illustrate how best practices lead to making good science easy. Adopting best practices like dependency injection and writing DRY/modular code can help making experimentation faster contrary to expectations. It makes reproducibility easier, and the code easily extendible.
Version control is another best practice adopted across software engineering. Data Version Control (DVC) helps apply version control to machine learning projects. DVC allows one to manage large datasets, models and metrics linked to the code in git. It helps to build a reproducible pipeline tracing how the dataset gets converted to features, which get trained into models, and the finally resulting in a benchmark. The pipeline can be easily reproduced. One can then change any part of the pipeline say fixing a bug in preprocessing or changing a training hyper-parameter and then rerun the entire pipeline with only the parts that need to be changed being recomputed.
- Familiarity with Version Control Systems like git.
- Idea of how typical machine learning workflow is. - Dataset processing, training, benchmarking.
- Knowledge of applying deep learning for a task, preferably but not strictly NLP.
I am a machine learning engineer working currently at Zoho. I have experience in research and engineering for NLP. My work involves replicating current research in deep learning for NLP, experimenting upon new ideas and deploying for real world use cases.