Untangling data science experiments
Data scientists conduct a large number of experiments (its a science after all). For each experiment, they have to go through a meticulous process of
- Documenting the context
- Managing the experiment parameters
- Documenting the hypothesis
- Documenting the results of the experiment
- Saving the results of the experiment in csv files, plots, etc.
In this talk, I will showcase how we manage all this at Oneirix Labs using our
experiment.py is a module to automate all the book-keeping mentioned above. It allows you to
- Describe your experiment using a BDD syntax
- Setup and inject experiment parameters into functions
- Record results of experiments (dataframes, plots, dicts)
- Hierarchical storage of results to disk
- Create nested variants of an experiments (think hierarchy of sub-experiments)
- Iterate over variants based on a list of values of an experiment parameter
In addition, it also makes tweaking and re-running experiments faster by caching the results of expensive operations. For this, it enables
- Caching of outputs of functions, to save time in re-runs
- Coarse and fine grained control to ignore or bust the caches
- Setting the context, explaining the problem: 7 minutes
- Explanation of the solution: 5 minutes
- Demo of the library: 15 minutes
Some exposure to data science will be nice, so that you can relate to the problem.
A software industry veteran walks into a data science bar ...
Aditya has worked in the software industry for around 18 years. He has worked on a wide range of technologies, including mainframes, embedded systems, kernel development, cloud applications, cloud infrastructure and QA.
He has been a co-founder in 3 startups, and played a variety of roles over the years. Nowadays, he is working in the field of engineering science, the confluence of mathematics and software, at Oneirix Labs
Aditya has been an active member and organiser of many open source communities like the Pune Linux Users group, the Pune Ruby community and the Deccan Ruby Conference and the Distributed Systems community and conference.