Reproducible Scalable Workflows with Nix, Papermill and Renku
Rohit Goswami (~HaoZeke) |
The provenance of Jupyter notebook interfaces can no longer be denied in the data-science and analysis community. In particular, fledgling and "fresh out of school" researchers and practitioners are used to using Jupyter notebooks for their initial analysis. As might be expected, these workflows are difficult to reproduce and also store. Caching efficiency and dependency re-use are almost always sub-optimal with virtual environments, compared to native installations, and the same issues (along with additional security concerns) plague docker setups as well. There are a set of Jupyter tools which have evolved to close this gap, like JupyText. However, the fundamental aspect of reproducing workflows on high performance computing clusters, of being able to compose programmatically, compilation rules which efficiently use underlying hardware with minimal user intervention is still not a solved problem. In this talk, I will discuss packaging Python applications and workflows in an end-to-end composable manner using the Nix ecosystem, which leverages a functional programming paradigm and then show how this allows for both user-friendly low-compute analysis, while being scalable on large clusters. To that end, the tools introduced will be:
- The Nix programming language (emphasis on developer environments for python with mkShell)
- Jupyter Python kernels (the Xeus kernel for Python debugging) and Jupytext
- Papermill for parameterizing notebooks
- Renku for tracing provenance
The goal is to have the audience familiarized with the best practices for reproducibility and analysis. The focus will be on scientific HPC applications, though any managed cluster can and will benefit from the practices described.
- Introduction: 1 Min
- Python Packaging: 2 Min
- Nix Introduction and Philosophy: 7 Min
- Project Setup and Workflow: 3 Min
- Reproducibility, definitions and tools: 7 Min
- Cluster Management and Data provenance: 5 Min
- Conclusions and Future Directions: 2 Min
QnA. : 5 Min
An understanding of the python packaging ecosystem and it's shortcomings will be covered briefly, but prior experience would be desirable.
- Experience working with large-data workflows
- Typically this would involve running say, a standard GPU computation on Colab
- Experience with HPC architecture and managers like PBS Torque/SLURM
- Also tooling like LMod
A more in-depth introductory workshop on Nix itself given by me (and Amrita Goswami) at CarpentryCon2020 is here:
I'm presently a doctoral researcher at the University of Iceland in the School of Natural Science and Engineering. I work on large-data problems at the intersection of quantum chemistry and machine learning in the Faculty of Physical Sciences and have over ten years of FOSS development experience. I am an OSI (Open Source Initiative) advocate member, and am also a member of and contributor to other scientific and FOSS programming communities (e.g. the Carpentries). I have an eclectic set of interests, mostly centered around HPC algorithmic efficiency and reproducible science. In the past, I have been associated with IIT Kanpur, specifically the Chemical Engineering department, the HPC division, and the department of Chemistry. I am also the co-developer and author of the reproducible FOSS project d-SEAMS. I have a history of open source pedagogy as well, having been a CS106A Code in Place section leader, and also having co-taught a course on computational chemistry at the middle-school level.