Managing your data science project environments with Conda (+pip)
David R. Pugh (~davidrpugh) |
This workshop is a Software Carpentry-style introduction to Conda (+pip) for (data) scientists. Conda is an open source package and environment management system that runs on Windows, Mac OS and Linux. Although Conda was created for Python packages, Conda can package and distribute software for any language (which makes Conda well suited to manage environments for data science and machine learning projects!). Pip is the de facto standard package-management system used to install and manage software packages written in Python: if it's written in Python, then it will be available on the Python Package Index (PyPI) via pip. Conda and pip work great as a team and this workshop will cover when and how to use pip to install packages into Conda environments.
This workshop motivates the use of Conda (+pip) as a development tool for building and sharing project specific software environments that facilitate reproducible (data) science workflows. Particular attention is given to using Conda to create reproducible environments with NVIDIA GPU dependencies (including environments for Horovod, TensorFlow, PyTorch, and NVIDIA RAPIDS).
What follows is a rough outline of a typical 3-hour/half-day workshop. Please note that the content can be adjusted to fit into a shorter time slot as necessary,
- Getting Started with Conda (20 minutes) Why should I use a package and environment management system as part of my research workflow? What is Conda? Why use Conda?
- Working with Environments (1 hour) What is a Conda environment? How do I create (delete) an environment? How do I activate (deactivate) an environment? Where should I install my environments? How do I find out the environments that exist on my machine? How do I find out what packages have been installed in an environment? How do I delete an environment that I no longer need?
- Sharing Environments (30 minutes) Why should I share my Conda environment with others? How do I share my Conda environment with others?
- Using Packages and Channels (30 minutes) What are Conda channels? What are Conda packages? Why should I be explicit about which channels my research project uses?
- Managing GPU dependencies with Conda (30 minutes) How do I get see which CUDA dependencies are available via Conda ? What are the best practices for using managing CUDA dependencies via Conda?
As this is a Software Carpentry-style tutorial each episode of the tutorial contains a number of hands-on exercises to be completed by tutorial participants. The instructor will also live code the solutions to the exercises.
Basic familiarity with Python programming and Bash shell concepts (i.e., basic commands, environment variables, etc). Familiarity installing NVIDIA CUDA Toolkit would be beneficial for NVIDIA GPU focused episodes.
The bulk of the content for the workshop will be taken from open source lesson materials for a Software Carpentry-style course entitled Introduction to Conda for (Data) Scientists that I am developing. In addition to the lesson materials, I have also written a number of blog posts that cover most of the material in the lessons (but without as much detail).
Dr. David R. Pugh is a staff scientist with the King Abdullah University of Science and Technology (KAUST) Research Computing Core Labs where he provides data science training and consulting services to KAUST students, faculty, and research scientists. David is a certified Software and Data Carpentry instructor with extensive teaching experience having taught Software and Data Carpentry workshops in Japan, Saudia Arabia, and the UK.
At KAUST David is the lead instructor of the popular Introduction to Data Science Workshop Series where he teaches programming in Python, Bash Shell, and SQL and best practices for reproducible research using Git, Conda, and Docker.
In addition to his work at KAUST, David has given a number of invited talks and training sessions including a week-long Introduction to Data Science using Python to researchers at the Asia Pacific Energy Research Center (APERC) in Tokyo, a hands-on tutorial at HPC Saudi 2019 on Deep Learning with PyTorch, and two-days of hands-on instruction in Scikit-Learn to participants of the KAUST Women in Machine Learning (WiML) Bootcamp.