Serverless Data Science: Scaling algorithms made easy
Nabarun Pal (~palnabarun) |
All problems in computer science can be solved by another level of indirection. -- David Wheeler
Data scientists have always been facing issues in using cloud technologies due to the abstractions not being similar to the existing tools that they use, though this problem can be solved by a level of indirection as the famous quote from David Wheeler quoted above says. In this session, I will be talking about problems faced by data scientists in parallelizing their sequential workloads.
I will be highlighting serverless technology as a solution to the problems. GCP Cloud Functions and AWS Lambda are popular serverless alternative for running parallel workloads due to their infrastructural simplicity. But, the problem comes again when we explore the API. To solve this problem, we built an abstraction which behaves exactly like Python Process Pool. This solves many problems like development overhead, limitations of server compute resources, infrastructural complexity and so on. The talk differentiates between running sequential workloads using Threads, Process, and Serverless Architecture. The idea is that attendees will go back with a knowledge of when and where to use serverless technologies and make an informed decision.
This talk is not intended to be a tutorial to any technology. Broadly, I will be sharing our experience at rorodata about how we solved problems faced by our data science team using serverless, how we came about to use serverless and the tooling we developed around it to make serverless easier to use.
Basic knowledge of what threads and processes are, what advantages they have and their limitations.
PS: Not a very hard requirement, since I will be explaining the same in the talk.
- Current State
- The Abstraction
- Performance Metrics
- Limitations and Future Improvements
LambdaPool will be open-sourced by mid-July 2019 and the source code will be published to GitHub. Link to the same will be updated on the slides. We are working on refining the user and developer documentation.
PS: Depending on the time left at the end of the talk, I would like to decide on whether I will have an on-stage Q&A or hallway questions. I prefer the latter as during my previous talks, I have seen the conversations to be more productive for the attendees.
- Source code is open-sourced and published here
- Updated the slides and agenda
- Spoke about the same topic at AWS Community Day Bengaluru 2019
Nabarun works as a Platform Engineer at a fast-paced SaaS startup Rorodata. He is a graduate of Indian Institute of Technology Roorkee. He has been solving problems faced by data scientists by building powerful yet simple abstractions in Python. He co-authored a functions as a service framework called Firefly in 2017, on which he spoke at PyCon India 2017 and FOSSASIA Summit 2018. He had also spoken at PyData Delhi 2017. He spoke about the same at AWS Community Day Bengaluru 2019. Nabarun was a John Hunter Matplotlib Summer Fellow under NumFocus in 2018. Recently, he has been venturing into the fields of Container Orchestration and Serverless Computing, with this interest resulting in the creation of robust tooling around data science products, which culminated in rorodata's AI driven Planning product Algoshelf. Recently, he has started contributing to the Kubernetes community in various code and non-code roles.