Concurrent data processing in Python
Anand S (~anand) |
What mechanisms do we have to process large scale data concurrently in Python?
This talk will review and compare popular and emerging mechanisms:
- Increase server utilisation by leveraging multiple cores (forking,
- Increase responsiveness via concurrent processing (
- Distributed data structures (HDFS, distributed DataFrames, etc)
- Offloading to an external engine (but which one? Relational or non-relational? How to set it up?)
- Pre-compute and cache (where? at what level of aggregation? how to deal with combinatorial explosions?)
... and show examples of real-life code that use these techniques.
- You've a strong understanding of Python (e.g. you can define a generator without looking at the docs)
- You can process basic data (e.g. you can sum all columns of a CSV file)
- You have a problem processing large volumes of data
Anand is the Chief Data Scientist at Gramener.com. He has advised and designed IT systems for organizations such as the Citigroup, Honda, IBM, Tesco, etc.
Anand and his team explore insights from data and communicates these as visual stories. Anand also builds the Gramener Visualisation Server -- Gramener's flagship product.
Anand has an MBA from IIM Bangalore and a B.Tech from IIT Madras. He has worked at IBM, Lehman Brothers, The Boston Consulting Group and Infosys Consulting. He blogs at s-anand.net.