Concurrent data processing in Python

Anand S (~anand) | 24 Apr, 2015

63

Votes

Description:

What mechanisms do we have to process large scale data concurrently in Python?

This talk will review and compare popular and emerging mechanisms:

Increase server utilisation by leveraging multiple cores (forking, subprocess, multiprocessing, threading, etc)
Increase responsiveness via concurrent processing (asyncio, trollius, etc)
Distributed data structures (HDFS, distributed DataFrames, etc)
Offloading to an external engine (but which one? Relational or non-relational? How to set it up?)
Pre-compute and cache (where? at what level of aggregation? how to deal with combinatorial explosions?)

... and show examples of real-life code that use these techniques.

Prerequisites:

You've a strong understanding of Python (e.g. you can define a generator without looking at the docs)
You can process basic data (e.g. you can sum all columns of a CSV file)
You have a problem processing large volumes of data

Speaker Info:

Anand is the Chief Data Scientist at Gramener.com. He has advised and designed IT systems for organizations such as the Citigroup, Honda, IBM, Tesco, etc.

Anand and his team explore insights from data and communicates these as visual stories. Anand also builds the Gramener Visualisation Server -- Gramener's flagship product.

Anand has an MBA from IIM Bangalore and a B.Tech from IIT Madras. He has worked at IBM, Lehman Brothers, The Boston Consulting Group and Infosys Consulting. He blogs at s-anand.net.

Speaker Links:

Other talks:

Data, Politics and Anomalies, TEDx, NMIMS Bangalore 2015
Faster Data Processing in Python, PyCon 2014
Data Visualisation with PowerPoint in Python, PyCon 2013
Pandas tutorial, Fifth Elephant 2013

Section:	Concurrency
Type:	Talks
Target Audience:	Intermediate
Last Updated:	03 Sep, 2015

Comments