Big Data Carpentry with Python

Saket Bhushan (~saket) | 31 Aug, 2017

1

Vote

Description:

Automation is not replacement, but an aid to manual labour. We at Sosio have internally cut-short our data processing time to 50% by automating the monotonous simple tasks. While automating mundane tasks speeds up the processes, saving us time and energy, automation is always not an easy answer. It is often complex, requires human intervention, and even if set up successfully needs constant monitoring and review.

There are myriad data pipelining frameworks and libraries available for every use case imaginable. The complexity of handling such diversity in tooling,
and uniqueness of the problem statement leads to duplicated efforts and reinvention of the wheel.

The session will primarily help the audience with an understanding of Pipeline Frameworks, Workflow Automation and the relevant pythonic toolsets that help achieve the same. We will go through some common design patterns, tradeoffs and available libraries / frameworks for designing such systems. We will focus on topics of reusability, consistency, availability, idempotency, and scalability of the systems.

We will take up basic data pipelining concepts as well as practical use cases for using data pipelines with Python. We will cover some of the popular task and data workflow tools like Celery, Luigi, and Airflow and touch on some over arching concepts when building a data pipeline.

The principles can be applied to archival, warehousing and analytics, and low-latency hot storage data.

We will solve few example problems during the workshop to make these points concrete. Much of what is being presented is based on our experience of trying different libraries learning lessons the hard way, as to what did not work, and what made things easy for us.

By the end of the session, one should be comfortable with

Assessing if a pipeline framework is right for your dataset.
Comparing pipeline tools and writing tasks.
Parallelising and Scaling tasks
Approaching data pipelining with a python toolset

Specifically we will be talking about

Understanding a queue, constructs of producer and consumer
Writing and Deploying tasks using Celery
Scaling celery workers and monitoring with Flower
First Steps with Dask
Data pipelines and DAGs
First steps with Luigi and Airflow
Custom and Advanced Tasks with Luigi and Airflow
Pipelines and Spark Streaming - listening to twitter stream
Pipelines and Django Channels - pub sub and data flow

Slides

Prerequisites:

Intermediate understanding of Python
Basic understanding of Bash Command
Basic of Deployment and working with remote servers
Interest in Data and Systems

Speaker Info:

Saket is founder of Sosio. Sosio caters to the large scale data needs of enterprises, and non-profits. He has been semi-active in tech-conferences attending and delivering talks across the globe. In his personal capacity he has introduced Python to more than 500 individuals, and conducted training sessions at corporate houses like Oracle. In his previous life, he spent good chunk of his time optimising computational mechanics algorithms.

Speaker Links:

Twitter

Section:	Others
Type:	Workshops
Target Audience:	Intermediate
Last Updated:	03 Oct, 2017

Comments