Big Data Carpentry with Python

Saket Bhushan (~saket)


1

Vote

Description:

Automation is not replacement, but an aid to manual labour. We at Sosio have internally cut-short our data processing time to 50% by automating the monotonous simple tasks. While automating mundane tasks speeds up the processes, saving us time and energy, automation is always not an easy answer. It is often complex, requires human intervention, and even if set up successfully needs constant monitoring and review.

There are myriad data pipelining frameworks and libraries available for every use case imaginable. The complexity of handling such diversity in tooling,
and uniqueness of the problem statement leads to duplicated efforts and reinvention of the wheel.

The session will primarily help the audience with an understanding of Pipeline Frameworks, Workflow Automation and the relevant pythonic toolsets that help achieve the same. We will go through some common design patterns, tradeoffs and available libraries / frameworks for designing such systems. We will focus on topics of reusability, consistency, availability, idempotency, and scalability of the systems.

We will take up basic data pipelining concepts as well as practical use cases for using data pipelines with Python. We will cover some of the popular task and data workflow tools like Celery, Luigi, and Airflow and touch on some over arching concepts when building a data pipeline.

The principles can be applied to archival, warehousing and analytics, and low-latency hot storage data.

We will solve few example problems during the workshop to make these points concrete. Much of what is being presented is based on our experience of trying different libraries learning lessons the hard way, as to what did not work, and what made things easy for us.

By the end of the session, one should be comfortable with

  • Assessing if a pipeline framework is right for your dataset.
  • Comparing pipeline tools and writing tasks.
  • Parallelising and Scaling tasks
  • Approaching data pipelining with a python toolset

Specifically we will be talking about

  • Understanding a queue, constructs of producer and consumer
  • Writing and Deploying tasks using Celery
  • Scaling celery workers and monitoring with Flower
  • First Steps with Dask
  • Data pipelines and DAGs
  • First steps with Luigi and Airflow
  • Custom and Advanced Tasks with Luigi and Airflow
  • Pipelines and Spark Streaming - listening to twitter stream
  • Pipelines and Django Channels - pub sub and data flow

Slides

Prerequisites:

  • Intermediate understanding of Python
  • Basic understanding of Bash Command
  • Basic of Deployment and working with remote servers
  • Interest in Data and Systems

Speaker Info:

Saket is founder of Sosio. Sosio caters to the large scale data needs of enterprises, and non-profits. He has been semi-active in tech-conferences attending and delivering talks across the globe. In his personal capacity he has introduced Python to more than 500 individuals, and conducted training sessions at corporate houses like Oracle. In his previous life, he spent good chunk of his time optimising computational mechanics algorithms.

Speaker Links:

Linkedin

Twitter

Section: Others
Type: Workshops
Target Audience: Intermediate
Last Updated:

Hi, It would be nice if you add your slides before 10 Sept. It will help our team to review your proposal. Thanks

Rajat Saini (~rajataaron)

Arent you trying to cover too many frameworks in one workshop ? It would be more useful if you can focus on a few middleware. Otherwise you may touch upon each but may not be able to cover anything in any detail.

Anand B Pillai (~pythonhacker)

Anand,

Not trying to cover too many frameworks at once. Spark and Django do have wide adoptions. Just trying to show how they fit in the context with a sample use case.

Rajat,

Will get the initial slides uploaded by EOD tomorrow.

Saket Bhushan (~saket)

Login to add a new comment.