Processing Billions of Records Per day with Python

Shaik Asifullah (~shaik2)




Have you ever been amazed how efficiently and effectively tech giants are processing their data ? Do you want to build an analytics system that is capable of processing billions of records in a day ? For those of you who are wondering how to build a scalable, low latency system for running arbitrary SQL queries in Python, this talk is for you!

This system is distinguished by being

  1. schema-independent, and
  2. processing queries with minimal latency

I will describe how to architect this system using the powerful Lambda Architecture (an often used design pattern in big data) and Apache Kafka, how to process and format the raw schema-independent data, and introduce different online analytical processing (OLAP) systems and their respective tradeoffs. The end product will be an analytics engine capable of running arbitrary queries on billions of records.

Finally, I will also discuss some exciting extensions of this pipeline, including applying machine learning algorithms and adding a monitoring system. The talks ends with benchmarks of queries made on billions of records followed by a Q&A session.

This talk is intended for folks belonging to any of these fields:

  • Involved in the process of revamping their data warehousing systems for arbitrary queries with minimal latency
  • Those who want to build their own analytics layer from scratch
  • Analytics enthusiasts


  • General Python knowledge
  • Basic SQL queries
  • Great Enthusiasm
  • Little Familiarity with Databases

Content URLs:

Speaker Info:

Shaik Asifullah is currently working as Senior Data Engineer at MoEngage, open source developer who previously worked at WalmartLabs and graduated from BITS Pilani, Goa. He got interested in learning more about Big Data technologies after he learnt about Columnar databases. He was also associated with faculty of University of Zurich & ETH Zurich in building a Sentiment Analyser and worked on predicting results of US 2016 Presidential elections with the model. His recent open source contribution is regarding building a distributed Python environment for building, simulating, and analysing models of biochemical networks, including gene regulatory networks and metabolic networks. He is also a great admirer of Freud Psychoanalysis & Andre Breton Surrealism.

Speaker Links:

Id: 842
Section: Data science
Type: Talks
Target Audience: Intermediate
Last Updated: