Scaling up Data Pipelines using Apache Parquet and Dask
Lalit Musmade (~lalit05) |
Python is widely used for transforming data by data pipelines in a wide range of functionality like web development, scientific computing, data science, and machine learning. Python is often used to move data in and out of databases, however, it is not the best tool when data is large and doesn't fit in-memory. Python with Apache Parquet as a file format and cloud storage technology can be used for better performance of data pipelines.
APACHE PARQUET DATA FORMAT
Apache Parquet is the widely used columnar data format for storage of the data in the big data ecosystem. Parquet is built to support storing data in a CPU and move large data in and out in an efficient way and provides capabilities to push-down queries to the I/O layer. It is a top-level Apache project since 2015. The tabular nature of Parquet is a good fit to read into Pandas DataFrames with the two libraries fastparquet and PyArrow. Fastparquet is a Python-based implementation that uses the Numba Python-to-LLVM compiler. PyArrow is part of the Apache Arrow project and uses the C++ implementation of Apache Parquet.
APACHE PARQUET WITH PANDAS & DASK
While Pandas is mostly used to work with data that fits into memory, Dask allows us to scale working with data to multi-core machines and distributed clusters. Data can be split up into partitions and stored in cloud object storage systems like Amazon S3, Google Cloud Storage or Azure Datalake. This talk will show how to use predicate push down, data partitioning and general data layout to speed up queries on large data.
OUTLINE OF THE TALK
- Introduction [2 minutes]
- Why use Parquet? [2 minutes]
- Advantage of Parquet and Columnar Storage [2 minutes]
- Python and Parquet: PyArrow and fastparquet [2 minutes]
- Reading and Writing Parquet: [15 minutes]
- Column Projection
- Predicate Push Down
- Dictionary encoding and filtering
- Data Partitioning
- Data pipelines optimized. [2 minutes]
- Q/A [5 minutes]
Basic knowledge about Python and Database.
Familiarity with Python Scientific Computing Libraries: Pandas and Dask
Useful links for the related content, however, they are not prerequisite:
Lalit is Senior Data Scientist at Ecolab. His work at Ecolab involves building data-driven products that are scalable for multiple Ecolab customers. His previous work experience includes working with bioinformatics and online advertising data to design services and products using Python. He holds an M.tech from IIT Madras where he worked on designing self-tuning controllers using feedback data from systems.