Mastering a data pipeline with Python: 6 years of learned lessons from mistakes to success.
Robson Júnior (~bsao) |
Building data pipelines are a consolidated task, there are a vast number of tools that automate and help developers to create data pipelines with few clicks on the cloud. It might solve non-complex or well-defined standard problems. This presentation is a demystification of years of experience and painful mistakes using Python as a core to create reliable data pipelines and manage insanely amount of valuable data. Let’s cover how each piece fits into this puzzle: data acquisition, ingestion, transformation, storage, workflow management and serving. Also, we’ll walk through best practices and possible issues. We’ll cover PySpark vs Dask and Pandas, Airflow, and Apache Arrow as a new approach.
Robson is a developer deeply involved with software communities, especially the Python community. I've been organizing conferences and meetups since 2011 and effectively speaking in conferences since 2012 about python and cloud technologies and since 2016 about data-related technologies. Also as an Independent consultant, I conduct on-demand architecture consultancy and training sessions about data-related technologies.