Building a data pipeline with python

animenon


Description:

A talk about building a data pipeline in python in two ways-

  1. Using pandas (for usual datasets)
  2. Using pyspark (for Big data)

Data Pipeline Illustration

Data pipeline would involve data ingestion, transformation (cleansing, manipulation) and storage. It would also involve deriving insights from this data which I don't intend to cover in this talk.

Pandas Data Pipeline

Intro

Why Pandas? How and When to Use?

Small Demo -

Tools used - pandas module, numpy module

Steps Involved

  • Create Pandas DataFrame
  • Data Preparation and Transformations (Missing value imputation, Removing outliers and other cleansing tasks)
  • Exploratory data analysis

PySpark Data Pipeline

Why Apache Spark? How and When its used?

Small Demo -

Tools used - Apache Spark, Apache Hive

Steps Involved

  • Create a Spark DataFrame
  • Data Preparation and Transformations (Missing value imputation, Removing outliers and other cleansing tasks)
  • Exploratory data analysis

Prerequisites:

  • Python
    • Data Management basics

Speaker Info:

Anirudh has over 3 years of python development experience and has presented talks on Data Engineering and Big Data at many platforms. He has also presented a talk about Natural Language Processing at a Bangalore Python User Group meet in 2015.

Anirudh works as an Asst Manager, Big Data Analytics at Genpact. He designs and builds data pipelines using python and big data tools/technologies.

Speaker Links:

GitHub

StackOverflow

Section: Data Science, Machine Learning and AI
Type: Talks
Target Audience: Intermediate
Last Updated: