Building a data pipeline with python

animenon | 31 May, 2019

Description:

A talk about building a data pipeline in python in two ways-

Using pandas (for usual datasets)
Using pyspark (for Big data)

Data Pipeline Illustration

Data pipeline would involve data ingestion, transformation (cleansing, manipulation) and storage. It would also involve deriving insights from this data which I don't intend to cover in this talk.

Pandas Data Pipeline

Intro

Why Pandas? How and When to Use?

Small Demo -

Tools used - pandas module, numpy module

Steps Involved

Create Pandas DataFrame
Data Preparation and Transformations (Missing value imputation, Removing outliers and other cleansing tasks)
Exploratory data analysis

PySpark Data Pipeline

Why Apache Spark? How and When its used?

Small Demo -

Tools used - Apache Spark, Apache Hive

Steps Involved

Create a Spark DataFrame
Data Preparation and Transformations (Missing value imputation, Removing outliers and other cleansing tasks)
Exploratory data analysis

Prerequisites:

Python
- Data Management basics

Content URLs:

Slides: https://docs.google.com/presentation/d/1iMssWK2V8SPSaXC9dMVtbGbETH3lUADwZ1XH7bulm5A

Python for Natural Language Processing

Apache Spark for Beginners

Speaker Info:

Anirudh has over 3 years of python development experience and has presented talks on Data Engineering and Big Data at many platforms. He has also presented a talk about Natural Language Processing at a Bangalore Python User Group meet in 2015.

Anirudh works as an Asst Manager, Big Data Analytics at Genpact. He designs and builds data pipelines using python and big data tools/technologies.

Speaker Links:

GitHub

StackOverflow

Section:	Data Science, Machine Learning and AI
Type:	Talks
Target Audience:	Intermediate
Last Updated:	21 Aug, 2019

Comments