Data processing with Map Reduce : The Python Way

Vishal Kanaujia (~vishalkanaujia)




Modern businesses such as e-commerce analyze huge data sets to derive business trends and patterns. Hadoop is the most popular choice to store large data in its HDFS file system. It introduced Map Reduce programming to process data and uses distributed and highly scalable data crunching applications.

Python provides an array of tools to handle different stages of data analysis. Data analysis involves following stages:

  • Data Preprocessing (Pandas)
  • Data Analysis (scikit-learn, nltk)
  • Data Pipeline (MRjob, Hadoop Streaming)
  • Data Visualization (matplotlib)

In this talk, we would discuss 'Data Pipeline' stage and its Python modules to develop Map-Reduce applications that work seamlessly on a Hadoop cluster. We would also demonstrate practical Map Reduce programming examples with following Python modules.

  • MRjob
  • Hadoop Streaming

Audience will learn basics of map-reduce principles and develop scalable map-reduce programs to process data sets on Hadoop.

Speaker Info:

Vishal Kanaujia is a Python developer and technology enthusiast. He has delivered talks and written articles on Python. He was a speaker at international conferences Pycon APAC Singapore and New Zealand. He has been delivering talks for in PyCon, Scipy India. He has interest in fields including Python internals, Filesystems, and search technologies, and application performance optimization.

He shares his thoughts, opinion and knowledge through blog (, technical papers, and articles with international magazines including ‘Linux for You’ and ‘Agile Record’.

Section: Data Visualization and Analytics
Type: Talks
Target Audience: Beginner
Last Updated:

Hi Vishal,

Thank you for the proposal. Can you provide some information on the following :

  1. Pandas, Sci-Kit learn, MRJob, Matplotlib : All of these are comprehensive in themselves, how do you plan to cover them in the stipulated time.
  2. You mentioned demonstrating few practical applications, can you share what kind of examples are you planning to cover ?
  3. What features will you covering when you talk about MRjob ? Do you also plan to cover topics like deployments and dependency resolutions on the data node when using Python with MRJob.
  4. What all data sets are you planning to use .
  5. Can you also share you previous talk videos / slides.

Looking forward for more discussions.

konark modi (~konark)

Login to add a new comment.