Data processing with Map Reduce : The Python Way
Vishal Kanaujia (~vishalkanaujia) |
Modern businesses such as e-commerce analyze huge data sets to derive business trends and patterns. Hadoop is the most popular choice to store large data in its HDFS file system. It introduced Map Reduce programming to process data and uses distributed and highly scalable data crunching applications.
Python provides an array of tools to handle different stages of data analysis. Data analysis involves following stages:
- Data Preprocessing (Pandas)
- Data Analysis (scikit-learn, nltk)
- Data Pipeline (MRjob, Hadoop Streaming)
- Data Visualization (matplotlib)
In this talk, we would discuss 'Data Pipeline' stage and its Python modules to develop Map-Reduce applications that work seamlessly on a Hadoop cluster. We would also demonstrate practical Map Reduce programming examples with following Python modules.
- Hadoop Streaming
Audience will learn basics of map-reduce principles and develop scalable map-reduce programs to process data sets on Hadoop.
Vishal Kanaujia is a Python developer and technology enthusiast. He has delivered talks and written articles on Python. He was a speaker at international conferences Pycon APAC Singapore and New Zealand. He has been delivering talks for in PyCon, Scipy India. He has interest in fields including Python internals, Filesystems, and search technologies, and application performance optimization.
He shares his thoughts, opinion and knowledge through blog (http://freethreads.wordpress.com), technical papers, and articles with international magazines including ‘Linux for You’ and ‘Agile Record’.