Explore Big Data using Simple Python Code and Cloud Environment
The main intention of the talk is to create a confidence that any one with basic knowledge in python can start exploring Public Data sets by leveraging cloud infrastructure. During the talk author will show a demo of processing a public data set ( wikipedia logs) of about 300 GB and get insights of the data ( say top 20 pages accessed in Wikipedia English over one month) with less than 30 lines of simple python code and by using amazon EMR ( Elastic Map Reduce ) which is a service provided by Amazon which takes care of provisioning of the computing power and also installation and configuration of Hadoop Map Reduce Cluster. Based on Author experience one of the main bottleneck in learning Big Data is complex nature of configuring the Hadoop Cluster for beginners. Instead of focussing on mundane activities of configuring infrastructure , one can leverage the existing economically feasible cloud infrastructure and focus their time on exploring the data.
Author will create the blog with detailed steps on how to setup the environment and start learning Big Data using Python and Cloud. Audience can try the same during their free time and willing to spend 3-4 dollars towards the cost of cloud.
Presentation and supporting documentation will be posted here : https://github.com/ravvas/Pycon2015
Harikrishna Ravva - Author is currently working as Performance Engineering Lead in Accenture. He is still Beginner in Python with few automation tools created on python and implemented the same in his project.
GitHub : https://github.com/ravvas