Crawling your own Dataset for Research using Python
Siddhartha Anand (~siddhartha2) |
In the world of big data, you have tons of reasons to analyse it. It might be for a tweet that is getting retweeted or a story that gets shared on facebook. It might be for a political campaign to predict the behaviour of voters or it might be for recommending a recipe that you would love to try. Data has pervaded our lives. We are knowingly and unknowingly creating data every second. Most of the times, researchers feel helpless when they need to gather data which is very specific to their own requirements. Sometimes, they may have to rely on already available publicly hosted data which might be old and not up-to-date, or they may have to do away with half-baked datasets.
Gathering of large scale data is still a distant dream for most of the researchers in the field of Computer Science since not all of them are lucky enough to get data from internet giants like Google or Facebook. This talk will cover the different ways of collecting data from different kinds of data sources and the tools needed to do that easily. It will be an overview of the different ways to collect data for your research.
- Where is data ?
- Data-intensive applications (Weather prediction, Finance markets, Textual data for NLP, Chatbots, Sports analysis, Recommender Systems, Targeted Advertising, Face Recognition, Answer recommendations etc)
- Already existing datasets (snap stanford, kaggle, not much avenues to collect data)
- What to do then? Small scale/Large scale crawling, Static and dynamic websites
- How to? - Tools (selenium, phantom, scrapy, beautifulsoup, twitter api, facebook api, zomato, dblp)
- Examples for each
- Obeying robots.txt
- Basics of python language
- Idea of web crawling/scraping
- Interest in Data
I am a software developer working in this field for the last 4 years and in love with python from the time I met this language. I have spent my weekdays and weeknights using python to solve lots of problems that I have faced. When I am not working, you will usually find me tinkering with newer technologies mostly related to python. I am currently working on developing a full fledged REST API service which will be driven by a web crawler running periodically to build a data-intensive application.