Scraping to the rescue

Satwik Kansal (~satwik)


25

Votes

Description:

The talk will be focused on getting started in web scraping using python. Scraping in python can be done in various different ways, the aim of this talk to provide the attendees with nitty-gritty details so that at the end of the talk, attendees will be able to judge on their own what approach to take and what libraries/tools to use depending on the problems they intend to solve. The talk will cover useful scraping libraries/tools and neat tricks and techniques required to scrape even the hard-to-scrape sites effectively. Hard-to-scrape can be described as sites which load the DOM with Javascript, or need authentication, or require captchas , involving cookies, e. t.c.
We'll illustrate different possible approaches with their pros and cons for getting the following tasks done :

1. Obtaining the HTML.
2. Parsing HTML and extracting useful information.
3. Tackling counter-measures like hidden form fields, per IP address query limits, user agent blocking, dynamic pages using Javascript.

Also we'll be going through some real world example codes to give the attendees a gist of what all it takes to successfully extract the data they require. At the end we'll mention some scraping ethics to be aware of so that one doesn't end up putting anyone in trouble.

Prerequisites:

  1. Basic HTML and CSS knowledge.
  2. Knowledge of HTTP methods GET and POST .
  3. Familiarity with python language.

Content URLs:

Slides
https://docs.google.com/presentation/d/1vH8iglKUqzzydG0NK_lW0TtghFxu6U29KHrOGlHNmEk/pub?start=false&loop=false&delayms=5000

Speaker Info:

Satwik Kansal :
B.Tech student at Delhi Technological University. Python enthusiast and web developer, interested in Web scraping and Data Analysis.

Pradhvan Bisht:
A CS sophomore. Pythonista and a web developer interested in FOSS.

Speaker Links:

Satwik Kansal :
http://satwikkansal.xyz
https://www.github.com/satwikkansal

Pradhvan Bishth :
https://in.linkedin.com/in/pradhvan-bisht-8285a2116
https://github.com/Pradhvan

Section: Web Development
Type: Talks
Target Audience: Beginner
Last Updated:

Do you happen to have a list of useful websites which could be scraped and such? I am suggesting that a bunch of "ToDo" or, "Now that we have shared our knowledge, how about you/the audience consider working on these sites" kind of list.

sankarshan mukhopadhyay (~sankarshan)

I've recently wrote a blog post for "Getting started with scraping in python". The site I used for my tutorial is this "Pycon India 2016" site itself. Please have a look at it http://satwik.ghost.io/2016/06/19/any/ . <- This is the temporary link as of now. Regarding useful sites to scrape, there's no limit to that. A rule of thumb may be "Whenever you want to use some data that any website is displaying, but that website doesn't have an API then you should scrape." For example, you can scrape bookmyshow.com for list of events in your area, Flipkart for making sites like buyhatke, cricbuzz for getting cricket statistics and score and so on. However, I'll make a list of all the practical applications I could think of and upload a link to it soon. Thanks for your suggestions by the way :).

Satwik Kansal (~satwik)

There's a fine line between scraping a site and DOSing a site. Please explain to the audience the importance of not accidentally bringing the site down.

sankarshan mukhopadhyay (~sankarshan)

Thanks, I've added a slide for 'Ethics of Scraping' :)

Satwik Kansal (~satwik)

Seems interesting and informative, Looking forward to it!

Abhishek Chauhan (~abhishek27)

Login to add a new comment.