Scraping to the rescue

Satwik Kansal (~satwik) | 06 Apr, 2016

25

Votes

Description:

The talk will be focused on getting started in web scraping using python. Scraping in python can be done in various different ways, the aim of this talk to provide the attendees with nitty-gritty details so that at the end of the talk, attendees will be able to judge on their own what approach to take and what libraries/tools to use depending on the problems they intend to solve. The talk will cover useful scraping libraries/tools and neat tricks and techniques required to scrape even the hard-to-scrape sites effectively. Hard-to-scrape can be described as sites which load the DOM with Javascript, or need authentication, or require captchas , involving cookies, e. t.c. <br> We'll illustrate different possible approaches with their pros and cons for getting the following tasks done : <br>

1. Obtaining the HTML.
2. Parsing HTML and extracting useful information.
3. Tackling counter-measures like hidden form fields, per IP address query limits, user agent blocking, dynamic pages using Javascript.

Also we'll be going through some real world example codes to give the attendees a gist of what all it takes to successfully extract the data they require. At the end we'll mention some scraping ethics to be aware of so that one doesn't end up putting anyone in trouble.

Prerequisites:

Basic HTML and CSS knowledge.
Knowledge of HTTP methods GET and POST .
Familiarity with python language.

Content URLs:

Slides
https://docs.google.com/presentation/d/1vH8iglKUqzzydG0NK_lW0TtghFxu6U29KHrOGlHNmEk/pub?start=false&loop=false&delayms=5000

Speaker Info:

Satwik Kansal : <br> B.Tech student at Delhi Technological University. Python enthusiast and web developer, interested in Web scraping and Data Analysis.

Pradhvan Bisht: <br> A CS sophomore. Pythonista and a web developer interested in FOSS.

Speaker Links:

Satwik Kansal :
<a href="http://satwikkansal.xyz">http://satwikkansal.xyz</a>
https://www.github.com/satwikkansal

Pradhvan Bishth :
https://in.linkedin.com/in/pradhvan-bisht-8285a2116
https://github.com/Pradhvan

Section:	Web Development
Type:	Talks
Target Audience:	Beginner
Last Updated:	13 Jul, 2016

Comments