Scrape Anything - Even the most difficult sites

by Shashank Shekhar (speaking)

Section: Workshops
Technical level: Intermediate

Objective

This workshop will start from the basics of web scraping and proceed to the techniques required to scrape even the hard-to-scrape sites and running concurrent scrapers to scale up the process.

Hard-to-scrape can be described as sites which load the DOM with Javascript, need authentication, etc.

Description

Python and available modules make Web Scraping an easy task.

But some sites require you to be creative in your approach in writing your scraping solutions. e.g. appannie.com, jelly.com

In this workshop we will discuss about such common roadblocks and how to overcome them.

We will also discuss how to introduce concurrency to your scrapers.

Requirements

A laptop with:
- Linux - Internet Connectivity (Provided at the Venue?) - Python - BeautifulSoup4 - Requests - Selenium - Chrome/Firefox - Redis Server - Redis(python wrapper) - Celery-with-Redis

Speaker bio

I am Shashank Shekhar.

I work as a Developer in Noida. I spend most of my time making REST APIs and web scrapers in Python.

Comments

▲
4
▼

[-][+] Devanshu Gupta 280 days ago

Sounds Exciting, can you please share some of the links regarding the contents related to the workshop/session. Or any of your previous talk, videos, slides may work.

[reply] [link]
▲
3
▼

[-][+] Shashank Shekhar 171 days ago

Hi, I am sorry for being so late with the replies.

@Devanshu this is my first participation at any conference. So, I am looking for some guidance if I am going wrong somewhere. It would be very helpful if you can tell which topics to add or remove.

@Anand I can provide link to a scraper for zomato that I am writing for the purpose of PyCon. https://gist.github.com/shshank/0407cec8887e6ea0a856

Below are the details of the scraper I would like to talk about.

Making a basic scraper:
- Parser Using beautiful Soup. - Fetching pages using requests. Example with Zomato.com

-Selenium with phantomJS for sites which load the DOM using javascript, and for form filling tasks. Including authentications. Example with Appannie.com

Making a crawler for Zomato.
- Making a parser for the restaurant page. (Done in the first part.) - Finding link patterns of restaurant pages - Finally making a crawler using the code for above and adding celery+redis for task queues and keeping track of already visited pages.

Any feedback will be helpful.

[reply] [link]
▲
2
▼

[-][+] Kracekumar Ramaraju 259 days ago

Can you give more info why celery, redis is required. Also internet can be flaky at times and not completely reliable.

[reply] [link]
▲
2
▼

[-][+] Anand Chitipothu 243 days ago

Could you please share links any of the scrappers that you've written?

[reply] [link]

Login with Twitter or Google to leave a comment →