Scrape Anything - Even the most difficult sites
by Shashank Shekhar (speaking)
Objective
This workshop will start from the basics of web scraping and proceed to the techniques required to scrape even the hard-to-scrape sites and running concurrent scrapers to scale up the process.
Hard-to-scrape can be described as sites which load the DOM with Javascript, need authentication, etc.
Description
Python and available modules make Web Scraping an easy task.
But some sites require you to be creative in your approach in writing your scraping solutions. e.g. appannie.com, jelly.com
In this workshop we will discuss about such common roadblocks and how to overcome them.
We will also discuss how to introduce concurrency to your scrapers.
Requirements
A laptop with:
- Linux
- Internet Connectivity (Provided at the Venue?)
- Python
- BeautifulSoup4
- Requests
- Selenium
- Chrome/Firefox
- Redis Server
- Redis(python wrapper)
- Celery-with-Redis
Speaker bio
I am Shashank Shekhar.
I work as a Developer in Noida. I spend most of my time making REST APIs and web scrapers in Python.
4
▼
Sounds Exciting, can you please share some of the links regarding the contents related to the workshop/session. Or any of your previous talk, videos, slides may work.
3
▼
Hi, I am sorry for being so late with the replies.
@Devanshu this is my first participation at any conference. So, I am looking for some guidance if I am going wrong somewhere. It would be very helpful if you can tell which topics to add or remove.
@Anand I can provide link to a scraper for zomato that I am writing for the purpose of PyCon. https://gist.github.com/shshank/0407cec8887e6ea0a856
Below are the details of the scraper I would like to talk about.
Making a basic scraper:
- Parser Using beautiful Soup. - Fetching pages using requests. Example with Zomato.com
-Selenium with phantomJS for sites which load the DOM using javascript, and for form filling tasks. Including authentications. Example with Appannie.com
Making a crawler for Zomato.
- Making a parser for the restaurant page. (Done in the first part.) - Finding link patterns of restaurant pages - Finally making a crawler using the code for above and adding celery+redis for task queues and keeping track of already visited pages.
Any feedback will be helpful.
2
▼
Can you give more info why celery, redis is required. Also internet can be flaky at times and not completely reliable.
2
▼
Could you please share links any of the scrappers that you've written?