A Intermediate's Guide to (theoretically unlimited) WebScraping with Python using Requests & lxml & ToR
Laneone (~Laneone) |
Ever tried to webscrape?
Ever faced a "No robots allowed! No web scraping allowed!" message from a favorite site?
This talk is for meant for you.
Usually when you're done building a fancy web scraper and begin running the homebrew'd tool on your favorite site there's chances you'll face a block on your IP address preventing your computer from accessing more resources and therefore downloading the contents of the website. Your tool maybe fast, it might be scalable, it might be the best written scraper out there, but with just one IP address under your belt, it's easy for giants to block your ip address and prevent you from getting that precious data, especially if you've built a threadsafe and multi-node webscraper.
Enter The Onion Router,
The ToR project, allows you to use the the internet vis-a-vis a proxy and visit the same website under a different endpoint ip address, but that's just for one instance of ToR.
What if you ran, say 200? at once?
200 ip addresses > 1 ip address.
With 200 endpoints and the latest update to the requests library, you can now use your multi-threaded and resource hungry webscraper and it can(not) be stopped! Whatever your rate of data collection, you can 200x it!
The stack is simple, you open a SOCKS5 proxy per ToR endpoint, connect it to a request with it's own port number and you're good for that one request, same for multiple requests. You can build a task scheduler to orchestrate the url to scrape and the port number the tor endpoint is on and have the entire application running on a cloud service provider to ensure you face no bandwidth issues.
The demo centered around the talk will attempt to rapidly and quickly scrape users from the famous social network Ask.fm which is known to restrict users from retreiving from their site if you attempt to download more than 4 users in under a second, but with the hack in place, you'll be retrieving close to maximum efficiency on a DigitalOcean droplet , but this can be applied to virtually any website and any cloud provider.
Never pay for webscraping again!
Thanks and see you at PyCon! -Lokesh Poovaragan
Basic concepts of web scraping, Regex, Task scheduler, ports and proxies!
Hi I'm Loki! (Lokesh Poovaragan)
A full-stack developer from Dayananda Sagar, Bangalore, and I love to code in python! In my free time I love to web scrape and gather good amounts of public data and encompass them into json format for data(sentiment) analysis. I also build prototypes of interesting combinations of technology to solve unique problem statements. I love exploring new and interesting areas of work and I love to play with code!