Large scale web crawling using Python
Anand B Pillai (~pythonhacker) |
Web crawling is hard. Large scale web crawling - which involves crawling millions of web pages in a month across 500 to 1000 websites, is even harder.
Python comes with a number of libraries which allow you to do such crawling-at-scale but a lot of real-world issues have to be tackled to get the crawling infrastructure right Some of which are,
- Crawl rates - You need to strike the right balance here to make sure you don't crawl too aggressively but at the same time don't crawl too slow that the crawl finishes too late.
- Right Data - You need to make sure you crawl the right parts of the websites to get the right data you want.
- Dont get blocked! - Crawling from the same set of IP addresses will get you blocked across most modern websites. One needs some kind of rotating web proxy infrastructure to make sure that crawls can continue without getting kicked out.
- Capturing Errors - How to capture crawling errors so you can detect most issues and surface them up, while doing distributed crawling.
Having nearly a decade of experience writing custom web-crawlers, the speakers have developed a set of custom tools to make crawling easy and painless. One of this is a tool to create a set of rotating web proxy caching nodes which use Squid and frontend by a HTTP load-balancer. The other one is a distributed crawler which uses Django as the middleware to distribute crawling across multiple crawler nodes while managing crawls at one place.
In this talk, the author(s) discuss about one such tool they have created and have successfully used in multiple businesses and software companies over the last 3 years. The tool allows one to quickly and cheaply create an infrastructure of custom web proxy nodes which supports multiple VPS backends. Using this tool one can rune an industrial strength web crawling infrastructure with a set of rotating proxies of up to 50 nodes with a monthly cost of just under 300 $.
The authors will talk about their experience and background creating and using the tool over the years, how it works with any web-crawler and the open source nature of the code which allows it to support different infrastructure backends and also the Squid configuration for the nodes which allows to hide the IP addresses behind the crawler.
- Some knowledge of web crawling and or web scraping.
- Any knowledge of Scrapy and some experience using it is very handy
- Knowledge of HTTP proxy servers is a huge plus.
Anand B Pillai is a technology professional with 20 years of software development, design and architecture. He has worked in a number of companies over the years in fields ranging from Security, Search Engines, Large Scale Web Portals and Big Data. He is the founder of the Bangalore Python User's Group and the author of Software Architecture with Python (PacktPub, April 2017). Anand has a lot of experience in web-crawling having written the original Python web-crawler HarvestMan in 2005 and developing a number of custom crawlers for various startups solving various problems. Anand is an Independent Software Professional.
Noufal Ibrahim is the CEO and Founder of Hamon Technologies at Calicut, Kerala. He was key to starting the very first PyCon India conference in 2009 and has since been involved in the conference closely throughout the years. Noufal was the keynote speaker of PyCon India 2017. Noufal has made a name not just by his Python community activities, but also by his creative Python introductory talks he has conducted in various universities and institutions in Kerala. He is also a professional trainer in Python and git.
Both Noufal and Anand are Fellows of the Python Software Foundation (PSF).