- Software Development Tools
- Technical level
Learn to use scrapy for crawling the web.
Extracting structured information from a webpages is a relatively simple task in python, given the innumerable tools at our disposal namely BeautifulSoup, PyQuery, lxml etc. However, crawling and scraping data from multiple websites makes the job difficult because everyone on the internet likes to structure their information differently.
Crawling upto 10 portals is manageable upto 10 portals, beyond that it becomes a menace. What we need then, is a framework to keep the crawling and parsing logic separate and also help manage the parsers. This is where scrapy comes to our assistance. It is the most pythonic way of scraping the web.
A laptop computer running any flavour of linux. It would help if python 2.7 and scrapy are already installed in a virtualenv.
Leads the web scraping project at Reviews42 which crawls hundreds of ecommerce portals for catalogging the products being sold online.