Creating a crawler service to work efficiently at scale using Gevent and Flask
Rishi Raj Chopra (~rishi11) |
With so much information available over the internet nowadays, crawling efficiently at scale has become inevitable. During crawling, most of the time in most of the cases is spent by the processor waiting for the response from the remote server to which the crawl request has been made. This happens more so in python due to the GIL (Global Interpreter Lock).
Using a library like Gevent, which lets you perform asynchronous tasks synchronously, the time which the processor spends waiting for the response can be used to make more crawl requests thus rendering the crawler make more concurrent requests. A simple web service written in flask can achieve this, and guess what? It's not that hard. Multiple processes can be deployed for the flask application using Uwsgi and Nginx, thus making it more efficient.
What this talk covers:
- Concurrency in Python
- Cooperative Multitasking in Python as opposed to Preemptive Multitasking
- Crawler service
- Using python requests module to make requests
- Using Flask to create the web service
- Using Gevent to make concurrent requests
- Event Loop
- Monkey Patching in Python
- Deploying the flask application using Gevent
- Using Apache Bench for testing and comparing the results
- Basics of Python GIL
- Basic Crawling in Python (using urllib or requests)
- Making web servers using Flask
- Knowledge of deploying flask applications
- Knowledge of using Apache Bench for testing web applications
The speaker is a 2016 graduate of Delhi Technological University (Formerly Delhi College of Engineering) in B.Tech Information Technology and currently works as a Software Development Engineer for Zomato. He has an year of industry experience in Python and Web Services.