Making the most out of web scraping : Optimization using multithreading
Anuj Menta (~anujmenta) |
In the age where data is the new currency, companies are trying to make the most of the world wide web. Statistics say that over 60-70% of website traffic is not human at all. In this world of web scrapers and spiders, we have our sub-optimal script scraping off a website. What if I told you that your script could be optimized with a simple yet important upgrade in functionality? Concurrency.
Computer time and real time are two different things. One second for us can be a thousand for a computer. Tapping into this could potentially save us lots in computational time. In this talk I would be going over basic principles of how to structure your code to facilitate multi-threading and make the most of the moment.
The talk would briefly go over the following :
- What is web scraping?
- What is multithreading?
- How to speed up your web scraper by X times?
- Familiarity with python
- Familiarity with requests/beautifulsoup (or any other library allowing you to make a get request)
I am an IIT Kharagpur graduate(2017) who spent over 4 years coding in Python. Worked with all styles of python from website development using Django and Flask to scientific computing using numpy and scikit-learn to web-scraping using Selenium. It's been a wonderful journey all along and I'm now looking forward to bring as many people on board as I can to experience what I've experienced.
I am also the founder of Papercop, an examination preparation portal for the students of IIT Kharagpur which has about 70k+ hits. I am a very passionate speedcuber( Can solve the rubiks cube in about 10s odd). Won plenty of medals in speedcubing competitions across the country. I now work as an analyst with an MNC.
Github : https://github.com/anujmenta
LinkedIn : https://in.linkedin.com/in/anuj-menta-314b5969
World Cube Association Profile : https://www.worldcubeassociation.org/persons/2013MENT01