Making the most out of web scraping : Optimization using multithreading

Anuj Menta (~anujmenta) | 29 Aug, 2017

5

Votes

Description:

In the age where data is the new currency, companies are trying to make the most of the world wide web. Statistics say that over 60-70% of website traffic is not human at all. In this world of web scrapers and spiders, we have our sub-optimal script scraping off a website. What if I told you that your script could be optimized with a simple yet important upgrade in functionality? Concurrency.

Computer time and real time are two different things. One second for us can be a thousand for a computer. Tapping into this could potentially save us lots in computational time. In this talk I would be going over basic principles of how to structure your code to facilitate multi-threading and make the most of the moment.

The talk would briefly go over the following :

What is web scraping?
What is multithreading?
How to speed up your web scraper by X times?

Prerequisites:

Familiarity with python
Familiarity with requests/beautifulsoup (or any other library allowing you to make a get request)

Speaker Info:

I am an IIT Kharagpur graduate(2017) who spent over 4 years coding in Python. Worked with all styles of python from website development using Django and Flask to scientific computing using numpy and scikit-learn to web-scraping using Selenium. It's been a wonderful journey all along and I'm now looking forward to bring as many people on board as I can to experience what I've experienced.

I am also the founder of Papercop, an examination preparation portal for the students of IIT Kharagpur which has about 70k+ hits. I am a very passionate speedcuber( Can solve the rubiks cube in about 10s odd). Won plenty of medals in speedcubing competitions across the country. I now work as an analyst with an MNC.

Speaker Links:

Github : https://github.com/anujmenta

LinkedIn : https://in.linkedin.com/in/anuj-menta-314b5969

World Cube Association Profile : https://www.worldcubeassociation.org/persons/2013MENT01

Section:	Concurrency
Type:	Talks
Target Audience:	Beginner
Last Updated:	29 Aug, 2017

Comments