Workshop: Master Advanced Python Web Scraping Techniques
Fabien Vauchelles (~fabien) |
Description:
Join me for an incredible tutorial to unlock the full potential of Web Scraping in Python! From novice to virtuoso, you’ll learn advanced techniques for collecting crucial datasets to train AI models.
🔍 Highlights 🔍
Protection Disclosed 🔒
- Overcome fingerprint challenges and anti-bot measures.
- Reverse engineering protection to understand signals tracking
Proxy and Browser Farms Adventure 🌐
- Discover Scrapoxy, the free and open-source proxies waterfall tailored for Web Scraping
- Become an expert in browser farms with Playwright
This 2-hour tutorial will immerse you in the secret world of anti-bot protection.
Don't miss the unique opportunity to master these essential skills!
Imagine Isabella, a visionary AI engineer with a head full of dreams. She wants to revolutionise the tourism industry. But there is a catch - she's missing the crucial ingredient: data.
During this tutorial, we’ll join Isabella on her data quest. I created the website https://trekky-reviews.com/ specifically for this talk. We’ll address protection measures step by step in real-time such as proxies, headless browsers and deobfuscation.
And the best part? Every attendee will leave with the skills to apply these latest legal techniques to collect data.
Here is the outline of the workshop:
The tutorial duration can vary from 90 to 120 minutes. I'm flexible and can adjust the number of protections to bypass based on your requirements.
Introduction (3 mins, slides) To kick off the tutorial, I engage the participants by asking about their experiences with bypassing website protection. This sets the stage for introducing myself and expressing my passion for web scraping and reverse-engineering anti-bot measures.
Narrative (3 mins, slides) I share a compelling narrative to this audience: Meet Isabella, a visionary AI engineer with a head full of dreams. To build her product, she needs to collect vital data and bypass protections.
Legal (3 mins, slides) Let's take a proactive approach. Here's a straightforward decision pathway: If the data is public, non-personal, you don't need to agree to any terms (T&C), and you're not causing harm (DDoS), then you're good to go!
Website Target Structure (3 mins, demo) I created a dedicated website for this tutorial: https://trekky-reviews.com/. This site features various iterations. Each fortified with progressively challenging protections. Throughout the presentation, we'll help Isabella to manoeuvre through these defences.
Framework Installation and 1st challenge (15 mins, exercices) I will guide participants through the installation of the Scrapy framework and kickstart the first project.
Basic Challenge-Solving (15 mins, exercices) Participants will engage in solving 2 challenges: - Bypass Useragent filtering - Add consistent HTTP headers
Proxies Overview (5 mins, slides) I explain the different types of proxy: Datacenter, ISP, Residential, and Mobile, outlining their respective advantages and drawbacks.
Proxies Challenges (20 mins, exercices) We'll set up Scrapoxy and configure the first connector. Participants will tackle 2 challenges: - Bypass Rate Limit with Datacenter proxies - Avoid detection with ISP proxies
Headless Browser Challenge (20 mins, exercices) Participants will install Playwright and tackle a series of challenges, including: - Executing Javascript with a headless browser - Tuning headless browser parameters (like timezone)
Code Deobfuscation (10 mins, slides) I'll introduce techniques for deobfuscating both strings and code-flow.
Deobfuscation Challenge (20 mins, exercices) With the installation of Babel.js, participants will start reverse engineering a protection through deobfuscation. They will replicate the anti-bot behaviour, including payload encryption.
Conclusion (3 min, slides) As a wrap-up, I will present upcoming challenges and potential solutions, leaving us with food for thought into the future of protections.
Also, I already talked in Devoxx, PyCon or other conferences. Here is my latest recorded talk: https://www.youtube.com/watch?v=Kcq36_lMbvY
I hope this submission would meet your expectations for the conference!
Prerequisites:
Before the tutorial, please ensure you have installed the following software: Python (version 3), Node.js (version 20) and Docker.
Basic knowledge of Python and JavaScript is recommended, but don't worry if you're new to it - I'll be here to help you every step of the way.
Speaker Info:
Fabien Vauchelles is an Anti-Ban Expert. With over a decade of experience in Web Scraping, Fabien's passion for code and technology helps him to bypass protections. He is the creator of Scrapoxy, a mature free and open-source proxy waterfall tailored for the Web Scraping industry.
He had the opportunity of sharing his insights at many events including Devoxx conferences, Voxxed Days, API Days, PyCon, PyData and others.
Speaker Links:
Here are some previously recorded talks:
- Anatomy of Anti-Bot Protection, Extract Summit 2023, https://www.youtube.com/watch?v=0KTIloOlDK0
- Data Science University, Devoxx France 2016 (in french), https://www.youtube.com/watch?v=eD8R39Pua9I
- Machine Learning for Developer, Voxxed Days 2016 (in french), https://www.youtube.com/watch?v=AbfCUNtNpRA