Decoding Web Crawling at Scale with Python

Shloka Shah (~shloka)




Web Crawling has emerged as a powerful technique for gathering and analyzing data. In today's data-driven world, information is critical, and the web is an abundant source of valuable data.

In this talk, we will start by demystifying the concept of web crawling and understanding its distinctions from web scraping. We will explore the traditional web crawling methods, including manual extraction, and browser automation, and discuss its limitations and drawbacks. Python provides a variety of tools and libraries for web crawling. We will look at two Python libraries Scrapy and Selenium discussing the pros & cons of each.

Next, we will understand Advanced Crawling Techniques which are required for

  1. Handling Dynamic Content rendered by Javascript
  2. Handling Paginated Pages

Further, we will dive into Scaling Web Crawlers and how to handle issues such as

  1. Error Handling: Handling errors when dealing with distributed crawling
  2. Rate at which we Crawl: Finding the right balance between crawling fast and crawling the entire data
  3. Handling Visited URLs: How to maintain visited URLs in a distributed crawler
  4. Handling Spam/Non-Useful URLs: Managing URLs not relevant to the crawler use case
  5. Crawling without getting banned: Make sure that website doesn't block the crawler

Next, we will touch upon the Efficient storing & Organizing of crawled data for further scraping. Finally, we will address the ethical considerations associated with web crawling and discuss best practices and guidelines to ensure responsible and ethical web crawling practices.

Key Takeaways:

  1. Clear understanding of web crawling and its distinctions from web scraping
  2. Knowledge of popular Python libraries for web crawling (Scrapy, Selenium)
  3. Understanding of advanced crawling techniques.
  4. Insight into Scaling Web Crawlers.
  5. Awareness of ethical considerations and best practices for responsible web crawling.

Brief Outline:

  1. Introduction to web crawling & distinctions from web scraping [2 minutes]
  2. Traditional Methods of Crawling [2 minutes]
  3. Python Libraries (Scrapy & Selenium), examples & their pros & cons [4 minutes]
  4. Advanced Crawling Techniques & Scaling Crawlers [16 minutes]
  5. Efficient Organization of URLs for further scraping [2 minutes]
  6. Ethical Considerations & Best Practices [2 minutes]
  7. Q&A [2 mins]


Familiarity with Python programming basics and a basic understanding of Web Crawling will be helpful but not necessary.

Speaker Info:

Shloka works at HackerRank as a Software Development Engineer II, demonstrating her passion for Problem-Solving. She is a part of the HackerRank Labs team where she focuses on building new products and finding their product market fit. Her main areas of interest revolve around Software Development, Backend Development, and Machine Learning. Over the past 2.5 years, she has gained valuable experience building scalable Web Crawlers & Scrapers using Python, scalable applications using Ruby on Rails, and gained hands-on experience in developing and productionizing various Machine Learning models to solve complex problems. She takes great pleasure in developing her own solutions using a data-driven methodology.

Speaker Links:

Shloka shares her experiences on her personal blog, earning her the HackerNoon Contributor of the Year award. In addition to her writing achievements, she mentors aspiring software developers on various topics related to Software Development. Shloka has spoken at events such as the Ruby on Rails Global Summit by Geekle, the Pune FOSS Conference, and Mumbai FOSS meetup. She has served as a judge and mentor in multiple Hackathons. Furthermore, she actively contributes to the community as a mentor with Rails Girls Bangalore and volunteers with FOSS United Bangalore and FOSS United Mumbai. Shloka also contributes her expertise as a member of the CFP review team at FOSS United. She is also a volunteer in the Content Team for PyCon India 2023.

Section: Web & App development
Type: Talks
Target Audience: Intermediate
Last Updated: