ScrapyQuickStart: Web Scraping Code Generation
Susmit Vengurlekar (~susmitpy) |
Description:
Keeping code modular, adhering to good and standard team-defined practices is hard. Setting up manually the various files in the structure required is hard. The tool to be discussed aims to solve this problem when writing web scraping code using scrapy.
The tool (https://github.com/susmitpy/ScrapyExample) generates code in a standard fashion using Jinja template for the item to be scrapped. It does this in a modular way keeping the code separate. The generated code also includes code to make the scraped data available after the execution without writing to any file in between , something which had taken me a lot of time to figure out.
The takeaways for the audience is mentioned in the outline of the talk below.
Outline
- Intro to Web Scraping
- Beautiful Soup vs Selenium vs Scrapy - What to use when (Takeaway #1)
- Code walk-through of a basic generated scrapy code along with execution (Takeaway #2)
- Generating the demoed code using Jinja Templates (Takeaway #3 - Inspiration for their own repetitive tasks)
- Using Jinja Templates vs Copy-Paste-Modify
Prerequisites:
Basics of python. Considering 25 minutes is long enough to explain clearly the scrapy code, audience need not have prior experience in web scraping.
Content URLs:
- https://github.com/susmitpy/ScrapyExample
Speaker Info:
As a Data Scientist and Full Stack Software Developer, Having developed many software products utilising different technologies backed by cost optimised but performant cloud based system architecture, having worked with MySQL, Postgres, Firebase Firestore, MongoDB, Cassandra as well as Graph databases such as Neo4j, I have learned to use the right tool for the right job in the right manner.
https://www.linkedin.com/in/susmit-vengurlekar
Speaker Links:
- https://github.com/susmitpy/BuildingARecommendationEngineUsingNeo4jAzureBootcampTalk (Gave talk at Global Azure Bootcamp 2024 - Mumbai) - https://youtu.be/V05Pz1tVovs?feature=shared
- https://pypi.org/project/cache-df/
- https://pypi.org/project/unitgen/
- https://susmitpy.medium.com/