Comparing Scrapping Libraries in Python
Tapasweni Pathak (~tapasweni-pathak) |
We will review
What This Talk Will Cover
- LXML vs. BeautifulSoup (with numerous pages)
- Scrapy: Why is it so easy to use? How fast can we go?
- What if page is broken?
- Who utilizes xpath and css select for identifying elements and why is it good if X does that?
- How many average function calls the below are using?
- LXML with XPath
- LXML with CSS
- Beautiful Soup
We will compare them with a series of sites evaluating how quickly they can parse pages and how accurately they can find data.
We have used pstats and cProfile to gather the data for time and function calls.
Case Studies yet to be added.
Tapasweni Pathak is working as a Software Developer with SAP Labs. She is a GSOC Mentor for Systers Org. She is in the organizing committee of GHC India for Hackathon track. She is a OWASP Summer Code Sprint 2015 student. She contributes to Linux Kernel and works on a lot of side projects. She is a FOSS enthusiast. She reads and writes on Quora. She loves C, Python, Operating Systems and Compilers. In past she has worked as an Outreachy Linux Kernel Intern, Engineering Intern in Qualcomm Inc and Research Intern in I.I.T Delhi.
I'm, pursuing masters in Computer Science. At present I am working with Systers as GSoC intern. I have worked with OpenStack Zaqar as Outreachy Intern, and at present I am an active contributor to OpenStack. Programming in Python is something I enjoy a lot! Python lets me convert my ideas into real project easily. I work on a lot of projects, participate in hackathons, contribute to other open source projects.
I believe Technology and Computer Science is something that is not just amazing, but also involved in each and every part of our life. It plays a role in how we order/cook food to how we connect to thousands of people in the world. It holds a vital part in almost all the activities we do in our life. It have power to change the world, change the culture, change the thinking of people, making the lives more better.