Art of Web Scrapping using Python (Low level)

Talks | Submit a talk
Authors Siddhant Sanyam
Talk Type tutorial
Level Beginner
Topic Network programming
Tags screen scraping, parsing html, offline interface
Summary

Web scrapping involves fetching formatted data from websites, which otherwise do not provide any API to do so. This means that using this technique you can script programs, that can go to different websites, and get you the data which has importance. For Ex: making an desktop interface for a website. For automatically posting data after doing some analysis. Fetching data offline in formatted manner. Web scrapping is very useful for everyday and personal use also.

Outline

What will be covered

  • urllib
  • htmlparse
  • xmlparse
  • Regex based parsing
  • BeautifulSoup
  • urllib2
  • Cookie handling
  • MozillaCookieJar
  • Talk on using AI techniques
  • Limitations of WebScrapping, where to use, where not to use

Examples Covered

  • Simplest: monitoring for changes in a website
  • Barebone: Fetching Results from a University website
  • onemangadl : Script to fetch online mangas from websites
  • Posting data: pasteonline, a script very helpful in pasting logs, text files to free pasting websites like pastebin.com
  • Acting like a real browser, Sending additional headers: downloading youtube videos.
  • Persistance: maintaining state with Cookies: Sending SMS using free services
  • Storing state data to disk: improvising the previous script to try previous session
  • Handling Errors, Logging
  • Making API: API of the same SMS sending Script
  • Making your software configurable and extendable: adding classes and inheritance
  • Automatic XSS detector: example of crawling, how webscraping can be put in good use
  • Dealing with Captchas
  • Not repeating yourself. Making resuable code.
Notes
  • Time needed would be about 2.5 Hours plus minus 15 minutes. Exact timings spent on each example is highly dependent on the audience and can't be predicted at least until 25th July(I'll do a rehearsal with my friends to check)

  • I believe that this tutorial is best covered progressively using examples i.e. explaining the tools then and when needed.

Setup for the Tutorial

I usually give demonstration talks and tutorials using a GNU Screen setup, it would be awesome if we can have the following setup:

  • Every attendee is connected in an Internal LAN (wired or wi-fi) and so am I
  • Attendees have SSH client through which they can SSH on my computer
  • I have a screen session running
  • We all share a common screen session to which only I can write and others can read
  • I'll need Internet connection to demonstrate all the examples.
  • Attendees can view and download the source files from the HTTP server running on my PC.

I feel that this setup is better than projector setup since user can see the code and my pointer right on their own screen, they can copy-paste if required.

Profile of the authors

Blog: http://yatantrika.co.cc/ Git Repo: http://github.com/siddhant3s/

  • Undergrad from NIT Trichy. Working with Python for more than three years.
  • Use Python for all personal programming needs.
  • Currently an Intern at Nexedi working on UNG project (a free alternative of Google Docs and other SaaS service)
  • Creator and Designer of DFCTF 2011, India's second CTF style Ethical Hacking competition.(Backend of which was written in Python).
  • Speaker at workshops related to Web security.(Last one at Vortex 11)
  • Good working knowledge of Robotics and Embedded system and AI related paradigms.
Files
No files uploaded. You can upload a file if you are author of this talk.