Liberating tabular data from the clutches of PDFs

jayant (~heaven00)


12

Votes

Description:

Budget Documents are moral documents that represent the priorities and values of the states and its governing bodies. Unfortunately these documents are published in unstructured PDF formats which makes it difficult for researchers, economists and general public to analyse and use this crucial data.

In this session will delve into how we can create a data pipeline and leverage computer vision techniques to parse these documents into clean machine-readable formats by leveraging libraries like OpenCV, numpy, pandas, PyPDF2, tabula and poppler-pdf-to-text

Outline

  • Setting the scene
  • Issues with Indian Budget Documents
  • Extracting Tables with boundaries.
    • Detecting Table Boundaries using OpenCV
    • Leveraging Open Source Tools like “Tabula”
  • What about tables without boundaries ?
  • Extracting information from tables without boundaries
    • Geometrical features using OpenCV library
    • Textual features using “pdf to text” poppler’s version
  • Building a pipeline to detect table components
    • Headers
    • Number Cells
    • Text Based Cells / Groupings
  • Detecting Table layout
    • Detecting rows
    • Detecting columns
    • Where each component lies
  • Extracting tables split across Pages
  • Building a base for machine learning models while doing so.
  • Open Research using Jupyter Notebooks
  • How you can contribute ?

Prerequisites:

  • Python 2.7
  • pandas
  • numpy
  • Basic Image Manipulation using OpenCV

Content URLs:

Repo: https://github.com/heaven00/pycon_delhi_2017
Slides: https://heaven00.github.io/pycon_delhi_2017

Speaker Info:

Jayant works with Open Budgets India to help make India's Budgets open, usable and easy to comprehend and during the weekends he works with Datakind as a core team member to help make social organisations data driven.

Jayant is also a machine learning enthusiast and enjoys good food and games.

Speaker Links:

  • https://github.com/cbgaindia/parsers
  • https://github.com/cbgaindia/scrapers
  • https://github.com/heaven00

Section: Data Analysis and Visualization
Type: Talks
Target Audience: Intermediate
Last Updated:

Would be very interesting to see if and how information can be extracted from a pdf parser, similar to a html parser. This would be in contrast to detecting it visually by opencv.

Abhas Bhattacharya (~bendtherules)

Hi, It would be nice if you add your slides before 10 Sept. It will help our team to review your proposal. Thanks

Rajat Saini (~rajataaron)

@Abhas Bhattacharya What we are working on is a combination of both visual/geometrical features and textual information

@Rajat Saini I think i should be able to do a rough draft this weekend, will upload by sunday (10th Sep) night or monday (11th Sep) afternoon (12:00pm)

jayant (~heaven00)

@rajataaron I have added the link to the repo. WIll be updating the slides link sometime during the afternoon. I had travel plans over the weekend so the presentation, isnt quite finished yet.

jayant (~heaven00)

Hi - Can you complete the slides ? It is an interesting talk, was hoping to see more content.

Anand B Pillai (~pythonhacker)

I was waiting to get a confirmation on whether I am speaking or not to invest more time to complete the content.

So far I haven't received any communication and concluded that it wouldn't be my turn this time :)

I will redo some of the content later on with updated research though.

jayant (~heaven00)

You are scheduled at 11:30 - 12:15 on 1st day of conference (Nov 4).

Ashish Kulkarni (~ashish69)

Login to add a new comment.