Liberating tabular data from the clutches of PDFs

jayant (~heaven00) | 31 Aug, 2017

12

Votes

Description:

Budget Documents are moral documents that represent the priorities and values of the states and its governing bodies. Unfortunately these documents are published in unstructured PDF formats which makes it difficult for researchers, economists and general public to analyse and use this crucial data.

In this session will delve into how we can create a data pipeline and leverage computer vision techniques to parse these documents into clean machine-readable formats by leveraging libraries like OpenCV, numpy, pandas, PyPDF2, tabula and poppler-pdf-to-text

Outline

Setting the scene
Issues with Indian Budget Documents
Extracting Tables with boundaries.
- Detecting Table Boundaries using OpenCV
- Leveraging Open Source Tools like “Tabula”
What about tables without boundaries ?
Extracting information from tables without boundaries
- Geometrical features using OpenCV library
- Textual features using “pdf to text” poppler’s version
Building a pipeline to detect table components
- Headers
- Number Cells
- Text Based Cells / Groupings
Detecting Table layout
- Detecting rows
- Detecting columns
- Where each component lies
Extracting tables split across Pages
Building a base for machine learning models while doing so.
Open Research using Jupyter Notebooks
How you can contribute ?

Prerequisites:

Python 2.7
pandas
numpy
Basic Image Manipulation using OpenCV

Content URLs:

Repo: https://github.com/heaven00/pycon_delhi_2017
Slides: https://heaven00.github.io/pycon_delhi_2017

Speaker Info:

Jayant works with Open Budgets India to help make India's Budgets open, usable and easy to comprehend and during the weekends he works with Datakind as a core team member to help make social organisations data driven.

Jayant is also a machine learning enthusiast and enjoys good food and games.

Speaker Links:

https://github.com/cbgaindia/parsers
- https://github.com/cbgaindia/scrapers
- https://github.com/heaven00

Section:	Data Analysis and Visualization
Type:	Talks
Target Audience:	Intermediate
Last Updated:	17 Oct, 2017

Comments