Liberating tabular data from the clutches of PDFs
jayant (~heaven00) |
Budget Documents are moral documents that represent the priorities and values of the states and its governing bodies. Unfortunately these documents are published in unstructured PDF formats which makes it difficult for researchers, economists and general public to analyse and use this crucial data.
In this session will delve into how we can create a data pipeline and leverage computer vision techniques to parse these documents into clean machine-readable formats by leveraging libraries like OpenCV, numpy, pandas, PyPDF2, tabula and poppler-pdf-to-text
- Setting the scene
- Issues with Indian Budget Documents
- Extracting Tables with boundaries.
- Detecting Table Boundaries using OpenCV
- Leveraging Open Source Tools like “Tabula”
- What about tables without boundaries ?
- Extracting information from tables without boundaries
- Geometrical features using OpenCV library
- Textual features using “pdf to text” poppler’s version
- Building a pipeline to detect table components
- Number Cells
- Text Based Cells / Groupings
- Detecting Table layout
- Detecting rows
- Detecting columns
- Where each component lies
- Extracting tables split across Pages
- Building a base for machine learning models while doing so.
- Open Research using Jupyter Notebooks
- How you can contribute ?
- Python 2.7
- Basic Image Manipulation using OpenCV
Jayant works with Open Budgets India to help make India's Budgets open, usable and easy to comprehend and during the weekends he works with Datakind as a core team member to help make social organisations data driven.
Jayant is also a machine learning enthusiast and enjoys good food and games.