Extracting tabular data from PDFs with Camelot & Excalibur

Vinayak Mehta (~vinayak-mehta)


Description:

Extracting tables from PDFs is hard. The Portable Document Format was not designed for tabular data. Sadly, a lot of open data is shared as PDFs and getting tables out for analysis is a pain. A simple copy-and-paste from a PDF into a text file or spreadsheet program doesn't work.

This talk will briefly touch upon the history of the Portable Document Format, discuss some problems that arise when extracting tabular data from PDFs using the current ecosystem of libraries and tools and demonstrate how Camelot and Excalibur solve this problem better and in a scalable manner. These easy-to-use packages automatically detect and extract tables from PDFs and give you access to the extracted tables in pandas DataFrames. You can also download them as CSVs or Excel files.

Prerequisites:

Basic familiarity with the Python programming language will help the audience understand the talk better. The talk will briefly touch on the history of PDF and the PDF table extraction use-case, so knowing about that isn’t a prerequisite. The talk can be particularly helpful for data analysts, scientists and journalists since they work with a lot of open data (a lot of which is shared as PDFs) and have a recurrent need to extract tables from PDFs for analysis and record-keeping.

After watching this talk, the audience will have a high-level understanding of how the Portable Document Format works. They will also learn how to easily extract tabular data from any type of PDF (the table structures can be bizarre!) using Camelot (the Python library) or Excalibur (the web interface), access extracted tables as pandas DataFrames and save them into CSVs or Excel files.

Content URLs:

I've given this talk at:

  • PyCon AU 2019: https://www.youtube.com/watch?v=99A9Fz6uHAA
  • PyCon US 2019: https://www.youtube.com/watch?v=Irf6kdl0lAA

You can check out the slides for this talk here: https://speakerdeck.com/vinayakmehta/extracting-tabular-data-from-pdfs-using-camelot-and-excalibur-pycon-au

Speaker Info:

I'm the author of both the Python library, Camelot (https://github.com/socialcopsdev/camelot) and the web interface, Excalibur (https://github.com/camelot-dev/excalibur). I have also written multiple blog posts on this topic and published two of them on Hacker Noon.

Here's an outline for my talk:

  • Introductions (3 min)
    • Greetings
    • Introduce myself
    • Set expectations for the talk
  • History of the Portable Document Format (2 mins)
    • The Camelot Project
    • PostScript, the page description language
    • Universal need for sharing documents
  • “I want to extract tables from this PDF? What do I do?” (5 mins)
    • How/where I stumbled across the problem
    • Problems with tabular data being released in PDFs
    • Why another library/tool?
      • Problems with existing libraries/tools
  • Camelot: PDF Table Extraction for Humans (7 mins)
    • Why the name? (1 min)
      • Monty Python and the Holy Grail reference
      • Fun-fact about a Monty Python reference used in the Python programming language
    • How to install and run? (1 min)
      • pip install camelot-py
      • A simple API inspired from requests and pandas
    • How to use? (5 mins)
      • Visual debugging
      • Add table areas and columns
      • Flag superscripts and subscripts
      • Shift and copy text in spanning cells of extracted table
  • Excalibur: The Web Interface to Camelot (7 mins)
    • Why the name? (1 min)
      • Monty Python and the Holy Grail reference
      • Fun-fact about a Monty Python reference used in the Python programming language
    • How to install and run? (1 min)
      • pip install excalibur-py
      • Built with configurability and scale in mind, Airflow-esque
    • How to use? (5 mins)
      • Upload and select page numbers
      • Table auto-detection
      • Draw table areas/columns
      • Input Camelot advanced settings
      • Save settings and select pre-saved settings
      • Download extracted tables in any format (CSV, Excel, JSON, HTML)
  • How to get involved (1 min)
    • Contributions are welcome!
    • Planned enhancements
      • OCR to extract text and tables from images
      • Removing OpenCV as a dependency
    • Links to documentation and issue tracker
    • Parting note and thank you
  • Questions (5 mins)

Speaker Links:

[1] My website: https://www.vinayakmehta.com

[2] Camelot: https://github.com/socialcopsdev/camelot

[3] Excalibur: https://github.com/camelot-dev/excalibur

Section: Developer tools and automation
Type: Talks
Target Audience: Beginner
Last Updated: