Camelot and Excalibur: PDF Table Extraction for Humans

Vinayak Mehta (~vinayak-mehta)


0

Votes

Description:

Extracting tables from PDFs is hard. The Portable Document Format was not designed for tabular data. Sadly, a lot of open data is shared as PDFs and getting tables out for analysis is a pain. A simple copy-and-paste from a PDF into a text file or spreadsheet program doesn’t work.

Camelot and Excalibur automatically detect and extract tables from PDFs and give you access to the extracted tables in pandas DataFrames. You can also download them as CSVs or Excel files.

I'm the author and maintainer of these projects and can help participants setup their development environments and help people start working on issues.

Prerequisites:

Some experience with Python would help.

PS: If you've written Python interfaces to C libraries or have worked with C, then do show up for a bit! Your help would be highly appreciated :)

Content URLs:

  • The Camelot Contributor's Guide is helpful reading, before the sprint.

  • It is recommended that participants install both packages before the sprint.

  • I'll be opening up issues tagged with Good First Issue on the the GitHub repos, which will be easy beginner friendly issues.

  • You can find the repos in this GitHub organization: https://github.com/camelot-dev

Section: Developer tools and automation
Type: DevSprint
Target Audience: Advanced
Last Updated: