Wrangling Unconventional File Formats with Python: Playing with PDFs

greatdevaks


26

Votes

Description:

Data Wrangling involves detection, correction, removal, or otherwise dealing with inaccurate and corrupted data. The most common file formats in which data can be stored are CSV, JSON, and XML. However, many times, the data is not available in the desired format and rather is available in some unconventional file formats like PDF or PPT. Parsing PDFs may seem like a daunting task to many as it is quite an unpredictable format. Simply stated, PDF is a hard-to-parse format. This workshop will help you understand the concept of Wrangling PDFs in an easy and fun way.

Following will be the flow of this workshop:

  • Self Introduction
  • Brief Introduction to Data Wrangling
  • Why prefer CSV, JSON, or XML?
  • Why avoid using PDFs?
  • Basics of RegEx based Pattern Matching
  • Parsing PDFs Programmatically using "slate" and "pdfminer": Getting hands-on
  • Inefficient Parsing? Consider Data Cleaning
  • Exploring PDF Wrangling with "pdftables"
  • Where to go from here?
  • Question and Answers Session
  • The End :)

Key Takeaways:

  • Gain confidence in Data Wrangling using Python.
  • Get familiar with the daunting PDF Parsing task.
  • Get hands-on with popular PDF Wrangling libraries in Python: "slate", "pdfminer", and "pdftables".
  • Understand the concept and importance of Data Cleaning.

Prerequisites:

  • Basic knowledge of programming in Python language.
  • Familiarity with wrangling CSV, JSON, or XML files will be good but is not necessary.

Speaker Info:

Highlights:

  • Former Software Developer Intern at IBM & an ALL STACK DEVELOPER capable of designing and developing solutions for Mobile, Web, Embedded Systems, and Desktop. Areas of interest are Computational Neuroscience, Deep Learning, and Cloud Computing.
  • Represented India at International Hackathons like Hack Junction’16, Finland and Hack the North’16, Canada. Got invited for more than a ‘dozen’ of prestigious International Hackathons (PennApps’17, HackNY’17, Hack Princeton’17 and many more) and Conferences.
  • Recently talked about "Understanding and Implementing Recurrent Neural Networks using Python" at GeoPython, Basel, Switzerland'18.
  • Will be speaking about Artificial Neural Networks at EuroPython 2018, Edinburgh, Scotland.
  • A Microsoft Certified Professional, Microsoft Technology Associate, IBM Certified Web Developer, and Hewlett Packard Certified Developer.
  • Has 8+ International Publications. [Latest work got published in ACM CHI 2018. The project was exhibited in Montreal, Canada.]
  • Received 6 Honours and Awards (International and National level).

My compact Biography: My name is Anmol Krishan Sachdeva. I am currently pursuing MSc Advanced Computing from University of Bristol, United Kingdom. My specialization is in AI, ML, Applied Data Science, Computer Vision, and Computational Neuroscience. I am also doing research work on Neural Networks and Computational Neuroscience. This conference is the right place to deliver the knowledge. Looking forward to speaking at the conference.

Section: Core python and Standard library
Type: Workshops
Target Audience: Beginner
Last Updated: