Automatic Data Validation and Cleaning with PySemantic

Jaidev Deshpande (~jaidev)


6

Votes

Description:

Data is dirty. Any dataset that isn't properly curated and stored can suffer from many problems like having mixed data types, not being properly encoded or escaped, uneven number of fields, and so on. None of these problems are unsolvable. In fact, most of us are pretty good at cleaning data. Normally, when we know little or nothing about a given dataset, we proceed in a very predictable manner. We first try to read the data naively and see if errors are raised by the parser. If they are, we try to fix our function calls. When those are fixed, we try to run some sanity checks on the data, and end up filtering the dataset, sometimes quite heavily.

The problem with this process is that it is iterative, and worse, it is reactive. Everybody in the team has to do it if they are to use the dataset. Sure, one can simply clean it up and dump it in a new file with just a few lines of code. But we shouldn't have to run that script every time we encouter a new dataset. We would be much more comforable if data is cleaned as it is read. It is much more efficient if data cleaning is a part of data ingestion. Secondly, and more importantly, cleaning data via ad-hoc Python scripts is non trivial. Readable as Python scripts might be, it's not always easy for everyone in the team to change the cleaning process. Moreover, there are no Python libraries that offer an abstraction at the level of cleaning and validating data.

Therefore, if one has to go through the process of data validation and cleaning in a customizable, modular way, one has to make sure that:

  • the specifications for all datasets are in one place, not in different scripts.
  • datasets are grouped under a suitable name, that pertains to particular projects.
  • strict validation and cleaning rules must be applied to all aspects of a dataset
  • the process of validation and cleaning has to be indentically reproducible by everyone who works on the data

PySemantic is a Python module that automates all of this, and more. The purpose of this talk is to introduce this module and talk about the best practices of cleaning and validating data.

Prerequisites:

Knowledge prerequisites:

  • Basic Python data structures
  • Pandas parsers
  • NumPy ndarrays and their data types
  • Basic tabular data analysis

Software Prerequisites - See https://github.com/motherbox/pysemantic#dependencies

Content URLs:

Here's a video that explains PySemantic in some detail (Note that it was meant for an audience of non-programmers):