Automatic Data Validation and Cleaning with PySemantic
Jaidev Deshpande (~jaidev) |
Data is dirty. Any dataset that isn't properly curated and stored can suffer from many problems like having mixed data types, not being properly encoded or escaped, uneven number of fields, and so on. None of these problems are unsolvable. In fact, most of us are pretty good at cleaning data. Normally, when we know little or nothing about a given dataset, we proceed in a very predictable manner. We first try to read the data naively and see if errors are raised by the parser. If they are, we try to fix our function calls. When those are fixed, we try to run some sanity checks on the data, and end up filtering the dataset, sometimes quite heavily.
The problem with this process is that it is iterative, and worse, it is reactive. Everybody in the team has to do it if they are to use the dataset. Sure, one can simply clean it up and dump it in a new file with just a few lines of code. But we shouldn't have to run that script every time we encouter a new dataset. We would be much more comforable if data is cleaned as it is read. It is much more efficient if data cleaning is a part of data ingestion. Secondly, and more importantly, cleaning data via ad-hoc Python scripts is non trivial. Readable as Python scripts might be, it's not always easy for everyone in the team to change the cleaning process. Moreover, there are no Python libraries that offer an abstraction at the level of cleaning and validating data.
Therefore, if one has to go through the process of data validation and cleaning in a customizable, modular way, one has to make sure that:
- the specifications for all datasets are in one place, not in different scripts.
- datasets are grouped under a suitable name, that pertains to particular projects.
- strict validation and cleaning rules must be applied to all aspects of a dataset
- the process of validation and cleaning has to be indentically reproducible by everyone who works on the data
PySemantic is a Python module that automates all of this, and more. The purpose of this talk is to introduce this module and talk about the best practices of cleaning and validating data.
- Basic Python data structures
- Pandas parsers
- NumPy ndarrays and their data types
- Basic tabular data analysis
Software Prerequisites - See https://github.com/motherbox/pysemantic#dependencies
Here's a video that explains PySemantic in some detail (Note that it was meant for an audience of non-programmers):
Slides will be available shortly.
I'm a data scientist at DataCulture Analytics (http://dataculture.io) where I build large scale machine learning applications. Previously I've worked at Enthought, Inc, where I was one of the developers of the Canopy data analysis platform. I've been a research assistant in the fields of machine learning and signal processing at the Tata Institute of Fundamental Research and the University of Pune. I love developing GUI apps and signal processing tools in my free time.
- GitHub: http://github.com/jaidevd
- Twitter: http://twitter.com/jaidevd
- AboutMe: http://about.me/jaidevd