Structured data pre-processing, every data scientist's nightmare - decoded
dilsher mann (~dilsher) |
In almost every talk given by a Data Scientist, we can see one common topic of concern, the quality of input data or how to get more suitable data out of something procured from third party data vendors, all of which can be assorted under an umbrella term - data pre-processing. That is something which eats into their valuable time and also affects performance/suitability of the data analytics engines run over input data. This responsibility of getting data, as per recommended quality standards, is partly shared by a Data Engineer, whose ultimate day-job is to create data pipelines supplying data reliably to a data store (typically a data warehouse or a data lake).
Having worked in multiple data engineering projects, I had always felt the need to somehow automate this act of data cleansing and pre-processing. Having gone through multitude of options available for data pre-processing, I finally have an elegant solution for this, an open source project by Nicola Iarocci - Cerberus.
Cerberus is a lightweight and extensible data validation library for Python. It lets the users define data validation rules in the form of simple python data structures and then validate input data with help of these defined rules. The rules cover almost every other data validation rule which is generally applied to the raw data sets in data analytics projects.
So, via this talk, I would like to share how this simple open-source tool has made my life easy as a Data Engineer and what all benefits can other fellow data enthusiasts reap in the sphere of python related projects.
A generic understanding of Python data structures and a tinder interest in data domain would be perfect.
Dilsher Singh Mann
Dilsher is a tech graduate in Computer Science from NIT Jalandhar and has been working in data engineering domain for 3 years now. He likes to unveil day to day challenges faced by data engineers and tries to get simple solutions to those.