OLMONK : Data validation package

ankur09011


21

Votes

Description:

The cycle of data analysis involves many stages from data gathering to data visualization, with data filtering being one of the limiting steps. The need for data filtering will arise from problems in the way the data is entered and stored. Mistakes as simple as setting a numeric column to character data type, or passing null values to required parameters can result in crashes or undefined behavior. OLMONK helps one avoid them, and enables smooth data analysis and visualization.

The package can be used as a validation layer before any kind of data analysis software. While building a software, the developer can import OLMONK and can define constraints depending on the required input format. The package then logs appropriate warnings and errors, along with data location (i.e. row and column info), in a file, and notifies the user of the same. While most of the standard inconsistencies can be rectified by OLMONK, some cases may require human intervention and for them the package raises errors.

It allows the user to ensure the correctness of data, before feeding the analysis tool with potentially incompatible data. One can validate different kinds of data with tailor-made checks. It can currently process .csv, .txt, .xlsx, .bed format files.

OLMONK enables the user to:

  1. Validate data with single config file
  2. Add external validation functions
  3. Generate report In different formats
  4. Use inbuilt data validation checks such as a check for subset, superset or duplicates

The code follows PEP8 and PEP256 guidelines throughout, and has been tested with >90% test coverage.

The talk will be focused on how to use the package for validating demo files and discuss scope of the package in various industries.

Prerequisites:

Python

Content URLs:

https://docs.google.com/presentation/d/1FKpD4AQnwhei1O9OMPHF1sSoH9y4SYF2_n9AO79qjao/edit?usp=sharing

Speaker Info:

Ankur is a developer with over 4 years of experience in domain varying across Embedded Systems, Robotics, and Data Science. He currently works as a developer at Elucidata Corporation, where he builds models that help scientists process biological data. He holds Bachelor's degree from Indian Institute of Information Technology, Jabalpur.

Speaker Links:

LinkedIn Profile: https://linkedin.com/ankur-agrawal-9a280752/

Section: Data Analysis and Visualization
Type: Talks
Target Audience: Intermediate
Last Updated: