Data Validation for Machine Learning Applications

Ved (~ved77) | 27 May, 2024

0

Votes

Description:

Description: Abstract: In the rapidly evolving landscape of data science and machine learning, maintaining high-quality data is paramount for developing reliable models and making informed decisions. This talk aims to explore the challenges faced by data scientists and machine learning practitioners due to poor data quality and how Python can be leveraged to automate data quality checks across various environments. By integrating Python-based solutions for data quality monitoring, organizations can significantly improve their data health, leading to more accurate models and insightful analytics.

Outline:

Introduction
- Brief overview of the importance of data quality in data science and machine learning.
- Introduction to Python's versatility and its application in data quality monitoring.
What is Data Validation?
- Definition of data validation as data quality checks.
- Importance of data quality based on project requirements.
- Checking assumptions about data to ensure accuracy, reliability, and relevance.
Why Data Validation is Crucial
- "Garbage in, garbage out" principle.
- Insights from Andrew Ng on the impact of data quality on model accuracy.
- Personal story highlighting the importance of data validation in real-world scenarios.
- Common sources of error, including buggy code, incorrect usage, hard-coded values, wrong source data, ecosystem changes, and wrong schema.
Challenges in Data Processing
- Discuss common challenges faced by data scientists, including data inconsistency, volume management, and integration issues across different platforms.
- Highlight the impact of these challenges on data quality and subsequent decision-making processes.
Impact of Poor Data Quality on Machine Learning Models
- Explore how poor data quality affects model training, leading to inaccurate predictions and potentially costly decisions.
- Examples of real-world consequences of poor data quality on machine learning model performance.
Key Takeaways
- Summary of the importance of data validation in machine learning.
- Best practices for ensuring data quality throughout the ML application.
- Encouragement to implement automated data validation in their workflows.

Takeaways: - A clear understanding of the importance of data validation in machine learning. - Practical knowledge of tools and techniques for implementing data validation. - Strategies to integrate data validation into existing machine learning workflows. - Insights from real-world examples to apply best practices in their projects.

Prerequisites:

This talk is aimed at data scientists, machine learning engineers, and developers who are involved in building and maintaining machine learning models. Attendees should have a basic understanding of machine learning concepts and some experience with data processing.

Speaker Info:

Dhruv Nigam

Dhruv is a machine learning engineer who loves to build and deploy models at scale using Python. At Dream11, he leverage uplift modeling, reinforcement learning, and supervised learning to create action systems that enhance the user experience for over 100 million users. Before Dream11, Dhruv was a Director and founding Data scientist at Protium. He was key in scaling data science infrastructure from scratch to serve over 500k customers at Protium. He established core data engineering pipelines, data models, and deployment frameworks (GitLab CI/CD, Fast API, EC2, MlFlow) for machine learning models. He has spoken at various prestigious venues including a sponsor talk at CODS COMAD 2024. He has a bachelors and Masters in Electrical Engineering from IIT Bombay.

Ved Prakash

Ved is a skilled ML engineer with 9+ years of experience in conceptualizing and deploying large-scale machine learning and deep learning solutions. At Dream11, he has been a key player in reengineering the core contest generation engine. He is currently engaged in building state-of-the-art deep learning models tailored for tabular data domains. Before joining Dream11, Ved led the search and personalization initiatives at Paytm, where he built and deployed cutting-edge real-time machine learning solutions for 350 million users.

Speaker Links:

Dhruv

Linkedin - www.linkedin.com/in/dhruv-nigam-52531176.

Github - https://github.com/dhruvnigam93.

Twitter - https://twitter.com/druubeey.

Talk on credit risk modeling organized by Databuzz and DPhi - https://www.youtube.com/live/4acAw17khkY?si=vD-83gcY99CehXis.

Ved

https://github.com/ved93.

https://www.linkedin.com/in/vedthedataguy/.

Talk on real time ML- challenges and solutions - https://www.youtube.com/watch?v=DD5f-Gz1890.

Section:	Python in Platform Engineering and Developer Operations
Type:	Talk
Target Audience:	Intermediate
Last Updated:	31 May, 2024

Comments