Demystifying MDM & Entity Resolution using Dedupe
Master data is at the heart of an efficient and effective modern business.Master data management (MDM) is the effort made by an organization to create one single master reference source for all critical business data, leading to fewer errors and less redundancy in business processes.
The real challenge is the real world data is messy and it's difficult to make a decision out of this data. There are lot of records which can be duplicates or have the same entity references which leads to ambiguity and resource consumption.
Entity resolution (ER) is the task of disambiguate records that correspond to real world entities across and within datasets. Problems associated with entity resolution are equally big — as the volume and velocity of data grow, inference across networks and semantic relationships between entities becomes increasingly difficult. Data quality issues, schema variations, and idiosyncratic data collection traditions can all complicate these problems even further. When combined, such challenges amount to a substantial barrier to organizations’ ability to fully understand their data, let alone make effective use of predictive analytics to optimize targeting, thresholding, and resource management.
it's a modern day python library for entity resolution, which works on machine learning algorithms to perform Deduplication and Record Linkage.
Basic outline of the talk
What is MDM and current challenges in the Organization? [7-10 minutes]
Why Entity Resolution for MDM? [5 minutes]
What is Python Dedupe library and How it works and helps in ER? [5-7 minutes]
ML approach to solve Entity Resolution [3-5 minutes]
Q/A - [2-3 minutes]
- Basic Knowledge of Python and
- Basics of Machine Learning Classifiers like LR,KNN, DT etc.
Vinay is working as a Data Scientist and he loves creating the Data Driven Applications and really love working with the messy data and cleaning it to implement Machine Learning Models to the new age applications. In his leisure time he blogs on Kanoki.org and writes articles on Data Science central. He is an Electrical Engineer from an academic perspective and earned certificate in Data Mining from Indian Statistical Institute and currently pursuing his masters in Statistics.
He has delivered talks in the past in PYCON - New Delhi and other conferences Internationally.
Beside Data, he is a passionate cyclist and rides 100KM average in a week.
Selenium Conference 2016: