Scaling Entity Resolution with Deep Learning
Flávio Juvenal (~flavio) |
The problem of finding duplicate records in one or more datasets is called Entity Resolution. Python already offers great tools and libraries for Entity Resolution, like Dedupe
But a challenge most ER tools don’t solve well is indexing / blocking records. Even if one has a perfect similarity function to classify if a pair of records is duplicate or not, the number of pairs of records grows quadratically by the number of records. Therefore, even with a great similarity function, one must find a way to reduce the total number of pairs to evaluate by using indexing / blocking techniques. Python tools for ER use either simple blocking procedures without good recall, or complex procedures that don't scale.
A perfect indexing / blocking function must receive records (not pairs of records) and return labels, with repeated labels only for records that are duplicates. That’s essentially a clustering function. While that’s not feasible, it is possible to train a Deep Learning model to vectorize records in a way that vectors from duplicate records are similar, i.e., they’re close together in some euclidean space.
In this talk, we introduce such blocking model with the open-source library Entity Embed, a PyTorch library for transforming records like companies, products, etc. into vectors to support scalable Entity Resolution using Approximate Nearest Neighbors. Vectorizing records enables scalable ANN search, which means finding thousands of candidate duplicate pairs of records per second per CPU.
Here's a tentative outline for the talk:
Introduction [2 minutes]
- Real-world data is a mess
- Duplicate data issues
- Difficulty to extract knowledge
Fuzzy similarity [3 minutes]
- Strings (levenshtein, jaro-winkler, etc)
Indexing / Blocking [3 minutes]
- Why it is necessary
- Full indexing (slow!)
- Blocking by common attribute (straightforward, not optimal)
- Imagining a perfect blocking function
Entity Embed: Architecture [5 minutes]
- Token embedding
- Field embedding
- Tuple signature
- Contrastive training
- Embedding space
Entity Embed: Usage [10 minutes]
- Preparing the data
- Defining the fields
- Building the model
- Training the model
- Finding candidate pairs
References and Questions [2 minutes]
Basic Deep Learning.
Flávio is a software engineer from Brazil and partner at Vinta Software. At Vinta, Flávio builds high-quality web products, from UX to code, using mostly React and Django. Flávio loves coffee and is always looking for good coffee beans with exotic flavor profiles.
Flávio has contributed to open-source projects related to Entity Resolution, and worked on this area in enterprise projects. He is of the maintainers of Entity Embed library: https://github.com/vintasoftware/entity-embed/
Flávio is experienced in presenting talks at Python conferences. From 2016 to 2020, he gave talks at PyCon US, DjangoCon US, DjangoCon Europe, PyBay, and PyGotham. Also, he presented talks related to Entity Resolution at two of those conferences. Here's the link for the latest one: https://youtu.be/eMI8lwQl3Dc