Scaling Entity Resolution with Deep Learning

Flávio Juvenal (~flavio)


Description:

The problem of finding duplicate records in one or more datasets is called Entity Resolution. Python already offers great tools and libraries for Entity Resolution, like Dedupe

But a challenge most ER tools don’t solve well is indexing / blocking records. Even if one has a perfect similarity function to classify if a pair of records is duplicate or not, the number of pairs of records grows quadratically by the number of records. Therefore, even with a great similarity function, one must find a way to reduce the total number of pairs to evaluate by using indexing / blocking techniques. Python tools for ER use either simple blocking procedures without good recall, or complex procedures that don't scale.

A perfect indexing / blocking function must receive records (not pairs of records) and return labels, with repeated labels only for records that are duplicates. That’s essentially a clustering function. While that’s not feasible, it is possible to train a Deep Learning model to vectorize records in a way that vectors from duplicate records are similar, i.e., they’re close together in some euclidean space.

In this talk, we introduce such blocking model with the open-source library Entity Embed, a PyTorch library for transforming records like companies, products, etc. into vectors to support scalable Entity Resolution using Approximate Nearest Neighbors. Vectorizing records enables scalable ANN search, which means finding thousands of candidate duplicate pairs of records per second per CPU.

Here's a tentative outline for the talk:

Introduction [2 minutes]

  • Real-world data is a mess
  • Duplicate data issues
  • Difficulty to extract knowledge

Fuzzy similarity [3 minutes]

  • Strings (levenshtein, jaro-winkler, etc)
  • Numbers
  • Addresses

Indexing / Blocking [3 minutes]

  • Why it is necessary
  • Full indexing (slow!)
  • Blocking by common attribute (straightforward, not optimal)
  • Imagining a perfect blocking function

Entity Embed: Architecture [5 minutes]

  • Token embedding
  • Field embedding
  • Tuple signature
  • Contrastive training
  • Embedding space

Entity Embed: Usage [10 minutes]

  • Preparing the data
  • Defining the fields
  • Building the model
  • Training the model
  • Finding candidate pairs

References and Questions [2 minutes]

Prerequisites:

Basic Deep Learning.
Intermediate Python.

Content URLs:

Documentation for Entity Embed: https://entity-embed.readthedocs.io/en/latest/
Entity Embed GitHub: https://github.com/vintasoftware/entity-embed

Speaker Info:

Flávio is a software engineer from Brazil and partner at Vinta Software. At Vinta, Flávio builds high-quality web products, from UX to code, using mostly React and Django. Flávio loves coffee and is always looking for good coffee beans with exotic flavor profiles.

Flávio has contributed to open-source projects related to Entity Resolution, and worked on this area in enterprise projects. He is of the maintainers of Entity Embed library: https://github.com/vintasoftware/entity-embed/

Speaker Links:

Flávio is experienced in presenting talks at Python conferences. From 2016 to 2020, he gave talks at PyCon US, DjangoCon US, DjangoCon Europe, PyBay, and PyGotham. Also, he presented talks related to Entity Resolution at two of those conferences. Here's the link for the latest one: https://youtu.be/eMI8lwQl3Dc

Section: Data Science, Machine Learning and AI
Type: Talks
Target Audience: Intermediate
Last Updated: