Fuzzy Matching - Smart Way of Finding Similar Names Using FuzzyWuzzy
Cheuk Ting Ho (~Cheukting) |
Ever encounter a tricky situation of knowing there’s names that are the same, but matching strings straight away leads you no where? All you need is FuzzyWuzzy, a simple but powerful open-source Python library and some wit. This talk will demonstrate how to efficiently fuzzy match company names.
Matching strings should be one of the first natural language processing problem that human encounter since we start use computer to handle data. Unlike numerical value which has an exact logic to compare them, it is very hard to say how alike two strings are for a computer. One may compare them character by character and have an idea of how many characters in the pair of stings are the same. Unfortunately in most application we need computer to perceive strings like we do and therefore we have to use fuzzy matching. Fuzzy matching on names is never straight forward though, the definition of how “difference” of two names are really depends case by case. For example with restaurant names, matching of words like “cafe” “bar” and “restaurant” are consider less valuable then matching of some other less common words. Also, do we consider company names that matches partly (like “Happy Unicorn company” and Happy Unicorn co.”) are the same?
In the first half of the talk Levenshtein Distance, a measure of the similarity between two strings, will be explained. Different functions in FuzzyWuzzy like “partial_ratio” and “token_sort_ratio” will also be explored and compared for difference. It is very important to understand our tool and choose the right one for our task. Then in the second half, we will start tackling the example problem: matching company names, we will show that besides using FuzzyWuzzy, we have to also handle problem like finding and avoid matching of common words and speeding up the matching process by grouping the names. By combining all tricks and techniques that we demonstrate, we will also evaluate how efficient this method is and the advantage of using this method.
This talk is for people in all level of Python experience who would like to learn a trick or two and would like to be able to solve similar problems in the future. Theory of how the library works will be explained and It is easy to be pick up even for beginners.
None, it's a beginner friendly talk.
The library shown in the talk is an open source library which is available on Github: https://github.com/seatgeek/fuzzywuzzy
Source code available on Github: https://github.com/Cheukting/fuzzy-match-company-name
Slides (not finalized): http://slides.com/cheukting_ho/fuzzy-matching
After spending 5 years doing research in theoretical physics at Hong Kong University of Science and Technology, Cheuk has transferred her analytical and logical skills in natural science and built a career in data science. Cheuk is now a Data Scientist in Hotelbeds Groups which is one of the biggest worldwide wholesaler in travel business.
Cheuk constantly contributes to the community by giving AI and deep learning workshops, volunteering at Datakind for charities. At the same time participate open source projects like Pandas, Gensim and Dateutil. Cheuk has also been a guest speaker at University of Oxford and Queen Mary University of London, and various conferences including PyData Amsterdam, PyCon Israel and PyLondinium. Believing in gender equality, Cheuk is currently a co-organizer of AI club for Gender Minorities to support Tech Diversity and Inclusion.
Public speaker in technology, example of my talk on YouTube: https://www.youtube.com/watch?v=bQ2Qu63SYHw
Slide Share: https://www.slideshare.net/CheukTingHo/presentations
Speaker Bio: https://www.papercall.io/speakers/26654/