Data in the real world is messy. Dealing with messy data sets is painful and burns through time which could be spent analyzing the data itself. By using a novel approach borrowed from the field of Natural Language Processing we can perform these two tasks on large data sets.
This article focuses on ‘fuzzy’ matching and how this can help to automate significant challenges in a large number of data science workflows.
The Business Problem
Identity Resolution is about matching identities across disparate data sources. Let’s use this example: Let’s say you are a pharmaceutical company. You attended an event and gathered a list of doctors in an Excel spreadsheet. Now you want to take that list of doctors and best match them with your internal database to identify the doctor’s corresponding ID, address, and information to determine whether or not you can follow up with that doctor. There are several ways to do this matching, and we will discuss one approach below.
How will we know if we’ve achieved our goal? Our metrics of success will be attaining a 30% match rate with an algorithm that runs in under three hours. Why is under three hours important? Selecting the right algorithm is critical because we want to ensure computational efficiency and scale the algorithm to much larger data sets.
The Methodology for Matching
The process we will follow is:
- Deduplicate, preprocess, or clean the text data
- Transform word data into numeric data (TFIDF)
- Calculate distances between each row (Cosine similarity)
- Choose best match based on smallest distance
Note: all of these methods fall under the area of Natural Language Processing (NLP), a specialization of machine learning and text mining to solve problems related to natural language.
Step 1: Preprocess Text
Preprocessing means doing the following:
- Deduplicate the database as required
- Lowercase
- Remove symbols ‘$@#%^&*, etc.
- Address any leading or trailing white space
- Normalization / Convert ‘St’ -> ‘Street’, etc.
Note: In certain cases, we may have to deal with Unicode, emoji, symbols.
Step 2: TFIDF Transformation aka “Bag of Words Model”
In information retrieval, tf–idf or TFIDF (short for Term Frequency–Inverse Document Frequency) is a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus. It is often used as a weighting factor in searches of information retrieval, text mining, and user modeling. Here is further information and an example of TFIDF in action:
In the previous example, ngrams or chunks are applied to whole words.
We can apply the same technique to sets of characters.
For example, we could split each word into 3 character ngrams:
“Jeffrey” -> “ Je”, “Jef”, “eff”, “ffr”, “fre”, “rey”, “ey “
Step 3: Calculate Distance
After splitting all of the words and creating our giant corpus, we have to identify a way to measure the similarity between each set of characters. Cosine similarity measures the similarity between two vectors of an inner product space. It is measured by the cosine of the angle between two vectors and determines whether two vectors are pointing in roughly the same direction. It is often used to measure document similarity in text analysis.
Step 4: Choose the best match
Why cosine similarity? Why not Euclidean distance?
Cosine similarity is generally used as a metric for measuring distance when the magnitude of the vectors does not matter. This happens when working with text data represented by word counts. We could assume that when a word (e.g. “science”) occurs more frequently in document 1 than it does in document 2, document 1 is more related to the topic of science. However, it could also be the case that we are working with documents of uneven lengths.
Text data is the most typical example for when to use this metric.
Going back to our doctor matching example. Here is the output of the matching algorithm with some fake data. The yellow highlighted area represents data we captured from the medical event in the field.
Let’s say there were 3000 doctors that attended the event, therefore we will see 3000 rows in the spreadsheet below. In our internal database, we have 500,000 doctors to try and match. For simplicity, let’s say we only captured first name, last name, and state at the event. To the immediate right of those 3 columns is the highest score returned after calculating scores for all 500,000 doctors. The subsequent columns on the right will be the best-matched doctor per our analysis.
After sorting the scores, we can utilize manual validation to estimate the accuracy level of our matching algorithm and also set a threshold for confirmed matches:
Depending on our client’s database/cloud framework, we can utilize the solution described above in their environment and expect similar results. When you compare this algorithm with other approaches, this tends to perform effectively and efficiently.
In summary, tf-idf can be a highly effective and highly performant way of cleaning, deduping, and matching data when dealing with larger record counts.