How do we Match with Edit Distance and Cosine Similarity

EeLianChong
5 min readJul 11, 2021

--

***

And so, someone came and told me “So I have applied Levenshtein distance. It is promising. But it still does not come out as expected for closely similar addresses.”😞; and another told me, “I had applied Cosine Similarity at work. It will not work. Don’t waste time.”😠

So, I asked them, “How did you apply it? Did you copy some sample algorithm from the web? Did you attempt to read up what the algorithm does? What is the tolerance level you are looking at in terms of match rate (back to the 1010* confusion matrix)?”🤔🤔🤔 *1010 is an internal joke among some friends.

Or what if you were to use a few of these formulas and information in sync for maximal effect? Like managing a football team, how do you position and leverage on your 3 + 1 strikers in the team? and X number of midfielders? Do you field them altogether? Save some for second halves?😏 And it’s Italy vs England for Euro 2021! Woohoo!👻

Anyway, using the same Malaysian Address example again for illustration. Purely because I think

[1] business use cases in standardizing addresses, although important to some organizations at the moment, will likely see a significant decline over the next 10–20 years, because
[2] commercially, there are already ready off-the-shelves solutions to these address verification, standardization and cleansing requirement which you can make some fun quick comparison; and
[3] front- end control wise, there are also adoption of controls in place such as autocompletion of addresses based on selection on google map to minimize these address issue, as seen with MySejahtera for postcode selection, for example.

But what will likely come in the next couple of years will be the increasing use eKYC solutions for customer onboarding, with address verification being part of the integrated solution. In fact, these solutions are already in the market. 👻

***

Single versus Split String Street Address

Assuming we are looking at the expected street address string of “JALAN RAMAL 2/3”, and we are attempting to find which of the inputted address by our customers/ customer services agents matches this the most, for us to tag them to the same household.

Let’s look at applying Levenshtein similarity ratio (modified from the original edit distance formula to compute dissimilarity) on the following strings of address streets versus the truncated version of the strings.

For Single String application, we can see a list of 5 candidates that most closely matches the expected formatted address, and 2 we (Malaysians) know with eyeball level certainty are not the same.

On the other hand, Split String application tend to solve this better. Here, the string is truncated at component level and different level of dissimilarity tolerance for each of the components has been applied, where the precedence is that [1] numerical components must have a 100% match whereas [2] the other subcomponents will be dependent on the highest combinational match (and [3] must be above a select threshold of let’s say 60%).

***

Full String Street Address

Expanding to full address string will require the extraction of each of the address components of postcodes, unit numbers, street names, buildings, taman (“garden” or address2 line in typical Malaysian address) etc to name a few. But how to do this? that is beyond the scope of this article.

For illustration, let’s take the following two strings which are both the same, except one has string repetitions in it due to front end input error.

123A JALAN MERU MERU KLANG 42200 SELANGOR
123A JALAN MERU MERU KLANG 42200 SELANGOR KLANG 42200 SELANGOR

Applying the normalized frequency count for cosine similarity, we are getting a 100% match whereas Levenshtein being an edit distance for dissimilarity, returns 34% dissimilarity or 66% similarity.

Expanding this to other samples, what can we infer from the use of these similarity and dissimilarity indexes? How about other information like postcodes geographical proximity (if not numeric proximity), unit number match and street name match ratio? Below is a crude example of how this could possibly work [You will need to click to expand for better view, download or ping me for the image file].

***

But is this the best way to implement this use case? Nope, not to me, at least.

It can however, be used as a starter kit for those who would like to understand how each of these algorithms can be applied, and perhaps, in sync with one another. While it might not be necessary or even easy for us to reach the level of PhD researchers in the comprehension of each algorithm or theory [there are simply too many and will take us more than a lifetime to learn], but at the very least, we should strive to understand how it works. 💪 Blind application will simply lead to more frustrations and questions when deciphering the results; and as always, you may ping me for more information on the actual work. 👻

***

References:
HighFive picture:
https://www.nicepng.com/ourpic/u2w7r5u2y3y3i1w7_a-look-at-the-high-five-illustration/
addresses obtained are further edited to simulate some common scenarios
https://www.bestrandoms.com/random-address-in-my

--

--

No responses yet