Edit Distance with Python

EeLianChong
4 min readJul 2, 2021

--

Back in the day when I was a junior analyst (I think am still one now since I can only retire in my 60s 😛), before data science or advance analytics got popular, and before the tools and role distinction happens in most organization, we have one cute old little tool that we use, excel VBA.

So back in those days when data is not as humongous, we use this little friendly tool called Excel VBA to build and run simple haversine distance, edit distance (like Hamming and Levenshtein), jaccard and cosine similarity to name a few.

And the beauty of our roles back then is the freedom to explore, and as edit distance like Hamming distance existed since the 1950s and before, hence I have never got rejected for using similarity or even dissimilarity index haha until this year. But I won’t go deeper into the details, though it did made me question myself if I was so terrible on presentations, and on how I can better articulate my experience on using these formulas and tools? Then, I realized it probably really isn’t just me. 🤭 Having said that, continuous improvement of who we are is a necessity. 💩👻

***

Introduction

Anyway, today’s article is on a mini Edit Distance example in Python.

So, what’s an Edit Distance? Edit Distance, in the context of computational linguistic is about the quantification of how dissimilar two strings are.

So, here’s a quick example of what Levenshtein distance is in action, computational wise.

Let’s say we have two strings of words.
JALAN ABENG
JLN ABANG

*Jalan means street in Bahasa Malaysia btw. So these examples are supposed to mean Abeng Street and Abang Street.

We are able to identify
The longest string length: 11 for String 1 JALAN ABENG
The shortest string length: 9 for String 2 JLN ABANG

The differences? 3 [highlighted in red]
This means it takes 3 moves to go from one string to the other.

This gives us 0.2727 assuming we are using the formula of differences/longest_string_length*

What use case can we do with this? Plenty!!! It’s like my mini penknife as an analyst back then. But it’s also how you get creative with using this. For example, if you are applying this in address matching context, you will likely want to split “Jalan Desa 2/3” into 3 chunks of Jalan + Desa + 2/3 for optimal results when applying this, and even applying a different level of dissimilarity tolerance for each of the components.

*Note that the actual mathematics has been simplified in the context of business audience. But the details can be found online.

***

Python [Distance Library]

Now, this is how we do it in Python using the distance library. It’s a breeze 😎 versus coding it from scratch in VBA back then. The Distance library provides helpers for other distance computations as well.

***

VBA [Levenshtein Similarity]

We can also be a wee bit more creative like shifting this from dissimilarity to similarity index. Here’s an output (click to expand the image below) using my old implementation in Excel VBA. You can grab the code from me if interested. Thought Excel VBA might be of little interest to the audience these days, hence did not include it.

Likewise, you may find the implementation done in oracle function and or even bigquery from the other popular authors on Medium.

***

References:

You can read about edit distance here:
https://en.wikipedia.org/wiki/Edit_distance
https://en.wikipedia.org/wiki/Levenshtein_distance

Python Distance library:
https://pypi.org/project/Distance/

Image of Richard Hamming:
https://nps.edu/-/iconic-researcher-teacher-richard-hamming-maintains-lasting-legacy-on-campus

And this is Richard Hamming. I first learn about Hamming Distance and the series of edit distance back in university in one small sub-sub-sub section. Using his image as tribute to his work (although we are using Levenshtein as example).😛

--

--

No responses yet