ANLP(2) — Minimum Edit Distance

vamshi krishna
4 min readNov 30, 2021

An Algorithmic approach to NLP

Edit Distance

A considerable part of NLP involves measuring how similar two strings are. In applications such as spelling correction, suppose a user enters cafeine, the machine is expected to suggest the word most similar to it. In this case, it is caffeine which is only one letter more than the typed word.

Edit Distance is a quantity that signifies how similar two strings are. Edit distance is the number of editing operations like insertion, deletion, and substitution required to transform one string into another. This brings us to the question, what is the Minimum Edit distance between two strings? It is essential to know this because edit distance in itself does not directly represent string similarity. You can unnecessarily edit a string multiple times even when you can reach the required string in a few edits. For example, the minimum edit distance between the words “Sunday” and “Saturday” is 3, as it needs one substitution and two inserts as shown below:

However, the alignment of strings is also an important aspect in finding minimum edit distance. Consider two words “snowy” and “sunny.” If we were to find the edit distance by comparing every letter from the end, the edit distance between them would be 5(three deletions, one substitution, and one insert). However, if we aligned the words from their starting letter, the edit distance would become three, as shown below:

The Algorithm: Dynamic Programming

If we were to perform this algorithm naively, it would take exponential time since the space of all possible edits is huge. However, because several different edit paths will lead to the same state (string), rather than recalculating all of them, we could just memorize the shortest path to a state each time we saw it. We can do this by using Dynamic Programming. Bellman (1957) originated the term “dynamic programming” to describe a set of algorithms that use a table-driven method to solve issues by combining answers to sub-problems.

Let’s start by defining the shortest edit distance between two strings. We’ll define D[i, j] as the edit distance between X[1..i] and Y[1.. j], i.e. the first I characters of X and the first j characters of Y, given two strings, source string X of length n and target string Y of length m. D[n,m] is the edit distance between X and Y.

We’ll utilize dynamic programming to calculate D[n,m] from the bottom up, merging subproblem solutions. Going from i characters to 0 involves i deletes in the base case, with a source substring of length i but an empty target string. Going from 0 to j characters requires j inserts with a target substring of length j and an empty source. We compute larger D[i, j] based on previously computed smaller values after computing D[i, j] for tiny i j.

Each of these operations can also be assigned a certain cost or weight. The Levenshtein distance between two sequences is the simplest weighting factor in which each of the three actions costs one (Levenshtein, 1966) — we assume that substituting a letter for itself, such as t for t, has no cost. Between purpose and execution, the Levenshtein distance is 5. In addition, Levenshtein provided an alternative version of his measure in which each insertion or deletion costs one and replacements are not permitted. (Since any replacement can be represented by one insertion and one deletion, this is equal to allowing substitution but assigning a cost of two to each substitution.) The Levenshtein distance between intention and execution is 8 in this version. If we were to calculate the Levenshtein distance, the algorithm would become:

The pseudocode for the algorithm is as follows:

The below Dynamic Programming table shows the calculation of Levenshtein distance between “kitten” and “sitting” using the DP approach:

The time complexity of this algorithm is O(mn) where m and n are the input sizes of strings. This is because of the 2D array of (m x n) dimensions.

We see how Dynamic Programming helps us in calculating the minimum edit distance, which is a fundamental algorithm in NLP. Minimum edit distance is directly helpful in algorithms such as spelling correction, but the alignment provided by the algorithm(can be achieved through backtracing) can also help in speech processing, machine translation, and calculating word error rate. Another similar Dynamic programming algorithm is the The Viterbi Algorithm ), which is a probabilistic extension of minimum edit distance.

--

--

vamshi krishna

I'm a research student at IIIT Hyderabad currently studying Computer Science and Computational Linguistics.