Edit Distance Problem

finding where the pattern string P occurs as a substring of the text string T.

• Substitution – Replace a single character from pattern P with a different character in text T, such as changing “shot” to “spot.”
• Insertion – Insert a single character into pattern P to help it match text T, such as changing “ago” to “agog.”
• Deletion – Delete a single character from pattern P to help it match text T, such as changing “hour” to “our.”

Edit Distance by Recursion. We can define a recursive algorithm using the observation that the last character in the string must either be matched, substituted, inserted, or deleted.

Let i and j be the last character of the relevant prefix of P and T, let D[i, j] be the minimum number of differences between P1, P2, … , Pi and the segment of T ending at j. D[i, j] is the minimum of the three possible ways to extend smaller strings:

• If (Pi = Tj ), then D[i−1, j −1], else D[i−1, j −1]+1. This means we either match or substitute the ith and jth characters, depending upon whether the tail characters are the same.
• D[i − 1, j] + 1. This means that there is an extra character in the pattern to account for, so we pay the cost of an deletion of p[i] from pattern.
• D[i, j − 1] + 1. This means that there is an extra character in the text to remove, so we pay the cost of a insertion of t[j] into pattern.

#define MATCH 0 /* enumerated type symbol for match */
#define INSERT 1 /* enumerated type symbol for insert */
#define DELETE 2 /* enumerated type symbol for delete */
 
int string_compare(char *s, char *t, int i, int j)
{
    int k; /* counter */
    int opt[3]; /* cost of the three options */
    int lowest_cost; /* lowest cost */
 
    if (i == 0) return(j * indel(’ ’));
    if (j == 0) return(i * indel(’ ’));
 
    opt[MATCH] = string_compare(s,t,i-1,j-1) + match(s[i],t[j]);
    opt[INSERT] = string_compare(s,t,i,j-1) + indel(t[j]);
    opt[DELETE] = string_compare(s,t,i-1,j) + indel(s[i]);
 
    lowest_cost = opt[MATCH];
 
    for (k=INSERT; k<=DELETE; k++)
      if (opt[k] < lowest_cost) lowest_cost = opt[k];
 
    return( lowest_cost );
}

Dynamic approach

There can only be |P| · |T| possible unique recursive calls, since there are only that many distinct (i, j) pairs to serve as the argument
parameters of recursive calls. By storing the values for each of these (i, j) pairs in a table, we just look them up as needed and avoid recomputing them.

Our dynamic programming implementation has three differences from the recursive version. First, it gets its intermediate values using table lookup instead of
recursive calls. Second, it updates the parent field of each cell, which will enable us to reconstruct the edit sequence later. Third, it is implemented using a more
general goal cell() function instead of just returning m[|P|][|T|].cost. This will enable us to apply this routine to a wider class of problems.

typedef struct {
  int cost; /* cost of reaching this cell */
  int parent; /* parent cell */
} cell;
 
cell m[MAXLEN+1][MAXLEN+1]; /* dynamic programming table */
 
int string_compare(char * s, char * t) {
    int i,
    j,
    k;/* counters */
    int opt[3];/* cost of the three options */
    for (i = 0; i < MAXLEN; i++) {
        row_init(i);
        column_init(i);
    }
    for (i = 1; i < strlen(s); i++) {
        for (j = 1; j < strlen(t); j++) {
            opt[MATCH] = m[i - 1][j - 1].cost + match(s[i], t[j]);
            opt[INSERT] = m[i][j - 1].cost + indel(t[j]);
            opt[DELETE] = m[i - 1][j].cost + indel(s[i]);
 
            m[i][j].cost = opt[MATCH];
            m[i][j].parent = MATCH;
 
            for (k = INSERT; k <= DELETE; k++)
            if (opt[k] < m[i][j].cost) {
                m[i][j].cost = opt[k];
                m[i][j].parent = k;
            }
        }
    }
    goal_cell(s, t, & i, & j);
    return (m[i][j].cost);
}
 
reconstruct_path(char * s, char * t, int i, int j) {
    if (m[i][j].parent == -1) return;
    if (m[i][j].parent == MATCH) {
        reconstruct_path(s, t, i - 1, j - 1);
        match_out(s, t, i, j);
        return;
    }
    if (m[i][j].parent == INSERT) {
        reconstruct_path(s, t, i, j - 1);
        insert_out(t, j);  // insert t[j] into pattern
        return;
    }
    if (m[i][j].parent == DELETE) {
        reconstruct_path(s, t, i - 1, j);
        delete_out(s, i);  // delete p[i]
        return;
    }
}
row_init(int i)
{
  m[0][i].cost = i;
  if (i>0)
    m[0][i].parent = INSERT;
  else
    m[0][i].parent = -1;
}
 
column_init(int i)
{
  m[i][0].cost = i;
  if (i>0)
    m[i][0].parent = DELETE;
  else
    m[i][0].parent = -1;
}
 
int match(char c, char d)
{
  if (c == d) return(0);
  else return(1);
}
 
int indel(char c)
{
  return(1);
}

cells (i, 0) and (0, i) correspond to matching length-i strings against the empty string. This requires exactly i insertions/deletions, so the definition of these functions is clear:

The function goal cell returns the indices of the cell marking the endpoint of the solution. For edit distance, this is defined by the length of the two input strings. However, other applications we will
soon encounter do not have fixed goal locations.

goal_cell(char *s, char *t, int *i, int *j)
{
  *i = strlen(s) - 1;
  *j = strlen(t) - 1;
}
insert_out(char *t, int j)
{
  printf("I");
}
 
delete_out(char *s, int i)
{
  printf("D");
}
 
match_out(char *s, char *t, int i, int j)
{
  if (s[i]==t[j]) printf("M");
  else printf("S");
}

substring best matching

Suppose that we want to find where a short pattern P best occurs within a long text T. for example, searching for “Skiena” in all its misspellings (Skienna, Skena, Skina, … ) within a long file

We want an edit distance search where the cost of starting the match is independent of the position in the text, so that a match in the middle is not prejudiced against. Now the goal state is not necessarily at the end of both strings, but the cheapest place to match the entire pattern somewhere in the text. Modifying these two functions gives us the correct solution:

row_init(int i)
{
  m[0][i].cost = 0; /* note change */
  m[0][i].parent = -1; /* note change */
}
 
goal_cell(char *s, char *t, int *i, int *j)
{
  int k; /* counter */
 
  *i = strlen(s) - 1;
  *j = 0;
 
  for (k=1; k<strlen(t); k++)
    if (m[*i][k].cost < m[*i][*j].cost) 
      *j = k;
}

variation of editing distance by a graph.

http://www.ardendertat.com/2011/10/17/programming-interview-questions-8-transform-word/#comment-1330

LCS( longest common subsequence)

Unless otherwise stated, the content of this page is licensed under Creative Commons Attribution-ShareAlike 3.0 License