Monday, November 19, 2007

KMP - String pattern algo

Naive approaches for string pattern matching take in the complexity order of O(N*M), such that N is the size of the string to search, and M is the size of the string of the pattern to look for.

The KMP approach is based upon a function that details previous suffix/prefix operations.
This is doable in O(M) time:

F[M]; // all zero's.
j = 0;
for (int i = 2; i <= M;) {
if (P[i-1] == P[j]) {
F[i++] = ++j;
} else if (j) {
j = F[j]; // Shift by j - i
} else {
++i;
}
}

The KMP Failure Function is probably the hardest part, building a suffix datastructure that is also a prefix. The technique is to use dp with respect to the previously calculated fail function for determining the best result. In code, the recursive function would look like this:

int failFunction(int n)
{
if (n <= 1) return 0;
int res = n - 1;
do {
res = failFunction(res);
if (str[res] == str[n-1])
return (res + 1);
} while (res)
return (0);
}

A memoized version of it goes like so:

int memoFF(int n)
{
if (n <= 1) return 0;
if (memo[n] != -1) return memo[n];
int res = n-1;
do {
res = memoFF(res);
if (str[res] == str[n-1])
return (memo[n] = res + 1);
} while (res);
return (memo[n] = 0);
}

The actual algorithm for computing the failure function is the hard part. The rest is easy:

int KMP(string S, string P)
{
int N = S.size(), M = P.size();
for (int i = 0, j = 0; i < N;) {
if (S[i] == P[j]) {
++i, ++j;
if (j == M) return 1;
} else if (j){
j = F[j];
} else {
++i;
}
}
return 0;
}

No comments: