Wednesday, November 21, 2007

Printing Neatly

Word wrap in most text editors take the greedy approach of formatting a long non-newline text that spans into a multi-line paragraph. However, we will look at something that tries to minimize the "badness" or shall we say the number of unused whitespace. Whitespace between words does not count as unused whitespace, but rather space that is used to fill the rest of the line is "unused."

Firstly, we need to consult the badness function:

c[i,j] = (SPACES_PER_LINE - (j - i + Length(word k)(i,j)))^2
Length(word k)(i,j) = SUM from k = i to j the length of the word k.

Also, the memoized version of this can be used to identify a one-state function, that depends on the number of words. Note that also, if c[i,j] is not INFINITY (the operand inside the squared is >= 0), then the total cost is just c[i,j]. Else, choose for every k, such that k = i to j-1, find the optimal solution to fit up to k words on previous lines plus the cost c[k+1,j] on this line. The solution is the minimum. So the recurrence relation is:

f(n) = MIN { c[1,n], MIN { for(k,1,n) f(k) + c[k+1,n] } }

Therefore, in code the memoized version would look like so:

#define SPACES_PER_LINE 80
#define INFINITY 1e9
uint32_t memo[N + 1];

uint32_t c(uint32_t i, uint32_t j)
{
uint32_t spaces = j - i;
uint32_t wordLength = 0;
for (int k = i; k <= j; ++k) {
wordLength += str[k].size();
}
int32_t res = SPACES_PER_LINE - (spaces + wordLength);
if (res >= 0) {
return ((uint32_t)res * res);
}
return (INFINITY);
}
uint32_t f(uint32_t n)
{
if (memo[n] != -1) {
return (memo[n]);
}
if (c(1, n) != INFINITY) {
return (memo[n] = c(1, n));
}
uint32_t min = INFINITY;
for (int k = 1; k < n; ++k) {
min = MIN(min, f(k) + c(k+1, n));
}
return (memo[n] = min);
}

This solution could also have been built up using dp.

Monday, November 19, 2007

KMP - String pattern algo

Naive approaches for string pattern matching take in the complexity order of O(N*M), such that N is the size of the string to search, and M is the size of the string of the pattern to look for.

The KMP approach is based upon a function that details previous suffix/prefix operations.
This is doable in O(M) time:

F[M]; // all zero's.
j = 0;
for (int i = 2; i <= M;) {
if (P[i-1] == P[j]) {
F[i++] = ++j;
} else if (j) {
j = F[j]; // Shift by j - i
} else {
++i;
}
}

The KMP Failure Function is probably the hardest part, building a suffix datastructure that is also a prefix. The technique is to use dp with respect to the previously calculated fail function for determining the best result. In code, the recursive function would look like this:

int failFunction(int n)
{
if (n <= 1) return 0;
int res = n - 1;
do {
res = failFunction(res);
if (str[res] == str[n-1])
return (res + 1);
} while (res)
return (0);
}

A memoized version of it goes like so:

int memoFF(int n)
{
if (n <= 1) return 0;
if (memo[n] != -1) return memo[n];
int res = n-1;
do {
res = memoFF(res);
if (str[res] == str[n-1])
return (memo[n] = res + 1);
} while (res);
return (memo[n] = 0);
}

The actual algorithm for computing the failure function is the hard part. The rest is easy:

int KMP(string S, string P)
{
int N = S.size(), M = P.size();
for (int i = 0, j = 0; i < N;) {
if (S[i] == P[j]) {
++i, ++j;
if (j == M) return 1;
} else if (j){
j = F[j];
} else {
++i;
}
}
return 0;
}