Exact Match String Searching: The Knuth-Morris-Pratt Algorithm

Searching for the occurence of one string within the body of a larger string is a process done millions of times, and is one of the central operations in text processing. It is performed as part of  querying a search engine and whenever you press Ctrl + F to search for a document or page for a keyword, an exact match string search takes place. Despite its ubiquity there are only a handful of algorithms to choose from, and even less which are easily understood. To compound matters further, the problem is such that the classical brute force algorithm while possible to improve on, as we will show shortly, it certainly is not easy.

The Brute Force Approach

The following algorithm will search text for the occurence of pattern in time O(MN) where M is the length of the pattern and N is the length of the text. It is simple to understand, and in many situations runs in time closer to O(N + M). The difference in running times stems from the amount of times the search finds a partial match.

bool exactMatch(string::iterator pattern_begin, string::iterator text_begin, string::iterator pattern_end, string::iterator text_end) {
    auto i = pattern_begin, j = text_begin;
    while ((i != pattern_end && j != text_end) && (*i++ == *j++)); 
    return (i == pattern_end);
}

int bruteForceStringSearch(string pattern, string text) {
    for (int i = 0; i < text.length(); i++) {
        if (exactMatch(pattern.begin(), text.begin()+i, pattern.end(), text.end()))
            return i;
    }
    return -1;
}

The reason for this inneficiency is because no matter how long the partial match was, the algorithm rewinds all the way back to the start position+1 to start over matching the pattern from the next character. Knowing this, then its safe to assume that If we could somehow avoid having to do that rewind phase, we could greatly improve our running time. This is the line of thought which drove several separate researchers to arrive at generally the same solution, despite different formulations of the original problem. That solution is what has come to be known as the Knuth-Morris-Prratt (KMP) string searching algorithm.

The Knuth-Morris-Pratt Algorithm

The KMP algorithm allows for string matching without backtracking by incorporating a pre-processing step into the algorithm. This processing constructs a DFA called the "failure function". The failure function is an array which based on the current position in the pattern and the character being examing will tell us which state to go to next to continue searching.

class KMP {
    private:
        string pattern;
        vector<int> next;
    public:
        KMP(const string& pat) {
            pattern = pat;
            next = vector<int>(pattern.length(), 0);
            next[0] = -1;
            for (int i = 0, j = -1; i < pattern.length(); i++, j++, next[i] = j)
                while ((j >= 0) && (pattern[i] != pattern[j]))
                    j = next[j];
        }
        int match(string text) {
            int i, j;
            for (i = 0, j = 0; i < text.length() && j < pattern.length(); i++, j++) 
                while ((j >= 0) && (text[i] != pattern[j])) 
                    j = next[j];
                
            return (j == pattern.length()) ? i-pattern.length():-1;
        }
};


Leave A Comment