📝📓 record pattern search algorithms and introduce two solutions

Introduction

Actually, pattern searching is an important problem in computer sciemce. When we do search for a string in nptepad/word file or browser or database, pattern searching algorithms are used to show the search results. The essence is the problem of finding occurrences of pattern string within another string or body of text, there are many different algorithms for efficient searching.

1. brute force

time complexity: O(N * M) N is the len of text string, M is the len of pattern string such as : target = "ABABDABACDABABCABAB", pattern = "ABABCABAB"

pseudocode in C:

void pattern_search(char *t, char *p){
	int len_t = strlen(t);
	int len_p = strlen(p);
	bool found;
	for(int i = 0; i < len_t - len_p; i ++){
		found = true;
		for(int j = i; j < i + len_p; j ++){
			if(t[j] != p[j - i]){
				found = false;
				break;
			}
		}
		if(found)
			return i;
	}
	return -1;
}

2. Knuth-Morris-Pratt Algorithm

time complexity: is O(N + M)

the naive pattern searching/brute force algorithms doesn't work well in cases where we see many matching characters followed any mismatching character.

The KMP matching algorithm uses degenerating property(pattern having same sub-patterns more than once in the pattern)of the pattern and improves the worst case complexity to O(N).The basic idea behind KMP's algorithms is : whenever we detect a mismatch(after some matches), we already know some of the characters in the text of the next window. We take advantage of this information to avoid matching the repeated characters we have known. Here, we need an auxiliary array named next to store the amount of longest common prefix and suffix.

pseudocode in C:

void prefix(char *p, int *next){
	int len_p = strlen(p);
	for(int i = 1; i < len_p; i ++){
		if(p[i] == p[next[i - 1]])
			next[i] = next[i - 1] + 1;
		else{
			if(p[i] == p[0])
				next[i] = 1;
			else
				next[i] = 0;
		}
	}
}

int pattern_search(char *t, char *p){
	int len_t = strlen(t);
	int len_p = strlen(p);
	int *next = (int *)malloc(sizeof(int));
	prefix(p, next);
	
	int i = 0;
	int j = 0;
	while(i < len_t){
		if(t[i] == p[j]){
			i ++;
			j ++;
		}
		if(j == len_p)
			return i - j;
		if(i < len_t && t[i] != p[j]){
			if(j != 0)
				j = next[j - 1];
			else
				i += 1;
		}
	}
	return -1;
}