Download Presentation

String Matching

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

Presentation Transcript

detecting the occurrence of a particular substring (pattern) in another string (text)

A straightforward Solution

The Knuth-Morris-Pratt Algorithm

The Boyer-Moore Algorithm

Straightforward solution

Algorithm: Simple string matching

Input: P and T, the pattern and text strings; m, the length of P. The pattern is assumed to be nonempty.

Output: The return value is the index in T where a copy of P begins, or -1 if no match for P is found.

int simpleScan(char[] P,char[] T,int m)

int match //value to return.

int i,j,k;

match = -1;

j=1;k=1; i=j;

while(endText(T,j)==false)

if( k>m )

match = i; //match found.

break;

if(tj == pk)

j++; k++;

else

//Back up over matched characters.

int backup=k-1;

j = j-backup;

k = k-backup;

//Slide pattern forward,start over.

j++; i=j;

return match;

Analysis

Worst-case complexity is in (mn)

Need to back up.

Works quite well on average for natural language.

The Knuth-Morris-Pratt Algorithm

Pattern Matching with Finite Automata

e.g. P = “AABC”

The Knuth-Morris-Pratt Flowchart

Character labels are inside the nodes

Each node has two arrows out to other nodes: success link, or fail link

next character is read only after a success link

A special node, node 0, called “get next char” which read in next text character.

e.g. P = “ABABCB”

Construction of the KMP Flowchart

Definition:Fail links

We define fail[k] as the largest r (with r<k) such that p1,..pr-1 matches pk-r+1...pk-1.That is the (r-1) character prefix of P is identical to the one (r-1) character substring ending at index k-1. Thus the fail links are determined by repetition within P itself.

Algorithm: KMP flowchart construction

Input: P,a string of characters;m,the length of P.

Output: fail,the array of failure links,defined for indexes 1,...,m.The array is passed in and the algorithm fills it.