Text Searching - PowerPoint PPT Presentation

About This Presentation
Title:

Text Searching

Description:

... computes the shift table for a pattern p to be used in the Knuth ... This algorithm searches for an occurrence of a pattern p in a text t over alphabet S. ... – PowerPoint PPT presentation

Number of Views:70
Avg rating:3.0/5.0
Slides: 28
Provided by: MScha
Category:

less

Transcript and Presenter's Notes

Title: Text Searching


1
CHAPTER 9
  • Text Searching

2
Algorithm 9.1.1 Simple Text Search
This algorithm searches for an occurrence of a
pattern p in a text t. It returns the smallest
index i such that ti..i m- 1 p, or -1 if no
such index exists.
Input Parameters p, t Output Parameters
None simple_text_search(p, t) m p.length n
t.length i 0 while (i m n) j
0 while (ti j pj) j j
1 if (j m) return i i i
1 return -1
3
Algorithm 9.2.5 Rabin-Karp Search
This algorithm searches for an occurrence of a
pattern p in a text t. It returns the smallest
index i such that ti..i m- 1 p, or -1 if no
such index exists.
Input Parameters p, t Output Parameters
None rabin_karp_search(p, t) m p.length n
t.length q prime number larger than m r
2m-1 mod q // computation of initial
remainders f0 0 pfinger 0 for j 0 to
m-1 f0 2 f0 tj mod q pfinger
2 pfinger pj mod q ...
4
Algorithm 9.2.5 continued
... i 0 while (i m n) if (fi
pfinger) if (ti..i m-1 p) // this
comparison takes //time
O(m) return i fi 1 2 (fi- r
ti) ti m mod q i i 1 return -1
5
Algorithm 9.2.8 Monte Carlo Rabin-Karp Search
This algorithm searches for occurrences of a
pattern p in a text t. It prints out a list of
indexes such that with high probability ti..i
m- 1 p for every index i on the list.
6
Input Parameters p, t Output Parameters
None mc_rabin_karp_search(p, t) m
p.length n t.length q randomly chosen prime
number less than mn2 r 2m-1 mod q //
computation of initial remainders f0
0 pfinger 0 for j 0 to m-1 f0 2
f0 tj mod q pfinger 2 pfinger pj
mod q i 0 while (i m n) if (fi
pfinger) prinln(Match at position
i) fi 1 2 (fi- r ti) ti m
mod q i i 1
7
Algorithm 9.3.5 Knuth-Morris-Pratt Search
This algorithm searches for an occurrence of a
pattern p in a text t. It returns the smallest
index i such that ti..i m- 1 p, or -1 if no
such index exists.
8
Input Parameters p, t Output Parameters
None knuth_morris_pratt_search(p, t) m
p.length n t.length knuth_morris_pratt_shift(p
, shift) // compute array shift of shifts i
0 j 0 while (i m n) while (ti
j pj) j j 1 if (j
m) return i i i shiftj - 1 j
max(j - shiftj - 1, 0) return -1
9
Algorithm 9.3.8 Knuth-Morris-Pratt Shift Table
This algorithm computes the shift table for a
pattern p to be used in the Knuth-Morris-Pratt
search algorithm. The value of shiftk is the
smallest s gt 0 such that p0..k -s ps..k.

10
Input Parameter p Output Parameter
shift knuth_morris_pratt_shift(p, shift) m
p.length shift-1 1 // if p0 ? ti we
shift by one position shift0 1 // p0..- 1
and p1..0 are both // the empty string i
1 j 0 while (i j lt m) if (pi j
pj) shifti j i j j
1 else if (j 0) shifti i
1 i i shiftj - 1 j max(j -
shiftj - 1, 0 )

11
Algorithm 9.4.1 Boyer-Moore Simple Text Search
This algorithm searches for an occurrence of a
pattern p in a text t. It returns the smallest
index i such that ti..i m- 1 p, or -1 if no
such index exists.
Input Parameters p, t Output Parameters
None boyer_moore_simple_text_search(p, t)
m p.length n t.length i 0
while (i m n) j m - 1 // begin at
the right end while (ti j pj)
j j - 1 if (j lt 0)
return i i i 1 return
-1
12
Algorithm 9.4.10 Boyer-Moore-Horspool Search
This algorithm searches for an occurrence of a
pattern p in a text t over alphabet S. It returns
the smallest index i such that ti..i m- 1 p,
or -1 if no such index exists.
13
Input Parameters p, t Output Parameters
None boyer_moore_horspool_search(p, t)
m p.length n t.length // compute the
shift table for k 0 to S - 1 shiftk
m for k 0 to m - 2 shiftpk m - 1 -
k // search i 0 while (i m n) j
m - 1 while (ti j pj)
j j - 1 if (j lt 0)
return i i i shiftti m -
1 //shift by last letter return -1
14
Algorithm 9.5.7 Edit-Distance
The algorithm returns the edit distance between
two words s and t.
Input Parameters s, t Output Parameters
None edit_distance(s, t) m s.length n
t.length for i -1 to m - 1 disti, -1 i
1 // initialization of column -1 for j 0 to
n - 1 dist-1, j j 1 // initialization
of row -1 for i 0 to m - 1 for j 0 to n
- 1 if (si tj) disti, j
min(disti - 1, j - 1, disti - 1, j
1, disti, j - 1 1) else
disti, j 1 min(disti - 1, j - 1,
disti - 1, j, disti, j -
1) return distm - 1, n - 1
15
Algorithm 9.5.10 Best Approximate Match
The algorithm returns the smallest edit distance
between a pattern p and a subword of a text t.
Input Parameters p, t Output Parameters
None best_approximate_match(p, t) m
p.length n t.length for i -1 to m - 1
adisti, -1 i 1 // initialization of
column -1 for j 0 to n - 1 adist-1, j
0 // initialization of row -1 for i 0 to m
- 1 for j 0 to n - 1 if (si
tj) adisti, j min(adisti - 1, j -
1, adist i - 1, j 1, adisti, j - 1
1) else adist i, j 1
min(adisti - 1, j - 1, adist i -
1, j, adisti, j - 1) return adist m - 1, n -
1
16
Algorithm 9.5.15 Dont-Care-Search
This algorithm searches for an occurrence of a
pattern p with dont-care symbols in a text t
over alphabet S. It returns the smallest index i
such that ti j pj or pj ? for all j
with 0 j lt p, or -1 if no such index exists.
17
Input Parameters p, t Output Parameters
None don t_care_search(p, t) m p.length
k 0 start 0 for i 0 to m ci 0
// compute the subpatterns of p, and store them
in sub for i 0 to m if (pi ?)
if (start ! i) // found the end of
a dont-care free subpattern
subk.pattern pstart..i - 1
subk.start start k k 1
start i 1 ...
18
... if (start ! i) // end of the last
dont-care free subpattern subk.pattern
pstart..i - 1 subk.start start
k k 1 P sub0.pattern, . . . ,
subk - 1.pattern aho_corasick(P, t) for
each match of subj.pattern in t at position i
ci - subj.start ci - subj.start
1 if (ci - subj.start k)
return i - subj.start return - 1
19
Algorithm 9.6.5 Epsilon
This algorithm takes as input a pattern tree t.
Each node contains a field value that is either
, , or a letter from S. For each node, the
algorithm computes a field eps that is true if
and only if the pattern corresponding to the
subtree rooted in that node matches the empty
word.
Input Parameter t Output Parameters
None epsilon(t) if (t.value ) t.eps
epsilon(t.left) epsilon(t.right) else if
(t.value ) t.eps epsilon(t.left)
epsilon(t.right) else if (t.value )
t.eps true epsilon(t.left) // assume only
child is a left child else // leaf with
letter in S t.eps false
20
Algorithm 9.6.7 Initialize Candidates
This algorithm takes as input a pattern tree t.
Each node contains a field value that is either
, , or a letter from S and a Boolean field
eps. Each leaf also contains a Boolean field cand
(initially false) that is set to true if the leaf
belongs to the initial set of candidates.
21
Input Parameter t Output Parameters
None start(t) if (t.value )
start(t.left) if (t.left.eps)
start(t.right) else if (t.value )
start(t.left) start(t.right)
else if (t.value ) start(t.left)
else // leaf with letter in S t.cand
true
22
Algorithm 9.6.10 Match Letter
This algorithm takes as input a pattern tree t
and a letter a. It computes for each node of the
tree a Boolean field matched that is true if the
letter a successfully concludes a matching of the
pattern corresponding to that node. Furthermore,
the cand fields in the leaves are reset to false.
23
Input Parameters t, a Output Parameters
None match_letter(t, a) if (t.value )
match_letter(t.left, a) t.matched
match_letter(t.right, a) else if (t.value
) t.matched match_letter(t.left,
a) match_letter(t.right, a) else if
(t.value ) t.matched
match_letter(t.left, a) else // leaf
with letter in S t.matched t.cand (a
t.value) t.cand false return
t.matched
24
Algorithm 9.6.10 New Candidates
This algorithm takes as input a pattern tree t
that is the result of a run of match_letter, and
a Boolean value mark. It computes the new set of
candidates by setting the Boolean field cand of
the leaves.
25
Input Parameters t, mark Output Parameters
None next(t, mark) if (t.value )
next(t.left, mark) if (t.left.matched)
next(t.right, true) // candidates following a
match else if (t.left.eps) mark)
next(t.right, true) else
next(t.right, false) else if (t.value )
next(t.left, mark) next(t.right,
mark) else if (t.value ) if
(t.matched) next(t.left, true) //
candidates following a match else
next(t.left, mark) else // leaf
with letter in S t.cand mark
26
Algorithm 9.6.15 Match
This algorithm takes as input a word w and a
pattern tree t and returns true if a prefix of w
matches the pattern described by t.
Input Parameter w, t Output Parameters
None match(w, t) n w.length epsilon(t)
start(t) i 0 while (i lt n)
match_letter(t, wi) if (t.matched)
return true next(t, false) i
i 1 return false
27
Algorithm 9.6.16 Find
This algorithm takes as input a text s and a
pattern tree t and returns true if there is a
match for the pattern described by t in s.
Input Parameter s, t Output Parameters
None find(s,t) n s.length epsilon(t)
start(t) i 0 while (i lt n)
match_letter(t, si) if (t.matched)
return true next(t, true) i
i 1 return false
Write a Comment
User Comments (0)
About PowerShow.com