Exact Set Matching - PowerPoint PPT Presentation

1 / 30
About This Presentation
Title:

Exact Set Matching

Description:

tattoo, theater, other} Failure links. p. o. t. t. o. 1. t. o. h. e. 4. a. t ... tattoo, at} Substring patterns. p. o. t. t. o. 1. t. a. t. t. o. o. 2. a. t. a ... – PowerPoint PPT presentation

Number of Views:64
Avg rating:3.0/5.0
Slides: 31
Provided by: nathanjoh
Category:
Tags: exact | matching | set | tattoo

less

Transcript and Presenter's Notes

Title: Exact Set Matching


1
Exact Set Matching
  • Lecture 8 September 27, 2005
  • Algorithms in Biosequence Analysis
  • Nathan Edwards - Fall, 2005

2
Exact set matching
  • Exact set matching problem
  • set of patterns P P1, P2, , Pz
  • total length n
  • text T
  • length m (as before)
  • Use Boyer-Moore
  • O( P1 m Pz m ) O( n z.m )
  • Not bad, but we can do better!

3
Exact set matching
  • Aho-Corasick
  • O( n m k ) time, k is of occurrences
  • Not the only way to achieve this bound, but the
    others preprocess T
  • Generalize Knuth-Morris-Pratt
  • Keyword tree / Trie

4
Keyword Tree
  • Rooted directed tree
  • exactly one character on each edge
  • any two edges from the same node have distinct
    labels
  • every pattern corresponds to a path from the root
    to some node
  • every leaf corresponds to some pattern

5
Keyword Tree
  • Ppotato, pottery, poetry, science,
    school

s
p
c
h
o
o
l
o
5
i
e
t
t
e
r
t
n
a
e
y
c
r
t
3
y
e
o
2
4
1
6
Keyword Tree
  • Construction in O(n) time
  • Iteratively add the patterns
  • for each pattern start at root follow edges
    until either pattern is exhausted, or end at
    node v, so new path out of v
  • Constant work per pattern character

7
Keyword Tree
  • Ppotato, pottery, poetry, science,
    school
  • Add
  • poet

s
p
c
h
o
o
l
o
5
i
e
t
t
e
r
t
n
a
6
e
y
c
r
t
3
y
e
o
2
4
1
8
Keyword Tree
  • Ppotato, pottery, poetry, science,
    school, poet
  • Add
  • potent

s
p
c
h
o
o
l
o
5
i
e
t
t
e
r
t
n
a
6
e
e
y
c
r
t
n
3
y
e
o
t
2
4
1
7
9
Naïve algorithm
  • At each position l of T,
  • Follow edges from the root until a leaf is
    reached.
  • Any labeled node on the path from the root
    represents an occurrence of a pattern from P
    ending at l
  • Running time O(Pmax m) O(n m).
  • Solves dictionary problem effectively!

10
Speedup of search
  • Generalize Knuth-Morris-Pratt
  • Shift pattern in jumps
  • Avoid recomputing matches
  • Need analogue to spi and failure functions
  • For now, assume that no pattern in P is a
    substring (proper or not) of another substring of
    P

11
Label of node v
  • Definition
  • Each node v is labeled with the string L(v) of
    characters from the root to v
  • Definition
  • lp(v) is the length of the longest proper
    suffix of L(v) that is a prefix of some pattern
    in P.
  • Lemma
  • If a is the lp(v)-length suffix of L(v), then
    there is a unique node u s.t. L(u) a

12
Failure link
  • DefinitionLet nv be the node labeled with a
    (longest suffix of L(v) that is a prefix of some
    pattern of P). If lp(v) 0, then nv is the root.
  • DefinitionThe edge (v, nv) is called the
    failure link of v.

13
Failure links
  • Ppotato, tattoo, theater, other

t
p
a
t
t
o
o
o
2
h
o
e
t
t
a
h
t
a
e
e
t
r
r
o
3
4
1
14
Aho-Corasick
  • c 1, w root
  • repeat while edge (w,w) is labeled T(c) if
    w holds pattern i Pi occurs in T ending at
    c w w, c w nw until c m

15
Generalization of K-M-P
  • Failure links jump back in patterns, just as
    failure function did for K-M-P
  • Note that we generalize spi rather than spi
    F(i) rather than F(i)
  • Shift doesnt miss any patterns from P
  • O(m) time to search, like K-M-P
  • Each character matched at most once
  • On mismatch, we jump backwards towards root so
    we cant mismatch too many times.

16
Failure link algorithm
  • Failure links can be computed in O(n) time
  • Iterative given nv for all nodes v with labels
    of length l, compute nw for all nodes w with
    labels of length l1.
  • Failure links for root and nodes with length 1
    labels point to the root.

17
Failure link algorithm
  • v is parent of v, x is char on edge (v,v),w
    nv
  • while no edge of w has label x and w ? rootw
    nw
  • if an edge (w,w) has label x
  • nv w
  • else
  • nv root
  • Running time proof, and correctness, generalize
    K-M-P.

18
Failure links
  • Ppotato, tattoo, theater, other

t
p
a
t
t
o
o
o
2
h
o
e
t
t
a
h
t
a
e
e
t
r
r
o
3
4
1
19
Failure links
  • Ppotato, tattoo, theater, other

t
p
a
t
t
o
o
o
2
h
o
e
t
t
a
h
t
a
e
e
t
r
r
o
3
4
1
20
Failure links
  • Ppotato, tattoo, theater, other

t
p
a
t
t
o
o
o
2
h
o
e
t
t
a
h
t
a
e
e
t
r
r
o
3
4
1
21
Failure links
  • Ppotato, tattoo, theater, other

t
p
a
t
t
o
o
o
2
h
o
e
t
t
a
h
t
a
e
e
t
r
r
o
3
4
1
22
Failure links
  • Ppotato, tattoo, theater, other

t
p
a
t
t
o
o
o
2
h
o
e
t
t
a
h
t
a
e
e
t
r
r
o
3
4
1
23
Failure links
  • Ppotato, tattoo, theater, other

t
p
a
t
t
o
o
o
2
h
o
e
t
t
a
h
t
a
e
e
t
r
r
o
3
4
1
24
Relaxing the substring assumption
  • Recall that we assumed that P contained no
    substrings.
  • Exact duplicates can be stored as extra indices
    in each node,
  • What about proper substrings?

25
Substring patterns
  • Ppotato, tattoo, at

t
p
a
a
o
t
t
t
t
3
o
a
o
t
o
2
1
26
Substring patterns
  • TheoremIf there exists a directed path of
    failure links from v to w, with w numbered for
    pattern i, then Pi occurs in T whenever v is
    reached.
  • TheoremIf node v is reached, then Pi occurs in
    T at c if v is numbered i or there is a directed
    path of failure links from v to w numbered i.

27
Substring patterns
  • To achieve O(n m k) time bound, we must do
    constant work per occurrence of some pattern from
    P
  • Following failure links might be too much work!
  • Precompute output links to shortcut failure link
    paths

28
Substring patterns
  • Ppotato, tattoo, at

t
p
a
a
o
t
t
t
t
3
o
a
o
t
o
2
1
29
Strong vs Weak Shift
  • I made a big fuss about the difference between
    spi and spi wrt K-M-P
  • Can we use the strong shift rule here?
  • What about the real-time extension to K-M-P?
  • Note that this looks very much like an automata
  • Except for multiple mismatch per text character

30
Next time
  • Implementation issues
  • Applications
Write a Comment
User Comments (0)
About PowerShow.com