Vakhitov Alexander Approximate Text Indexing. - PowerPoint PPT Presentation

About This Presentation
Title:

Vakhitov Alexander Approximate Text Indexing.

Description:

Using simple mathematical arguments the matching probabilities in the suffix tree are bound and by a clever division of the search pattern sub-linear time is achieved – PowerPoint PPT presentation

Number of Views:36
Avg rating:3.0/5.0
Slides: 30
Provided by: wwwmayrIn
Category:

less

Transcript and Presenter's Notes

Title: Vakhitov Alexander Approximate Text Indexing.


1
Vakhitov AlexanderApproximate Text Indexing.
  • Using simple mathematical arguments the matching
    probabilities in the suffix tree are bound and by
    a clever division of the search pattern
    sub-linear time is achieved
  • by A Hybrid Method for Approximate
  • String Matching G. Navarro, R. Baeza-Yates

2
The Task
  • The task is to find substrings from the long
    text T, approximately matching to our pattern P.
  • For example, we have text T'adbc' and P'abc'
  • (s-starting position of a substring)adbcadbc
    insertion of 'd' (s1)dbc(a-gtd) bc
    replacement 'a' with 'd' (s2)bc(a)bc
    deletion of 'a' (s3)

3
Errors
  • There are 3 kinds of transformations, which make
    errors in initial string insertion, replacement
    and deletion. If we transform S with such chnges
    to S', then we can transform S' to S with the
    same number of changes.
  • The minimal number of deletions, insertions and
    replacements, needed to transform string A to
    string B is called edit distance between A and B
    (ed(A,B)).
  • Example ed('abc','adbc')ed('dbc','abc')ed('bc'
    ,'abc')1ed('survey','surgery')2

replace 'v' with 'g'
insert 'r'
surgery
surgey
survey
4
The resulting algorithm
  • The algorithm solves the approximate string
    matching problem in O(n?log n) time (n is the
    size of text T, ??(0,1)), if
  • the error level , where
  • ? is the size of the alphabet,
  • e2.718..,
  • ?k/m,
  • k is the number of errors,
  • m is the size of the pattern P.

5
Plan of the report
  • Some useful ideas basic algorithms
  • The main algorithm
  • Analysis of the complexity of the algorithm in
    different cases

6
Dividing the pattern
  • Lemma There are strings A and B, ed(A,B)?k, and
    we divide A into j substrings (Ai). Then at least
    one of the (Ai) appear in B with at most ?k/j?
    errors.
  • We need k changes to transform A to B. Each
    change transforms one of the Ai, so k changes are
    distributed between j substrings gt the average
    number of changes is k/j.
  • Example ed('he likes','they like')3kA1'he
    ',A2'likes' gt j2ed('he ','they ')2
    ed('likes','like')2?k/j?

7
Computing edit distance
  • There are strings x,y. ed(x,y)?
  • xx1x2...xm,yy1y2..yn, xp,yq?S
  • Cijed(x1..xi,y1..yj) C0..x,0..y is a matrix,
    filled with Cij.
  • Computing Cij
  • C0,jj Ci,0i
  • Ci,j if (xiyj) then Ci-1,j-1
  • else 1minCi-1,j-1,Ci-1,j,Ci,j-1
  • Example x'survey',y'surgery'

... u r v ey
... u r g ery
0
1
2
1
0
1
1
1
2
green means xiltgtyj red means xiyj arrow shows
the element used to sompute Cij
8
Edit distance in the case of text T and pattern P
  • We need to find a substring in text T which
    matches P with minimal number of errors. Let x be
    the pattern, and y will be a text.
  • The matching text substring can begin in every
    text position so, we have to initialize C0,j
    with 0. The rest is left from the previous task.
  • The algorithm can store only the last column and
    analyze the text incrementally from the
    beginning. It goes left and down through the
    matrix, filling it with Cij.


9
Examples of the matrices
10
Construction of the NFA
  • Nondeterministic Finite Automaton, which is
    searching the text substrings, approximately
    matching to the pattern with k errors. It
    consists of k rows and m columns.
  • Transitions
  • Pattern and text characters are the same
    (horizontal)
  • Insert characters into pattern (vertical)
  • Replace the pattern character with the text one
    (solid diagonal)
  • Delete the pattern character (dashed diagonal)

11
Nondeterministic Finite Automaton
This automaton is for approximate string matching
for the pattern 'survey' with 2 errors
12
Depth-first search
  • DEF k-neighborhood is the set of strings that
    match P with at most k errors
  • Uk(P)x ? ? ed(x,P)?k
  • Searching this strings in the text (without
    errors) can solve the problem, but
    Uk(P)O(mk?k) is quite large.
  • We can determine which strings form Uk(P) appear
    in the text by traversing the text suffix tree.
    Here we can use the Ukt(P) set. Ukt(P) is a set
    of neighborhood elements which are not prefixes
    of others.

13
Algorithm for searching on the suffix tree
  • Starts from the root
  • Considers the string x incrementally
  • Determines when ed(P,x)?k
  • Determines when ed(P,xy)gtk for any y

14
Algorithm for searching on the suffix tree
  • Each new character of x corresponds to a new
    column in the matrix (adding s?? to x ltgt
    updating column in O(m) time).
  • A match is detected when the last element of the
    column is ? k
  • x cannot be extended to match P when all the
    values of the last column are gt k

15
Algorithm for searching on the suffix tree
(illustration)
16
Partitioning the pattern
  • The cost of the suffix tree search is exponential
    in m and k, so it's better to perform j searches
    of patterns of length m/j and k/j errors that's
    why we divide patterns.
  • So, we divide our pattern into j pieces and
    search them using the above algorithm. Then, for
    each match found ending at text position i we
    check the text area i-m-k..imk
  • But the larger j, the more text positions need to
    be verified, and the optimal j will be found soon.

17
Searching pieces of the pattern
  • Let's use NFA with depth-first search (DFS)
    technique (the suffixes from the suffix tree will
    be the input of the automaton)
  • At first, we'll transform our NFA
  • Initial self-loop isn't needed (it allowed us
    earlier to start matching from every position of
    the text)
  • We remove the low-left triangle of our automaton,
    because we avoid initial insertions to the
    pattern
  • We can start matching only with k1 first pattern
    characters

18
The changes to NFA
19
Using suffix array instead of suffix tree
  • The suffix array can replace the suffix tree in
    our algorithm. It has less space requirements,
    but the time complexity should be multiplied by
    log n.
  • Suffix array replaces nodes with intervals and
    traversing to the node is going to the interval.
    If there is a node and it's children, then the
    node interval contains children intervals.

20
Analysis for the algorithm the average number of
nodes at level l
  • For a small l, all the text suffixes (except the
    last l) are longer than l, so nearly n suffixes
    reach level l
  • The maximum number of nodes in the level l is ?l,
    where ??
  • We use the model of n balls randomly thrown into
    ?l urns. The average number of filled urns is
  • ?l(1-(1-1/?l)n)?l(1-e-?(n/?l))?(minn,?l)

21
Probability of processing a given node at depth l
in the suffix tree.
  • If l?m', at least l-k text characters must match
    the pattern (m is the pattern size), and
  • if l?m', at least m'-k pattern characters must
    match the text. We sum all the probabilities for
    different pattern prefixes

The largest term of the 1st sum is the first one
and by using Stirling's approximation we have
22
Probability of processing a given node at depth l
in the suffix tree.
  • ..which is

?(?)lO(1/l) , where
, ?k/l
The whole first summation is bounded by l-k times
the last term, so we get (l-k)?(?)lO(1/l)O(?(?)l)
.The first summation exponentially decreases if
?(?)lt1. It means that
gte2/(1- ?)2
(because e-1lt ? ? /(1- ?) if ??0,1)
23
Probability of processing a given node at depth l
in the suffix tree.
  • ..or, equivalently,

The second summation can be also bounded by this
O(?(?)l). So the upper bound for the probability
of processing a givennode at depth l in the
suffix tree is O(?(?)l).
In practice, e should be replaced by c1.09 (it
was defined experimentally), because we have only
founded the upper bound of the probability.
24
Analysis of the single pattern search in the
suffix tree
  • Using the formulas bounding the probability of
    matching, let's consider that in levels l

all the nodes are visited, while nodes at level
lgtL(k) are visited with probability O(?(k/l)l).
Remember that the average number of visited nodes
at the level l (for small l) is ?(minn,?l).
25
Three cases of analysis
26
The cases of analysis
  • (a) L(k) ??log?n, n ???L(k) small n online
    search preferable, no index needed (since the
    total work is n)
  • (b) mk lt log?n, n gt ?mklarge n the total
    cost
  • independent on n(?k/l)
  • (c) L(k) ??log?n ??mk intermediate
    n,sublinear of n time.

27
Analysis of pattern partitioning
  • We need to perform j searches and then verify all
    the possible matches. We also determine three
    cases according to previous slide
  • (a) ??j log?n, n ???L(k/j),
    complexity O(n)
  • (b) mk lt j log?n, n gt ?(mk)/j , if error level
  • the complexity is O(n1-log????????????) -
    sublinear of n (using
  • j(mk) / log?n)

(c) with the same as in (b) error level, using
the same j, we also get sublinear complexity.
28
Other types of algorithms
  • Limited depth-first search technique determines
    viable prefixes (the prefixes of the possible
    pattern matches) and searches for them in the
    suffix tree (it is expensive and it cannot be
    implemented on the suffix array)
  • Filtering discard large parts of the text
    checking for a necessary condition (simpler than
    the matching condition). Most existing filters
    are based on finding substrings of the pattern
    without errors, and with big error level they
    can't work.

29
Summary Conclusions
  • The splitting technique balances between
    traversing too many nodes of the suffix tree and
    verifying too many text positions
  • The resulting index has sublinear retrieval time
    (O(n?)), 0lt?lt1) if the error level is moderate
  • In future there can appear more exact algorithms
    to determine the correct number of pieces in
    which the pattern is divided and there are (and
    may appear in future) some better algorithms for
    verifying after matching a piece of pattern.
Write a Comment
User Comments (0)
About PowerShow.com