Suffix Trees - PowerPoint PPT Presentation

1 / 40
About This Presentation
Title:

Suffix Trees

Description:

Finish proof of Ukkonen's O(m) construction. Generalized suffix trees ... Maximal repeats: GATA, AG, GAT. 37. Maximal Repeats. Can be determined in linear time ... – PowerPoint PPT presentation

Number of Views:96
Avg rating:3.0/5.0
Slides: 41
Provided by: nathanjoh
Category:
Tags: gata | suffix | trees

less

Transcript and Presenter's Notes

Title: Suffix Trees


1
Suffix Trees
  • Lecture 12 October 11, 2005
  • Algorithms in Biosequence Analysis
  • Nathan Edwards - Fall, 2005

2
Suffix Trees
  • Finish proof of Ukkonens O(m) construction
  • Generalized suffix trees
  • First applications of suffix trees

3
Example
  • Suffix tree of S xabxa

x
b

a
a
x
6
b

a
b
x
5

x

a
4
a


3
2
1
4
Example
  • Implicit suffix tree of S xabxa

x
b
a
a
x
b
a
b
x
x
a
a
3
1
2
5
Ukkonens O(m) construction high-level
  • Construct T1
  • For i 1 to m 1 Phase i1 make Ti1 from
    Ti For j 1 to i 1 Extension j Find
    path from root labeled Sji Extend path with
    S(i1)

6
Suffix extension rules
  • Extension of Sji ß with S(i1)
  • ß ends at a leaf ? add S(i1) to leaf edge label
  • No path at end of ß starts with S(i1)? add new
    leaf edge at end of ß with label
    S(i1). create new node at end of ß if
    necessary
  • A path at end of ß starts with S(i1)? do nothing

7
Example
  • Implicit suffix tree of S axabx

b
a
x
x
x
a
a
4
b
b
x
b
x
1
x
3
2
8
Example
  • Implicit suffix tree of S axabxbRule 1

b
a
x
x
x
a
a
b
b
b
4
x
b
x
b
x
b
1
b
3
2
9
Example
  • Implicit suffix tree of S axabxbRule 2

b
a
x
x
x
b
a
a
b
b
b
5
4
x
b
x
b
x
b
1
b
3
2
10
Example
  • Implicit suffix tree of S axabxbRule 3

b
a
x
x
x
b
a
a
b
b
b
5
4
x
b
x
b
x
b
1
b
3
2
11
Suffix links
  • Given
  • String xa, character x, string a
  • Internal node v with node-label xa
  • If there is a node s(v) with node-label a, then v
    ? s(v) is called a suffix link.
  • If a is empty, s(v) is the root.

12
Suffix links
  • Corollary If implicit suffix tree Ti has an
    internal node v with path label xa,Ti must have a
    node s(v) with path-label a.

13
Skip/Count Edge Traversal
14
Summary
  • Given node v with label xa, suffix link points to
    node s(v) with label a.
  • Suffix links skip/count edge traversal
  • find ß Sji given ß0 Sj-1i
  • accomplish all extensions of a phase in O(m) time

15
Problem!
  • Total length of the edge labels could be ?(m2)
  • Construct suffix tree on S abcdefghijklmnopqrs
    tuvwxyz
  • Total length of edge labels is 2627/2
  • Need output size to be O(m) too!
  • Use indices instead of substrings on edges

16
Example
  • Suffix tree of S xabxa

x
b

a
a
x
6
b

a
b
x
5

x

a
4
a


3
2
1
17
Example
  • Suffix tree of S xabxa
  • Note that we must now keep S in memory!

6,6
1,2
2,2
6
3,6
6,6
5
6,6
3,6
3,6
4
3
2
1
18
Suffix extension rules
  • Extension of Sji ß with S(i1)
  • ß ends at a leaf ? change label (j,i) to (j,i1)
  • No path at end of ß starts with S(i1)? add new
    leaf edge at end of ß with label
    (i1,i1). create new node at end of ß if
    necessary
  • A path at end of ß starts with S(i1)? do nothing

19
Rule 3 ends the phase
  • Extension of Sji ß with S(i1)
  • If Rule 3 applies, then ß is already followed by
    S(i1) in Ti
  • Some other suffix Sj0i, j lt j0, of S1i
    starts with ß S(i1),
  • So we know Sj01i is in Ti
  • Therefore Sj1i is already followed by S(i1)
    in Ti
  • End the phase once Rule 3 applies
  • The remaining do nothings are done for free!

20
Applying Rule 1 is easy
  • Once Sji ends at a leaf (Rule 1), every
    extension of suffix j ends at the same leaf
  • Sji1, Sji2,
  • Instead of changing label (j,i) to (j,i1) for
    each such leaf edge, we make its label (j,e)
  • e is a global whose value is i1 in phase i1
  • One change to e updates all leaf edges

21
Where are the leaf nodes?
  • Notice that once we create a leaf j with Rule 2,
    suffix j always ends at a leaf.
  • Therefore, all extensions until Rule 3 applies
    result in leaf-edges.
  • So, leaf extensions (Rule 1) for all Rule 1 and 2
    extensions of previous iterations can be done
    implicitly, in constant time.

22
Where are the leaf nodes?
Rules 1 2
Rule 3
1 1 1 1 1 1 1 1 1 1 2 2 2 1 2
3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2
Phase 1
Phase 2
Phase 3
23
Where are the leaf nodes?
Rules 1 2
Rule 3
1 1 1 1 1 1 1 1 1 1 2 2 2 1 2
3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2
Phase 1
Phase 2
Phase 3
24
Where are the leaf nodes?
Rules 1 2
1 1 1 1 1 1 1 1 1 1 2 2 2 1 2
3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2
Phase 1
Phase 2
Phase 3
25
Single Phase Algorithm
  • Phase i1 make Ti1 from Ti Add S(i1)
  • Increment e to i1
  • Compute explicit extensions from p until Rule 3
    applies
  • Set p to position where Rule 3 applied.
  • Position of suffix Spi in tree at end of phase
    i1, is the position of suffix Spi1 required
    for start of phase i2.

26
Running time analysis
  • m phases, phase i extensions overlap phase i1
    extensions in at most one position.
  • 2m explicit extensions
  • Explicit extension takes constant time plus
    number of node-skips after following suffix link.
  • Similar node-depth argument as before
  • node-depth doesnt change between phases
  • decrease node-depth at most 2(explicit-extns)
  • max node-depth at most m

27
Post-processing
  • Convert implicit suffix tree to suffix tree in
    O(m) time
  • Add terminator symbol to S
  • Replace e with S
  • O(m) time, visit each leaf edge.

28
Generalized Suffix Trees
  • It is unusual to have a single long string to
    analyze.
  • Usually have a set of strings
  • Even the genome is a set of chromosome sequences!
  • Generalized suffix trees store all suffixes of a
    set of strings
  • leaves index string number position
  • Build suffix tree of S S1S2S3..._at_Sk

29
Generalized Suffix Trees
  • Synthetic suffixes can be misleading
  • Remove them by changing edge label
  • Can run Ukkonens algorithm on each string in
    turn
  • Just find prefix of S2 that is already in tree
  • Resume algorithm with next character extension
  • Edge labels might be from different strings
  • Use 1 never-match character, instead of many
    unique characters
  • Encode absolute position or relative position and
    string index?

30
Applications
  • Longest common substring
  • Lightweight longest common substring
  • Maximal repeats
  • Maximal unique matches (MUMs)

31
Longest common substring
  • Longest substring contained in S1 and S2.
  • Construct generalized suffix tree of S1, S2
  • Each leaf represents a suffix of S1, S2, or both.
  • Mark each node of suffix tree with
  • 1 if all leaves in its sub-tree are suffixes of
    S1
  • 2 if all leaves in its sub-tree are suffixes of
    S2
  • 0 if a mixture of suffixes of S1 and S2
  • Longest common substring is represented by the
    node with mark 0 of maximum string-depth.
  • Why must LCS end at a node?

32
Longest common substring
  • Leaves are easy to mark
  • Mark internal nodes, bottom upif all its
    children are from S1 mark with 1if all its
    children are from S2 mark with 2otherwise, mark
    with 0.
  • Post-order traversal to mark all nodes, and find
    deepest 0 mark in tree.
  • O(S1S2) time

33
Lightweight LCS
  • Build suffix tree for S1
  • Initialize all nodes of tree with label 0
  • Match the initial characters of S2 from the root
    of the tree
  • Given mismatch at character S2(i1)
  • find node v at or above end of S21i
  • Label(v) i

34
Lightweight LCS
  • Given match from the root, between S2ji and
    mismatch at character S2(i1)
  • find node v at or above end of S2ji
  • Label(v) i-j1
  • Jump to suffix link s(v) and use skip/count edge
    traversal to find S2j1i in tree
  • Re-start character comparisons in tree with
    S2(i1)
  • Treat end of S2 like a mismatch.
  • Do tree traversal to find largest label

35
Lightweight LCS
  • O(S1S2) time
  • Very similar to Keyword-Tree automaton
  • Build suffix tree for smaller of S1 and S2

36
Maximal Repeats
  • Given a string S, find all substrings with at
    least two copies in S, that cannot be extended to
    the left or right
  • S AAGATATGATAGGATC
  • Maximal repeats GATA, AG, GAT

37
Maximal Repeats
  • Can be determined in linear time
  • If a is a maximal repeat, then there is an
    internal node with label a.
  • Why?
  • Therefore, there are at most m S maximal
    repeats.
  • Why?

38
Maximal Repeats
  • So, which internal nodes represent maximal
    repeats?
  • Define the left-character of a suffix Sjm to
    be character S(j-1).
  • An internal node is called left-diverse if at
    least two leaves in its sub-tree have different
    left-characters

39
Maximal Repeats
  • Theorem
  • The node-label a of internal node v is a maximal
    repeat if and only if v is left diverse.
  • Left-diverse nodes can be marked using a
    post-order traversal of tree, as in longest
    common substring

40
Maximal unique match
  • Useful for anchoring larger alignments
  • MUMmer Delcher, ., Salzberg, 99, 02, 04
  • MUMs occur exactly once in each of S1 and S2 and
    cant be extended in either direction and still
    be a match
  • Maximal repeats in generalized suffix tree of S1
    and S2 that have exactly two leaves, one from S1
    and one from S2
Write a Comment
User Comments (0)
About PowerShow.com