Title: Suffix Trees
1Suffix Trees
- String any sequence of characters.
- Substring of string S string composed of
characters i through j, i lt j of S. - S cater gt ate is a substring.
- car is not a substring.
- Empty string is a substring of S.
2Subsequence
- Subsequence of string S string composed of
characters i1 lt i2 lt lt ik of S. - S cater gt ate is a subsequence.
- car is a subsequence.
- The empty string is a subsequence.
3String/Pattern Matching
- You are given a source string S.
- Answer queries of the form is the string pi a
substring of S? - Knuth-Morris-Pratt (KMP) string matching.
- O(S pi ) time per query.
- O(nS Si pi ) time for n queries.
- Suffix tree solution.
- O(S Si pi ) time for n queries.
4String/Pattern Matching
- KMP preprocesses the query string pi, whereas the
suffix tree method preprocesses the source string
S. - An application of string matching.
- Genome project.
- Databank of strings (gene sequences).
- Character set is ATGF.
- Determine if a new sequence is a substring of a
databank sequence.
5Definition Of Suffix Tree
- Compressed trie with edge information.
- Keys are the nonempty suffixes of a given string
S. - Nonempty suffixes of S sleeper are
- sleeper
- leeper
- eeper
- eper
- per, er, and r.
6String Matching Suffixes
- pi is a substring of S iff pi is a prefix of some
suffix of S. - Nonempty suffixes of S sleeper are
- sleeper
- leeper
- eeper
- eper
- per, er, and r.
- Which of these are substrings of S?
- leep, eepe, pe, leap, peel
7Last Character Of S Repeats
- When the last character of S appears more than
once in S, S has at least one suffix that is a
proper prefix of another suffix. - S creeper
- creeper, reeper, eeper, eper, per, er, r
- When the last character of S appears more than
once in S, use an end of string character to
overcome this problem. - S creeper
- creeper, reeper, eeper, eper, per, er, r,
8Suffix Tree For S abbbabbbb
9Suffix Tree For S abbbabbbb
1
2
5
10
3
1
5
9
4
4
8
3
abbbabbbb
7
2
6
12345678910
10Suffix Tree For S abbbabbbb
1
1
4
5
2
10
1
3
8
1
5
9
4
4
2
8
3
abbbabbbb
7
2
6
12345678910
11Suffix Tree Construction
- See Web write up for algorithm.
- Time complexity
- S n, alphabet size r.
- O(nr) using array nodes.
- This is O(n) for r a constant (or r lt c).
- O(n) expected time using a hash table.
- O(n) time algorithm for large r in reference
cited in Web write up.
12O(pi) Time Substring Matching
babb
abbba
baba
13Find All Occurrences Of pi
- Search suffix tree for pi.
- Suppose the search for pi is successful.
- When search terminates at an element node, pi
appears exactly once in the source string S.
14Search Terminates At Element Node
abbbb
15Search Terminates At Branch Node
- When the search for pi terminates at a branch
node, each element node in the subtree rooted at
this branch node gives a different occurrence of
pi.
16Search Terminates At Branch Node
ab
17Find All Occurrences Of pi
- To find all occurrences of pi in time linear in
the length of pi and linear in the number of
occurrences of pi, augment suffix tree - Link all element nodes into a chain in inorder.
- Each branch node keeps a pointer to the left most
and right most element node in its subtree.
18Augmented Suffix Tree
b
19Longest Repeating Substring
- Find longest substring of S that occurs more than
m gt 1 times in S. - Label branch nodes with number of element nodes
in subtree. - Find branch node with label gt m and max char
field.
20Longest Repeating Substring
m 2
m 5
21Longest Common Substring
- Given two strings S and T.
- Find the longest common substring.
- S carport, T airports
- Longest common substring rport
- Longest common subsequence arport
- Longest common subsequence may be found in
O(ST) time using dynamic programming. - Longest common substring may be found in
O(ST) time using a suffix tree.
22Longest Common Substring
- Let be a new symbol.
- Construct the suffix tree for the string U
ST. - U carportairports
- No repeating substring includes .
- Find longest repeating substring that is both to
left and right of . - Find branch node that has max char and has at
least one element node in its subtree that
represents a suffix that begins in S as well as
at least one that begins in T.