Title: Suffix Arrays and Suffix Trees
1Suffix Arrays and Suffix Trees
2- Motivation
- What are suffix arrays and trees ?
- Examples
- Some construction algorithms
3Motivation
- Many biological problems require approximate
matching. - No efficient indices for approximate matching
known. - Filter algorithms for approximate matching use
exact matching. - To be efficient, fast exact matching algorithms
have to be employed. - gt Indices for exact string matching
4- What are suffix arrays and trees?
- Text indexing data structures
- not word based
- allow search for patterns or
- computation of statistics
- Important Properties
- Size
- Speed of exact matching
- Space required for construction
- Time required for construction
5The Suffix Array Definition Given a string D
the suffix array SA for this string is the
sorted list of pointers to all suffixes of
D. (Manber, Myers 1990)
6Example
D A B A A B B A B B A C
0 A B A A B B A B B A C 1 B A A B B A B B A
C 2 A A B B A B B A C 3 A B B A B B A C 4 B
B A B B A C 5 B A B B A C 6 A B B A C 7 B B
A C 8 B A C 9 A C 10 C
SORT LEXICOGRAPHICALLY!
7Example
A B A A B B A B B A C
2 A A B B A B B A C 0 A B A A B B A B B A C 3
A B B A B B A C 6 A B B A C 9 A C 1 B A A B B
A B B A C 5 B A B B A C 8 B A C 4 B B A B B A
C 7 B B A C 10 C
8Exact matching using a Suffix Array
A B A A B B A B B A C
SUFFIX ARRAY SA
SA 2 0 3 6 9 1 5 8 4 7 10
Basic Idea 2 binary searches in SA Search for
leftmost position Search for rightmost position
9A B A A B B A B B A C
2 0 3 6 9 1 5 8 4 7 10
0 1 2 3 4 5 6 7 8 9 10
10A B A A B B A B B A C
2 0 3 6 9 1 5 8 4 7 10
0 1 2 3 4 5 6 7 8 9 10
BB gt BA
Continue binary search in the right (larger) half
of SA
11A B A A B B A B B A C
2 0 3 6 9 1 5 8 4 7 10
0 1 2 3 4 5 6 7 8 9 10
BB BB
More occurences of BB left of this one possible!
12A B A A B B A B B A C
2 0 3 6 9 1 5 8 4 7 10
0 1 2 3 4 5 6 7 8 9 10
BB gt BA
leftmost position of BB is pointed to by SA8
13A B A A B B A B B A C
2 0 3 6 9 1 5 8 4 7 10
0 1 2 3 4 5 6 7 8 9 10
BB gt BA
Search further to the right
14A B A A B B A B B A C
2 0 3 6 9 1 5 8 4 7 10
0 1 2 3 4 5 6 7 8 9 10
BB BB
More occurences of BB right of this one possible!
15A B A A B B A B B A C
2 0 3 6 9 1 5 8 4 7 10
0 1 2 3 4 5 6 7 8 9 10
BB BA
More occurences of BB right of this one possible!
16A B A A B B A B B A C
2 0 3 6 9 1 5 8 4 7 10
0 1 2 3 4 5 6 7 8 9 10
BB lt C
rightmost position of BB is pointed to by SA9
17B B
Results of search for
A B A A B B A B B A C
2 0 3 6 9 1 5 8 4 7 10
0 1 2 3 4 5 6 7 8 9 10
leftmost position of BB is pointed to by SA8
rightmost position of BB is pointed to by SA9
gtAll occurences of the pattern BB are pointed to
by SA8..9
18- Important Properties
- for SA N and p length of pattern
- Size 1 Pointer per Letter (4 Byte if N lt 4Gb)
- Speed of exact matching
- O(log N) binary search steps
- of compared chars is O(p log N)
- can be reduced to O(p log N)
19- Some known Construction methods
- Manber-Myers
- variant of the labeling technique of Karp, Miller
and Rosenberg - Sorting of suffixes is performed as follows
i
Sort in i rounds substrings of length 2 each
round is possible in O(N)
Construction in O(N log(N)) time
20Second Round 21 2
Example
First Round 20 1
D A B A A B B A B B A C
2 0 3 6 9
0 2 3 6 9 1 4 5 7 8 10
0 A B A A B B A B B A C 1 B A A B B A B B A
C 2 A A B B A B B A C 3 A B B A B B A C 4 B
B A B B A C 5 B A B B A C 6 A B B A C 7 B B
A C 8 B A C 9 A C 10 C
1 5 8
4 7
10
BN bucket number
BN bucket number
Round i
For each suffix x we have to identify the bucket
number of x2i-1 2i-1.
The bucket number can be found in the field BN
BNx2i-1
21 Round 1 2-pass Bucketsort
using the first character Create
2 arrays, BN and SA Round i When comparing
suffix x and y from the same bucket 1. For
0..2i-1-1 suffix x and y are equal 2. For 2i-1..
2i-1 suffix x and y have already been compared !
Result is given by comparing suffix x2i-1
with y2i-1 use BN to access suffix x2i-1
and y2i-1
22Baeza-Yates-Gonnet-Snider (External)
- Idea
- text is cut in pieces of size M
- runs in N/M rounds, in each round
- - compute SA for the current text piece
- - merge SA with the suffix array for the
previous pieces
Run Time
All sort steps O((N3 log M)/M)
All merge steps O((N3 log M)/M)
Total run time O((N3 log M)/M)
No. of Block I/Os B Block Size
O((N3 log M)/(MB))
23Baeza-Yates-Gonnet-Snider (External)
Runtime analysis of one round - compute SA for
the current piece of size M O(M
log M) sort comparisons of suffixes Problem
worst case comparison is complete suffixes (
N) But expected case is much smaller (lcp) Worst
case runtime O(N M log M)
character comparisons - merge SA with the already
existing SAx length of SAx O(N) gt number of
merge steps O(N) one merge step 1 suffix
comparison O(N) worst case gt O(N2) runtime
gt total runtime for one round O(N2 N M log
M) O(N2 log M) N/M rounds gt total runtime
O(N3 log M / M)
24Example BGSN Construction
M 4
A B A A B B A B
A B A A
SA1 2 0 3 1
SA2 6 7 5 4
B B A B
Merge SA1 , SA2
SA
2
6
0
3 1 7 5 4
25The Suffix Tree Definition Given a string D
the suffix tree ST for this string is the
compacted trie built on all suffixes of
D. (Weiner, 1973)
26- The Suffix Tree
- Structural Properties
- Each arc of the tree denotes a substring
- Each node has outdeg gt 1
- Node arcs start with different characters
- Each leaf l denotes the suffix composed
- of all arc labels on the path root l
- N leaves and ltN internal nodes
- a special character is used as end marker
27An Example
B
A
A
A
16
A
B
B
B
B
A
A
A
A
A
14
B
B
15
B
A
A
B
A
A
B
A
A
A
B
B
A
A
A
A
.
12
A
13
.
A
B
A
.
A
B
.
B
B
.
.
A
A
B
A
.
.
B
A
.
9
10
12
.
A
.
A
B
.
.
.
.
B
.
.
6
A
.
7
1
8
4
5
2
3
0
D A B A A B A B A A B A A B A B A
0 1 2 3 4 5 6 7
8 9 10 11 12 13 14 15 16
28An Example
B
A
A
A
16
A
B
B
B
B
A
A
A
A
A
14
B
B
15
B
A
A
B
A
A
B
A
A
A
B
B
A
A
A
A
.
12
A
13
.
A
B
A
.
A
B
.
B
B
.
.
A
A
B
A
.
.
B
A
.
9
10
12
.
A
.
A
B
.
.
.
.
B
.
.
6
A
.
7
1
8
4
5
2
3
0
D A B A A B A B A A B A A B A B A
0 1 2 3 4 5 6 7
8 9 10 11 12 13 14 15 16
29An Example
B
A
A
A
16
A
B
B
B
B
A
A
A
A
A
14
B
B
15
B
A
A
B
A
A
B
A
A
A
B
B
A
A
A
A
.
12
A
13
.
A
B
A
.
A
B
.
B
B
.
.
A
A
B
A
.
.
B
A
.
9
10
12
.
A
.
A
B
.
.
.
.
B
.
.
6
A
.
7
1
8
4
5
2
3
0
D A B A A B A B A A B A A B A B A
0 1 2 3 4 5 6 7
8 9 10 11 12 13 14 15 16
30Simple Construction
for all suffixes s insert(s)
ABAABABAABAABABA
0
D A B A A B A B A A B A A B A B A
0 1 2 3 4 5 6 7
8 9 10 11 12 13 14 15 16
31Simple Construction
for all suffixes s insert(s)
BAABABAABAABABA
ABAABABAABAABABA
1
0
D A B A A B A B A A B A A B A B A
0 1 2 3 4 5 6 7
8 9 10 11 12 13 14 15 16
32Simple Construction
for all suffixes s insert(s)
BAABABAABAABABA
ABAABABAABAABABA
1
0
D A B A A B A B A A B A A B A B A
0 1 2 3 4 5 6 7
8 9 10 11 12 13 14 15 16
33Simple Construction
A
for all suffixes s insert(s)
BAABABAABAABABA
ABABAABAABABA
BAABABAABAABABA
2
1
0
D A B A A B A B A A B A A B A B A
0 1 2 3 4 5 6 7
8 9 10 11 12 13 14 15 16
34Simple Construction
A
for all suffixes s insert(s)
BAABABAABAABABA
ABABAABAABABA
BAABABAABAABABA
2
1
0
D A B A A B A B A A B A A B A B A
0 1 2 3 4 5 6 7
8 9 10 11 12 13 14 15 16
35Simple Construction
A
for all suffixes s insert(s)
B
A
BAABABAABAABABA
ABABAABAABABA
BAABAABABA
ABABAABAABABA
3
2
1
0
D A B A A B A B A A B A A B A B A
0 1 2 3 4 5 6 7
8 9 10 11 12 13 14 15 16
36Problem O(n ) Space ( N N-1 N-2 ... 1)
2
A
B
B
C
D
C
C
E
D
D
D
E
E
E
E
0
3
1
2
4
5
D A B C D E
0 1 2 3 4 5
37Solution Arc Pointers
B
A
A
A
16
A
B
B
B
B
A
A
A
A
A
14
B
B
15
B
A
A
B
A
A
B
A
A
A
B
B
A
A
A
A
.
12
A
13
.
A
B
A
.
A
B
.
B
B
.
.
A
A
B
A
.
.
B
A
.
9
10
12
.
A
.
A
B
.
.
.
.
B
.
.
6
A
.
7
1
8
4
5
2
3
0
D A B A A B A B A A B A A B A B A
0 1 2 3 4 5 6 7
8 9 10 11 12 13 14 15 16
38Solution Arc Pointers
B
(0,0)
A
A
16
A
B
B
B
B
A
A
A
A
A
14
B
B
15
B
A
A
B
A
A
B
A
A
A
B
B
A
A
A
A
.
12
A
13
.
A
B
A
.
A
B
.
B
B
.
.
A
A
B
A
.
.
B
A
.
9
10
12
.
A
.
A
B
.
.
.
.
B
.
.
6
A
.
7
1
8
4
5
2
3
0
D A B A A B A B A A B A A B A B A
0 1 2 3 4 5 6 7
8 9 10 11 12 13 14 15 16
39Solution Arc Pointers
B
(0,0)
A
A
16
A
B
(1,2)
B
B
A
A
A
A
14
B
B
15
B
A
A
B
A
A
B
A
A
A
B
B
A
A
A
A
.
12
A
13
.
A
B
A
.
A
B
.
B
B
.
.
A
A
B
A
.
.
B
A
.
9
10
12
.
A
.
A
B
.
.
.
.
B
.
.
6
A
.
7
1
8
4
5
2
3
0
D A B A A B A B A A B A A B A B A
0 1 2 3 4 5 6 7
8 9 10 11 12 13 14 15 16
40Solution Arc Pointers
B
(0,0)
A
A
16
A
B
(1,2)
B
B
A
A
A
14
B
B
15
B
A
A
A
(3,5)
A
B
A
A
B
B
A
A
A
A
.
12
A
13
.
A
B
A
.
A
B
.
B
B
.
.
A
A
B
A
.
.
B
A
.
9
10
12
.
A
.
A
B
.
.
.
.
B
.
.
6
A
.
7
1
8
4
5
2
3
0
D A B A A B A B A A B A A B A B A
0 1 2 3 4 5 6 7
8 9 10 11 12 13 14 15 16
41Solution Arc Pointers
B
(0,0)
A
A
16
A
B
(1,2)
B
B
A
A
A
14
B
B
15
B
A
A
A
(3,5)
A
B
A
A
B
B
A
A
A
A
.
12
A
13
.
A
B
A
.
A
.
B
B
.
.
A
B
(6,7)
A
.
.
B
A
.
9
10
12
.
A
.
A
B
.
.
.
.
B
.
.
6
A
.
7
1
8
4
5
2
3
0
D A B A A B A B A A B A A B A B A
0 1 2 3 4 5 6 7
8 9 10 11 12 13 14 15 16
42Solution Arc Pointers
B
(0,0)
A
A
16
A
B
(1,2)
B
B
A
A
A
14
B
B
15
B
A
A
A
(3,5)
A
B
A
A
B
B
A
A
A
A
.
12
A
13
.
A
B
A
.
A
.
B
B
.
.
A
B
(6,7)
A
.
.
B
A
.
9
10
12
.
.
A
B
.
.
.
.
6
A
7
(8,16)
1
8
4
5
2
3
0
D A B A A B A B A A B A A B A B A
0 1 2 3 4 5 6 7
8 9 10 11 12 13 14 15 16
43O(n) Arcs gt O(n) pointer pairs
B
(0,0)
A
A
16
A
B
(1,2)
B
B
A
A
A
14
B
B
15
B
A
A
A
(3,5)
A
B
A
A
B
B
A
A
A
A
.
12
A
13
.
A
B
A
.
A
.
B
B
.
.
A
B
(6,7)
A
.
.
B
A
.
9
10
12
.
.
A
B
.
.
.
.
6
A
7
(8,16)
1
8
4
5
2
3
0
D A B A A B A B A A B A A B A B A
0 1 2 3 4 5 6 7
8 9 10 11 12 13 14 15 16
44Searching
B
A
A
A
16
A
B
B
B
B
A
A
A
A
A
14
B
B
15
B
A
A
B
A
A
B
A
A
A
B
B
A
A
A
A
.
12
A
13
.
A
B
A
.
A
B
.
B
B
.
.
A
A
B
A
.
.
B
A
.
9
10
12
.
A
.
A
B
.
.
.
.
B
.
.
6
A
.
7
1
8
4
5
2
3
0
P A B A A B A B
45Searching
B
A
A
A
16
A
B
B
B
B
A
A
A
A
A
14
B
B
15
B
A
A
B
A
A
B
A
A
A
B
B
A
A
A
A
.
12
A
13
.
A
B
A
.
A
B
.
B
B
.
.
A
A
B
A
.
.
B
A
.
9
10
12
.
A
.
A
B
.
.
.
.
B
.
.
6
A
.
7
1
8
4
5
2
3
0
P A B A A B A B
46Searching
B
A
A
A
16
A
B
B
B
B
A
A
A
A
A
14
B
B
15
B
A
A
B
A
A
B
A
A
A
B
B
A
A
A
A
.
12
A
13
.
A
B
A
.
A
B
.
B
B
.
.
A
A
B
A
.
.
B
A
.
9
10
12
.
A
.
A
B
.
.
.
.
B
.
.
6
A
.
7
1
8
4
5
2
3
0
P A B A A B A B
47Searching
B
A
A
A
16
A
B
B
B
B
A
A
A
A
A
14
B
B
15
B
A
A
B
A
A
B
A
A
A
B
B
A
A
A
A
.
12
A
13
.
A
B
A
.
A
B
.
B
B
.
.
A
A
B
A
.
.
B
A
.
9
10
12
.
A
.
A
B
.
.
.
.
B
.
.
6
A
.
7
1
8
4
5
2
3
0
P A B A A B A B
48Searching
B
A
A
A
16
A
B
B
B
B
A
A
A
A
A
14
B
B
15
B
A
A
B
A
A
B
A
A
A
B
B
A
A
A
A
.
12
A
13
.
A
B
A
.
A
B
.
B
B
.
.
A
A
B
A
.
.
B
A
.
9
10
12
.
A
.
A
B
.
.
.
.
B
.
.
6
A
.
7
1
8
4
5
2
3
0
P A B A A B A B
49Searching
B
A
A
A
16
A
B
B
B
B
A
A
A
A
A
14
B
B
15
B
A
A
B
A
A
B
A
A
A
B
B
A
A
A
A
.
12
A
13
.
A
B
A
.
A
B
.
B
B
.
.
A
A
B
A
.
.
B
A
.
9
10
12
.
A
.
A
B
.
.
.
.
B
.
.
6
A
.
7
1
8
4
5
2
3
0
D A B A A B A B A A B A A B A B A A B
50Some Structural Properties Longest common prefix
of two suffixes in D depth of the lowest common
node of the suffixes
B
B A A B A B A
A
A
B
A
lcp 2
14
51Some Structural Properties Longest repeat in
D maximum depth of any inner node Most common
string of length m For each node save number of
leaves below it Examine all nodes with depth gt
m many more.... Several applications in Biology
(See frex book by Gusfield)
52Summary Suffix Trees
- Search time O(p log S occ)
- Space O(N)
- (between 1.25 and 5 n Pointers)
- Implementations frex by Kurtz (Bielefeld)
- Construction O(N log S)
- O(N) for integers (Farach, 97)
- Note Implementation Details are extremely
- important for practicacl use.
- (constants/space)
53Suffix Tree Applications
- Work on the following organisms
- Arabidopsis Thaliana (100 Mbps)
- Michigan State / Minnesota University
- Yeast (13 Mbps)
- MPI for Biochemistry, Munich
- Borelia Burgdorferi (1 Mbps)
- Brookhaven Nat. Lab. / Stony Brook Univ.