Suffix Arrays and Suffix Trees - PowerPoint PPT Presentation

1 / 53
About This Presentation
Title:

Suffix Arrays and Suffix Trees

Description:

Text indexing data structures. not word based. allow search for patterns or ... a special character is used as end marker. An Example. D = A B A A B A B A A B A ... – PowerPoint PPT presentation

Number of Views:129
Avg rating:3.0/5.0
Slides: 54
Provided by: stefanbu5
Category:

less

Transcript and Presenter's Notes

Title: Suffix Arrays and Suffix Trees


1
Suffix Arrays and Suffix Trees
  • Stefan Burkhardt

2
  • Motivation
  • What are suffix arrays and trees ?
  • Examples
  • Some construction algorithms

3
Motivation
  • Many biological problems require approximate
    matching.
  • No efficient indices for approximate matching
    known.
  • Filter algorithms for approximate matching use
    exact matching.
  • To be efficient, fast exact matching algorithms
    have to be employed.
  • gt Indices for exact string matching

4
  • What are suffix arrays and trees?
  • Text indexing data structures
  • not word based
  • allow search for patterns or
  • computation of statistics
  • Important Properties
  • Size
  • Speed of exact matching
  • Space required for construction
  • Time required for construction

5
The Suffix Array Definition Given a string D
the suffix array SA for this string is the
sorted list of pointers to all suffixes of
D. (Manber, Myers 1990)
6
Example
D A B A A B B A B B A C
0 A B A A B B A B B A C 1 B A A B B A B B A
C 2 A A B B A B B A C 3 A B B A B B A C 4 B
B A B B A C 5 B A B B A C 6 A B B A C 7 B B
A C 8 B A C 9 A C 10 C
SORT LEXICOGRAPHICALLY!
7
Example
A B A A B B A B B A C
2 A A B B A B B A C 0 A B A A B B A B B A C 3
A B B A B B A C 6 A B B A C 9 A C 1 B A A B B
A B B A C 5 B A B B A C 8 B A C 4 B B A B B A
C 7 B B A C 10 C
8
Exact matching using a Suffix Array
A B A A B B A B B A C
SUFFIX ARRAY SA
SA 2 0 3 6 9 1 5 8 4 7 10
Basic Idea 2 binary searches in SA Search for
leftmost position Search for rightmost position
9
A B A A B B A B B A C
2 0 3 6 9 1 5 8 4 7 10
0 1 2 3 4 5 6 7 8 9 10
10
A B A A B B A B B A C
2 0 3 6 9 1 5 8 4 7 10
0 1 2 3 4 5 6 7 8 9 10
BB gt BA
Continue binary search in the right (larger) half
of SA
11
A B A A B B A B B A C
2 0 3 6 9 1 5 8 4 7 10
0 1 2 3 4 5 6 7 8 9 10
BB BB
More occurences of BB left of this one possible!
12
A B A A B B A B B A C
2 0 3 6 9 1 5 8 4 7 10
0 1 2 3 4 5 6 7 8 9 10
BB gt BA
leftmost position of BB is pointed to by SA8
13
A B A A B B A B B A C
2 0 3 6 9 1 5 8 4 7 10
0 1 2 3 4 5 6 7 8 9 10
BB gt BA
Search further to the right
14
A B A A B B A B B A C
2 0 3 6 9 1 5 8 4 7 10
0 1 2 3 4 5 6 7 8 9 10
BB BB
More occurences of BB right of this one possible!
15
A B A A B B A B B A C
2 0 3 6 9 1 5 8 4 7 10
0 1 2 3 4 5 6 7 8 9 10
BB BA
More occurences of BB right of this one possible!
16
A B A A B B A B B A C
2 0 3 6 9 1 5 8 4 7 10
0 1 2 3 4 5 6 7 8 9 10
BB lt C
rightmost position of BB is pointed to by SA9
17
B B
Results of search for
A B A A B B A B B A C
2 0 3 6 9 1 5 8 4 7 10
0 1 2 3 4 5 6 7 8 9 10
leftmost position of BB is pointed to by SA8
rightmost position of BB is pointed to by SA9
gtAll occurences of the pattern BB are pointed to
by SA8..9
18
  • Important Properties
  • for SA N and p length of pattern
  • Size 1 Pointer per Letter (4 Byte if N lt 4Gb)
  • Speed of exact matching
  • O(log N) binary search steps
  • of compared chars is O(p log N)
  • can be reduced to O(p log N)

19
  • Some known Construction methods
  • Manber-Myers
  • variant of the labeling technique of Karp, Miller
    and Rosenberg
  • Sorting of suffixes is performed as follows

i
Sort in i rounds substrings of length 2 each
round is possible in O(N)
Construction in O(N log(N)) time
20
Second Round 21 2
Example
First Round 20 1
D A B A A B B A B B A C
2 0 3 6 9
0 2 3 6 9 1 4 5 7 8 10
0 A B A A B B A B B A C 1 B A A B B A B B A
C 2 A A B B A B B A C 3 A B B A B B A C 4 B
B A B B A C 5 B A B B A C 6 A B B A C 7 B B
A C 8 B A C 9 A C 10 C
1 5 8
4 7
10
BN bucket number
BN bucket number
Round i
For each suffix x we have to identify the bucket
number of x2i-1 2i-1.
The bucket number can be found in the field BN
BNx2i-1
21
Round 1 2-pass Bucketsort
using the first character Create
2 arrays, BN and SA Round i When comparing
suffix x and y from the same bucket 1. For
0..2i-1-1 suffix x and y are equal 2. For 2i-1..
2i-1 suffix x and y have already been compared !
Result is given by comparing suffix x2i-1
with y2i-1 use BN to access suffix x2i-1
and y2i-1
22
Baeza-Yates-Gonnet-Snider (External)
  • Idea
  • text is cut in pieces of size M
  • runs in N/M rounds, in each round
  • - compute SA for the current text piece
  • - merge SA with the suffix array for the
    previous pieces

Run Time
All sort steps O((N3 log M)/M)
All merge steps O((N3 log M)/M)
Total run time O((N3 log M)/M)
No. of Block I/Os B Block Size
O((N3 log M)/(MB))
23
Baeza-Yates-Gonnet-Snider (External)
Runtime analysis of one round - compute SA for
the current piece of size M O(M
log M) sort comparisons of suffixes Problem
worst case comparison is complete suffixes (
N) But expected case is much smaller (lcp) Worst
case runtime O(N M log M)
character comparisons - merge SA with the already
existing SAx length of SAx O(N) gt number of
merge steps O(N) one merge step 1 suffix
comparison O(N) worst case gt O(N2) runtime
gt total runtime for one round O(N2 N M log
M) O(N2 log M) N/M rounds gt total runtime
O(N3 log M / M)
24
Example BGSN Construction
M 4
A B A A B B A B
A B A A
SA1 2 0 3 1
SA2 6 7 5 4
B B A B
Merge SA1 , SA2
SA
2
6
0
3 1 7 5 4
25
The Suffix Tree Definition Given a string D
the suffix tree ST for this string is the
compacted trie built on all suffixes of
D. (Weiner, 1973)
26
  • The Suffix Tree
  • Structural Properties
  • Each arc of the tree denotes a substring
  • Each node has outdeg gt 1
  • Node arcs start with different characters
  • Each leaf l denotes the suffix composed
  • of all arc labels on the path root l
  • N leaves and ltN internal nodes
  • a special character is used as end marker

27
An Example
B
A
A

A
16
A
B
B
B
B
A


A
A
A
A
14
B
B
15
B
A
A
B
A


A
B
A
A
A
B
B
A
A
A
A
.
12
A
13
.
A
B
A
.
A



B
.
B
B
.
.
A
A
B
A
.
.
B
A
.

9
10
12

.
A
.
A
B
.

.
.

.
B
.
.
6
A

.
7
1

8


4

5
2
3
0
D A B A A B A B A A B A A B A B A
0 1 2 3 4 5 6 7
8 9 10 11 12 13 14 15 16
28
An Example
B
A
A

A
16
A
B
B
B
B
A


A
A
A
A
14
B
B
15
B
A
A
B
A


A
B
A
A
A
B
B
A
A
A
A
.
12
A
13
.
A
B
A
.
A



B
.
B
B
.
.
A
A
B
A
.
.
B
A
.

9
10
12

.
A
.
A
B
.

.
.

.
B
.
.
6
A

.
7
1

8


4

5
2
3
0
D A B A A B A B A A B A A B A B A
0 1 2 3 4 5 6 7
8 9 10 11 12 13 14 15 16
29
An Example
B
A
A

A
16
A
B
B
B
B
A


A
A
A
A
14
B
B
15
B
A
A
B
A


A
B
A
A
A
B
B
A
A
A
A
.
12
A
13
.
A
B
A
.
A



B
.
B
B
.
.
A
A
B
A
.
.
B
A
.

9
10
12

.
A
.
A
B
.
.
.


.
B
.
.
6
A

.
7
1

8


4

5
2
3
0
D A B A A B A B A A B A A B A B A
0 1 2 3 4 5 6 7
8 9 10 11 12 13 14 15 16
30
Simple Construction
for all suffixes s insert(s)
ABAABABAABAABABA
0
D A B A A B A B A A B A A B A B A
0 1 2 3 4 5 6 7
8 9 10 11 12 13 14 15 16
31
Simple Construction
for all suffixes s insert(s)
BAABABAABAABABA
ABAABABAABAABABA
1
0
D A B A A B A B A A B A A B A B A
0 1 2 3 4 5 6 7
8 9 10 11 12 13 14 15 16
32
Simple Construction
for all suffixes s insert(s)
BAABABAABAABABA
ABAABABAABAABABA
1
0
D A B A A B A B A A B A A B A B A
0 1 2 3 4 5 6 7
8 9 10 11 12 13 14 15 16
33
Simple Construction
A
for all suffixes s insert(s)
BAABABAABAABABA
ABABAABAABABA
BAABABAABAABABA
2
1
0
D A B A A B A B A A B A A B A B A
0 1 2 3 4 5 6 7
8 9 10 11 12 13 14 15 16
34
Simple Construction
A
for all suffixes s insert(s)
BAABABAABAABABA
ABABAABAABABA
BAABABAABAABABA
2
1
0
D A B A A B A B A A B A A B A B A
0 1 2 3 4 5 6 7
8 9 10 11 12 13 14 15 16
35
Simple Construction
A
for all suffixes s insert(s)
B
A
BAABABAABAABABA
ABABAABAABABA
BAABAABABA
ABABAABAABABA
3
2
1
0
D A B A A B A B A A B A A B A B A
0 1 2 3 4 5 6 7
8 9 10 11 12 13 14 15 16
36
Problem O(n ) Space ( N N-1 N-2 ... 1)
2
A
B
B
C
D

C
C
E
D
D
D
E

E
E
E




0
3
1
2
4
5
D A B C D E
0 1 2 3 4 5
37
Solution Arc Pointers
B
A
A

A
16
A
B
B
B
B
A


A
A
A
A
14
B
B
15
B
A
A
B
A


A
B
A
A
A
B
B
A
A
A
A
.
12
A
13
.
A
B
A
.
A



B
.
B
B
.
.
A
A
B
A
.
.
B
A
.

9
10
12

.
A
.
A
B
.
.
.

.

B
.
.
6
A
.

7
1

8


4

5
2
3
0
D A B A A B A B A A B A A B A B A
0 1 2 3 4 5 6 7
8 9 10 11 12 13 14 15 16
38
Solution Arc Pointers
B
(0,0)
A

A
16
A
B
B
B
B
A


A
A
A
A
14
B
B
15
B
A
A
B
A


A
B
A
A
A
B
B
A
A
A
A
.
12
A
13
.
A
B
A
.
A



B
.
B
B
.
.
A
A
B
A
.
.
B
A
.

9
10
12

.
A
.
A
B
.
.
.

.

B
.
.
6
A

.
7
1

8


4

5
2
3
0
D A B A A B A B A A B A A B A B A
0 1 2 3 4 5 6 7
8 9 10 11 12 13 14 15 16
39
Solution Arc Pointers
B
(0,0)
A

A
16
A
B
(1,2)
B
B


A
A
A
A
14
B
B
15
B
A
A
B
A


A
B
A
A
A
B
B
A
A
A
A
.
12
A
13
.
A
B
A
.
A



B
.
B
B
.
.
A
A
B
A
.
.
B
A
.

9
10
12

.
A
.
A
B
.
.
.

.

B
.
.
6
A

.
7
1

8


4

5
2
3
0
D A B A A B A B A A B A A B A B A
0 1 2 3 4 5 6 7
8 9 10 11 12 13 14 15 16
40
Solution Arc Pointers
B
(0,0)
A

A
16
A
B
(1,2)
B
B


A
A
A
14
B
B
15
B
A
A
A
(3,5)


A
B
A
A
B
B
A
A
A
A
.
12
A
13
.
A
B
A
.
A



B
.
B
B
.
.
A
A
B
A
.
.
B
A
.

9
10
12

.
A
.
A
B
.
.
.

.

B
.
.
6
A

.
7
1

8


4

5
2
3
0
D A B A A B A B A A B A A B A B A
0 1 2 3 4 5 6 7
8 9 10 11 12 13 14 15 16
41
Solution Arc Pointers
B
(0,0)
A

A
16
A
B
(1,2)
B
B


A
A
A
14
B
B
15
B
A
A
A
(3,5)


A
B
A
A
B
B
A
A
A
A
.
12
A
13
.
A
B
A
.
A



.
B
B
.
.
A
B
(6,7)
A
.
.
B
A
.

9
10
12

.
A
.
A
B
.
.
.

.

B
.
.
6
A

.
7
1

8


4

5
2
3
0
D A B A A B A B A A B A A B A B A
0 1 2 3 4 5 6 7
8 9 10 11 12 13 14 15 16
42
Solution Arc Pointers
B
(0,0)
A

A
16
A
B
(1,2)
B
B


A
A
A
14
B
B
15
B
A
A
A
(3,5)


A
B
A
A
B
B
A
A
A
A
.
12
A
13
.
A
B
A
.
A



.
B
B
.
.
A
B
(6,7)
A
.
.
B
A
.

9
10
12

.
.
A
B
.
.

.

.
6
A

7
(8,16)
1

8


4
5
2
3
0
D A B A A B A B A A B A A B A B A
0 1 2 3 4 5 6 7
8 9 10 11 12 13 14 15 16
43
O(n) Arcs gt O(n) pointer pairs
B
(0,0)
A

A
16
A
B
(1,2)
B
B


A
A
A
14
B
B
15
B
A
A
A
(3,5)


A
B
A
A
B
B
A
A
A
A
.
12
A
13
.
A
B
A
.
A



.
B
B
.
.
A
B
(6,7)
A
.
.
B
A
.

9
10
12

.
.
A

B
.
.

.
.
6
A

7
(8,16)
1

8


4
5
2
3
0
D A B A A B A B A A B A A B A B A
0 1 2 3 4 5 6 7
8 9 10 11 12 13 14 15 16
44
Searching
B
A
A

A
16
A
B
B
B
B
A


A
A
A
A
14
B
B
15
B
A
A
B
A


A
B
A
A
A
B
B
A
A
A
A
.
12
A
13
.
A
B
A
.
A



B
.
B
B
.
.
A
A
B
A
.
.
B
A
.

9
10
12

.
A
.
A
B
.
.
.

.

B
.
.
6
A

.
7
1

8


4

5
2
3
0
P A B A A B A B
45
Searching
B
A
A

A
16
A
B
B
B
B
A


A
A
A
A
14
B
B
15
B
A
A
B
A


A
B
A
A
A
B
B
A
A
A
A
.
12
A
13
.
A
B
A
.
A



B
.
B
B
.
.
A
A
B
A
.
.
B
A
.

9
10
12

.
A
.
A
B
.
.
.

.

B
.
.
6
A

.
7
1

8


4

5
2
3
0
P A B A A B A B
46
Searching
B
A
A

A
16
A
B
B
B
B
A


A
A
A
A
14
B
B
15
B
A
A
B
A


A
B
A
A
A
B
B
A
A
A
A
.
12
A
13
.
A
B
A
.
A



B
.
B
B
.
.
A
A
B
A
.
.
B
A
.

9
10
12

.
A
.
A
B
.
.
.

.

B
.
.
6
A

.
7
1

8


4

5
2
3
0
P A B A A B A B
47
Searching
B
A
A

A
16
A
B
B
B
B
A


A
A
A
A
14
B
B
15
B
A
A
B
A


A
B
A
A
A
B
B
A
A
A
A
.
12
A
13
.
A
B
A
.
A



B
.
B
B
.
.
A
A
B
A
.
.
B
A
.

9
10
12

.
A
.
A
B
.
.
.

.

B
.
.
6
A

.
7
1

8


4

5
2
3
0
P A B A A B A B
48
Searching
B
A
A

A
16
A
B
B
B
B
A


A
A
A
A
14
B
B
15
B
A
A
B
A


A
B
A
A
A
B
B
A
A
A
A
.
12
A
13
.
A
B
A
.
A



B
.
B
B
.
.
A
A
B
A
.
.
B
A
.

9
10
12

.
A
.
A
B
.
.
.

.

B
.
.
6
A

.
7
1

8


4

5
2
3
0
P A B A A B A B
49
Searching
B
A
A

A
16
A
B
B
B
B
A


A
A
A
A
14
B
B
15
B
A
A
B
A


A
B
A
A
A
B
B
A
A
A
A
.
12
A
13
.
A
B
A
.
A



B
.
B
B
.
.
A
A
B
A
.
.
B
A
.

9
10
12

.
A
.
A
B
.
.
.

.

B
.
.
6
A

.
7
1

8


4

5
2
3
0
D A B A A B A B A A B A A B A B A A B
50
Some Structural Properties Longest common prefix
of two suffixes in D depth of the lowest common
node of the suffixes
B
B A A B A B A
A
A
B

A
lcp 2
14

51
Some Structural Properties Longest repeat in
D maximum depth of any inner node Most common
string of length m For each node save number of
leaves below it Examine all nodes with depth gt
m many more.... Several applications in Biology
(See frex book by Gusfield)
52
Summary Suffix Trees
  • Search time O(p log S occ)
  • Space O(N)
  • (between 1.25 and 5 n Pointers)
  • Implementations frex by Kurtz (Bielefeld)
  • Construction O(N log S)
  • O(N) for integers (Farach, 97)
  • Note Implementation Details are extremely
  • important for practicacl use.
  • (constants/space)

53
Suffix Tree Applications
  • Work on the following organisms
  • Arabidopsis Thaliana (100 Mbps)
  • Michigan State / Minnesota University
  • Yeast (13 Mbps)
  • MPI for Biochemistry, Munich
  • Borelia Burgdorferi (1 Mbps)
  • Brookhaven Nat. Lab. / Stony Brook Univ.
Write a Comment
User Comments (0)
About PowerShow.com