Suffix Arrays

About This Presentation

Title:

Suffix Arrays

Description:

Obtained by comparing v and w and stopping at the first unequal symbol. Use precomputed lcp information to reduce the number of comparisons to O(P logN) ... – PowerPoint PPT presentation

Number of Views:82

Avg rating:3.0/5.0

Slides: 58

Provided by: csta3

Category:

more less

Transcript and Presenter's Notes

Title: Suffix Arrays

1
Suffix Arrays

A New Method for Online String Searches
U.Manber and G.Myers

2
Introduction - String matching

Let A a0a1...aN-1 be a large text of length N
Let W w0w1...wp-1 be a word of length P
Is W a substring of A?

3
Introduction - Suffix Trees

Build time
O(N)
Search time
O(P)
Structure space
O(N)
Big constant
Dependent of S

4
Suffix Arrays

An array of all the suffixes of A
Sorted by lexicographical order
A aababa

baba ba ababa aba aababa a
5
Suffix Arrays

Ai aiai1...aN-1
The suffix of A that starts at position i.
Position array (Pos)
Posk is the start position of kth smallest
suffix
APosk is the suffix pointed from Posk
APosk is the kth smallest suffix
Pos

A aababa
0 1 2 3 4 5
5 0 3 1 4 2
0 1 2 3 4 5
6
Searching

Is W a substring of A?
W is a substring of A Some suffix Ai starts
with W
i is Ws location
All the instances of W must match consecutive
suffixes in the array
Find the array interval that contains those
suffixes

7
Searching - Definitions

For a string u
up u0u1...up-1
For strings u,v
u p v up vp
Same for ?, , gt
For any p, Pos is ordered according to p

8
Searching - Definitions

W w0w1wP-1
LW min (k W p APosk or k N)
First suffix p from W
RW max (k APosk p W or k 1)
Last suffix p from W

9
Search Algorithm

k LW, RW W p APosk
To find Ws instances - find LW, RW
Number of Ws occurrences is (RW-LW1)
Matches are APosLW,, APosRW
Suffix array is sorted - use binary search

10
Binary Search

Search interval L,R
Midpoint M
Compare W to APosM
Decide where to search next
W p APosM - search in left half (R M)
W gtp APosM - search in right half (L M)
O(PlogN)

cbb bcd abc aab
W abc
L
M
R
11
Search Algorithm

Observation
We can use information from one comparison to
speedup the next comparisons
Use additional information
lcp longest common prefix

12
Search Algorithm - lcp

lcp(v,w) the length of the longest common
prefix of v and w
Obtained by comparing v and w and stopping at the
first unequal symbol
Use precomputed lcp information to reduce the
number of comparisons to O(P logN)

13
Search Algorithm

Consider all possible midpoints
M 1N-2
Every midpoint corresponds to a triplet LM,M,RM
Suppose we precomputed two arrays
LlcpM lcp (APosLM, APosM)
RlcpM lcp (APosM, APosRM)

14
Search Algorithm

Maintain two more variables
l lcp(APosL, W)
r lcp(W, APosR)
W abcd

ad acd acb aca ac abcd abc abb abaa
15
Search Algorithm

Assume lr
Compare l with Llcp
If l lt LlcpM
W gtl1 APosLM
APosLM l1 APosM
W gtl1 APosM

ad acd ac abcd abac ababa abab abaa aba
16
Search Algorithm

If l gt LlcpM
APosLM ltl APosM
W l APosLM
W ltl APosM

W abcd
adc adb ada ad aca ac abd abcd aba
17
Search Algorithm

If l LlcpM
W can be in either half
Start comparing A and APosM from the (l1)
symbol
First unequal symbol determines whether to go
right or left
r/l will be updated to lj
j1 comparisons

W abcd
adc adb ada abcd abcc abc abaa aba ab
18
Search Algorithm - Complexity

In each Iteration
Let hmax(l,r)
We start comparing from the hth symbol to the
hj1
j1 symbol comparisons
Next time we will start from the hj symbol
j symbols out of the j1 will not be compared
again

19
Search Algorithm - Complexity

Every symbol in W will be successfully matched at
most once
O(P) successful comparisons
At most one symbol will be unsuccessfully matched
in each iteration
O(logN) unsuccessful comaprsions
Total O(P logN) comparisons

20
Build Suffix Array

So far
A O(P logN) search algorithm
Given a sorted suffix array
Given lcp information (Llcp, Rlcp)
Next
Sort the suffix array in O(NlogN)
Compute the lcps while sorting the array

21
Sort Algorithm

First stage
Sort the suffixes into buckets, according to
first symbol
Inductive stage
Assume array is bucket sorted according to first
H symbols
Every H-bucket holds suffixes with the same H
first symbols
Buckets are ordered according to the H relation
Sort according to 2H first symbols

22
Sort Algorithm Intuition

Let Ai, Aj be two suffixes in the same H-bucket
Ai H Aj
Next H symbols of Ai and Aj are the first H
symbols of AiH and AjH
In order to determine the 2H order of Ai and Aj,
look at the H order of AiH and AjH

A aababaa
baa babaa ababaa abaa aababaa aa a
H 2
Ai
Aj
AjH
AiH
23
Sort Algorithm Main Idea

Let Ai be a suffix in the first H-bucket
Ai starts with the smallest H-symbol string
Ai-H should be the first in its 2H-bucket

A aababa
H 1
ba baba aababa aba ababa a
24
Sort Algorithm

In stage H
Go over all the suffixes in the H order
For each Ai move Ai-H to the next available place
in its H-bucket
The suffixes are now sorted according to the 2H
order
Go on to stage 2H to produce 4H order

25
Sort Algorithm - Example
A assassin
0 1 2 3 4 5 6 7
A3 A0 A6 A7 A1 A5 A4 A2
in
n
sin
ssassin
ssin
sassin
assassin
assin
H 2
26
Sort Algorithm - Example
A0 A3 A6 A7 A2 A5 A4 A1
A0 A3 A6 A7 A2 A5 A1 A4
27
Sort Algorithm - Complexity

First Stage
Bucket sort according to first symbol
O(NlogN)
Inductive Stages
O(logN) stages
O(N) per stage
Total O(NlogN)
Space
Can be implemented using two N-sized integer
arrays

28
Finding Longest Common Prefixes

The search algorithm uses lcp information
LlcpM lcp (APosLM, APosM)
RlcpM lcp (APosM, APosRM)
We want to compute this information while we are
sorting the array

29
Finding Longest Common Prefixes

Show how to compute lcps for suffixes in
adjacent H-buckets during the sort algorithm
Use that to compute the lcps of all the suffixes
that are consecutive in the sorted suffix array
Show how to compute lcps for all the necessary
suffixes

30
Finding LCP for adjacent buckets

After the first sort stage, lcps of suffixes in
adjacent buckets is 0
Assume after stage H we know the lcps between
suffixes in adjacent H-buckets
Suppose Ap and Aq are in the same H-bucket but
not in the same 2H bucket
H lcp(Ap, Aq) lt 2H
lcp(Ap, Aq) H lcp(ApH, AqH)
lcp(ApH, AqH) lt H

31
Finding LCP for adjacent buckets

Let i,j be ApH, AqHs positions in the suffix
array
Assume iltj
Array is ordered according to the ltH order
lcp(APosi, APosj) min(lcp(APosk-1,
APosk))

ba baba aababa aba ababa a
32
LCP Data Structures Hgt

We need a data structure that will allow us
get the lcps of consecutive suffixes
get their minimum
Hgt an N-1 sized array
Hgti lcp(APosi-1, APosi)

33
LCP Data Structures Hgt

Hgt will be computed inductively throughout the
sort
Initialized to N1
Hgti is updated in stage 2H APosi started a
new 2H-bucket
To update Hgti
Let a,b be the array positions of APosi-1H
and APosi H
Assume ab
Hgti H min(Hgtk)

34
Finding LCP - Example
sassin ssin sin ssassin n in assassin assin
H 1
ssassin ssin sin sassin n in assin assassin
H 2
1
1
lcp (sin, ssin) 1 lcp(in, sin) 1
min(lcp(in,n), lcp(n,sassin), lcp(sassin,sin) 1
0 1
lcp(sassin,sin) 1 lcp(assin, in) 1
ssin ssassin sin sassin n in assin assassin
H 4
35
LCP Data Structures - Interval Tree

We need the following operations for Hgt
Set(i, h) sets Hgti to h
Min_height(i,j) determines min(Hgtk)
We need to find a way to find the lcps for all
the necessary suffixes not just the ones in
consecutive positions

36
LCP Data Structures - Interval Tree

A full and balanced binary tree
N-1 leaves, correspond to Hgt
O(logN) height, N-2 interior vertices
Keep a Hgt value for each interior vertex as
well
Hgtv min(Hgtleft(v), Hgtright(v))

37
LCP Data Structures - Interval Tree

Operations implementation
Set(i,h)
Set Hgti to h and update the Hgt values on the
path from i to the root
Min-height(i,j)
Finds the minimal Hgt value by scanning O(logN)
vertices in the tree
Operations complexity O(logN)

38
Finding LCP Interval Tree
39
Finding LCP - Complexity

In stage 2H we update Hgti for all the leaves
that started new buckets
Each update is one set operation and one
Min_height - O(logN)
Throughout the algorithm every leaf is updated
exactly once - O(N) updates
Updates complexity O(NlogN)
In each stage we scan the array to see which
suffixes opened new buckets
Scans complexity O(NlogN)
Total LCP complexity O(NlogN)

40
Finding LCP - Llcp and Rlcp

We want Llcp and Rlcp to be available
directly from the interval tree at the end of the
sort
Use an interval tree that represents a binary
search
Each interior node corresponds to (LM, RM) for
some M
For each interior node (LM, RM)
Left(LM, RM) (LM,M)
Right(LM, RM) (M, RM)
N-2 interior nodes
Leaves correspond to (i-1,i)
Leaf(i-1,i) Hgti

41
Finding LCP - Llcp and Rlcp

According to interval tree structure
Hgt(L,R) min(Hgtk)
Hgt(L,R) lcp (APosL, APosR)
LlcpM Hgt(LM,M)
RlcpM Hgt(M,RM)

k L1,R
42
Worst Case Complexity

Suffix Array
Build time
O(NlogN)
Search time
O(PlogN)
Structure space
O(N)
2N - 3N integers
Independent of S

Suffix Tree
Build time
O(N)
Search time
O(P)
Structure space
O(N)
Big constant
Dependent of S

43
Expected Time Improvements

Improve the expected case time of
Search Algorithm
Sort Algorithm
LCP computation
Use the following assumptions
All N-symbol strings are equally likely
Under this assumption
Expected length of longest repeated substring of
A is O(logSN)

44
Expected Case Improvements - Main Idea

Let T
Let IntT(u) integer encoding in base S of the
T-symbol prefix of u
Example
T 3
S a,b
u abaa
IntT(u) 010 2
There are ST N possible T-symbol prefixes
IntT(u) is a number in 0,N-1
Map each suffix Ap to IntT(Ap)
Can be done in O(N) time

45
Expected Case Improvements - Search Algorithm

Use an additional array Buck
Think of the sorted array as buckets, based on
the IntT encoding
Buckk min i IntT (APosi) k
The first position that contains a suffix thats
mapped to k
Compute Buck
at the end of the sort algorithm
O(N) additional time

46
Expected Case Improvements - Search Algorithm

Given a word W
We need to find Lw and Rw
Let k IntT(W)
Lw and Rw must be in ks bucket
(Buckk, Buckk1)
We only need to search one bucket

47
Expected Case Improvements - Search Algorithm

Number of buckets ST N
Average number of elements in a bucket O(1)
In the binary search for W
Expected size of bucket to search O(1)
Expected number of search steps O(1)
Expected case time O(P)

48
Expected Case Improvements - Sort Algorithm

First stage of sort
Sort according to first symbol
Replace first stage with sort according to IntT
Equivalent to sort according to first T symbols
Can be done in O(N) time
We changed the base case of the sort from H1 to
HT

49
Expected Case Improvements - Sort Algorithm

Observation
Let C be the length of the longest repeated
substring of A
Sort is in fact complete once we have reached
(C1)-buckets
Suppose some (C1)-bucket contains more than one
suffix
Then we have two suffixes with lcp gt C
This prefix is a repeated substring longer than C
- contradiction

50
Expected Case Improvements - Sort Algorithm

Expected case
C O(logSN) O(T)
Number of stages O(1)
Expected case time O(N)

51
Expected Case Improvements - LCP Computation

Replace interval tree with sort history
Binary tree
Models the refinement of buckets during the sort
A vertex for each H-bucket
Each vertex holds the stage number at which its
bucket was split

52
Expected Case Improvements - LCP Computation

Leaves correspond to suffixes and are arranged in
an N element array
Each vertex has at least two children
O(N) nodes
Can be built with O(N) additional time during the
sort

53
Expected Case Improvements - LCP Computation

Given the sort history we can compute lcp(Ap, Aq)
Find the nca (nearest common ancestor) of Ap and
Aq
Let H be the ncas stage number
lcp(Ap, Aq) H lcp(ApH, AqH)
Recursively compute lcp(ApH, AqH)
Stop when the nca is the root

54
Expected Case Improvements - LCP Computation

Each step is O(1)
At each step the stage number of the nca is at
least halved
Suppose we stop the recursion when H lt T
Expected length of longest repeated substring is
O(T)
Expected case lcp is O(T) O(logSN)

55
Expected Case Improvements - LCP Computation

O(1) recursive steps in the expected case
Expected case time for one lcp O(1)
Expected case time for computing Llcp, Rlcp
O(N)

56
Expected Case Improvements - LCP Computation

We need a way to find lcps that are known to be
less than T
Build a ST x ST array
LookupIntT(x), IntT(y) lcp(x,y) for all
T-symbol strings x,y
Max N entries (ST vN)
Compute incrementally in O(N)
Final recursion steps are replaced by O(1) lookup

Suffix Arrays - PowerPoint PPT Presentation

Suffix Arrays

Obtained by comparing v and w and stopping at the first unequal symbol. Use precomputed lcp information to reduce the number of comparisons to O(P logN) ... – PowerPoint PPT presentation