Title: Suffix Arrays
1Suffix Arrays
- A New Method for Online String Searches
- U.Manber and G.Myers
2Introduction - String matching
- Let A a0a1...aN-1 be a large text of length N
- Let W w0w1...wp-1 be a word of length P
- Is W a substring of A?
3Introduction - Suffix Trees
- Build time
- O(N)
- Search time
- O(P)
- Structure space
- O(N)
- Big constant
- Dependent of S
4Suffix Arrays
- An array of all the suffixes of A
- Sorted by lexicographical order
- A aababa
baba ba ababa aba aababa a
5Suffix Arrays
- Ai aiai1...aN-1
- The suffix of A that starts at position i.
- Position array (Pos)
- Posk is the start position of kth smallest
suffix - APosk is the suffix pointed from Posk
- APosk is the kth smallest suffix
-
- Pos
A aababa
0 1 2 3 4 5
5 0 3 1 4 2
0 1 2 3 4 5
6Searching
- Is W a substring of A?
- W is a substring of A Some suffix Ai starts
with W - i is Ws location
- All the instances of W must match consecutive
suffixes in the array - Find the array interval that contains those
suffixes
7Searching - Definitions
- For a string u
- up u0u1...up-1
- For strings u,v
- u p v up vp
- Same for ?, , gt
- For any p, Pos is ordered according to p
8Searching - Definitions
- W w0w1wP-1
- LW min (k W p APosk or k N)
- First suffix p from W
- RW max (k APosk p W or k 1)
- Last suffix p from W
9Search Algorithm
- k LW, RW W p APosk
- To find Ws instances - find LW, RW
- Number of Ws occurrences is (RW-LW1)
- Matches are APosLW,, APosRW
- Suffix array is sorted - use binary search
10Binary Search
- Search interval L,R
- Midpoint M
- Compare W to APosM
- Decide where to search next
- W p APosM - search in left half (R M)
- W gtp APosM - search in right half (L M)
- O(PlogN)
cbb bcd abc aab
W abc
L
M
R
11Search Algorithm
- Observation
- We can use information from one comparison to
speedup the next comparisons - Use additional information
- lcp longest common prefix
12Search Algorithm - lcp
- lcp(v,w) the length of the longest common
prefix of v and w - Obtained by comparing v and w and stopping at the
first unequal symbol - Use precomputed lcp information to reduce the
number of comparisons to O(P logN)
13Search Algorithm
- Consider all possible midpoints
- M 1N-2
- Every midpoint corresponds to a triplet LM,M,RM
- Suppose we precomputed two arrays
- LlcpM lcp (APosLM, APosM)
- RlcpM lcp (APosM, APosRM)
14Search Algorithm
- Maintain two more variables
- l lcp(APosL, W)
- r lcp(W, APosR)
- W abcd
ad acd acb aca ac abcd abc abb abaa
15Search Algorithm
- Assume lr
- Compare l with Llcp
- If l lt LlcpM
- W gtl1 APosLM
- APosLM l1 APosM
- W gtl1 APosM
ad acd ac abcd abac ababa abab abaa aba
16Search Algorithm
- If l gt LlcpM
- APosLM ltl APosM
- W l APosLM
- W ltl APosM
W abcd
adc adb ada ad aca ac abd abcd aba
17Search Algorithm
- If l LlcpM
- W can be in either half
- Start comparing A and APosM from the (l1)
symbol - First unequal symbol determines whether to go
right or left - r/l will be updated to lj
- j1 comparisons
W abcd
adc adb ada abcd abcc abc abaa aba ab
18Search Algorithm - Complexity
- In each Iteration
- Let hmax(l,r)
- We start comparing from the hth symbol to the
hj1 - j1 symbol comparisons
- Next time we will start from the hj symbol
- j symbols out of the j1 will not be compared
again
19Search Algorithm - Complexity
- Every symbol in W will be successfully matched at
most once - O(P) successful comparisons
- At most one symbol will be unsuccessfully matched
in each iteration - O(logN) unsuccessful comaprsions
- Total O(P logN) comparisons
20Build Suffix Array
- So far
- A O(P logN) search algorithm
- Given a sorted suffix array
- Given lcp information (Llcp, Rlcp)
- Next
- Sort the suffix array in O(NlogN)
- Compute the lcps while sorting the array
21Sort Algorithm
- First stage
- Sort the suffixes into buckets, according to
first symbol - Inductive stage
- Assume array is bucket sorted according to first
H symbols - Every H-bucket holds suffixes with the same H
first symbols - Buckets are ordered according to the H relation
- Sort according to 2H first symbols
22Sort Algorithm Intuition
- Let Ai, Aj be two suffixes in the same H-bucket
- Ai H Aj
- Next H symbols of Ai and Aj are the first H
symbols of AiH and AjH - In order to determine the 2H order of Ai and Aj,
look at the H order of AiH and AjH
A aababaa
baa babaa ababaa abaa aababaa aa a
H 2
Ai
Aj
AjH
AiH
23Sort Algorithm Main Idea
- Let Ai be a suffix in the first H-bucket
- Ai starts with the smallest H-symbol string
- Ai-H should be the first in its 2H-bucket
A aababa
H 1
ba baba aababa aba ababa a
24Sort Algorithm
- In stage H
- Go over all the suffixes in the H order
- For each Ai move Ai-H to the next available place
in its H-bucket - The suffixes are now sorted according to the 2H
order - Go on to stage 2H to produce 4H order
25Sort Algorithm - Example
A assassin
0 1 2 3 4 5 6 7
A3 A0 A6 A7 A1 A5 A4 A2
in
n
sin
ssassin
ssin
sassin
assassin
assin
H 2
26Sort Algorithm - Example
A0 A3 A6 A7 A2 A5 A4 A1
A0 A3 A6 A7 A2 A5 A1 A4
27Sort Algorithm - Complexity
- First Stage
- Bucket sort according to first symbol
- O(NlogN)
- Inductive Stages
- O(logN) stages
- O(N) per stage
- Total O(NlogN)
- Space
- Can be implemented using two N-sized integer
arrays
28Finding Longest Common Prefixes
- The search algorithm uses lcp information
- LlcpM lcp (APosLM, APosM)
- RlcpM lcp (APosM, APosRM)
- We want to compute this information while we are
sorting the array
29Finding Longest Common Prefixes
- Show how to compute lcps for suffixes in
adjacent H-buckets during the sort algorithm - Use that to compute the lcps of all the suffixes
that are consecutive in the sorted suffix array - Show how to compute lcps for all the necessary
suffixes
30Finding LCP for adjacent buckets
- After the first sort stage, lcps of suffixes in
adjacent buckets is 0 - Assume after stage H we know the lcps between
suffixes in adjacent H-buckets - Suppose Ap and Aq are in the same H-bucket but
not in the same 2H bucket - H lcp(Ap, Aq) lt 2H
- lcp(Ap, Aq) H lcp(ApH, AqH)
- lcp(ApH, AqH) lt H
31Finding LCP for adjacent buckets
- Let i,j be ApH, AqHs positions in the suffix
array - Assume iltj
- Array is ordered according to the ltH order
- lcp(APosi, APosj) min(lcp(APosk-1,
APosk))
ba baba aababa aba ababa a
32LCP Data Structures Hgt
- We need a data structure that will allow us
- get the lcps of consecutive suffixes
- get their minimum
- Hgt an N-1 sized array
- Hgti lcp(APosi-1, APosi)
33LCP Data Structures Hgt
- Hgt will be computed inductively throughout the
sort - Initialized to N1
- Hgti is updated in stage 2H APosi started a
new 2H-bucket - To update Hgti
- Let a,b be the array positions of APosi-1H
and APosi H - Assume ab
- Hgti H min(Hgtk)
34Finding LCP - Example
sassin ssin sin ssassin n in assassin assin
H 1
ssassin ssin sin sassin n in assin assassin
H 2
1
1
lcp (sin, ssin) 1 lcp(in, sin) 1
min(lcp(in,n), lcp(n,sassin), lcp(sassin,sin) 1
0 1
lcp(sassin,sin) 1 lcp(assin, in) 1
ssin ssassin sin sassin n in assin assassin
H 4
35LCP Data Structures - Interval Tree
- We need the following operations for Hgt
- Set(i, h) sets Hgti to h
- Min_height(i,j) determines min(Hgtk)
- We need to find a way to find the lcps for all
the necessary suffixes not just the ones in
consecutive positions
36LCP Data Structures - Interval Tree
- A full and balanced binary tree
- N-1 leaves, correspond to Hgt
- O(logN) height, N-2 interior vertices
- Keep a Hgt value for each interior vertex as
well - Hgtv min(Hgtleft(v), Hgtright(v))
37LCP Data Structures - Interval Tree
- Operations implementation
- Set(i,h)
- Set Hgti to h and update the Hgt values on the
path from i to the root - Min-height(i,j)
- Finds the minimal Hgt value by scanning O(logN)
vertices in the tree - Operations complexity O(logN)
38Finding LCP Interval Tree
39Finding LCP - Complexity
- In stage 2H we update Hgti for all the leaves
that started new buckets - Each update is one set operation and one
Min_height - O(logN) - Throughout the algorithm every leaf is updated
exactly once - O(N) updates - Updates complexity O(NlogN)
- In each stage we scan the array to see which
suffixes opened new buckets - Scans complexity O(NlogN)
- Total LCP complexity O(NlogN)
40Finding LCP - Llcp and Rlcp
- We want Llcp and Rlcp to be available
directly from the interval tree at the end of the
sort - Use an interval tree that represents a binary
search - Each interior node corresponds to (LM, RM) for
some M - For each interior node (LM, RM)
- Left(LM, RM) (LM,M)
- Right(LM, RM) (M, RM)
- N-2 interior nodes
- Leaves correspond to (i-1,i)
- Leaf(i-1,i) Hgti
41Finding LCP - Llcp and Rlcp
- According to interval tree structure
- Hgt(L,R) min(Hgtk)
- Hgt(L,R) lcp (APosL, APosR)
- LlcpM Hgt(LM,M)
- RlcpM Hgt(M,RM)
k L1,R
42Worst Case Complexity
- Suffix Array
- Build time
- O(NlogN)
- Search time
- O(PlogN)
- Structure space
- O(N)
- 2N - 3N integers
- Independent of S
- Suffix Tree
- Build time
- O(N)
- Search time
- O(P)
- Structure space
- O(N)
- Big constant
- Dependent of S
43Expected Time Improvements
- Improve the expected case time of
- Search Algorithm
- Sort Algorithm
- LCP computation
- Use the following assumptions
- All N-symbol strings are equally likely
- Under this assumption
- Expected length of longest repeated substring of
A is O(logSN)
44Expected Case Improvements - Main Idea
- Let T
- Let IntT(u) integer encoding in base S of the
T-symbol prefix of u - Example
- T 3
- S a,b
- u abaa
- IntT(u) 010 2
- There are ST N possible T-symbol prefixes
- IntT(u) is a number in 0,N-1
- Map each suffix Ap to IntT(Ap)
- Can be done in O(N) time
45Expected Case Improvements - Search Algorithm
- Use an additional array Buck
- Think of the sorted array as buckets, based on
the IntT encoding - Buckk min i IntT (APosi) k
- The first position that contains a suffix thats
mapped to k - Compute Buck
- at the end of the sort algorithm
- O(N) additional time
46Expected Case Improvements - Search Algorithm
- Given a word W
- We need to find Lw and Rw
- Let k IntT(W)
- Lw and Rw must be in ks bucket
- (Buckk, Buckk1)
- We only need to search one bucket
47Expected Case Improvements - Search Algorithm
- Number of buckets ST N
- Average number of elements in a bucket O(1)
- In the binary search for W
- Expected size of bucket to search O(1)
- Expected number of search steps O(1)
- Expected case time O(P)
48Expected Case Improvements - Sort Algorithm
- First stage of sort
- Sort according to first symbol
- Replace first stage with sort according to IntT
- Equivalent to sort according to first T symbols
- Can be done in O(N) time
- We changed the base case of the sort from H1 to
HT
49Expected Case Improvements - Sort Algorithm
- Observation
- Let C be the length of the longest repeated
substring of A - Sort is in fact complete once we have reached
(C1)-buckets - Suppose some (C1)-bucket contains more than one
suffix - Then we have two suffixes with lcp gt C
- This prefix is a repeated substring longer than C
- contradiction
50Expected Case Improvements - Sort Algorithm
- Expected case
- C O(logSN) O(T)
- Number of stages O(1)
- Expected case time O(N)
51Expected Case Improvements - LCP Computation
- Replace interval tree with sort history
- Binary tree
- Models the refinement of buckets during the sort
- A vertex for each H-bucket
- Each vertex holds the stage number at which its
bucket was split
52Expected Case Improvements - LCP Computation
- Leaves correspond to suffixes and are arranged in
an N element array - Each vertex has at least two children
- O(N) nodes
- Can be built with O(N) additional time during the
sort
53Expected Case Improvements - LCP Computation
- Given the sort history we can compute lcp(Ap, Aq)
- Find the nca (nearest common ancestor) of Ap and
Aq - Let H be the ncas stage number
- lcp(Ap, Aq) H lcp(ApH, AqH)
- Recursively compute lcp(ApH, AqH)
- Stop when the nca is the root
54Expected Case Improvements - LCP Computation
- Each step is O(1)
- At each step the stage number of the nca is at
least halved - Suppose we stop the recursion when H lt T
- Expected length of longest repeated substring is
O(T) - Expected case lcp is O(T) O(logSN)
55Expected Case Improvements - LCP Computation
- O(1) recursive steps in the expected case
- Expected case time for one lcp O(1)
- Expected case time for computing Llcp, Rlcp
O(N)
56Expected Case Improvements - LCP Computation
- We need a way to find lcps that are known to be
less than T - Build a ST x ST array
- LookupIntT(x), IntT(y) lcp(x,y) for all
T-symbol strings x,y - Max N entries (ST vN)
- Compute incrementally in O(N)
- Final recursion steps are replaced by O(1) lookup
57Expected Time Complexity
- Search time
- O(P)
- Sort LCP computation time
- O(N)