Title: Succinct Data Structure
1Succinct Data Structure
2006/07/05 ?3?????????????_at_???
- ??? ??
- ???????????????????????
- hillbig_at_is.s.u-tokyo.ac.jp
2??
- ???????
- ???????????
- ????????
- ??????????????
- ??????????????????
?? ?????????????????????????
3?????
- ??
- Succinct Data Structure (SDS)
- ????????SDS
- ???????SDS
- ???????SDS
- SDS??????????
- Suffix Arrays?Burrow Wheelers ??
- FM-index, Compressed Suffix Arrays
- ????????
4?????
- ??
- Succinct Data Structure (SDS)
- ????????SDS
- ???????SDS
- ???????SDS
- SDS??????????
- Suffix Arrays?Burrow Wheelers ??
- FM-index, Compressed Suffix Arrays
- ????????
5??(1/2)
- ????? word RAM
- ????n?? log n ???????????
- ? ?? n232 64 ???64bit???????
- ???????
- ?????D????????L?L log(D?????)
- ?????????????????D??????????????
- ? n ??????T ?????L ?2n - ?(log n)
6??(2/2)???????? Manzini 01
- H00?????????
- nc T??c?????
- Hk k?????????
- Ts s?Sk?????????????
- Hk ? Hk-1 ?? H1 ?H0 ?1?????
- ????????????????????
7?
- T aacbbcbc
- T 8 , na2, nb3, nc3
- k0???
- H0(T) (2/8)log(8/2) ? 0.47
- k2???
- S2ac,cb,bb,bc,c,
- Taca Tcbab Tbbc
- H2(T) (1/8)0 (2/8)1 ? 0.25
- ?????H5?? ??0.23 DNA0.24 XML0.10
8?????
- ??
- Succinct Data Structure (SDS)
- ????????SDS
- ???????SDS
- ???????SDS
- SDS??????????
- Suffix Arrays?Burrow Wheelers ??
- FM-index, Compressed Suffix Arrays
- ????????
9Succinct Data Structure (SDS)??
- ???????D ??????????
- ??????? ??????????????
- ????????? ??????????
- ??????? L log (D ?????)
- ???????????? (1o(1))L bits
- ??????
- ?????O(1)????o(L)?????????????????
- ???????(??) ???????????????????
10????(??)????SDS
- ???? B0n-1 Bi1 ??? 0
- ??D?0n-1??????? i?D ??? Bi1,???????Bi0
- ????????????
- lookup(B,i) Bi???
- rank1(B,i) B0i??1?????
- select1(B,j)B??(j1)???1??????
B020 000100101010010010010 lookup1 (B,0)0,
lookup1 (B,6)1 rank1 (B,10)4, rank1
(B,15)5select1 (B,0)3, select1 (B,4)13
11????(??)????SDS???
- ??????
- ??????????????
- ?????????????????
- ?????????????(c.f. word RAM ???)
- ????????
- no(n) bit Jacobson 89 M96
- H0(B)o(n) bit Grossi 02
- ??????????
- ??????????????Gonzalez 05 Kim 05
12????(??)????SDS???rank??
- B???log2n????????? (SB Super-Block)
- ?SB???logn/2????????? (TBTiny-Block)
- ?SB????rank?????? O(n/logn) bit
- ?TB????rank????SB?????rank??(??rank)??? O(n
loglogn / logn) bit - TB??rank?????????popCount(????)
- rank(B,i) SBi/log2nTBi/logn2
rank(???rank)
B
log2n
SB
TB
logn/2
13popcount(x) x??1??????
- unsinged int popCount(unsinged int r) r ((r
0xAAAAAAAA) gtgt 1) (r 0x55555555)r ((r
0xCCCCCCCC) gtgt 2) (r 0x33333333)r ((r gtgt
4) r) 0x0F0F0F0F - r (rgtgt8) r
- return ((rgtgt16) r) 0x3F
-
- 0xAAAAAAAA 1010101010...102
- 0x55555555 0101010101...012
- 0xCCCCCCCC 1100110011...002
- 0x33333333 0011001100...112
- 0x0F0F0F0F 0000111100...112
14????(??)????SDS???select??
- rank????? O(log n)??
- select1(B,j) rank1(B,k)ltj?rank1(B,k1) ???k
- o(n)???????????????
- 1?1??????n??????
- Algorithm I Kim 2005
- B???log1/2n????????
- 1???????????????
- log2n???1??????????
- log1/2n???1????????????(1?1??????2log1/2?????????
?) - 1??????????????????
15?????SDSJacobson 85 Munro 01 Geary 05
Benoit 05
- ???n???????2n bit???
- ??????????O(nlogn) bit(??????????????????96n
bit) - Balanced Parenthesis (BP) Munro 01 Geary 05
- ??DFS??????????(????)???
- Depth First Unary Degree Sequence (DFUDS)
- ???(???DFS?????????????k????k??(?1??)???
BP
DUFDS
(()(()()))
((())(()))
0010010111
0001100111
16BP?????????
- ?????child,childrank???????????
- parent (x) x??
- firstchild (x)????
- sibling(x)????
- depth(x) x???(???????)
- desc(x) x????
- rank(x) x?preorder???
- select(i) preorder?i????????
- LA(x, d) x???????d??? (level-ancestor)
- lca(x, y) x?y???????
- degree(x) x?????
- child (x, i) x?i????
- childrank (x) x???????x???????
???????????
DFUDS?lca, depth, LA??????
17BP???
- n????,?????????????B02n-1????????
- ??DFS??????????(????)
- (??????????????B??rank(? select(
?BP??????????????? - B???Mlogn/2?????B0B(2n-1)/M???
- m(x) x??????????
- b(x) x???????????
(()(((
))()()
)())
18BP???????
- BP???????????(???????)
- findopen(x), findclose(x) x??????????????
- enclose(x) x?????????????????????
- ????????????
- parent(x) enclose(x)
- sibling(x) findclose(x)1
- first-child(x) x1
parent
(()(()()))
19findclose???(1/4)Munro 01 Geary 05
- ???x ????????
- x?near ? b(x) b(m(x))?????????????????
- x?far ? b(x) ? b(m(x))
- findclose(x) (???????)
- x?near???????????O(n1/2(logn)2) o(n)
- x?far????????
- ??????far???????nlogn?????
((((((
((((((
))))))
))))))
???far???
20 findclose???(2/4)
- ???far?????????????
- x?pioneer x?far???x????far??????x?????b(m(x))?b
(m(x)) - ??m(x)?pioneer???x?pioneer???
- ?? Block??T???pioneer???4T-6??
- ?? pioneer?????????????????????? ??????????
(()(((
))()()
)())
(
pioneer
(
far
21 findclose???(3/4)
- Pioneer????????????BP2???
- BP2???????????????O(n/logn)
- BP2 ??BP?????????(???)
- BP?BP2??????????P02n-1???
- Bi?Pioneer???Pi1 ????Pi0
- P??rank, select?BP??pioneer?BP2?????????
22 findclose???(4/4)
- findclose(x) (???????)
- x?near??? ???
- x?pioneer??? BP2???????
- x?far???(1)???far x ?select(rank(P,x))????(2)
y µ(x)????,B(y)????? (B(y)
?B(µ(x))???)(3)x?x???(???)???????,
B(µ(x))???????????? - findclose????enclose???????
23BP??
P 100010 010000 0001
BP
(()(((
))()()
)())
(
pioneer
(
far
(())
BP2
findclose(4)B4?pioneer???BP2???????(1rank(P,4)
-1)??findclose????????2????select(P,2)
7 findclose(3) B3?far??????pioneer?????????????
??????????
24???????SDS
- T0n-1?Ti?S????????c?S???rankc(T,i)?selectc(T
,i) ????? - ??????????????
- S2???????????????
- Sgt2???
- ????????SDS????SB,TB??????????????????????????O(
S (n/logn n loglog n/ logn)) - Sgtlogn???????????????n?????(???????S2566553
6 lognlt32)
25???????SDS
- ????????
- Slto(n/loglogn)???Generalized Wavelet Tree
?????? Ferragina 2004 - nH0(T) bits ? rank, select?????
- ?????Wavelet Tree??? Grossi 2003
- nH0(T) bits ? rank, select?log(S)??
- ?????????????Huffman?
- ?????????
- rank??????select??????????rank,select??????
26Wavelet??(1/2)
- Sa,b,ca 02 b 102 c 112
- T abbccbaacbab
0
1
a
0
1
b
c
abbccbaacbab011111001101
Huffman?
????1bit?
1
0
bbccbcbb
b?c??????????
a
00110100
????2bit?
0
1
c
b
27Wavelet Tree??(2/2)
0
1
- Sa,b,ca 02 b 102 c 112
- T abbccbaacbab
a
0
1
b
c
Huffman?
abbccbaacbab011111001101
rank1(8)5
1
0
rankb(T,8)3
bbccbcbb00110100
a
rank0(T,5)3
0
1
c
b
28?????
- ??
- Succinct Data Structure (SDS)
- ????????SDS
- ???????SDS
- ???????SDS
- SDS??????????
- Suffix Arrays?Burrow Wheelers ??
- FM-index, Compressed Suffix Arrays
- ?????
29(??)?????????/??
- ????
- ???????????? T (??n ?????????S)
- ???P (??m)
- occ(P) ???P?T??????
- loc(P) ???P?T?????????
- ??
- ??????????????????????????
- ???? (??????????????)
- ?????????????????????????????
- ?????????????????
- ??????
- ???????????????????????????
30????
- ??????????P?????????
- ?????????????????????
- ?????????????????????????????????????(???????????
?) - ??????????????????
- n-gram????????????????
- ??n?????????????
- ?????????????????O(N)????????????????????
31??????
- ?????????????????????
- ????????????????????????O(1)??????????????
- ??(??)?????????????????
- ????????????
- ?????????????
- ?????????????????????????
- Suffix Arrays (BW??)?SDS???????
- ???? nHkbit?occ(P)?O(m) ??Ferragina 2005
32Suffix Arrays (SA) Manber 1989
- ?? Tt1t2 t3..tN
- T????(suffix) Sk tk tk1tk2..tN
S7 S6 aS1 abraca S4 aca S2 bracaS5
ca S3 raca
S1 abraca S2 braca S3 raca S4 aca S5
ca S6 a S7
7614253
(1) T??????????
(3) ?????????
(2) ?????????????????
33SA??????
- ?? Tabracadabra ??? P bra
???????
????? occ(P) O(m log n) loc(P) O(m log
n occ(P)) ????? log n bit (5n
byte) Hgt??????occ(P)?O(mlog n)
11 10 a 7 abra 0 abracadabra 3
acadabra 5 adabra 8 bra 1 bracadabra
4 cadabra 6 dabra 9 ra 2 racadabra
bra gt adabra
bra bra
bra lt cadabra
34Compressed Suffix Arrays (CSA)Grossi, Vitter
00Sadakane 03Grossi, Guputa, Vitter 03
- SA?????????????????????????????
- ??SA??????? nlogn bit?SA?0??N-1??????????????
- SA???????????????????
- ?i SA-1SAi 1
- SA?????????SAkiSAik??????
- ?????SA?????SAk??SA????
- SAi SA?i-1 SA?2i-2 ...
SA?ni-n - ?? SA?ni p ??? SAi p-n
35????
- ??SA??????????
- ?? TSAi TSAi1 ??? ?i lt?i1
- ?? TSAi TSAi1???SAi?SAi1????
?????????????????????Suffix????????????SA-1SAi
1 lt SA-1SAi11??i lt ?i1 - di?i1-?i?d?????????????nH0 bits
(???nHkbits???) Sadakane 2003 - d?wavelet tree??????????nHkbit Grossi 2003
abra abracadabra
bra bracadabra
1????????
2?????(SAi1?SAi11)??????????????????
36Backward Search Sadakane 2002 Makinen 2004
- ???????SA?????SA?lookup?????????????????????
- Search PCAGTA in backword (Pm Pm-1)
A
AGTA
A
A
A
A
A
????????
??????prefix Pim??????????????(???????????
?????)
C
C
C
C
C
CAGTA
G
G
G
G
G
T
T
T
T
T
TA
GTA
37Burrows Wheelers Transform 1994 (BWT)
- ???????????(????)
- ?? BWTi TSAi-1
- ??SAi0?? BWTi Tn
- ? abracadabra ? BWT ardrcaaaabb
- BWT????????????????
- ???????????????????
- c.f. Compression boosting Ferragina 2005
t hese are possible ... t hese were not of
.. t hese ...
38BWT?
- When Farmer Oak smiled, the corners of his mouth
spread till they were within an unimportant
distance of his ears, his eyes were reduced to
chinks, and diver gingwrinkles appeared round
them, extending upon his countenance like the
rays in a rudimentary sketch of the rising sun.
His Christian name was Gabriel, and on working
days he was a young man of sound judgment, easy
motions, proper dress, and general good
character. On Sundays he was a man of misty
views, rather given to postponing, and hampered
by his best clothes andumbrella upon the whole,
one who felt himself to occupy morally that vast
..??
BW??
BWT?
Ioooooioororooorooooooooorooorromrrooomooroooooooo
rmoorooororioooroormmmmmuuiiiiiIiuuuuuuuiiiUiiiiii
oooooooooooorooooiiiioooioiiiiiiiiiiioiiiiiieuiiii
iiiiiiiiiiiiiouuuuouuUUuuuuuuooouuiooriiiriirriiii
riiiiiiaiiiiioooooooooooooiiiouioiiiioiiuiiuiiiiii
iiiiiiiiiiiiiiiiioiiiiioiuiiiiiiiiiiiiioiiiiiiiiii
iiioiiiiiiuiiiioiiiiiiiiiiiioiiiiiiiiiioiiiioiiiii
iioiiiaiiiiiiiiiiiiiiiiioiiiiiioiiiiiiiiiiiiiiiuii
iiiiiiiiiiiiiiiioiiiiiiiioiiiiiiiiiiiiiiiiiiiiiiii
iiiiiiiiiiiiuuuiioiiiiiuiiiiiiiiiiiiiiiiiiiiiiiioi
iiiuioiuiiiiiiioiiiiiiiuiiiiiiiiiiiiiiiiiiiiiiiiii
iiiioaoiiiiioioiiiiiiiioooiiiiiooioiiioiiiiiouiiii
iiiiiiiiooiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiioiiiiiiii
iiiiiiiiiiioiooiiiiiiiiiiioiiiiiuiiiiiiiiiiiiiiiii
iiiiiiiiiiiiiiiiiiioiiiiiiiiiiiiioiiiuiiiiiiiiiioi
iiiiiiiiiiiuoiiioiiioiiiiiiiiiiiiiiiiiiiiiiuiiiiuu
iiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiuiuiiiiiuuiiiii
iiiiiiiiiiiiiiiiiiiuiiiiiiiiiiiiiiiiiiiiiiiiiiiioi
iiiiiioiiiiiiiiiiiiiiiiiiiiioiiiiiiiiioiiiiuiiiioi
iiioiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiioiioiiii
iiuiiiiiiiiiiiiiiiooiiiiiiiiiiiiiiiiiiiioooiiiiiii
ioiiiiouiiiiiiiiiiiiiii..??
39BWT??????
- F ?Suffix????????????
- ??
- BWT??????c????????????????????????????F????????
?? - ??
- BWT,F??????????????????????????????????????????
???
a1 r d r c a2 a3 a4 a5 b b
a1 a2 bra a3 bracadabra a4 cadabra a5
dabra bra bracadabra cadabra dabra ra racada
bra
BWT
F
40BWT??????(?)
- SA-1 SA???? SAki ?? SA-1ik
- cumc T??c??????????
- ???????????????????
- SA-1SAi-1 rankc(BWT,i-1) cumc
- SA-1SAi1 selectx(BWT,i-cumx)????cBWTi
x?cumx?i?cumx1????x - BWT???????(lf-mapping)?????
41i SA SA-1 SA-1SAi1 BWT Suffix
0 11 3 3 a
1 10 7 0 r a
2 7 11 6 d abra
3 0 4 7 abracadabra
4 3 8 8 r acadabra
5 5 5 9 c adabra
6 8 9 10 a bra
7 1 2 11 a bracadabra
8 4 6 5 a cadabra
9 6 10 2 a dabra
10 9 1 1 b ra
11 2 0 4 b racadabra
42?BW?? (LF-mapping)
- void revBWT(char bwt, int n)int count 0x100
memset(count,0,sizeof(int)0x100)for (i 0 i
lt n i) countbwtifor (int i 1 i lt
0x100 i) counticounti-1int LFmapping
new intn for (int i n-1 i gt 0 i--)
LFmapping--countbwti iint next
find(BWT,) //return the position of for
(int i 0 i lt n i) next
LFmappingnext putchar(bwtnext) - delete LFmapping
43FM-index Ferragina 2000
- BWT????????????????
- BWTi TSAi-1 ?????
- BWT??rank, select???SA???
- SA-1SAi-1 rank(BWT,c) cumc
- SA-1SAi1 select(BWT,c)
- CSA????????????????????????????
- ?????????LZ-index Karkkainen 96???
44FM-index???
?? P0m-1???????? BWT0n-1
????????T?BWT?????? C0S-1
Cc?c????????BWT????????? ??? sp,ep
P?prefix?????suffix arrays????
epltsp????P?T????????????
- i m-1
- sp 0 ep n-1
- while (sp ? ep) and (i gt 0) do
- c Pi
- sp Ccrank(BWT,c,sp-1)1
- ep Ccrank(BWT,c,ep)
- i--
- end
45I SA BWT Head of Suffix
0 11 a1
1 10 r1 a1
2 7 d a2
3 0 a3
4 3 r2 a4
5 5 c a5
6 8 a2 b1
7 1 a3 b2
8 4 a4 c
9 6 a5 d
10 9 b1 r1
11 2 b2 r2
PabrTabracadabra BWTardrcaaaabb
sp
sp 0 ep 11 sp 901 10 ep 92
11 sp 501 6ep 52 7 sp 111
3 ep 13 4
i 2cr
abr
i 1cb
i 0ca
br
i m-1 sp 0 ep n-1 while (sp ? ep) and
(i gt 0) do c Pi sp
Ccrank(BWT,c,sp-1)1 ep
Ccrank(BWT,c,ep) i-- end
r
ep
46?????
- ??
- Succinct Data Structure (SDS)
- ????????SDS
- ???????SDS
- ???????SDS
- SDS??????????
- Suffix Arrays?Burrow Wheelers ??
- FM-index, Compressed Suffix Arrays
- ????????
47???
- Succinct Data Structures (SDS)
- ??????(???????)???(????)??????
- ???????????????????
- ??????
- Suffix Arrays?Burrows Wheeler ?????????????rank?se
lect????? - SDS???????????nHk bits???
48?????
- Succinct Data Structures
- ????????????
- ?????????????????
- ?????????????
- nHkbits O(1)??
- ?????????????? (Gap Measure Gupta 06)
- ??????????
- ?????????????(??????????)
- ????????????
- ???????Huynh 2005, ????, ????
- ????????
- logn???????????????SDS Makinen 06