Title: hillbig@is.s.u-tokyo.ac.jp
1???????????
- ?????????????????????????1? (???)
- ??? ??
- hillbig_at_is.s.u-tokyo.ac.jp
2?????
- ??
- ????
- ??????
- Suffix Arrays / Suffix Trees
- Static Dictionary
- ?????????
- CSA
- CST
- IIWT
3?????
- ??
- ????
- ??????
- Suffix Arrays / Suffix Trees
- Static Dictionary
- ?????????
- CSA
- CST
- IIWT
4??
- ?????????????????!
- Web??? 80???? gt1PB
- MEDLINE(??????) gt50GB
- ???????? gt800G base
- ????/??????/?????????
- ????????????????????!
- ??????/??????????
- ????????????????????????
- G_at_ogle??????????????????.
??????????????? ??????????
5?????????
- ????
- ????????T1N????????????????????????o(N)????
- ? ??????Suffix Arrays
- ???????
- ???????????(e.g. O(log2N)) ????????1/5??1/20??????
?? - ???PC (Mem 4GB)?20GB????????????????(Mem
16GB)?80GB??????????
6?????
- ??
- ????
- ??????
- Suffix Arrays / Suffix Trees
- Static Dictionary
- ?????????
- CSA
- CST
- IIWT
7??????
- ??????????????????
- ???????
- ??????????????
- ?book? lt102,187,298,gt?boot? lt609,1029gt
- ????????????????????????????????
8?????????
- ????????????????????????
- ?book? lt102, 187-102 ,298-187 ,gt
lt102,85,111,gt - ???????
- s???? Elias 1975
- golomb?? Golomb 1969
- rice ?? Rice 1979
- Interpolative?? Moffat 2000
- Recursive Integer Code Moffat 2005
9Suffix Arrays Manber 1989
- ??N????T1N ????????Si TiN
?T?suffix(???)???? - Si??????????????Suffix?????Suffix Arrays????
- ??????P1M?O(MlogN)??????
S7 S6 aS4 aca S1 abraca S2 bracaS5
ca S3 raca
S1 abraca S2 braca S3 raca S4 aca S5
ca S6 a S7
7 6 4 1 2 5 3
Suffix Arrays
???
10Suffix Arrays??????
Tabracadabra Pbra
???? ?????M ?????N??O(MlogN) Hgt???????????O(M
logN)
11 10 a 7 abra 0 abracadabra 3
acadabra 5 adabra 8 bra 1 bracadabra
4 cadabra 6 dabra 9 ra 2 racadabra
bra gt adabra
bra bra
bra lt cadabra
11Suffix Arrays???,???
- ???????????????????
- Ko 03 Karkkainen 03 Kim 03
- ?????????????O(logN)???Mamada 2003 Mori
2005Maniscalco 2005 - ????????????????????????? Huynh 05
- ????????????????
- ??????????????
12Suffix Trees Weiner 1973
- Suffix???????Trie????????????????????????????
- ??Suffix Arrays?????
- ?????1??????????Link (Suffix Link)????
???? abracadabra
ra
a
bra
bra
d
c
d
c
d
c
c
c
11
10
7
0
3
5
4
6
9
2
8
1
suffix arrays
13Static DictionaryMunro 1996 Pagh 2001
- bit?? B1N????????o(N)??????????????????????????
???????? - rank1(B,i)
- B1i??1?????
- select1(B,i)
- B1i??i???1??????
- rank0(B,i) select0(B,i)??????
- ???????????????
14Static Dictionary ??
- B 0,0,1,1,0,0,1,1,0,1,0
- rank1(B,1) 0
- rank1(B,2) 0
- rank1(B,3) 1
- rank1(B,8) 4
- select1(B,1) 3
- select1(B,3) 7
- rank0(B,2) 2
15Static Dictionary???
- ??????????????????????????????????????????
- ????????Gonzalez 2005
- rank(i) ?i/64??????????????i mod
64?1???popCount????(add 6? shift, 12?) - select?rank?????????????
- rank0(B,i) i- rank1(B,i)
-
1664bit??1?????? (C)
- unsigned long long popCount(unsinged long long
x) x ((x 0xAAAAAAAAAAAAAAAA) gtgt1 )
( x 0x5555555555555555 ) x ((x
0xCCCCCCCCCCCCCCCC) gtgt2 ) ( x
0x3333333333333333 ) x ((x
0xF0F0F0F0F0F0F0F0) gtgt4 ) ( x
0x0F0F0F0F0F0F0F0F ) x ((x
0xFF00FF00FF00FF00) gtgt8 ) ( x
0x00FF00FF00FF00FF ) x ((x
0xFFFF0000FFFF0000) gtgt16) ( x
0x0000FFFF0000FFFF ) x ((x
0xFFFFFFFF00000000) gtgt32) ( x
0x00000000FFFFFFFF ) return x -
17?????
- ??
- ????
- ??????
- Suffix Arrays / Suffix Trees
- Static Dictionary
- ?????????
- CSA
- CST
- IIWT
18??????
- ?????????????
- Compressed Suffix Arrays (CSA)
- Vertical Code
- Compressed Suffix Trees (CST)
- Parenthesis Tree (???)
- Inverted file Indexing with Wavelet Tree (IIWT)
- Wavelet Tree
19Compressed Suffix Arrays (CSA)Grossi
2001,2003Sadakane 2003
- CSA?O(NH0)bit?????????
- H0??????0???????
- ????H02.54bit
- ???SA??NlogN bit?? (4N Byte)
- 1GB?????SA??????????4GB???????CSA?200MB500MB
- ??????????????
- ?????????????????
20CSA???
- SA??????????4N(8N)byte
- SA?1??N???Permutation?????????
- SA?????????????i SA-1 SAi 1
????0SA-10SAji ?? SA-1i j - SA???????????SAi????????????SA????????????????S
Ai SA?i-1 SA?2i-2 ..
SA?ni-n??? SA?ni p ?? SAi p-n
21????
- ?i SA-1SAi 1
- ???i?????Suffix??????????Suffix????
- ?? TSAi TSAj, iltj ????i lt ?j
- SA??Suffix???????????????????Suffix
Si?Sj?1?????????2?????????????????? - ????Si1lt Sj1 ????SA-1SAi1 lt SA-1SAj1
(?i lt ?j) - ???????Suffix?????i?????
22Suffix a abra abracadabra acadabra adabra b
ra bracadabra cadabra dabra ra racadabra
i01234567891011
SA11107035814692
?30678910115214
??Suffix??????????????? ???????????????????????
23i01234567891011
SA11107035814692
?30678910115214
SAi mod 4 0???????
SA7 SA?7-1 SA11-1
SA?11-2 SA4-2
SA?4-3 SA8-3 4-3
1
SAiSA?i-1???
24????????????
- ????????????????????????????????????
- di ?i1-?i
- di???????????O(NH0)???????????????Sadakane
2003. - ????????????????????????????????di???????????i
?kdkdk1di-1
25Vertical Code Okanohara 2005 (to appear)
- ?????????????????????????????????
- Si1..Ndi????????????
- ???????????Recursive Integer Code Moffat 2005
- Vertical Code???Byte????????
- (c.f. Range Coder????????Byte?????????????????????
????????)
26Vertical Code???
- d1N?M???Block????d1M,dM12M,d2M13M,.
. - ?????Block?????????bit????Maxbiti????MaxBit1
log2(Max(d1,,dM))1MaxBit2
log2(Max(dM1,,d2M))1 - i???Block????k bit??????bit??Vik???V11
d11,d21,V12 (d1gtgt1)1,(d2gtgt1)1
27Vertical Code???(?)
- MaxBit, Vi1Maxbiti,?iM?????(M8???????
??Byte?????????????) - ?t Si1tdi ??????,
- j i / M
- k i mod M
- x ?jM
- for (i 1 i lt Maxbitj i)x
popCount(Vji ((1 ltlt k) 1))ltlt i
28Vertical Code??
- ?183,6,12,15,18,21,28,31d
183,3,6,3,3,3,7,3 - 3 3 6 3 3 3 7 31 1 0 1 1 1 1 11 1 1 1 1 1 1 10
0 1 0 0 0 1 0
1bit?
V11
V12
2bit?
3bit?
V13
3 0112
29Vertical Code??
- ?183,6,12,15,18,21,28,31d
183,3,6,3,3,3,7,33 3 6 3 3 3 7 31 1 0 1 1
1 1 11 1 1 1 1 1 1 10 0 1 0 0 0 1 0
V11
?6????
V12
V13
?6 d1d2d6 popCount(V11
((1ltlt6)-1)) popCount(V12
((1ltlt6)-1)) ltlt 1 popCount(V13
((1ltlt6)-1)) ltlt 2
30Compressed Suffix Trees (CST)Sadakane 2005
- T1N?Suffix Trees?CSA6N bit???
- ???????2Byte / ?? Okanohara 2005
- ????????100N bit ??
- Suffix Tree?????????????
- ??????????????
- Suffix Tree??T1N????????,???????N1???2N-1???
- ??????????????????12 NByte??
31???Jacobson 1989 Raman 2001Geary 2004
- ??????????????????(????)??????????
- findclose(i) i?????????????????????
- findopen(i)i?????????????????????
- enclose(i)i???????????????????????????
(()(()()))
32??????????????
- parent(i) enclose(i)
- child(i) i1
- sibling(i) findclose(i)1
(()(()()))
33??????
- ????M???????????
- ???3??????
- near ??????????????????
- far near????
- pioneer far?????????far?????????????????????
(()(((
))()()
)())
(
pioneer
(
far
34??????(?)
- ???????pioneer?????????pioneer????
- ????????????????????????B???pioneer?????????4B-6??
????? - pioneer????????????????????????
(()(((
))()()(
))())
(()())
35???????
- findclose(i)
- i???????????
- near ? ????????????????
- pioneer ? pioneer?????????
- far ? ???pioneer???????????
?????????????? ???pioneer???????????
???????????????? - ?????????
36Inverted file Indexing with Wavelet Tree (IIWT)
Okanohara 2005 (to appear)
- ????????????
- ??????????S,????M????O(MlogS)?????
(???????????0????????H0?????MH0bit) - ??????O(H0)???????
- ?????i???????
- ??????????????????
- N10GB M1G S 1024??1GB?????
37Wavelet Tree Grossi 2003
- ??CSA, FM-index????????????
- ?????????S?????????T1N?????????????a?S?rank,
select?O(logS)??????????????? - Static Dictionary?S2??
- ???????O(NH0)
38Wavelet Tree??
1
0
- Sa,b,ca 02 b 102 c 112
- T abbccbaacbab
0
1
a
b
c
abbccbaacbab011111001101
1
bbccbcbb00110100
0
a
0
1
b
c
39Wavelet Tree??
1
0
- Sa,b,ca 02 b 102 c 112
- T abbccbaacbab
0
1
a
b
c
rankb(8)3
abbccbaacbab011111001101
rank1(8)5
bbccbcbb00110100
rank0(5)3
a
b
c
40IIWT
- ???????????1?????????0???bit??T????
- ?????????????????T1M????Wavelet Tree???
- ??????ab?????p????????????????rankp(b)-
rankp(a) - ???p??i????????selectp(T,i)??????
this book is interesting.100001000010010000000000
0
41IIWT???/??
- ??
- ?????????????????O(logS) (CSA???????????O(N))
) - ??????????????????????????????????(??????????????
????) - ??
- ????????????
42??????
- ????????????????????????????
- ???PC(???4GB)?20GB??????(??????16GB)?80GB
- 1TB, 1PB??????????????????????????????
- ??800GB?????426GB?TREK Data?????
- ?????????????????????
- Example Based Learning???
- ????????????????????????