hillbig@is.s.u-tokyo.ac.jp - PowerPoint PPT Presentation

About This Presentation
Title:

hillbig@is.s.u-tokyo.ac.jp

Description:

1 ( ) – PowerPoint PPT presentation

Number of Views:32
Avg rating:3.0/5.0
Slides: 43
Provided by: oka69
Category:
Tags: file | hillbig | inverted | tokyo

less

Transcript and Presenter's Notes

Title: hillbig@is.s.u-tokyo.ac.jp


1
???????????
  • ?????????????????????????1? (???)
  • ??? ??
  • hillbig_at_is.s.u-tokyo.ac.jp

2
?????
  • ??
  • ????
  • ??????
  • Suffix Arrays / Suffix Trees
  • Static Dictionary
  • ?????????
  • CSA
  • CST
  • IIWT

3
?????
  • ??
  • ????
  • ??????
  • Suffix Arrays / Suffix Trees
  • Static Dictionary
  • ?????????
  • CSA
  • CST
  • IIWT

4
??
  • ?????????????????!
  • Web??? 80???? gt1PB
  • MEDLINE(??????) gt50GB
  • ???????? gt800G base
  • ????/??????/?????????
  • ????????????????????!
  • ??????/??????????
  • ????????????????????????
  • G_at_ogle??????????????????.

??????????????? ??????????
5
?????????
  • ????
  • ????????T1N????????????????????????o(N)????
  • ? ??????Suffix Arrays
  • ???????
  • ???????????(e.g. O(log2N)) ????????1/5??1/20??????
    ??
  • ???PC (Mem 4GB)?20GB????????????????(Mem
    16GB)?80GB??????????

6
?????
  • ??
  • ????
  • ??????
  • Suffix Arrays / Suffix Trees
  • Static Dictionary
  • ?????????
  • CSA
  • CST
  • IIWT

7
??????
  • ??????????????????
  • ???????
  • ??????????????
  • ?book? lt102,187,298,gt?boot? lt609,1029gt
  • ????????????????????????????????

8
?????????
  • ????????????????????????
  • ?book? lt102, 187-102 ,298-187 ,gt
    lt102,85,111,gt
  • ???????
  • s???? Elias 1975
  • golomb?? Golomb 1969
  • rice ?? Rice 1979
  • Interpolative?? Moffat 2000
  • Recursive Integer Code Moffat 2005

9
Suffix Arrays Manber 1989
  • ??N????T1N ????????Si TiN
    ?T?suffix(???)????
  • Si??????????????Suffix?????Suffix Arrays????
  • ??????P1M?O(MlogN)??????

S7 S6 aS4 aca S1 abraca S2 bracaS5
ca S3 raca
S1 abraca S2 braca S3 raca S4 aca S5
ca S6 a S7
7 6 4 1 2 5 3
Suffix Arrays
???
10
Suffix Arrays??????
Tabracadabra Pbra
???? ?????M ?????N??O(MlogN) Hgt???????????O(M
logN)
11 10 a 7 abra 0 abracadabra 3
acadabra 5 adabra 8 bra 1 bracadabra
4 cadabra 6 dabra 9 ra 2 racadabra
bra gt adabra
bra bra
bra lt cadabra
11
Suffix Arrays???,???
  • ???????????????????
  • Ko 03 Karkkainen 03 Kim 03
  • ?????????????O(logN)???Mamada 2003 Mori
    2005Maniscalco 2005
  • ????????????????????????? Huynh 05
  • ????????????????
  • ??????????????

12
Suffix Trees Weiner 1973
  • Suffix???????Trie????????????????????????????
  • ??Suffix Arrays?????
  • ?????1??????????Link (Suffix Link)????

???? abracadabra

ra
a
bra
bra

d
c
d
c
d
c
c

c

11
10
7
0
3
5
4
6
9
2
8
1
suffix arrays
13
Static DictionaryMunro 1996 Pagh 2001
  • bit?? B1N????????o(N)??????????????????????????
    ????????
  • rank1(B,i)
  • B1i??1?????
  • select1(B,i)
  • B1i??i???1??????
  • rank0(B,i) select0(B,i)??????
  • ???????????????

14
Static Dictionary ??
  • B 0,0,1,1,0,0,1,1,0,1,0
  • rank1(B,1) 0
  • rank1(B,2) 0
  • rank1(B,3) 1
  • rank1(B,8) 4
  • select1(B,1) 3
  • select1(B,3) 7
  • rank0(B,2) 2

15
Static Dictionary???
  • ??????????????????????????????????????????
  • ????????Gonzalez 2005
  • rank(i) ?i/64??????????????i mod
    64?1???popCount????(add 6? shift, 12?)
  • select?rank?????????????
  • rank0(B,i) i- rank1(B,i)

16
64bit??1?????? (C)
  • unsigned long long popCount(unsinged long long
    x) x ((x 0xAAAAAAAAAAAAAAAA) gtgt1 )
    ( x 0x5555555555555555 ) x ((x
    0xCCCCCCCCCCCCCCCC) gtgt2 ) ( x
    0x3333333333333333 ) x ((x
    0xF0F0F0F0F0F0F0F0) gtgt4 ) ( x
    0x0F0F0F0F0F0F0F0F ) x ((x
    0xFF00FF00FF00FF00) gtgt8 ) ( x
    0x00FF00FF00FF00FF ) x ((x
    0xFFFF0000FFFF0000) gtgt16) ( x
    0x0000FFFF0000FFFF ) x ((x
    0xFFFFFFFF00000000) gtgt32) ( x
    0x00000000FFFFFFFF ) return x

17
?????
  • ??
  • ????
  • ??????
  • Suffix Arrays / Suffix Trees
  • Static Dictionary
  • ?????????
  • CSA
  • CST
  • IIWT

18
??????
  • ?????????????
  • Compressed Suffix Arrays (CSA)
  • Vertical Code
  • Compressed Suffix Trees (CST)
  • Parenthesis Tree (???)
  • Inverted file Indexing with Wavelet Tree (IIWT)
  • Wavelet Tree

19
Compressed Suffix Arrays (CSA)Grossi
2001,2003Sadakane 2003
  • CSA?O(NH0)bit?????????
  • H0??????0???????
  • ????H02.54bit
  • ???SA??NlogN bit?? (4N Byte)
  • 1GB?????SA??????????4GB???????CSA?200MB500MB
  • ??????????????
  • ?????????????????

20
CSA???
  • SA??????????4N(8N)byte
  • SA?1??N???Permutation?????????
  • SA?????????????i SA-1 SAi 1
    ????0SA-10SAji ?? SA-1i j
  • SA???????????SAi????????????SA????????????????S
    Ai SA?i-1 SA?2i-2 ..
    SA?ni-n??? SA?ni p ?? SAi p-n

21
????
  • ?i SA-1SAi 1
  • ???i?????Suffix??????????Suffix????
  • ?? TSAi TSAj, iltj ????i lt ?j
  • SA??Suffix???????????????????Suffix
    Si?Sj?1?????????2??????????????????
  • ????Si1lt Sj1 ????SA-1SAi1 lt SA-1SAj1
    (?i lt ?j)
  • ???????Suffix?????i?????

22
Suffix a abra abracadabra acadabra adabra b
ra bracadabra cadabra dabra ra racadabra
i01234567891011
SA11107035814692
?30678910115214
??Suffix??????????????? ???????????????????????
23
i01234567891011
SA11107035814692
?30678910115214
SAi mod 4 0???????
SA7 SA?7-1 SA11-1
SA?11-2 SA4-2
SA?4-3 SA8-3 4-3
1
SAiSA?i-1???
24
????????????
  • ????????????????????????????????????
  • di ?i1-?i
  • di???????????O(NH0)???????????????Sadakane
    2003.
  • ????????????????????????????????di???????????i
    ?kdkdk1di-1

25
Vertical Code Okanohara 2005 (to appear)
  • ?????????????????????????????????
  • Si1..Ndi????????????
  • ???????????Recursive Integer Code Moffat 2005
  • Vertical Code???Byte????????
  • (c.f. Range Coder????????Byte?????????????????????
    ????????)

26
Vertical Code???
  1. d1N?M???Block????d1M,dM12M,d2M13M,.
    .
  2. ?????Block?????????bit????Maxbiti????MaxBit1
    log2(Max(d1,,dM))1MaxBit2
    log2(Max(dM1,,d2M))1
  3. i???Block????k bit??????bit??Vik???V11
    d11,d21,V12 (d1gtgt1)1,(d2gtgt1)1

27
Vertical Code???(?)
  • MaxBit, Vi1Maxbiti,?iM?????(M8???????
    ??Byte?????????????)
  • ?t Si1tdi ??????,
  • j i / M
  • k i mod M
  • x ?jM
  • for (i 1 i lt Maxbitj i)x
    popCount(Vji ((1 ltlt k) 1))ltlt i

28
Vertical Code??
  • ?183,6,12,15,18,21,28,31d
    183,3,6,3,3,3,7,3
  • 3 3 6 3 3 3 7 31 1 0 1 1 1 1 11 1 1 1 1 1 1 10
    0 1 0 0 0 1 0


1bit?
V11
V12
2bit?
3bit?
V13
3 0112
29
Vertical Code??
  • ?183,6,12,15,18,21,28,31d
    183,3,6,3,3,3,7,33 3 6 3 3 3 7 31 1 0 1 1
    1 1 11 1 1 1 1 1 1 10 0 1 0 0 0 1 0


V11
?6????
V12
V13
?6 d1d2d6 popCount(V11
((1ltlt6)-1)) popCount(V12
((1ltlt6)-1)) ltlt 1 popCount(V13
((1ltlt6)-1)) ltlt 2
30
Compressed Suffix Trees (CST)Sadakane 2005
  • T1N?Suffix Trees?CSA6N bit???
  • ???????2Byte / ?? Okanohara 2005
  • ????????100N bit ??
  • Suffix Tree?????????????
  • ??????????????
  • Suffix Tree??T1N????????,???????N1???2N-1???
  • ??????????????????12 NByte??

31
???Jacobson 1989 Raman 2001Geary 2004
  • ??????????????????(????)??????????
  • findclose(i) i?????????????????????
  • findopen(i)i?????????????????????
  • enclose(i)i???????????????????????????

(()(()()))
32
??????????????
  • parent(i) enclose(i)
  • child(i) i1
  • sibling(i) findclose(i)1

(()(()()))
33
??????
  • ????M???????????
  • ???3??????
  • near ??????????????????
  • far near????
  • pioneer far?????????far?????????????????????

(()(((
))()()
)())
(
pioneer
(
far
34
??????(?)
  • ???????pioneer?????????pioneer????
  • ????????????????????????B???pioneer?????????4B-6??
    ?????
  • pioneer????????????????????????

(()(((
))()()(
))())
(()())
35
???????
  • findclose(i)
  • i???????????
  • near ? ????????????????
  • pioneer ? pioneer?????????
  • far ? ???pioneer???????????
    ?????????????? ???pioneer???????????
    ????????????????
  • ?????????

36
Inverted file Indexing with Wavelet Tree (IIWT)
Okanohara 2005 (to appear)
  • ????????????
  • ??????????S,????M????O(MlogS)?????
    (???????????0????????H0?????MH0bit)
  • ??????O(H0)???????
  • ?????i???????
  • ??????????????????
  • N10GB M1G S 1024??1GB?????

37
Wavelet Tree Grossi 2003
  • ??CSA, FM-index????????????
  • ?????????S?????????T1N?????????????a?S?rank,
    select?O(logS)???????????????
  • Static Dictionary?S2??
  • ???????O(NH0)

38
Wavelet Tree??
1
0
  • Sa,b,ca 02 b 102 c 112
  • T abbccbaacbab

0
1
a
b
c
abbccbaacbab011111001101
1
bbccbcbb00110100
0
a
0
1
b
c
39
Wavelet Tree??
1
0
  • Sa,b,ca 02 b 102 c 112
  • T abbccbaacbab

0
1
a
b
c
rankb(8)3
abbccbaacbab011111001101
rank1(8)5
bbccbcbb00110100
rank0(5)3
a
b
c
40
IIWT
  • ???????????1?????????0???bit??T????
  • ?????????????????T1M????Wavelet Tree???
  • ??????ab?????p????????????????rankp(b)-
    rankp(a)
  • ???p??i????????selectp(T,i)??????

this book is interesting.100001000010010000000000
0
41
IIWT???/??
  • ??
  • ?????????????????O(logS) (CSA???????????O(N))
    )
  • ??????????????????????????????????(??????????????
    ????)
  • ??
  • ????????????

42
??????
  • ????????????????????????????
  • ???PC(???4GB)?20GB??????(??????16GB)?80GB
  • 1TB, 1PB??????????????????????????????
  • ??800GB?????426GB?TREK Data?????
  • ?????????????????????
  • Example Based Learning???
  • ????????????????????????
Write a Comment
User Comments (0)
About PowerShow.com