Title: Pr
1XML processing in DHT networks Serge Abiteboul,
Ioana Manolescu, Neoklis Polyzotis, Nicoleta
Preda, Chong Sun INRIA-Saclay UC Santa-Cruz
Date
1
2Outline
- Topic
- KadoP System
- Overview of DHT
- Query evaluation
- Optimization techniques
- DPP Distributed postings partitioning
- Structural Bloom Filters
- Conclusion
2
3Topic
- Querying large volume of content in a P2P network
for a community of users - Focus on indexing
- Content XML
- P2P network structured - around DHT
- XML indexing DHT networks
3
4Example Edos distribution system
- A system for managing Linux distribution
(Mandriva) - System releases
- about 10 000 software packages metadata (XML)?
- Community of open-source developers thousands
- Functionalities
- Publish/update releases
- Query the metadata
- Retrieve packages
4
5The KadoP system
5
6DHT A P2P indexing infrastructure
ID15
ID0x mod 24
ID1(x20)mod24
ID2(x21)mod24
Pastry
ID4(x22)mod24
Pointer in the finger table Look-up (K) from
client ID0 Look-up (K) from client ID1
ID8(x23)mod24
- Use a ring
- each peer takes an ID in the space Modulo(2N)?
- each peer stores (K, Object) pairs, for K
satisfying - ID peer K lt ID next peer
- Which API?
- locate (K) ? Peer IP
- get (K) ? Object
- put (K, Object)
6
7Advantages and Disadvantage
- Advantages
- Availability and reliability
- No centralization (bottleneck) and replication
- Scalability
- Scalable solution for keyword queries
- Disadvantage
- Difficult to maintain the structure
- Not suited for transient population of peers
-
7
8XML query processing in KadoP
- Query evaluation
- Step 1.
- Given a XQuery Q, decompose Q in tree pattern
queries - Evaluate each tree pattern query using the DHT
index to identify a set candidates peers P that
can provide answers - Step 2.
- Ship Q to these peers P and evaluate it there
8
9Indexing XML documents
Doc.xml
8
X ancestor of Y ? start(X) lt start(Y) end(X)?
1
A
2
6
7
8
B
C
X parent of Y? X ancestor of Y and level(X)
level(Y) - 1
4
4
4
3
5
6
D
E
F
4
4
John
G
6 6
Posting peer, doc, start, end, level
9
10XML indexing in DHT
- Publish them via a DHT
- put (k,postings), where k is a label or a keyword
- Remark all the postings for author accumulate at
the same peer
put(authorp2,d2,start,end,lev)?
Posting list for author
p2
DHT
p(author)?
p1
put(authorp1,d2,start,end,lev)?
10
11Some technical issues
- Goal manage millions of documents with thousands
of peers - First experiments were a disaster
- First works
- Replace the index storage of the DHT in a FS by
storage in a database (Berkeley DB)? - Extend the API of the DHT with Append and not
only Read/Write - Extend the API of the DHT with a streaming
exchange of postings - With this, KadoP scaled but was slow due to e.g.
long postings
11
12Optimization
12
13Main issue long postings
- Transfer of long posting is hurting performance
- Bad response time
- Parallelization Distributed Posting Partitioning
(DPP)? - Communication load
- Bloom filter Structural Bloom Filter
-
p(Name)?
long posting for Name
13
14DPP structure
(p,d)?
p(Name)?
long posting for Name
(p1,d1)?
(p3,d3)?
(p2,d2)?
(p4,d4)?
- DPP structure
- Split and distribute postings according to
conditions - Each condition is an interval C1(p1,d1),(p2,d2)
- Each two conditions are over disjoint intervals
- Some kind of B-tree for postings
C1 C2 C3 C4
p(Name)?
14
15Query processing (no DPP)
8
0
article
article
0
8
QP-peer
abstract
author
author
database
Ullman
8
0
abstract
index-Q
0
0
8
8
Ullman
database
Pipeline transfers of postings to query
processing peer Holistic twig-join algorithm to
compute the result in parallel at QP peer
15
16Query processing with DPP
At p(client)?
Conditions sorted according (p,d)
p( )?
C2
p( )?
C1
abstract
C5
p( )?
C4
p( )?
C3
XML
Fetch from p(abstract) and p(XML) the conditions
C1-C5 Prune intervals Transfer and compute in
parallel the join for each sub-interval
16
17Experiments
- Platform
- Grid5000 P2P platform for research in P2P
systems - Distributed geographically across 6 sites in
France - KadoP tested on more than 100 machines
- 1000 logical peers
- Conclusions in brief
- Good performance
- KadoP scales very nicely
- Issue does not support high churn of peers
(index copying)?
17
18Query response time
Qarticle//author//Ullman
18
19Optimization
- (b) Structural Bloom Filters
- Ancestor Bloom Filter
- Also in paper Descendant BF
19
20Using an Ancestor Bloom Filters
Query a//b Compute the Bloom Filter of the
a-postings and send to p(b)? Compute the
b-postings that have an a-ancestor (and
more)? Send it to the p(a) that can compute the
answer
L(a)?
DHT
p(a)?
L(b)?
F(b, ABF(a))?
p(b)?
20
21Technique dyadic intervals
Dyadic intervals
23
1
8
1, 4
22
5, 8
1 2 3 4 5 6 7
21
1, 2 3, 4 5, 6 7,
8
start
end
ap
1,1 2,2 3,3 4,4 5,5 6,6 7,7 8,8
20
bp
- Dyadic covers
- D(ap)1,4, 5,6, 7,7
- ap is ancestor of bp if
- ?? ? D(ap) (start(bp) ? ? )?
- Here 3 ? 1,4, so answer is yes!
21
22Ancestor Bloom Filter (simplified)?
- Publication ?d, ?ap in d, ?? ? D(ap)?
- Insert a trace in the Bloom Filter
- Say Th(d,?) 1 for some has function h
- Test for bp in d,
- for each dyadic interval ? s.t. start(bp) ? ?,
- test if Th(d,?) 1
- If one test is positive, conclude bp in d is a
solution - Wrong positives because of Hash collisions
22
23Query evaluation strategies
p(a)?
p(a)?
ABF(a)?
DBF(b)?
b F(b, DBF(c)? DBF(d))
b F(b, ABF(a))?
p(b)?
p(b)?
ABF(b)?
ABF(b)?
DBF(c)?
DBF(c)?
d F(d, ABF(b))?
p(d)?
p(c)?
p(d)?
p(c)?
c F(c, ABF(b))?
Descendant Bloom Reducer
Ancestor Bloom Reducer
24Performances
25Conclusion
25
26Related works
- Very active area
- DHT-based platforms for XML data management
- Locating data sources (Galanis al. VLDB03)?
- XPath lookup queries in P2P networks (Bonifati et
al. WIDM04)? - Other DHT-based systems for data management
- PIER query processor (Huebsch al, CIDR05)?
- Indexing in P2P networks (Aberer al, VLDB05)?
- Dyadic Intervals
- Maintenance of dynamic intervals (Gilbert al,
VLDB02)?
26
27Contribution
- Two optimization techniques for index processing
- Distributed Posting Partitioning
- Structural Bloom Filters
- A full system for P2P XML indexing
- As opposed to some simulation
- Lots of engineering details that are important
for performance - Extensively tested for performance
- Tested with a real application, EDOS
-
27
28On-going and future work
- New indexing techniques
- Trading-off precision for performance
- Publish summarizations of documents
- Index/transfer postings at a coarse level of
detail - Index views (query caching)?
- Query optimizer for KadoP
- This is standard distributed query processing
- Use standard optimization techniques, e.g., use
OptiMax ActiveXML optimizer (demo in ICDE08) - Develop what is specific for KadoP cost model
28
29 30Indexing time
30