Pr - PowerPoint PPT Presentation

1 / 30
About This Presentation
Title:

Pr

Description:

... Ancestor Bloom Filters. Query: a//b. Compute the Bloom Filter ... Structural Bloom Filters. A full system for P2P XML indexing. As opposed to some simulation ... – PowerPoint PPT presentation

Number of Views:38
Avg rating:3.0/5.0
Slides: 31
Provided by: proje76
Category:
Tags: bloom

less

Transcript and Presenter's Notes

Title: Pr


1
XML processing in DHT networks Serge Abiteboul,
Ioana Manolescu, Neoklis Polyzotis, Nicoleta
Preda, Chong Sun INRIA-Saclay UC Santa-Cruz
Date
1
2
Outline
  • Topic
  • KadoP System
  • Overview of DHT
  • Query evaluation
  • Optimization techniques
  • DPP Distributed postings partitioning
  • Structural Bloom Filters
  • Conclusion

2
3
Topic
  • Querying large volume of content in a P2P network
    for a community of users
  • Focus on indexing
  • Content XML
  • P2P network structured - around DHT
  • XML indexing DHT networks

3
4
Example Edos distribution system
  • A system for managing Linux distribution
    (Mandriva)
  • System releases
  • about 10 000 software packages metadata (XML)?
  • Community of open-source developers thousands
  • Functionalities
  • Publish/update releases
  • Query the metadata
  • Retrieve packages

4
5
The KadoP system
5
6
DHT A P2P indexing infrastructure
ID15
ID0x mod 24
ID1(x20)mod24
ID2(x21)mod24
Pastry
ID4(x22)mod24
Pointer in the finger table Look-up (K) from
client ID0 Look-up (K) from client ID1
ID8(x23)mod24
  • Use a ring
  • each peer takes an ID in the space Modulo(2N)?
  • each peer stores (K, Object) pairs, for K
    satisfying
  • ID peer K lt ID next peer
  • Which API?
  • locate (K) ? Peer IP
  • get (K) ? Object
  • put (K, Object)

6
7
Advantages and Disadvantage
  • Advantages
  • Availability and reliability
  • No centralization (bottleneck) and replication
  • Scalability
  • Scalable solution for keyword queries
  • Disadvantage
  • Difficult to maintain the structure
  • Not suited for transient population of peers

7
8
XML query processing in KadoP
  • Query evaluation
  • Step 1.
  • Given a XQuery Q, decompose Q in tree pattern
    queries
  • Evaluate each tree pattern query using the DHT
    index to identify a set candidates peers P that
    can provide answers
  • Step 2.
  • Ship Q to these peers P and evaluate it there

8
9
Indexing XML documents
Doc.xml
8
X ancestor of Y ? start(X) lt start(Y) end(X)?
1
A
2
6
7
8
B
C
X parent of Y? X ancestor of Y and level(X)
level(Y) - 1
4
4
4
3
5
6
D
E
F
4
4
John
G
6 6
Posting peer, doc, start, end, level
9
10
XML indexing in DHT
  • Publish them via a DHT
  • put (k,postings), where k is a label or a keyword
  • Remark all the postings for author accumulate at
    the same peer

put(authorp2,d2,start,end,lev)?
Posting list for author
p2
DHT
p(author)?
p1
put(authorp1,d2,start,end,lev)?
10
11
Some technical issues
  • Goal manage millions of documents with thousands
    of peers
  • First experiments were a disaster
  • First works
  • Replace the index storage of the DHT in a FS by
    storage in a database (Berkeley DB)?
  • Extend the API of the DHT with Append and not
    only Read/Write
  • Extend the API of the DHT with a streaming
    exchange of postings
  • With this, KadoP scaled but was slow due to e.g.
    long postings

11
12
Optimization
12
13
Main issue long postings
  • Transfer of long posting is hurting performance
  • Bad response time
  • Parallelization Distributed Posting Partitioning
    (DPP)?
  • Communication load
  • Bloom filter Structural Bloom Filter

p(Name)?
long posting for Name
13
14
DPP structure
(p,d)?
p(Name)?
long posting for Name
(p1,d1)?
(p3,d3)?
(p2,d2)?
(p4,d4)?
  • DPP structure
  • Split and distribute postings according to
    conditions
  • Each condition is an interval C1(p1,d1),(p2,d2)
  • Each two conditions are over disjoint intervals
  • Some kind of B-tree for postings

C1 C2 C3 C4
p(Name)?
14
15
Query processing (no DPP)
8
0
article
article
0
8
QP-peer
abstract
author
author
database
Ullman
8
0
abstract
index-Q
0
0
8
8
Ullman
database
Pipeline transfers of postings to query
processing peer Holistic twig-join algorithm to
compute the result in parallel at QP peer
15
16
Query processing with DPP
At p(client)?
Conditions sorted according (p,d)
p( )?
C2
p( )?
C1
abstract
C5
p( )?
C4
p( )?
C3
XML
Fetch from p(abstract) and p(XML) the conditions
C1-C5 Prune intervals Transfer and compute in
parallel the join for each sub-interval
16
17
Experiments
  • Platform
  • Grid5000 P2P platform for research in P2P
    systems
  • Distributed geographically across 6 sites in
    France
  • KadoP tested on more than 100 machines
  • 1000 logical peers
  • Conclusions in brief
  • Good performance
  • KadoP scales very nicely
  • Issue does not support high churn of peers
    (index copying)?

17
18
Query response time
Qarticle//author//Ullman
18
19
Optimization
  • (b) Structural Bloom Filters
  • Ancestor Bloom Filter
  • Also in paper Descendant BF

19
20
Using an Ancestor Bloom Filters
Query a//b Compute the Bloom Filter of the
a-postings and send to p(b)? Compute the
b-postings that have an a-ancestor (and
more)? Send it to the p(a) that can compute the
answer
L(a)?
DHT
p(a)?
L(b)?
F(b, ABF(a))?
p(b)?
20
21
Technique dyadic intervals
Dyadic intervals
23
1
8
1, 4
22
5, 8
1 2 3 4 5 6 7
21
1, 2 3, 4 5, 6 7,
8
start
end
ap
1,1 2,2 3,3 4,4 5,5 6,6 7,7 8,8
20
bp
  • Dyadic covers
  • D(ap)1,4, 5,6, 7,7
  • ap is ancestor of bp if
  • ?? ? D(ap) (start(bp) ? ? )?
  • Here 3 ? 1,4, so answer is yes!

21
22
Ancestor Bloom Filter (simplified)?
  • Publication ?d, ?ap in d, ?? ? D(ap)?
  • Insert a trace in the Bloom Filter
  • Say Th(d,?) 1 for some has function h
  • Test for bp in d,
  • for each dyadic interval ? s.t. start(bp) ? ?,
  • test if Th(d,?) 1
  • If one test is positive, conclude bp in d is a
    solution
  • Wrong positives because of Hash collisions

22
23
Query evaluation strategies
p(a)?
p(a)?
ABF(a)?
DBF(b)?
b F(b, DBF(c)? DBF(d))
b F(b, ABF(a))?
p(b)?
p(b)?
ABF(b)?
ABF(b)?
DBF(c)?
DBF(c)?
d F(d, ABF(b))?
p(d)?
p(c)?
p(d)?
p(c)?
c F(c, ABF(b))?
Descendant Bloom Reducer
Ancestor Bloom Reducer
24
Performances
25
Conclusion
25
26
Related works
  • Very active area
  • DHT-based platforms for XML data management
  • Locating data sources (Galanis al. VLDB03)?
  • XPath lookup queries in P2P networks (Bonifati et
    al. WIDM04)?
  • Other DHT-based systems for data management
  • PIER query processor (Huebsch al, CIDR05)?
  • Indexing in P2P networks (Aberer al, VLDB05)?
  • Dyadic Intervals
  • Maintenance of dynamic intervals (Gilbert al,
    VLDB02)?

26
27
Contribution
  • Two optimization techniques for index processing
  • Distributed Posting Partitioning
  • Structural Bloom Filters
  • A full system for P2P XML indexing
  • As opposed to some simulation
  • Lots of engineering details that are important
    for performance
  • Extensively tested for performance
  • Tested with a real application, EDOS

27
28
On-going and future work
  • New indexing techniques
  • Trading-off precision for performance
  • Publish summarizations of documents
  • Index/transfer postings at a coarse level of
    detail
  • Index views (query caching)?
  • Query optimizer for KadoP
  • This is standard distributed query processing
  • Use standard optimization techniques, e.g., use
    OptiMax ActiveXML optimizer (demo in ICDE08)
  • Develop what is specific for KadoP cost model

28
29
  • Merci

30
Indexing time
30
Write a Comment
User Comments (0)
About PowerShow.com