Complex Queries in DHT-based Peer-to-Peer Networks - PowerPoint PPT Presentation

About This Presentation
Title:

Complex Queries in DHT-based Peer-to-Peer Networks

Description:

Title: The PIER Relational Query Processing System Author: Ryan Huebsch Last modified by: Ryan Huebsch Created Date: 1/31/2002 10:12:22 PM Document presentation format – PowerPoint PPT presentation

Number of Views:55
Avg rating:3.0/5.0
Slides: 19
Provided by: RyanHu9
Learn more at: http://www.huebsch.org
Category:

less

Transcript and Presenter's Notes

Title: Complex Queries in DHT-based Peer-to-Peer Networks


1
Complex Queries in DHT-based Peer-to-Peer Networks
  • Matthew Harren, Joe Hellerstein,
  • Ryan Huebsch, Boon Thau Loo,
  • Scott Shenker, Ion Stoica
  • p2p_at_db.cs.berkeley.edu
  • UC Berkeley, CS Division

IPTPS 3/8/02
2
Outline
  • Contrast P2P DB systems
  • Motivation
  • Architecture
  • DHT Requirements
  • Query Processor
  • Current Status
  • Future Research

3
Uniting DHTs andQuery Processing
4
P2P DB Systems
P2P
DB
Flexibility ? ?
Decentralized ? ?
Strong Semantics ? ?
Powerful query facilities ? ?
Fault Tolerance ? ?
Lightweight ? ?
Transactions Concurrency Control ? ?
5
P2P DB ?
  • P2P Database? No!
  • ACID transactional guarantees do not scale, nor
    does the everyday user want ACID semantics
  • Much too heavyweight of a solution for the
    everyday user
  • Query Processing on P2P!
  • Both P2P and DBs do data location and movement
  • Can be naturally unified (lessons in both
    directions)
  • P2P brings scalability flexibilityDB brings
    relational model query facilities

6
P2P Query Processing(Simple) Example
SELECT song, size, server FROM album, song WHERE
album.ID song.albumID AND album.name Rubber
Soul
  • Filesharing
  • Keyword searching is ONE canned SQL query
  • Imagine what else you could do!

7
P2P Query Processing(Simple) Example
SELECT song, size, server FROM album-ngrams AN,
song WHERE AN.ID song.albumID AND AN.ngram IN
ltlist of search ngramsgt GROUP BY AN.ID HAVING
COUNT(AN.ngram) gt lt of ngrams in searchgt
  • Filesharing
  • Keyword searching is ONE canned SQL query
  • Imagine what else you could do!
  • Fuzzy Searching, Resource Discovery, Enhanced DNS

8
What this projectIS and IS NOT about
  • IS NOT ABOUT Absolute Performance
  • In most situations a centralized solution could
    be faster
  • IS ABOUT Decentralized Features
  • No administrator, anonymity, shared resources,
    tolerates failures, resistant to censorship
  • IS NOT ABOUT Replacing RDBMS
  • Centralized solutions still have their place for
    many applications (commercial records, etc.)
  • IS ABOUT Research synergies
  • Unifying/morphing design principles and
    techniques from DB and NW communities

9
General Architecture
  • Note the data is stored separately from the
    query engine, not a standard DB practice!
  • Based on Distributed Hash Tables (DHT) to get
    many good networking properties
  • A query processor is built on top

10
DHT API
  • Basic API
  • publish(RID, object)
  • lookup(RID)
  • multicast(object)
  • NOTE Applications can only fetch-by-name a very
    limited query language!

11
DHT API Enhancements I
  • Basic API
  • publish(namespace, RID, object)
  • lookup(namespace, RID)
  • multicast(namespace, object)
  • Namespaces subsets of the ID space for logical
    and physical data partitioning

12
DHT API Enhancements II
  • Additions
  • lscan(namespace) retrieve the data stored
    locally from a particular namespace
  • newData(namespace) receive a callback when new
    data is inserted into the local store for the
    namespace
  • This violates the abstraction of location
    independence
  • Why necessary? Parallel scanning of base relation
  • Why acceptable? Access is limited to reading,
    applications can not control the location of data

13
Query Processor(QP) Architecture
  • QP is just another application as far as the DHT
    is concerned DHT objects QP tuples
  • User applications can use QP to query data using
    a subset of SQL
  • Select
  • Project
  • Joins
  • Group By / Aggregate
  • Data can be metadata (for a file sharing type
    application) or entire records, mechanisms are
    the same

14
Indexes. The lifeblood of a database engine.
  • DHTs mapping of RID/Object is equivalent to an
    index
  • Additional indexes are created by adding another
    key/value pair with the key being the value of
    the indexed field(s) and value being a pointer
    to the object (the RID or primary key)

Secondary
PKey
Key
Index NS
Data
Ptr
DHT
DHT
Primary
PKey
Data
Primary Index
Secondary Index
15
Relational Algorithms
  • Selection/Projection
  • Join Algorithms
  • Symmetric Hash
  • Use lscan on tables R S. Republish tuples in a
    temporary namespace using the join attributes as
    the RID. Nodes in the temporary namespace perform
    mini-joins locally as tuples arrive and forwards
    results to requestor.
  • Fetch Matches
  • If there is an index on the join attribute(s) for
    one table (say R), use lscan for other table (say
    S) and then issue a lookup probing for matches in
    R.
  • Semi-Join like algorithms
  • Bloom-Join like algorithms
  • Group-By (Aggregation)

16
Interesting note
  • The state of the join is stored in the DHT store
  • Rehashed data is automatically re-routed to the
    proper node if the coordinate space adjusted
  • When a node splits (to accept a new node into the
    network) the data is also split, this includes
    previously delivered rehashed tuples
  • Allows for graceful re-organization of the
    network not to interfere with ongoing operations

17
Where we are
  • A working real implementation of our Query
    Processing (currently named PIER) on top of a CAN
    simulator
  • Initial work studying and analyzing algorithms
    nothing really ground-breaking YET!
  • Analyzing the design space and which problems
    seem most interesting to pursue

18
Where to go from here?
  • Common Issues
  • Caching Both at DHT and QP levels
  • Using Replication for speed and fault tolerance
    (both in data and computation)
  • Security
  • Database Issues
  • Pre-computation of (intermediate) results
  • Continuous queries/alerters
  • Query optimization (Is this like network
    routing?)
  • More algorithms, Dist-DBMS have more tricks
  • Performance Metrics for P2P QP Systems
  • What are the new apps the system enables?

19
Additional Slides
20
Symmetric Hash Join
  1. The tuple is checked against predicates that
    apply to it (i.e. produced gt 1970)
  2. Unnecessary fields can be projected out
  3. Re-insert the resulting tuple into the network
    using the join key value as the new RID, and use
    a new temporary namespace (both tables use same
    namespace)

When each node receives the multicast it uses
lscan to read all data stored at the node. Each
object or tuple is analyzed
I want Hawaiian images that appeared in movies
produced since 1970
Create a query request SELECT name, URL FROM
images, movies WHERE image.ID movie.ID AND
21
N-grams
  • Technique from information retrieval to do
    in-exact matching
  • I want tyranny, but I cant spell tyrrany
  • First, n-grams is created (bi-grams in this case)
  • Doc1 tyranny ? create 8 bi-grams
Write a Comment
User Comments (0)
About PowerShow.com