The PIER Relational Query Processing System - PowerPoint PPT Presentation

About This Presentation
Title:

The PIER Relational Query Processing System

Description:

The PIER Relational Query Processing System. Boon Thau Loo & Ryan Huebsch ... Note that space wraps (it is really a torus) so node 6 has nodes 2,3,4, and 5 as ... – PowerPoint PPT presentation

Number of Views:72
Avg rating:3.0/5.0
Slides: 30
Provided by: ryanhu
Learn more at: http://www.huebsch.org
Category:

less

Transcript and Presenter's Notes

Title: The PIER Relational Query Processing System


1
The PIER Relational Query Processing System
  • Boon Thau Loo Ryan Huebsch
  • boonloo, huebsch_at_eecs.berkeley.edu
  • Also developed with Matthew Harren
  • Faculty Joe Hellerstein, Scott Shenker, Ion
    Stoica
  • Group Email p2p_at_db.cs.berkeley.edu

UC Berkeley DB Seminar 2/8/02
2
What is PIER? Just another nonsense acronym?
  • PIER Peer-to-Peer Infrastructure for
    Information Exchange and Retrieval
  • PIER A relational query engine that runs on top
    of an existing P2P (semi-structured) network and
    its data
  • PIER A better Napster, Gnutella, Freenet or
    your other favorite file-sharing software
  • PIER An answer to the worlds information
    problems well probably not

3
Why? Whats wrong with existing P2P systems?
  • Existing P2P systems were designed mainly by
    networking people
  • Excel in
  • Fault tolerance
  • Flexibility
  • Decentralized (ad hoc, multiple administrators,
    limited controls)
  • Lack
  • Real query facilities only support keyword
    searching
  • Guarantees/promises (remember the networking
    motto best effort)

4
Why? Whats wrong with existing database systems?
  • Existing database systems
  • Excel in
  • Precise, correct answers
  • Powerful query facilities (SQL)
  • Guarantees and concurrency control (ACID
    Transactions)
  • Optimization dont tell us how to do it we
    tell you
  • Lack
  • Flexibility (servers come and go, loose
    structure)
  • Decentralized administration
  • Fault tolerance (can only tolerate a few
    problems, if any, the DB answer to a problem it
    didnt work, try again later)

5
What about Napster, Gnutella, and the rest?
  • Napster is centralized
  • Subject to failure
  • Subject to censorship (in this case the music
    industry)
  • Requires administration
  • Gnutella does not scale
  • Queries only asked in nearby network (limited hop
    count)
  • Research projects CAN, Chord (CFS, INS, TWINE),
    Pastry, Tapestry (OceanStore), etc.
  • Lack the query facilities

6
Why do you need complex queries? N-grams
  • Technique from information retrieval to do
    in-exact matching
  • I want tyranny, but I cant spell tyrrany
  • First, n-grams is created (bi-grams)
  • Doc1 tyranny ? create 8 bi-gramst, ty,
    yr, ra, an, nn, ny, y
  • Each bi-gram contains a pointer to the doc ID
  • So a db might look like

ra ? Doc2 rn ? Doc3 ty ? Doc1 vu ?
Doc2 xr ? Doc4
t ? Doc1 t ? Doc5 t ? Doc6 an ?
Doc1 an ? Doc3
an ? Doc5 ba ? Doc2 nn ? Doc1 ny ?
Doc1 ra ? Doc1
7
N-grams continued
  • Convert search string to n-grams
  • Intersect (which is a join) the list of n-grams
    from the search word with the index of
    n-grams/docIDs.
  • Aggregate the results by docID, count the number
    of n-gram matches for each docID More n-gram
    matches closer to request
  • SELECT i.docID, count(i.ngram) as matchesFROM
    userlist u, indexlist IWHERE u.ngram
    i.ngramGROUP BY i.docIDORDER BY matches ASC

8
What PIER IS and IS NOT about
  • IS NOT ABOUT Performance
  • In most situations a centralized solution could
    be faster
  • IS ABOUT Decentralized Features
  • No administrator, anonymity, shared resources,
    tolerates failures, resistant to censorship
  • IS NOT (Just) ABOUT Hype
  • Buzzwords, catch phrases, and empty promises
  • IS ABOUT
  • Uniting the networking community and the database
    community to get the best of both

9
General PIER Architecture
  • Based on Distributed Hash Tables (DHT) to get
    many good networking properties
  • A query processor is built on top
  • Note the data is stored separately from the
    query engine, not a standard DB practice!

10
DHTs Overview
  • Wide-area distributed hash table that supports
    finding data in an efficient manner
  • Data has an ID that is used to map the item to a
    node (or bucket) that stores it
  • Lookups/inserts ? routing problem
  • Number of implementations
  • CAN UC Berkeley/ICSI
  • CHORD MIT/UC Berkeley
  • Tapestry UC Berkeley, based on Plaxtons
    algorithm (UTexas).
  • Any many others, all similar APIs
    functionalities

11
DHTs CAN in particular
  • Content Addressable Network (CAN) is one of many
    DHT algorithms
  • Overlay network uses underlying TCP
  • Self-forming only need to find one node to
    join, no central administration needed
  • Fault tolerant adapts to many failures
  • Load balancing
  • Scalable each node needs only a little state
    information about its neighbors, O(d) cost and
    primitives are O(n1/d) cost
  • d of dimensions, n of nodes

12
CAN Overview
  • Establishes a d-dimensional Cartesian coordinate
    space
  • Nodes randomly choose a point in the space to
    join
  • The node currently responsible for that space
    splits and the space is divided equally
  • Nodes store information about their spatial
    neighbors in the coordinate space

13
CAN Overview II
  • Data is represented as a key/value pair
  • The key or resource ID (RID) is hashed into the
    coordinate using a hashing function h(x), such as
    SHA1
  • Data is stored at the node responsible for that
    space

I1
Item 1 h(RID1) lt3,2gt
14
CAN Overview III
Note that space wraps (it is really a torus) so
node 6 has nodes 2,3,4, and 5 as neighbors, but
to make this simple example work we ignore that
detail
  • Data is located by hopping from neighbor to
    neighbor till the right node is found
  • Node 6 wants I1
  • Determine coordinates h(RID1) lt2,3gt
  • Determine which neighbor is closest
  • N4 or N2
  • Hop to that neighbor and repeat

I1
15
CAN Overview IV
  • Cost of operations is based on the path length
    from node to node O(n1/d)
  • Basic API
  • Publish(namespace, RID, object)
  • Lookup(namespace, RID)
  • Multicast(group, object)
  • Applications can only do EXACT lookups that just
    isnt enough for real users!

16
PIER Overview
  • PIER is just another application as far as CAN is
    concerned CAN objects PIER tuples
  • User applications can use PIER to query data
    using a subset of SQL (Joins and Group By).
  • Assume
  • PIER is running on all CAN nodes
  • Metadata is stored in CAN, not the actual large
    files, although functionally they could
  • An objects RID is its primary key (in most
    cases)
  • The namespace is equivalent to the relations name

17
CAN Extensions
Not needed for correctness, could use polling
instead
  • To allow for PIER to operate, it must have access
    to the local data store
  • LocalGet(namespace) retrieve the data stored
    locally from a particular namespace
  • LocalNotify(namespace) receive a callback when
    new data is inserted into the local store for the
    namespace
  • This violates the abstraction of location
    independence
  • However, access is limited to reading the data,
    applications can not control the location of data
    through these extensions

18
Indexes. The lifeblood of a database engine.
  • CANs mapping of RID/Object is equivalent to an
    index
  • Additional indexes are created by adding another
    key/value pair with the key being the value of
    the indexed field(s) and value being a pointer
    to the object (the RID or primary key)
  • Original 35, abc.mp3, classical, 1837,
  • Index on 3rd/4th attributes classical.1837,35
  • Index lookups require 2 CAN lookups, one for the
    index entry, one for the object

19
Rehash-Join
  • The tuple is checked against predicates that
    apply to it (i.e. produced gt 1970)
  • Unnecessary fields can be projected out
  • Re-insert the resulting tuple into the network
    using the join key value as the new RID, and use
    a new temporary namespace (both tables use same
    namespace)

When each node receives the multicast it use
localGet to read all data stored at the node.
Each object or tuple is analyzed
I want Hawaiian images that appeared in movies
produced since 1970
Create a query request SELECT name, URL FROM
images, movies WHERE image.ID movie.ID AND
20
Get-Join
  • The tuple is checked against predicates that
    apply to it (i.e. produced gt 1970)
  • Issue a Lookup request for the tuple(s) that may
    be able to join with it the RID is join key and
    the namespace is for the other table
  • ONE TABLE MUST BE INDEXED ON JOIN KEY!

When each node receives the multicast it use
localGet to read all data stored at the node.
Each object or tuple is analyzed
I want Hawaiian images that appeared in movies
produced since 1970
Create a query request SELECT name, URL FROM
images, movies WHERE image.ID movie.ID AND
21
Hybrid-Join
  • Starts the same as the others request is
    multicasted to the nodes
  • Instead of rehashing the entire tuple, only
    rehash the primary key and join key of tuples
    that may satisfy the request (just like a
    semi-join algorithm or index-on-the-fly)
  • At the nodes with the rehashed data, if a join is
    possible (matching join keys) Lookup the entire
    tuple and then join

22
Quick Comparison
  • of data movements (latency)
  • Rehash, Query Request, Rehash, Result 3
  • Get, Query Request, Lookup, Answer, Result 4
  • Hybrid, Query Request, Small Rehash, Lookup,
    Answer, Result 5
  • Amount of data sent
  • Rehash, send projected tuples matching some
    predicates, but cant pre-check join predicate
  • Get, only retrieve tuples matching join
    predicate, but entire tuple must be fetched and
    may not match some predicates
  • Hybrid, send small tuples matching some
    predicates, entire tuple fetched if join is very
    likely

23
Applying Predicates
  • Example SQL query
  • SELECT emp.name emp.salary
  • FROM emp, dept
  • WHERE emp.deptid dept.id AND
  • emp.age lt 50 AND
  • dept.name ltgt CS AND
  • emp.salary gt dept.maxsalary

24
Quick Comparison
160k msgs 750kb
250k msgs 4.6MB
300k msgs 600kb
25
Rehash-Group By
  • Same as Rehash Join rehash tuples matching
    predicates
  • At the rehash sites perform the aggregate
  • After predefined time send the result
  • Obviously this needs work
  • Partial results
  • Statistics work (confidence levels, etc.)
  • Hierarchical aggregation (like in sensor nets)

26
Current Optimizations
  • Result tuple batchingWhy send results one by
    one?
  • Pipelined hash join naturally produces more
    results in one pass as more tuples are processed.
  • Network may be optimized for packets of certain
    size, batch 2,3,4,x tuples per message
  • Selection Projection
  • Before rehashing, throw out irrelevant tuples,
    project to reduce the size of the tuple
  • Rehash optimization
  • Instead of rehashing to all sites, randomly
    select a smaller group to be used.

27
Interesting note
  • The state of the join is stored in the DHT store
  • The query request is automatically delivered to
    new nodes when they join
  • Rehashed data is automatically re-routed to the
    proper node if the coordinate space adjusted
  • When a node splits (to accept a new node into the
    network) the data is also split, this includes
    previously delivered rehashed tuples
  • Allows for graceful re-organization of the
    network not to interfere with ongoing operations

28
Where we are
  • 25,000 lines of Java code, hours of debugging,
    speed/memory enhancements, etc.
  • A working real implementation of PIER on top of a
    CAN simulator
  • Initial work studying and analyzing algorithms
    nothing really ground-breaking YET!
  • Analyzing the design space and which problems
    seem most interesting to pursue

29
DHT API? Is it good?
  • API isnt good for
  • Range queries
  • Limited multicast Currently, all queries must
    be asked at all nodes, this is the same scaling
    problem with Gnutella Freenet.
  • Batch Publish/Lookup operations
  • How to properly extend the API if needed
  • End-to-End argument always in play which layer
    (CAN, PIER, or user application) should do what?

30
Where to go from here?1st Understand where we
are
  • Initial steps we need to take
  • Only testing so far have been limited to small
    micro-benchmarks need to take careful look at
  • Scaling
  • Effects of dynamic node re-organization
    (including failures, which result in data AND
    computation loss)
  • Heterogeneous network connections various
    topologies
  • Different data distributions
  • Optimal levels of parallelism too many nodes can
    be just as bad, if not worse, then too few.
  • More realistic simulations (including traces)
  • Develop or adopt performance metrics to insure
    efficiency and measure alternatives

31
Where to go from here?Then see where we can go
  • The hardcore research areas
  • Replication for speed and fault tolerance (both
    in data and computation)
  • Pre-computation of (intermediate) results
  • Caching Both at DHT and PIER levels
  • Query optimization
  • More algorithms, Dist-DBMS have more tricks
  • Security
  • What are the new apps the system enables?

32
Relationship to other Berkeley Projects. Are we
alone?
  • We have data coming from many sources, different
    rates, always changing Telegraph?
  • We expect that data is constantly published in
    this type of network, better aggregation
    algorithms Sensor Nets?
  • Many similar or overlapping queries are likely to
    be asked from different nodes within a short time
    window SteMs, XML Filters?
  • Any ideas?
Write a Comment
User Comments (0)
About PowerShow.com