P2P Querying - PowerPoint PPT Presentation

About This Presentation
Title:

P2P Querying

Description:

no guarantee on ... 'Sharing computer cycles' -- SETI Institute created a distributed ... All Music Guide. When querying, need to decide which buckets (ie. ... – PowerPoint PPT presentation

Number of Views:53
Avg rating:3.0/5.0
Slides: 30
Provided by: musicM
Category:
Tags: p2p | allmusic | querying

less

Transcript and Presenter's Notes

Title: P2P Querying


1
P2P Querying
  • Wes Hatch
  • MUMT-614
  • Mar 27 2003

2
What is P2P?
  • Nodes of equal roles exchanging information and
    services directly
  • distributed databases
  • good for file sharing, but
  • no complex query optimization
  • no guarantee on quality of results
  • Can also be used to foster collaboration among
    geographically distributed coworkers
  • Sharing computer cycles --gt SETI Institute
    created a distributed virtual supercomputer.

3
P2P systems
  • Just beyond their infancy
  • Many P2P projects are in their research phases
  • only a handful are in common use
  • Concerns regarding dependability

4
Gnutella
  • An open, online P2P network
  • does not require a central server for indexing
    data files
  • not proprietary

5
Gnutella
  • Is both software and a communication protocol
  • Computer functions as both a client and server
  • Servent runs on each member node

6
Gnutella basics
  • member node keeps track of three or four other
    member nodes (mean 3.4)
  • web of interlinked member nodes

7
Quick review of architecture
  • Routing and TTL (time to live)
  • DONT forward every packet to every connected
    host
  • would swamp the network with duplicate packets
    (which it already is)
  • when a message is sent, it is stamped with a TTL
  • each host which receives the TTL then decrements
    it
  • Can change the value of the TTL in settings panel
  • Set the TTL to 255, if you want

8
Basic Query
  • String matching
  • Query sent to all computers that returned Pong
    packets.
  • Each of these computers checks if it has any
    match,
  • If no, it sends the Query packet on to all the
    computers to which it is connected.
  • Process continues until TTL expires
  • Could be awhile as most servents allow you to
    adjust the TTL
  • GUIDs in each computer ensure that the same
    message does not get passed to the same computer
    again and again, creating a loop.

9
Downloading
  • The query by now has been distributed to a huge
    number of computers
  • QueryHit
  • contains IP address and GUID of the computer
    which has the desired content
  • Push
  • For computers behind a firewall

10
Common Messages
11
Everyones searching for the same thing
  • small percentage of files make up the bulk of
    queries
  • query processing dominates the workload of a node

12
Everyones got the same thing
13
One saving grace
  • correlation between file and query distributions
    explains why Gnutella has not collapsed under the
    weight of its naive, flooding-based probe scheme
    (Ledlie)
  • Query popularity follows a power-law distribution

14
Current limitations in Gnutella
  • Querying is
  • Slow
  • Inefficient
  • Bandwidth hungry
  • require symmetrical communication channel ( i.e.
    same upstream and downstream channel bandwidths)
  • Insecure
  • transient community
  • member nodes do not have permanent IP addresses.
    A new IP addressing scheme is needed

15
More limitations
  • Problem work well when the rate at which nodes
    enter and exit the system is small.but
  • Doesnt address the tradeoffs between the
    type/reliability of information exchange, and how
    long each node is active
  • 80 of the nodes exist for less than one hour!

16
Spectrum of control
  • Gnutella is loose
  • Inefficient
  • Finding a needle in a haystack
  • Simple
  • DHT
  • Lookup queries only
  • Efficient
  • measured in absolute resources consumed

17
DHT
  • like having a file cabinet distributed over
    numerous servers (F.Kaashoek)
  • if one server goes down, not all of the data is
    compromised.
  • no central server that contains a list of where
    all the data is
  • Scalable
  • Lookups can be resolved in log n hops
  • Flexible
  • Fault-tolerant

18
Why DHT?
  • SQL-like searching
  • each server has a partial list (small routing
    table) of where data is stored in the system
  • The trick is to create a lookup algorithm that
    allows data location to be easily found

19
N-grams
  • Technique from information retrieval to do
    inexact matching
  • Convert search strings into n-tuples
  • Eg. Beethoven--gt Bee eth tho

20
DHT techniques
  • Query optimization like network routing?
  • Hiearchical organization of data
  • Geographic layout
  • Proximity routing
  • Proximity neighbour selection

21
One example
  • directed BFS
  • sends query to a subset of neighbours
  • maintains statistics on neighbours
  • ping latency, history of results
  • chooses subset intelligently (via heuristics) to
    maximize quality of results
  • Eg neighbours w/shortest message queue (long
    message queue implies that node is saturated or
    dead)

22
Semantic Overlay Networks
  • Nodes w/semantically similar content are
    clustered together
  • Depends on classifiers to determine what similar
    content is!

23
Semantic Overlay Networks
  • Granularity
  • Too little will not generate enough locality too
    much will increase maintenance costs
  • When to join a given SON
  • How much of a given content should a user have?
  • Tradeoff between search speed and quality of
    results

24
Semantic Overlay Networks
  • Queries are processed by identifying which SON
    (s) are best equipped to answer it
  • Nodes outside a given SON are not bothered by the
    query
  • Saves time and resources

25
Semantic Overlay Networks
  • Classification hierarchy
  • Buckets a bucket is associated with each concept
    in hierarchy
  • All Music Guide
  • When querying, need to decide which buckets (ie.
    Nodes) to consider
  • If nodes have diverse files, wont be enough
    clustering to justify the use of SONs

26
Semantic Overlay Networks
  • Need to classify queries so the request may be
    directed accordingly
  • Manual
  • Automatic
  • Text matching
  • Bayesian networks

27
System schematic
28
Semantic Overlay Networks
  • Use only 10-20 of the message overhead of
    current P2P systems
  • Further improved query performance
  • Marginal effect to maximum achievable recall level

29
How will this affect queries?
  • SON
  • Faster results as there is less time spent
    processing queries
  • Resources better spent
  • Hierarchical distribution of data
Write a Comment
User Comments (0)
About PowerShow.com