Title: PeertoPeer Information Search
1Peer-to-Peer Information Search
- Sebastian Michel
- Ecole Polytechnique Fédérale Lausanne
- Lausanne - Switzerland
Josiane Xavier Parreira Max-Planck Institute for
Informatics Saarbrücken - Germany
2Outline of Part 1
- Introduction to P2P Systems
- Distributed Hashtables Range Queries
- Peer-to-Peer IR (Query Routing, Result Merging)
- Overlapping Sources / Multi-key Statistics
- Top-k Query Processing
- Probabilistic Pruning
- Distributed Top-k
3P2P Systems
Peer one that is of equal standing with
another (source Merriam-Webster Online
Dictionary )
- Known from Napster and others
- Sharing of mostly illegal content (mp3, movies)
- P2P Pirate-to-Pirate ??
- New kind of network organization no
client/server anymore - Basic Ideas
- Each peer connects to a few other peers
- All peers together form powerful networks
- Potential Benefits
- No single point of failure
- Load is spread across mulitple peers
- (Resilient to failures and dynamics)
4Napster
- Developed in 1998.
- First P2P file-sharing system
File Download
Publish file statistics
- Central server (index)
- Client software sends information
- about users contents to server.
- User send queries to server
- Server responds with IP of users
- that store matching files.
- ? Peer-to-Peer file sharing!
File Download
5Gnutella
- Protocol for distributed file sharing
- Started in 2000
- in 2005 1.81 million computers connected
- Unstructured Network
- Truly decentralized
- Uses message flooding during query execution.
- Later version with super nodes and query routing
http//www.slyck.com/news.php?story814
6Gnutella Style
TTL 1
TTL 2
TTL 0
TTL 3
TTL 1
TTL 2
TTL 2
Paris Hilton?
TTL 0
TTL 3
TTL 1
7Gnutella Style
- Pros
- no complex statistical bookkeeping
- Cons
- lot of network traffic
- some peers might not be reachable (TTL)
8Bit Torrent
- Idea Load sharing through file splitting
- A lot of (legal) software distributors offer
software through Bit-torrent - Download information in small .torrent file
- One tracker node per file (specified in torrent
file)
segment 1
segment 3
tracker node
segment 1
File
segment 2
segment 5
segment 3
segment 4
segment 5
request segments
request random peer list
segment 4
segment 2
Incentives tit-for-tat Each peer remembers
collaborative peers ? different priorities
Client
9Literature
- Book Peer-to-Peer Harnessing the Power of
Disruptive Technologies by Andy Oram. O'Reilly
Media, Inc.
10Overlay Networks
- On top of existing networks
- Different way to build an overlay network
- structured
- unstructured
- hybrid
11Self Properties (Promises)
- Self-Organizing
- evolves, grows..... without being guided/managed
- Self-Optimizing
- Self-Configuring
- Self-Healing
- Self-Restoration
- Self-Diagnostics
- Self-Protecting
12Outline
- Introduction to P2P Systems
- Distributed Hashtables Range Queries
- Peer-to-Peer IR (Query Routing, Result Merging)
- Overlapping Sources / Multi-key Statistics
- Top-k Query Processing
- Probabilistic Pruning
- Distributed Top-k
13Distributed Hash Tables
- Hash-Table given a key, return the bucket id.
Based on a hash function (like SHA-1) - Now Distributed. For a given key, return the id
of the peer currently responsible for the key. - Challenge Purely distributed protocols that cope
with node failures, departures, arrivals. - No central manager.
14Chord
- uses an m-bit identifier space ordered in a
mod-2m circle, the Chord ring - maps peers and objects to identifiers in the
Chord ring, using the hash function SHA-1 - uses consistent hashing
- an object with identifier id is placed on the
successor peer, succ(id), which is the first node
whose identifier is equal to, or follows id on
the Chord ring - Key k (e.g., hash(file name))
- is assigned to the node with
- key p (e.g., hash(IP address))
- such that k ? p and there is
- no node p with k ? p and p
Ion Stoica, Robert Morris, David R. Karger, M.
Frans Kaashoek, Hari Balakrishnan Chord A
scalable peer-to-peer lookup service for internet
applications. SIGCOMM 2001 149-160
15Chord
peer n maintains routing information about peers
that lie on the Chord ring at logarithmically
increasing distance ? Finger tables
fingertable p8
fingertable p51
p1
p56
Chord Ring
p8
p51
p48
p14
fingertable p42
p42
p38
p21
p32
16Node Joins in Chord
p42
lookup(42)
sets succ pointer
p42
p38
moving keys
updates succ pointer
p38
p48
k40
p42
k43
k39
k40
k39
init_finger_tables() successornode.find_success
or() predecessorsuccessor.predecessor predecess
or.successornew
17And others ...
- P-Grid Karl Aberer P-Grid A Self-Organizing
Access Structure for P2P Information Systems.
CoopIS 2001 179-194 - CAN Sylvia Ratnasamy, Paul Francis, Mark
Handley, Richard M. Karp, Scott Shenker A
scalable content-addressable network. 161-172 - Pastry Antony I. T. Rowstron, Peter Druschel
Pastry Scalable, Decentralized Object Location,
and Routing for Large-Scale Peer-to-Peer Systems.
Middleware 2001 329-350 - Bamboo Sean Rhea, Dennis Geels, Timothy Roscoe,
and John Kubiatowicz. Handling Churn in a DHT.
Proceedings of the USENIX Annual Technical
Conference, June 2004.
18Range queries
- Range queries
- A range query v1, v2 searches for those peers
which store data whit key value k? v1, v2
- DHTs only support efficiently exact-match queries
- The naïve approach to process range queries in
DHTs is to - query each value of a range individually
- It is HIGHLY EXPENSIVE!
19DHTs and Range Queries
Order preserving hash function
usually leads skewed distributions
- There are two main solutions to cope with load
imbalances i.e. to perform load balancing - transferring load, or
- replicating data
20DHT and Range Queries (2)
- Existing approaches to deal with range queries
- Locality preserving hashing
- OP-Chord Triantafillou et al (2003). Skip
Graphs Aspnes et al (2004) - Hashing ranges of values instead of each value
individually - CAN-based Andrzejak et al (2002), Sahin et al
(2004) - Another problem in that context access load
imbalances - One possible solution hot data transferring to
deal with those load imbalances - However, data transfer does not solve access load
imbalances in skewed access (query) distributions
21HotRod replicating hot arcs
Theoni Pitoura et al. EDBT 2006.
A peer is hot (or overloaded) when ? ? _max,
where ?_max is the upper limit of its resource
capacity An arc of peers is hot when at least
one of its peers is hot replicate ranges of
values
22Efficient Load Balancing
23Outline
- Introduction to P2P Systems
- Distributed Hashtables Range Queries
- Peer-to-Peer IR (Query Routing, Result Merging)
- Overlapping Sources / Multi-key Statistics
- Top-k Query Processing
- Probabilistic Pruning
- Distributed Top-k
24Building a P2P Search Engine(Peer to Peer
Information Retrieval)
- Distributed Google
- P2P approach best suitable
- large number of peers
- exploit mostly idle resources
- intellectual input of user community
- scalable and self organizing
25Information Retrieval Basics
Document
Terms
26Information Retrieval Basics (2)
Top-k Query Processing find k documents with the
highest total score
Query Execution Usually using some kind of
threshold algorithm - sequential scans over
the index lists (round-robin) -
(random accesses to fetch missing
scores) - aggregate scores - stop when the
threshold is reached
B tree on terms
index lists with (DocId tfidf) sorted by Score
e.g. Fagins algorithm TA or a variant without
random accesses
27Going distributed Index Organization
Peer 2
Peer 3
Peer 1
Peer 2
Peer 1
- peer index
- every peer has its own collection (full
documents) - distributed index index of peer descriptions
28(Full) Document Index
- Straight forward from centralized document index
- Each peer is responsible for storing the index
list for a subset of terms.
Query Routing DHT lookups Query Execution
Distributed Top-k TPUT 04, KLEE 05
29Peer Index
- Each peer has its own local index (e.g., created
by web crawls) - Peers publish compact per-term descriptions about
their index
Query Routing 1. DHT lookups 2. Retrieve
Metadata 3. Find most promising peers Query
Execution - Send the complete Query and
merge the incoming results
30P2P Search with Minerva
based on scalable, churn-resilient DHT with O(log
n) key lookup
Query routing aims to optimize benefit/cost
driven by distributed statistics on peers
content quality, content overlap, freshness,
authority, trust, etc.
Maintain semantic/social/statistical overlay
network (SON)
Exploit community behavior (bookmarks, links,
tags, clicks, etc.)
31Two major Problems
- Task of merging the obtained results into final
ranking Result Merging - Task of finding high quality peers Query
Routing - aka database/collection/peer selection
- Overview articles
- J. Callan. (2000). "Distributed information
retrieval." In W. B. Croft, editor, Advances in
Information Retrieval. Kluwer Academic
Publishers. (pp. 127-150). - Weiyi Meng, Clement T. Yu, King-Lup Liu Building
efficient and effective metasearch engines. ACM
Comput. Surv. 34(1) 48-89 (2002)
32Query Routing
- Given a Query Qterm1, term2, ...., termN)
select the most promising peers - Based on
- per-term per-peer statistics
- document frequency
- vocabulary size
- normalization issues like
- collection frequency
- avg vocabulary size
- Most popular
- CORI, GlOSS, Decision Theoretic Framework (DTF)
33CORI
Apply document ranking to resource ranking
Resources
....
p1
p2
pj-1
pj
t1
t2
t3
tk
Terms
q
C peers df document frequency
cf collection frequency cw distinct words
per peer
Query
34Literature
- J. Callan. (2000). "Distributed information
retrieval." In W. B. Croft, editor, Advances in
Information Retrieval. Kluwer Academic
Publishers. (pp. 127-150). - Weiyi Meng, Clement T. Yu, King-Lup Liu Building
efficient and effective metasearch engines. ACM
Comput. Surv. 34(1) 48-89 (2002) - CORI James P. Callan, Zhihong Lu, W. Bruce
Croft Searching Distributed Collections with
Inference Networks. SIGIR 1995 21-28 - GlOSS Luis Gravano, Hector Garcia-Molina,
Anthony Tomasic GlOSS Text-Source Discovery
over the Internet. ACM Trans. Database Syst.
24(2) 229-264 (1999) - Decision Theoretic Framework Norbert Fuhr A
Decision-Theoretic Approach to Database Selection
in Networked IR. ACM Trans. Inf. Syst. 17(3)
229-249 (1999)
35Result Merging
- Problem incomparable scores
- Different corpus statistics
- df component used in tfids scoring functions is
not globally known - user with lot of high quality documents for term
a ? high df - non expert user with some bad documents for term
a ? low df
- Different scoring functions
- completely different functions
- different parameters in the same function
36Result Merging Approaches
- Score Normalization by
- using global statistics
- computation of global statistics difficult (not
obvious) - solution using gossip
- score re-computation with query initiators local
statistics - required re-ranking and knowledge about document
contents - score re-computation using query routing scores
- routing score available anyway
37Global DF Estimation
gdf (global doc. freq.) of a term is interesting
key measure, but overlap among peers makes simple
distr. counting infeasible
- hash sketches Flajolet/Martin 1985
- duplicate-sensitive cardinality estimator for
multisets - hash each multiset element x onto m-bit
bitvector - and remember least significant 1 bit
- rough intuition least-significant bit set by
half of the documents, - second bit by ¼ of the documents......
- Theory says most significant bit estimator of
log (n) ndocuments - Higher accuracy average multiple iid sketches
38Global DF Estimation
Hash sketches of different peers collected at
directory peer distributivity is free!! ?i
?(h(x)) x ?Si ?(h(x)) x ? ?i Si
- gdf estimation algorithm
- each peer p posts hash sketch for each
(discriminative) term t to directory - directory peer for term t forms union of
incoming hash sketches - when a peer needs to know gdf(t), simply ask
directory peer for t - sliding-window techniques for dynamic adjustment
Matthias Bender, Sebastian Michel, Peter
Triantafillou, Gerhard Weikum Global Document
Frequency Estimation in Peer-to-Peer Web Search.
WebDB 2006
39Outline
- Introduction to P2P Systems
- Distributed Hashtables Range Queries
- Peer-to-Peer IR (Query Routing, Result Merging)
- Overlapping Sources / Multi-key Statistics
- Top-k Query Processing
- Probabilistic Pruning
- Distributed Top-k
40Autonomous Peers ?Overlapping Sources
A
A,B
A,B,C
A,..,D
A
B
querying peer
?
Recall
C
1
3
2
4
peers
?
D
E
41How?
- Enrich published statistics with overlap
estimators. - Interested in NOVELTY and QUALITY
- Iterative greedy selection process
- select first peer based on quality
- select next peer by qualitynovelty
- Suitable synopses for overlap estimation
- Bloom filter Bloom 1979
- hash sketches FlajoletMartin 1985
- min wise independent permutations Broder 1997
42Min-Wise Independent Permutations Broder 97
set of ids
17 21 3 12 24 8
h1(x) 7x 3 mod 51
20 48 24 36 18 8
h2(x) 5x 6 mod 51
40 9 21 15 24 46
hN(x) 3x 9 mod 51
9 21 18 45 30 33
compute N random permutations
MIPs are unbiased estimator of overlap P
min h(x) x?A min h(y) y?B A?B /
A?B
43Bloom Filter Bloom 1979
- bit array of size m
- k hash functions h_i docId_space ? 1,..,m
- insert n docs by hashing the ids and settings the
corresponding bits - document is in the Bloom Filter if the
corresponding bits are set - probability of false positives (pfp)
- tradeoff accuracy vs. efficiency
Andrei Broder and Michael Mitzenmacher Network
Applications of Bloom Filters A Survey. Internet
Mathematics 1(4). 2005.
X
1
1
1
1
44Multi-Key Statistics
- solves interesting problem
- peer with lot of docs on american football and
lots of documents about pop music has not a
single document about american music - cannot be predicted using per-term statistics
Obvious Recall that
45Multi-Key Statistics in P2P
- Motivation
- estimated_quality(a and b) quality(a) quality
(b) df_a df_b ! df_(a and b) - Impossible (Infeasible) to consider all
term-pairs, triplets, quadruples, ..... - Query Driven Analyze query logs _at_ directory
peers. - Data driven verficication
- PAnnaKournikova ......
- PAndyRodick
- PBerlinMarathon
- No additional messages shorter lists highly
accurate
Whole process can be easily integrated
into Peer-level P2P IR
additional statistics often not needed
Sebastian Michel, Matthias Bender, Nikos Ntarmos,
Peter Triantafillou, Gerhard Weikum, Christian
Zimmer Discovering and exploiting keyword and
attribute-value co-occurrences to improve P2P
routing indices. CIKM 2006 172-181
46Single-term vs. multi-term P2P document indexing
Single
term indexing
long posting lists
-
make use of highly discriminative keys limit
influence of overly long index lists consider
term pairs (triplets ...) for shorter lists ?
efficient query processing
.
term
1
posting list
1
c
PEER
1
o
term
2
posting list
2
v
l
l
...
...
...
a
m
s
term
M
-
1
posting list
M
-
1
PEER
N
term
M
posting list
M
Multi
-
term keys
Multi
term indexing
key
11
posting list
11
key
12
posting list
12
PEER
1
...
...
key
1
i
posting list
1
i
.
c
o
v
...
e
g
r
a
l
key
N
1
posting list
N
1
key
N
2
posting list
N
2
PEER
N
...
...
key
Nj
posting list
Nj
short posting lists
Gleb Skobeltsyn, Toan Luu, Ivana Podnar Zarko,
Martin Rajman, Karl Aberer Web text retrieval
with a P2P query-driven index. SIGIR 2007 679-686
47Literature
- Overlap Awareness
- Ronak Desai, Qi Yang, Zonghuan Wu, Weiyi Meng,
Clement T. Yu Identifying redundant search
engines in a very large scale metasearch engine
context. WIDM 2006 51-58 - Matthias Bender, Sebastian Michel, Peter
Triantafillou, Gerhard Weikum, Christian Zimmer
Improving collection selection with overlap
awareness in P2P search engines. SIGIR 2005
67-74 - Thomas Hernandez, Subbarao Kambhampati Improving
text collection selection with coverage and
overlap statistics. WWW (Special interest tracks
and posters) 2005 1128-1129 - Sketches
- Andrei Z. Broder, Moses Charikar, Alan M. Frieze,
Michael Mitzenmacher Min-Wise Independent
Permutations. J. Comput. Syst. Sci. 60(3)
630-659 (2000) - Philippe Flajolet, G. Nigel Martin Probabilistic
Counting Algorithms for Data Base Applications.
J. Comput. Syst. Sci. 31(2) 182-209 (1985) - Andrei Broder and Michael Mitzenmacher Network
Applications of Bloom Filters A Survey. Internet
Mathematics 1(4). 2005.
48Literature
- Multi-key statistics
- Ivana Podnar, Martin Rajman, Toan Luu, Fabius
Klemm, Karl Aberer Scalable Peer-to-Peer Web
Retrieval with Highly Discriminative Keys. ICDE
2007 1096-1105 - Gleb Skobeltsyn, Toan Luu, Ivana Podnar Zarko,
Martin Rajman, Karl Aberer Web text retrieval
with a P2P query-driven index. SIGIR 2007
679-686 - Sebastian Michel, Matthias Bender, Nikos Ntarmos,
Peter Triantafillou, Gerhard Weikum, Christian
Zimmer Discovering and exploiting keyword and
attribute-value co-occurrences to improve P2P
routing indices. CIKM 2006 172-181
49Outline
- Introduction to P2P Systems
- Distributed Hashtables Range Queries
- Peer-to-Peer IR (Query Routing, Result Merging)
- Overlapping Sources / Multi-key Statistics
- Top-k Query Processing
- Probabilistic Pruning
- Distributed Top-k
50For the IR people ....
- Why top-k?
- Cannot take a look at all matching documents
- E.g., Google provides millions of documents about
Britney Spears
Requires ranking (scoring)
In text retrieval for instance
of course pagerank if you wish
Remember Part one Local Query Execution at each
peer (peer-index-model) AND truly distributed
top-k processing in the full document-index.
51For the DB guys ...
- Table with schema (id, attribute, value)
SELECT id, aggr(value) from table group by
id sort by aggr(value) desc limit k
52For the networking guys ...
Network Monitoring
Find clients that cause high network traffic.
53Computational Model
- m lists with (itemId, score)-pairs sorted by
score descending. - One list per attribute (e.g. term)
- Aggregation function
- aggr()
- Monotonicity is important
- for all items a, b
- whith denoting the
score of item x in list i - Goal return the top-k items w.r.t. their
aggregated (overall) scores
54How to process this?
- Most popular Family of threshold algorithms
- Fagin, 1999
- Nepal/ Ramakrishna, 1999
- Güntzer/Balke/Kießling, 2001
- Basic ideas
- keep upper and lower score bound for each
document - lowerbound (or worstscore) sum of scores we
have seen so far - assuming 0 for unseen dimensions
- upperbound (or bestscore) lowerbound highest
possible value for unseen dimensions - know what weve got already know what do expect
- stop if no further step can improve the current
(i.e. final) ranking
55Fagins NRA
- NRA(q,L)
- top-k ? candidates ? min-k 0
- scan all lists Li (i 1..m) in parallel
- consider item d at position posi in
Li - E(d) E(d) ? i
- highi si(qi,d)
- worstscore(d) aggrs?(q?,d)??E(d)
- bestscore(d) aggraggrs?(q?,d)??E(d
), aggrhigh???E(d) - if worstscore(d) min-k then
- remove argmindworstscore(d)d
?top-k from top-k - add d to top-k
- min-k minworstscore(d)
d ? top-k - else if bestscore(d) min-k then
- candidates candidates ? d
- threshold max bestscore(d) d?
candidates - if threshold ? min-k then exit
56Top-k Search
Data items d1, , dn
d1
s(t1,d1) 0.7 s(tm,d1) 0.2
Query q (t1, t2, t3)
Index lists
k 1
d78 0.9
d1 0.7
d88 0.2
d23 0.8
d10 0.8
t1
Scan depth 1
Scan depth 2
Scan depth 3
d64 0.8
d10 0.2
d78 0.1
d23 0.6
d10 0.6
t2
d99 0.2
d34 0.1
d10 0.7
d78 0.5
d64 0.4
STOP!
t3
57Outline
- Introduction to P2P Systems
- Distributed Hashtables Range Queries
- Peer-to-Peer IR (Query Routing, Result Merging)
- Overlapping Sources / Multi-key Statistics
- Top-k Query Processing
- Probabilistic Pruning
- Distributed Top-k
58Evolution of a Candidates Score
Observation pruning often overly conservative
(deep scans, high memory for priority queue)
drop d from the candidate queue
score
bestscored
min-k
worstscored
scan depth
- Approximate top-k
- What is the probability that d qualifies for the
top-k ?
59Safe Thresholding vs. Probabilistic Guarantees
- NRA based on invariant
- Relaxed into probabilistic threshold test
- Or equivalently, with
bestscored
min-k
worstscored
bestscored
d(d)
worstscored
60Expected Result Quality
- Missing relevant items
- Probability p_miss of missing a true top-k object
equals the probability of erroneously dropping a
candidate from the queue - For each candidate p_miss e
- Precall r/k Pprecision r/k
- Eprecision Erecall
61Outline
- Introduction to P2P Systems
- Distributed Hashtables Range Queries
- Peer-to-Peer IR (Query Routing, Result Merging)
- Overlapping Sources / Multi-key Statistics
- Top-k Query Processing
- Probabilistic Pruning
- Distributed Top-k
62Going distributed
- Key Observations
- Network traffic is crucial
- Number of round trips is crucial
- Straight forward application of TA/NRA?
- expensive huge number of rounds trips
- even with batching unpredictable performance
-
63Where is the data?
P1
P0
P2
P3
- Consider
- network consumption
- per peer load
- latency (query response time)
- network
- I/O
- processing
64Three Phase Uniform Threshold Algorithm
Cao and Wang, PODC 2004
First distributed top-k algorithm with fixed
number of phases!
- Exactly 3 phases
- fetch k best entries (d, sj) from each of P1 ...
Pm and aggregate (?j1..m sj(d)) at query
initiator - ask each of P1 ... Pm for all entries with sj
min-k / m and aggregate results at query
initiator. min-k is score of item currently at
rank k. - fetch missing scores for all candidates by
random lookups at P1 ... Pm
65Coordinator
Peer P0
Cohort
Cohort
Peer Pj
Peer Pi
score
score
...
...
Index List
Index List
66Analysis of TPUT
- Theorem TPUT is an exact algorithm, i.e.
identifies the true top-k items
- Proof (sketch) TPUT cannot miss a true top-k
item. - Assume it misses one, i.e. item is below
mink/m in all lists. - ? overall score
- ? not a true top-k item!
list 1
list 2
list 3
State after phase 2
min-k score
67Analysis of TPUT
- if mink / m is small TPUT retrieves a lot of data
in Phase 2 - ? high network traffic
- random accesses
- ? high per-peer load
- KLEE VLDB 05
- Different philosophy approximate answers
- Efficiency
- Reduces (docId, score)-pair transfers
- no random accesses at each peer
- Two pillars
- The HistogramBlooms structure
- The Candidate List Filter structure
68Additional Data Structures
increase the min-k / m threshold
Equi-width histogram Bloom filter for each
cell average score per cell upper/lower
score
Usage During Phase 1 fetch top-k from
each list top-c cells
69KLEE
Coordinator
Peer P0
Cohort
Cohort
Peer Pj
Peer Pi
score
score
...
...
Index List
Index List
70KLEE Candidate Set Reduction
Coordinator Peer P0
candidate set
current top-k
min-k / m
Cohort Peer Pi
Cohort Peer Pj
010010000100010001
100010100000010001
top k
0000100000100000001
0000100000100000001
score
...
Index List
71KLEE Candidate Retrieval
Coordinator Peer P0
current top-k
candidate set
Cohort Peer Pi
Cohort Peer Pj
010010000100010001
100010100000010001
top k
0000100000100000001
0000100000100000001
score
...
Index List
72Literature
- Ronald Fagin Combining Fuzzy Information from
Multiple Systems. J. Comput. Syst. Sci. 58(1)
83-99 (1999) - Ronald Fagin, Amnon Lotem, Moni Naor Optimal
aggregation algorithms for middleware. J. Comput.
Syst. Sci. 66(4) 614-656 (2003) - Surya Nepal, M. V. Ramakrishna Query Processing
Issues in Image (Multimedia) Databases. ICDE
1999 22-29 - Ulrich Güntzer, Wolf-Tilo Balke, Werner Kießling
Towards Efficient Multi-Feature Queries in
Heterogeneous Environments. ITCC 2001 622-628 - Martin Theobald, Gerhard Weikum, Ralf Schenkel
Top-k Query Evaluation with Probabilistic
Guarantees. VLDB 2004 648-659 - Holger Bast, Debapriyo Majumdar, Ralf Schenkel,
Martin Theobald, Gerhard Weikum IO-Top-k
Index-access Optimized Top-k Query Processing.
VLDB 2006 475-486 - Amélie Marian, Nicolas Bruno, Luis Gravano
Evaluating top-k queries over web-accessible
databases. ACM Trans. Database Syst. 29(2)
319-362 (2004) - Pei Cao, Zhe Wang Efficient top-K query
calculation in distributed networks. PODC 2004
206-215 - Sebastian Michel, Peter Triantafillou, Gerhard
Weikum KLEE A Framework for Distributed Top-k
Query Algorithms. VLDB 2005 637-648
73Part II Social Search
74(No Transcript)
75Motivation
- People connected through a network
- People create links to other people
- Links can express friendship, recommendations,
etc - Different graph structures appear
- Sharing interests
- Enables users to find others who share common
interests - Similar users can provide relevant content
- Users and content spread at different sites
- Distributed nature and continuously increasing
size call for peer-to-peer approaches
76Outline of the Second Part
- Link Analysis The Web as a Graph
- PageRank
- Distributed Approaches
- BlockRank
- Local PageRank ServerRank
- Adaptive OPIC
- JXP
- Identifying common interests Semantic Overlay
Networks - Crespo and Garcia Molina
- pSearch
- p2pDating
- Social Networks A new paradigm
- What people share
- Social graphs
- Links, Tags, users analysis
77Links are everywhere
78Links are everywhere
Example of a Flickrs friends network
79Links are everywhere
80Links Analysis
- The set of nodes/pages (e.g., web pages, people,
products, etc) and the links connecting them
define a graph
81Link Analysis
- At the end we have something like this
- Lots of useful information can be obtained from
the analysis of the such graphs
82Adjacency Matrix
- Matrix representation of graphs
- Given a graph G, its adjacency matrix A is nxn
and - aij 1, it there is a link from node i to node j
- aij 0, otherwise
83PageRank Exploring the Wisdom of Crowds
- Measures relative importance of pages on the
graph - Importance of a page depends on the importance of
the pages that point to it - Random Surfer Model once in a page, the surfer
chooses to follow one of the outlinks with prob.
a, or to jump to a random page with prob. (1- a) - PR probability of being at a certain
- page, after a enough number of jumps
S. Brin L. Page. The anatomy of a large-scale
hypertextual web search engine. In WWW Conf. 1998.
84PageRank Formal Definition
- N ? Total number of pages
- PR(p) ? PageRank of page p
- out(p) ? Outdegree of p
- e? Random jump probability
- Can be computed using power iteration method
- In practice more efficient versions can be used
- Google is believed to use it on the Web graph,
combined with other metrics, to rank their search
results
85PageRank Matrix Notation
- A ? Matrix containing the transition
probabilities - where Pij 1/out(i), if there is a link from
i to j, 0 otherwise E is the random jumps matrix - Probability distribution vector at time k
- is the starting vector
- PageRank ? Stationary distribution of the Markov
Chain described by A, i.e., principal eigenvector
or A
86Going Distributed
- PageRank in principle needs the whole graph at
one place - Shortcomings
- Not Scalable for huge graphs, like the Web
- Slow update PageRank in such huge graph can
take weeks - Not suitable for different network architectures
(e.g. P2P) - Distributed approaches, where the graph is
partitioned, are clearly needed - Some distributed approaches (more details on the
next slides) - Local PageRank ServerRank (Wang et al.)
- BlockRank (Kamvar et al.)
- JXP (Parreira et al.)
87The Block Structure
- Most of links are among web pages inside same host
Pages from Host A
Block structure can be exploited for speeding up
and/or distributing the PR computation
Pages from Host B
Adjacency Matrix
88BlockRank
- PageRank in three steps
- Computes local PageRanks of pages for each
host, by considering only intra host links - Computes the importance of the host, using the
local PR values and the inter host links - Combines previous values to create the starting
vector for the standard PR algorithm - Speeds up computation
- Step 1 can be parallelized
- Still needs the whole matrix for step 3
S. Kamvar, T. Haveliwala, C. Manning G. Golub.
Exploiting the block structure of the web for
computing pagerank. Technical report, Stanford
University, 2003.
89Going Distributed
- Local PR ServerRank
- Similar to BlockRank
- Local PR PR computed inside each server using
intra server links - ServerRank PR computed on server graph using
inter server links - Server graph does not need to be materialized.
Computation is done by exchanging messages among
servers - Local PR and ServerRank are combined to
approximate the true PR of a page - Values can be further refined by using Local PR
info on ServerRank computation and vice versa. - Server partition can be a limitation
Y. Wang D. J. DeWitt. Computing pagerank in a
distributed internet search system. In VLDB, 2004.
90Partition at peer level
- In P2P networks, server partition is not suitable
91Partition at peer level
- Every peer crawls Web fragments at its discretion
- Peers have only local (incomplete) information
- Pages might be link to or linked by pages at
other peers - Overlaps between peers graphs may occur
- Peers a priori unaware of other peers contents
92Adaptive OPIC
- OPIC Online Page Importance Computation
- Computes the importance of a page on-line, with
few resources - Algorithm
- Pages initially receive some cash
- Pages are randomly visited
- When a page is visited, its cash is distributed
between the pages it points to - The page importance for a given page is computed
using the history of cash of that page
Serge Abiteboul, Mihai Preda, and Gregory Cobena.
Adaptive on-line page importance computation. In
WWW, 2003.
93Adaptive OPIC
- Example
- Small Web of 3 pages
- Alice has all the cash to start (Importance
independent of the initial state)
Alice
George
Bob
Cash-Game History Alice received 600 (200400) 4
0 Bob received 600 (200100300) 40 George
received 300 (200100) 20
94Adaptive OPIC
- No particular graph partition
- No need to store the link matrix
- Adapts to the changes on the web graph by
considering only the recent part of the cash
history for each page - Time window now-T, now
- High number of messages exchanged
- Does not handle case where same page is stored at
more than one place
95The JXP Algorithm
- Decentralized algorithm for computing global
authority scores of pages in a P2P Network - Runs locally at every peer
- No coordinator, asynchronous
- Combines Local PageRank computations Meetings
between peers - JXP scores converge to the true global PageRank
scores
Josiane Xavier Parreira, Carlos Castillo,
Debora Donato, Sebastian Michel and Gerhard
Weikum The JXP Method for Robust PageRank
Approximation in a Peer-to-Peer Web Search
Network. The VLDB Journal, 2007.
96The JXP Algorithm
- World Node
- Special node attached to the local graph at every
peer - Compact representation of all other pages in the
network - Special features
- All links from local pages to external pages
point to World Node - Links from external pages that point to local
pages (discovered during meetings) are
represented at the World Node - Score and outdegree of these external pages are
stored World Node outgoing links are weighted to
reflect score mass given by original link - Self-loop link to represent transitions among
external pages
W
97The JXP Algorithm
- Initialization step
- Local graph is extended by adding the world node
- PageRank is computed in the extended graph ? JXP
Scores - Main algorithm (for every Pi in the network)
- Select Pj to meet
- Update world node
- Add edges for pages in Pj that point to pages in
Pi - If an edge already exists at the world node, the
score of the source page is updated by taking the
highest of both scores - Compute PageRank ? JXP scores
98The JXP Algorithm
Theorem In a fair series of JXP meetings, the
JXP scores of all nodes converge to the true
global PR scores
99Locating parts of the Graph
- Finding peers that share common interests
- Many applications can benefit from it
- Distributed PR
- In principle, peers need to send content only to
the peers that contain their successors - Random messages guarantees that those peers will
eventually be reached, but part of messages will
be wasted
100WASTED MEETING!!!! We want to avoid it!!!
101Locating parts of the Graph
- Query answering
- Ideal Forward query only to peers that are more
likely to provide good answers to it - Query flooding is very expensive
- Hash-based queries are not suitable for
approximate queries
102Locating parts of the Graph
- Locating relevant peers
- Increase performance
- Reduce traffic load
- Idea Group peers according to the semantic of
their content and place them into different
overlay networks
103Outline of the Second Part
- Link Analysis The Web as a Graph
- PageRank
- Distributed Approaches
- BlockRank
- Local PageRank ServerRank
- Adaptive OPIC
- JXP
- Identifying common interests Semantic Overlay
Networks - Crespo and Garcia Molina
- pSearch
- p2pDating
- Social Networks A new paradigm
- What people share
- Social graphs
- Links, Tags, users analysis
104Semantic Overlay Networks
- Partition the P2P network into several thematic
networks - Peers with similar or beneficial/complementary
content are clustered together - Queries for a content will be forwarded only to
peers with such content - Flooding in smaller networks with smaller TTL (or
more results with same)
105Overlay Networks Random vs. Semantic
- Random
- Peers connect to a small set of random peers
- Queries are flooded through the network
- Peers with unrelated content receive query
- Low performance High number of messages
- Low recall if only few peers are contacted
- Semantic
- Peers connect to peers with related content ?
Cluster of peers - Peers identify querys topic and forward it only
the set of peers on that topic - Messages to peers with unrelated content are
avoided - Better performance Smaller number of messages
- High recall by asking only few peers
106When creating SONs
- Two main things to consider
- Node partitioning
- Clustering criteria
- Node partitioning - When does a peer belong to
SON A? - When it contains a doc of type A
- When it contains more than x docs of type A
- Less peers per SON ? more results sooner
- Less SONs per peer ? less connections
- Clustering criteria - Clustering must provide
- Load-balance
- Each category has similar number of nodes
- Each node belongs to a small number of categories
- Easy and accurate way to classify a document
107Crespo and Garcia-Molina
- Uses a classification hierarchy to form the
overlay networks - Documents and queries are classified into one or
more concepts - Queries are forwarded to peers in the super/sub
concepts -
A. Crespo and H. Garcia-Molina. Semantic Overlay
Networks for P2P Systems. Technical report,
Stanford University, January 2003.
108Crespo and Garcia-Molina
- Reported results show a significant improvement
on number of messages - Music file sharing scenario To get half the
documents that match a query - SONs 461 msgs
- Gnutella 1731 msgs
- SON links are logical Two peers
- that are connected on a SON can
- actually be many hops away from
- each other
- Requirement that hierarchy and
- classification algorithm are
- shared among all nodes might
- be a problem
109pSearch
- Semantic Overlay on top of Content Addressable
Networks (CANs) - Latent Semantic Indexing (LSI) is used to
generate a semantic vector for each document - Semantic vectors are used as keys to store docs
indices in the CAN - Indices close in semantics are stored close in
the overlay - Two types of operations
- Publish document indices
- Process queries
Chunqiang Tang, Zhichen Xu, and Sandhya
Dwarkadas. Peer-to-peer Information Retrieval
Using Self-Organizing Semantic Overlay Networks.
In SIGCOMM, 2003.
110pSearch Key Idea
semantic space
doc
111pSearch Key Idea
semantic space
doc
query
112BackgroundContent-Addressable Network
- Partition Cartesian space into zones
- Each zone is assigned to a computer
- Neighboring zones are routing neighbors
- An object key is a point in the space
- Object lookup is done through routing
113Background Vector Space Model
- Term Vectors represent documents and queries
- Elements correspond to importance of term in
document or vector - Statistical computation of vector elements
- Term frequency inverse document frequency
- Ranking of retrieved documents
- Similarity between document vector and query
vector
114Background Vector Space Model
A books on computer networks B network
routing in P2P networks Q P2P network
115Background Latent Semantic Indexing
- Document vectors dimension has to match the
dimension of the CAN network - Latent Semantic Indexing uses Singular Value
Decomposition (SVD) - high-dimensional term vector to low-dimensional
semantic vector - elements correspond to importance of abstract
concept in document/query - Also helps to overcomes synonym problem (e.g.,
user looks for car and dont find document about
automobile)
116Background Latent Semantic Indexing
documents
Va
Vb
terms
..
- SVD singular value decomposition
- Reduce dimensionality
- Suppress noise
- Discover word semantics
- Car Automobile
117pSearch Basic Algorithm Steps
- Receive a new document A generate a semantic
vector Va, store the key in the index - Receive a new query Q generate a semantic vector
Vq, route the query in the overlay - The query is flooded to nodes within a radius r
- R determined by similarity threshold or number of
wanted documents - All receiving nodes do a local search and report
references to best matching documents
118pSearch Illustration
119p2pDating
- Start with a randomly connected network
- Peers meet other peers they do not know (blind
dates) - If a peer likes another it will remember it as
a friend. - A remembers B ? abstract link A ? B
- Directed links ? preserves peers autonomy
- SONs dynamically evolve from the meeting process
J. X. Parreira et al. p2pDating Real Life
Inspired Semantic Overlay Networks for Web
Search. Information Processing Management 43,
643-664
120p2pDating
- Finding new friends
- Random meetings (Blind dates)
- Meet friends of friends
A
B
A
Bs Friends
If A and B are friends
it is very likely the Bs friends are friends
of A as well.
121Defining Good Friends
- Criteria for defining a good friend ? combination
of different measures - History Credits for good behavior in the past
- Response time, query result precision, etc
- Collection similarity
- Collection Overlap
- Different ways of estimating the overlap between
two collections - Number of links between peers
- Etc
- Peers might have more than one list of friends
- E.g., according to different criterias
122Going Social
- Before
- Only few content producers (e.g., companies,
universities) - Analysis was done using the content itself plus a
few implicit recommendations (links) - Very little information about the content
consumers (mainly through query logs) - Nowadays
- New technologies to facilitate content sharing
- Content consumers are now also content producers
and content describers (e.g., explicit
recommendations, tags, etc) - More and more crowd wisdom that can be harvested
123Outline of the Second Part
- Link Analysis The Web as a Graph
- PageRank
- Distributed Approaches
- BlockRank
- Local PageRank ServerRank
- Adaptive OPIC
- JXP
- Identifying common interests Semantic Overlay
Networks - Crespo and Garcia Molina
- pSearch
- p2pDating
- Social Networks A new paradigm
- What people share
- Social graphs
- Links, Tags, users analysis
124(No Transcript)
125Social Networks
- A social structure made of nodes (which are
generally individuals or organizations) that are
tied by one or more specific types of relations,
such as - values
- visions
- ideas
- friends
- conflict
- web links
- Etc
- Social networks have been studied for over a
century
126Social Network Services
- Enable the creation of online social networks for
communities of people who share interests and
activities, or who are interested in exploring
the interests and activities of others - Online communities offer an easy way
- for users to publish and share their content.
127Social Networking Growth
- Several social networking sites have experienced
dramatic growth during the past year.
Worldwide Growth of Selected Social Networking
Sites. June 2007 vs. June 2006, Users Age 15,
Source comScore
128What people share
129Social Networks
- Besides sharing content, a user can
- describe documents using tags
- maintain a list of friends
- make comments on other users content, exchange
opinions, discover users with similar profile. - In contrast to Web Graph, in Social Graphs users
are part of the model
130Social Content Graph
Sihem Amer-Yahia, Michael Benedikt, Philip
Bohannon Challenges in Searching Online
Communities. IEEE Data Eng. Bull. 30(2) 23-31
(2007)
131Social Graphs
- Other models also possible
- Directed vs. Undirected edges
- Etc.
Standard IR techniques for Web retrieval need to
be adapted to work on social networks - Lot of
current research dedicated on this area
132Social Networks
- The Wisdom of Crowds Beyond PR
- Spectral analysis of various graphs
- E.g., SocialPageRank, FolkRank.
- Tag semantic analysis
- Discovering semantic from tags co-occurrence
- E.g., SocialSimRank
- Distributed View
- Exploiting social relations to enhance search
- E.g., PeerSpective
133Link Analysis in Social Networks
- SocialPageRank
- High quality web pages are usually popularly
annotated and popular web pages, up-to-date web
users and hot social annotations can be mutual
enhanced. - Let MUT, MTD, MDU be the matrices corresponding
to relations UsersTags, TagsDocs, DocsUsers - Compute iteratively
S. Bao, X. Wu, B. Fei, G. Xue, Z. Su, Y. Yu
Optimizing Web Search Using Social Annotation.
WWW 2007
134Link Analysis in Social Networks
- FolkRank
- Define graph G as union of graphs UsersTags,
TagsDocs, DocsUsers - Assume each user has personal preference vector
- Compute iteratively
- FolkRank vector of docs is
Andreas Hotho, Robert Jäschke, Christoph Schmitz,
Gerd Stumme Information Retrieval in
Folksonomies Search and Ranking. ESWC 2006
411-426
135Tag Similarity
- SocialSimRank
- Idea Similar annotations (tags) are usually
assigned to similar web pages by users with
common interests. - sim(t1, t2) aggr sim(d1,d2) (t1,d1),
(t2,d2)?Tagging sim(d1, d2) aggr
sim(t1,t2) (t1,d1), (t2,d2)?Tagging
S. Bao, X. Wu, B. Fei, G. Xue, Z. Su, Y. Yu
Optimizing Web Search Using Social Annotation.
WWW 2007
136Exploring friendship connections
- PeerSpective users can query their friends
viewed pages - HTTP proxies on users computers index all browsed
content - When a Google search in performance, query is
also send to the other proxies in parallel
Alan Mislove, Krishna P. Gummadi, and Peter
Druschel. Exploiting Social Networks for Internet
Search. HotNets, 2006.
137Social Networks
- New paradigm of publishing and searching content
- Rich data
- Different link structures
- Users input for free!!!
- Relatively recent topic Lots of research
opportunities - Works mentioned are by no means complete, still a
lot to do
Since we are talking about Web 2.0 http//p2pinfo
rmationsearch.blogspot.com/