PeertoPeer Information Search

About This Presentation

Title:

PeertoPeer Information Search

Description:

marks. B0. term g: 13, 11, 45, ... term a: 17, 11, 92, ... term f: 43, 65, 92, ... Exploit community behavior (bookmarks, links, tags, clicks, etc. ... – PowerPoint PPT presentation

Number of Views:2401

Avg rating:3.0/5.0

Slides: 138

Provided by: lsirpeo

Category:

more less

Transcript and Presenter's Notes

Title: PeertoPeer Information Search

1
Peer-to-Peer Information Search

Sebastian Michel
Ecole Polytechnique Fédérale Lausanne
Lausanne - Switzerland

Josiane Xavier Parreira Max-Planck Institute for
Informatics Saarbrücken - Germany
2
Outline of Part 1

Introduction to P2P Systems
Distributed Hashtables Range Queries
Peer-to-Peer IR (Query Routing, Result Merging)
Overlapping Sources / Multi-key Statistics
Top-k Query Processing
Probabilistic Pruning
Distributed Top-k

3
P2P Systems
Peer one that is of equal standing with
another (source Merriam-Webster Online
Dictionary )

Known from Napster and others
Sharing of mostly illegal content (mp3, movies)
P2P Pirate-to-Pirate ??
New kind of network organization no
client/server anymore
Basic Ideas
Each peer connects to a few other peers
All peers together form powerful networks
Potential Benefits
No single point of failure
Load is spread across mulitple peers
(Resilient to failures and dynamics)

4
Napster

Developed in 1998.
First P2P file-sharing system

File Download
Publish file statistics

Central server (index)
Client software sends information
about users contents to server.
User send queries to server
Server responds with IP of users
that store matching files.
? Peer-to-Peer file sharing!

File Download
5
Gnutella

Protocol for distributed file sharing
Started in 2000
in 2005 1.81 million computers connected
Unstructured Network
Truly decentralized
Uses message flooding during query execution.
Later version with super nodes and query routing

http//www.slyck.com/news.php?story814
6
Gnutella Style
TTL 1
TTL 2
TTL 0
TTL 3
TTL 1
TTL 2
TTL 2
Paris Hilton?
TTL 0
TTL 3
TTL 1
7
Gnutella Style

Pros
no complex statistical bookkeeping
Cons
lot of network traffic
some peers might not be reachable (TTL)

8
Bit Torrent

Idea Load sharing through file splitting
A lot of (legal) software distributors offer
software through Bit-torrent
Download information in small .torrent file
One tracker node per file (specified in torrent
file)

segment 1
segment 3
tracker node
segment 1
File
segment 2
segment 5
segment 3
segment 4
segment 5
request segments
request random peer list
segment 4
segment 2
Incentives tit-for-tat Each peer remembers
collaborative peers ? different priorities
Client
9
Literature

Book Peer-to-Peer Harnessing the Power of
Disruptive Technologies by Andy Oram. O'Reilly
Media, Inc.

10
Overlay Networks

On top of existing networks

Different way to build an overlay network
structured
unstructured
hybrid

11
Self Properties (Promises)

Self-Organizing
evolves, grows..... without being guided/managed
Self-Optimizing
Self-Configuring
Self-Healing
Self-Restoration
Self-Diagnostics
Self-Protecting

12
Outline

Introduction to P2P Systems
Distributed Hashtables Range Queries
Peer-to-Peer IR (Query Routing, Result Merging)
Overlapping Sources / Multi-key Statistics
Top-k Query Processing
Probabilistic Pruning
Distributed Top-k

13
Distributed Hash Tables

Hash-Table given a key, return the bucket id.
Based on a hash function (like SHA-1)
Now Distributed. For a given key, return the id
of the peer currently responsible for the key.
Challenge Purely distributed protocols that cope
with node failures, departures, arrivals.
No central manager.

14
Chord

uses an m-bit identifier space ordered in a
mod-2m circle, the Chord ring
maps peers and objects to identifiers in the
Chord ring, using the hash function SHA-1
uses consistent hashing
an object with identifier id is placed on the
successor peer, succ(id), which is the first node
whose identifier is equal to, or follows id on
the Chord ring
Key k (e.g., hash(file name))
is assigned to the node with
key p (e.g., hash(IP address))
such that k ? p and there is
no node p with k ? p and p

Ion Stoica, Robert Morris, David R. Karger, M.
Frans Kaashoek, Hari Balakrishnan Chord A
scalable peer-to-peer lookup service for internet
applications. SIGCOMM 2001 149-160
15
Chord
peer n maintains routing information about peers
that lie on the Chord ring at logarithmically
increasing distance ? Finger tables
fingertable p8
fingertable p51
p1
p56
Chord Ring
p8
p51
p48
p14
fingertable p42
p42
p38
p21
p32
16
Node Joins in Chord
p42
lookup(42)
sets succ pointer
p42
p38
moving keys
updates succ pointer
p38
p48
k40
p42
k43
k39
k40
k39
init_finger_tables() successornode.find_success
or() predecessorsuccessor.predecessor predecess
or.successornew
17
And others ...

P-Grid Karl Aberer P-Grid A Self-Organizing
Access Structure for P2P Information Systems.
CoopIS 2001 179-194
CAN Sylvia Ratnasamy, Paul Francis, Mark
Handley, Richard M. Karp, Scott Shenker A
scalable content-addressable network. 161-172
Pastry Antony I. T. Rowstron, Peter Druschel
Pastry Scalable, Decentralized Object Location,
and Routing for Large-Scale Peer-to-Peer Systems.
Middleware 2001 329-350
Bamboo Sean Rhea, Dennis Geels, Timothy Roscoe,
and John Kubiatowicz. Handling Churn in a DHT.
Proceedings of the USENIX Annual Technical
Conference, June 2004.

18
Range queries

Range queries
A range query v1, v2 searches for those peers
which store data whit key value k? v1, v2

DHTs only support efficiently exact-match queries
The naïve approach to process range queries in
DHTs is to
query each value of a range individually
It is HIGHLY EXPENSIVE!

19
DHTs and Range Queries
Order preserving hash function
usually leads skewed distributions

There are two main solutions to cope with load
imbalances i.e. to perform load balancing
transferring load, or
replicating data

20
DHT and Range Queries (2)

Existing approaches to deal with range queries
Locality preserving hashing
OP-Chord Triantafillou et al (2003). Skip
Graphs Aspnes et al (2004)
Hashing ranges of values instead of each value
individually
CAN-based Andrzejak et al (2002), Sahin et al
(2004)
Another problem in that context access load
imbalances
One possible solution hot data transferring to
deal with those load imbalances
However, data transfer does not solve access load
imbalances in skewed access (query) distributions

21
HotRod replicating hot arcs
Theoni Pitoura et al. EDBT 2006.
A peer is hot (or overloaded) when ? ? _max,
where ?_max is the upper limit of its resource
capacity An arc of peers is hot when at least
one of its peers is hot replicate ranges of
values
22
Efficient Load Balancing
23
Outline

Introduction to P2P Systems
Distributed Hashtables Range Queries
Peer-to-Peer IR (Query Routing, Result Merging)
Overlapping Sources / Multi-key Statistics
Top-k Query Processing
Probabilistic Pruning
Distributed Top-k

24
Building a P2P Search Engine(Peer to Peer
Information Retrieval)

Distributed Google
P2P approach best suitable
large number of peers
exploit mostly idle resources
intellectual input of user community
scalable and self organizing

25
Information Retrieval Basics
Document
Terms
26
Information Retrieval Basics (2)
Top-k Query Processing find k documents with the
highest total score
Query Execution Usually using some kind of
threshold algorithm - sequential scans over
the index lists (round-robin) -
(random accesses to fetch missing
scores) - aggregate scores - stop when the
threshold is reached
B tree on terms
index lists with (DocId tfidf) sorted by Score
e.g. Fagins algorithm TA or a variant without
random accesses
27
Going distributed Index Organization

document index

Peer 2
Peer 3
Peer 1
Peer 2
Peer 1

peer index
every peer has its own collection (full
documents)
distributed index index of peer descriptions

28
(Full) Document Index

Straight forward from centralized document index
Each peer is responsible for storing the index
list for a subset of terms.

Query Routing DHT lookups Query Execution
Distributed Top-k TPUT 04, KLEE 05
29
Peer Index

Each peer has its own local index (e.g., created
by web crawls)
Peers publish compact per-term descriptions about
their index

Query Routing 1. DHT lookups 2. Retrieve
Metadata 3. Find most promising peers Query
Execution - Send the complete Query and
merge the incoming results
30
P2P Search with Minerva
based on scalable, churn-resilient DHT with O(log
n) key lookup

Query routing aims to optimize benefit/cost
driven by distributed statistics on peers
content quality, content overlap, freshness,
authority, trust, etc.
Maintain semantic/social/statistical overlay
network (SON)
Exploit community behavior (bookmarks, links,
tags, clicks, etc.)
31
Two major Problems

Task of merging the obtained results into final
ranking Result Merging
Task of finding high quality peers Query
Routing
aka database/collection/peer selection
Overview articles
J. Callan. (2000). "Distributed information
retrieval." In W. B. Croft, editor, Advances in
Information Retrieval. Kluwer Academic
Publishers. (pp. 127-150).
Weiyi Meng, Clement T. Yu, King-Lup Liu Building
efficient and effective metasearch engines. ACM
Comput. Surv. 34(1) 48-89 (2002)

32
Query Routing

Given a Query Qterm1, term2, ...., termN)
select the most promising peers
Based on
per-term per-peer statistics
document frequency
vocabulary size
normalization issues like
collection frequency
avg vocabulary size
Most popular
CORI, GlOSS, Decision Theoretic Framework (DTF)

33
CORI
Apply document ranking to resource ranking
Resources
....
p1
p2
pj-1
pj
t1
t2
t3
tk
Terms
q
C peers df document frequency
cf collection frequency cw distinct words
per peer
Query
34
Literature

J. Callan. (2000). "Distributed information
retrieval." In W. B. Croft, editor, Advances in
Information Retrieval. Kluwer Academic
Publishers. (pp. 127-150).
Weiyi Meng, Clement T. Yu, King-Lup Liu Building
efficient and effective metasearch engines. ACM
Comput. Surv. 34(1) 48-89 (2002)
CORI James P. Callan, Zhihong Lu, W. Bruce
Croft Searching Distributed Collections with
Inference Networks. SIGIR 1995 21-28
GlOSS Luis Gravano, Hector Garcia-Molina,
Anthony Tomasic GlOSS Text-Source Discovery
over the Internet. ACM Trans. Database Syst.
24(2) 229-264 (1999)
Decision Theoretic Framework Norbert Fuhr A
Decision-Theoretic Approach to Database Selection
in Networked IR. ACM Trans. Inf. Syst. 17(3)
229-249 (1999)

35
Result Merging

Problem incomparable scores
Different corpus statistics
df component used in tfids scoring functions is
not globally known
user with lot of high quality documents for term
a ? high df
non expert user with some bad documents for term
a ? low df

Different scoring functions
completely different functions
different parameters in the same function

36
Result Merging Approaches

Score Normalization by
using global statistics
computation of global statistics difficult (not
obvious)
solution using gossip
score re-computation with query initiators local
statistics
required re-ranking and knowledge about document
contents
score re-computation using query routing scores
routing score available anyway

37
Global DF Estimation
gdf (global doc. freq.) of a term is interesting
key measure, but overlap among peers makes simple
distr. counting infeasible

hash sketches Flajolet/Martin 1985
duplicate-sensitive cardinality estimator for
multisets
hash each multiset element x onto m-bit
bitvector
and remember least significant 1 bit
rough intuition least-significant bit set by
half of the documents,
second bit by ¼ of the documents......
Theory says most significant bit estimator of
log (n) ndocuments
Higher accuracy average multiple iid sketches

38
Global DF Estimation
Hash sketches of different peers collected at
directory peer distributivity is free!! ?i
?(h(x)) x ?Si ?(h(x)) x ? ?i Si

gdf estimation algorithm
each peer p posts hash sketch for each
(discriminative) term t to directory
directory peer for term t forms union of
incoming hash sketches
when a peer needs to know gdf(t), simply ask
directory peer for t
sliding-window techniques for dynamic adjustment

Matthias Bender, Sebastian Michel, Peter
Triantafillou, Gerhard Weikum Global Document
Frequency Estimation in Peer-to-Peer Web Search.
WebDB 2006
39
Outline

Introduction to P2P Systems
Distributed Hashtables Range Queries
Peer-to-Peer IR (Query Routing, Result Merging)
Overlapping Sources / Multi-key Statistics
Top-k Query Processing
Probabilistic Pruning
Distributed Top-k

40
Autonomous Peers ?Overlapping Sources
A
A,B
A,B,C
A,..,D
A
B
querying peer
?
Recall
C
1
3
2
4
peers
?
D
E
41
How?

Enrich published statistics with overlap
estimators.
Interested in NOVELTY and QUALITY
Iterative greedy selection process
select first peer based on quality
select next peer by qualitynovelty
Suitable synopses for overlap estimation
Bloom filter Bloom 1979
hash sketches FlajoletMartin 1985
min wise independent permutations Broder 1997

42
Min-Wise Independent Permutations Broder 97
set of ids
17 21 3 12 24 8
h1(x) 7x 3 mod 51
20 48 24 36 18 8
h2(x) 5x 6 mod 51
40 9 21 15 24 46

hN(x) 3x 9 mod 51
9 21 18 45 30 33
compute N random permutations
MIPs are unbiased estimator of overlap P
min h(x) x?A min h(y) y?B A?B /
A?B
43
Bloom Filter Bloom 1979

bit array of size m
k hash functions h_i docId_space ? 1,..,m
insert n docs by hashing the ids and settings the
corresponding bits
document is in the Bloom Filter if the
corresponding bits are set
probability of false positives (pfp)
tradeoff accuracy vs. efficiency

Andrei Broder and Michael Mitzenmacher Network
Applications of Bloom Filters A Survey. Internet
Mathematics 1(4). 2005.
X
1
1
1
1
44
Multi-Key Statistics

solves interesting problem
peer with lot of docs on american football and
lots of documents about pop music has not a
single document about american music
cannot be predicted using per-term statistics

Obvious Recall that
45
Multi-Key Statistics in P2P

Motivation
estimated_quality(a and b) quality(a) quality
(b) df_a df_b ! df_(a and b)
Impossible (Infeasible) to consider all
term-pairs, triplets, quadruples, .....
Query Driven Analyze query logs _at_ directory
peers.
Data driven verficication
PAnnaKournikova ......
PAndyRodick
PBerlinMarathon
No additional messages shorter lists highly
accurate

Whole process can be easily integrated
into Peer-level P2P IR
additional statistics often not needed
Sebastian Michel, Matthias Bender, Nikos Ntarmos,
Peter Triantafillou, Gerhard Weikum, Christian
Zimmer Discovering and exploiting keyword and
attribute-value co-occurrences to improve P2P
routing indices. CIKM 2006 172-181
46
Single-term vs. multi-term P2P document indexing
Single
term indexing
long posting lists
-
make use of highly discriminative keys limit
influence of overly long index lists consider
term pairs (triplets ...) for shorter lists ?
efficient query processing

.
term
1
posting list
1
c
PEER
1
o

term
2
posting list
2
v

l
l
...
...
...
a
m
s

term
M
-
1
posting list
M
-
1

PEER
N
term
M
posting list
M
Multi
-
term keys
Multi
term indexing

key
11
posting list
11

key
12
posting list
12
PEER
1
...
...

key
1
i
posting list
1
i
.
c
o
v

...
e
g
r
a
l

key
N
1
posting list
N
1

key
N
2
posting list
N
2
PEER
N
...
...

key
Nj
posting list
Nj
short posting lists
Gleb Skobeltsyn, Toan Luu, Ivana Podnar Zarko,
Martin Rajman, Karl Aberer Web text retrieval
with a P2P query-driven index. SIGIR 2007 679-686
47
Literature

Overlap Awareness
Ronak Desai, Qi Yang, Zonghuan Wu, Weiyi Meng,
Clement T. Yu Identifying redundant search
engines in a very large scale metasearch engine
context. WIDM 2006 51-58
Matthias Bender, Sebastian Michel, Peter
Triantafillou, Gerhard Weikum, Christian Zimmer
Improving collection selection with overlap
awareness in P2P search engines. SIGIR 2005
67-74
Thomas Hernandez, Subbarao Kambhampati Improving
text collection selection with coverage and
overlap statistics. WWW (Special interest tracks
and posters) 2005 1128-1129
Sketches
Andrei Z. Broder, Moses Charikar, Alan M. Frieze,
Michael Mitzenmacher Min-Wise Independent
Permutations. J. Comput. Syst. Sci. 60(3)
630-659 (2000)
Philippe Flajolet, G. Nigel Martin Probabilistic
Counting Algorithms for Data Base Applications.
J. Comput. Syst. Sci. 31(2) 182-209 (1985)
Andrei Broder and Michael Mitzenmacher Network
Applications of Bloom Filters A Survey. Internet
Mathematics 1(4). 2005.

48
Literature

Multi-key statistics
Ivana Podnar, Martin Rajman, Toan Luu, Fabius
Klemm, Karl Aberer Scalable Peer-to-Peer Web
Retrieval with Highly Discriminative Keys. ICDE
2007 1096-1105
Gleb Skobeltsyn, Toan Luu, Ivana Podnar Zarko,
Martin Rajman, Karl Aberer Web text retrieval
with a P2P query-driven index. SIGIR 2007
679-686
Sebastian Michel, Matthias Bender, Nikos Ntarmos,
Peter Triantafillou, Gerhard Weikum, Christian
Zimmer Discovering and exploiting keyword and
attribute-value co-occurrences to improve P2P
routing indices. CIKM 2006 172-181

49
Outline

Introduction to P2P Systems
Distributed Hashtables Range Queries
Peer-to-Peer IR (Query Routing, Result Merging)
Overlapping Sources / Multi-key Statistics
Top-k Query Processing
Probabilistic Pruning
Distributed Top-k

50
For the IR people ....

Why top-k?
Cannot take a look at all matching documents
E.g., Google provides millions of documents about
Britney Spears

Requires ranking (scoring)
In text retrieval for instance
of course pagerank if you wish
Remember Part one Local Query Execution at each
peer (peer-index-model) AND truly distributed
top-k processing in the full document-index.
51
For the DB guys ...

Table with schema (id, attribute, value)

SELECT id, aggr(value) from table group by
id sort by aggr(value) desc limit k
52
For the networking guys ...
Network Monitoring
Find clients that cause high network traffic.
53
Computational Model

m lists with (itemId, score)-pairs sorted by
score descending.
One list per attribute (e.g. term)
Aggregation function
aggr()
Monotonicity is important
for all items a, b
whith denoting the
score of item x in list i
Goal return the top-k items w.r.t. their
aggregated (overall) scores

54
How to process this?

Most popular Family of threshold algorithms
Fagin, 1999
Nepal/ Ramakrishna, 1999
Güntzer/Balke/Kießling, 2001
Basic ideas
keep upper and lower score bound for each
document
lowerbound (or worstscore) sum of scores we
have seen so far
assuming 0 for unseen dimensions
upperbound (or bestscore) lowerbound highest
possible value for unseen dimensions
know what weve got already know what do expect
stop if no further step can improve the current
(i.e. final) ranking

55
Fagins NRA

NRA(q,L)
top-k ? candidates ? min-k 0
scan all lists Li (i 1..m) in parallel
consider item d at position posi in
Li
E(d) E(d) ? i
highi si(qi,d)
worstscore(d) aggrs?(q?,d)??E(d)
bestscore(d) aggraggrs?(q?,d)??E(d
), aggrhigh???E(d)
if worstscore(d) min-k then
remove argmindworstscore(d)d
?top-k from top-k
add d to top-k
min-k minworstscore(d)
d ? top-k
else if bestscore(d) min-k then
candidates candidates ? d
threshold max bestscore(d) d?
candidates
if threshold ? min-k then exit

56
Top-k Search
Data items d1, , dn
d1
s(t1,d1) 0.7 s(tm,d1) 0.2
Query q (t1, t2, t3)
Index lists
k 1
d78 0.9
d1 0.7
d88 0.2
d23 0.8
d10 0.8
t1
Scan depth 1

Scan depth 2
Scan depth 3
d64 0.8
d10 0.2
d78 0.1
d23 0.6
d10 0.6
t2

d99 0.2
d34 0.1
d10 0.7
d78 0.5
d64 0.4
STOP!
t3

57
Outline

Introduction to P2P Systems
Distributed Hashtables Range Queries
Peer-to-Peer IR (Query Routing, Result Merging)
Overlapping Sources / Multi-key Statistics
Top-k Query Processing
Probabilistic Pruning
Distributed Top-k

58
Evolution of a Candidates Score
Observation pruning often overly conservative
(deep scans, high memory for priority queue)
drop d from the candidate queue
score
bestscored
min-k
worstscored
scan depth

Approximate top-k
What is the probability that d qualifies for the
top-k ?

59
Safe Thresholding vs. Probabilistic Guarantees

NRA based on invariant
Relaxed into probabilistic threshold test
Or equivalently, with

bestscored
min-k
worstscored
bestscored
d(d)
worstscored
60
Expected Result Quality

Missing relevant items
Probability p_miss of missing a true top-k object
equals the probability of erroneously dropping a
candidate from the queue
For each candidate p_miss e
Precall r/k Pprecision r/k
Eprecision Erecall

61
Outline

Introduction to P2P Systems
Distributed Hashtables Range Queries
Peer-to-Peer IR (Query Routing, Result Merging)
Overlapping Sources / Multi-key Statistics
Top-k Query Processing
Probabilistic Pruning
Distributed Top-k

62
Going distributed

Key Observations
Network traffic is crucial
Number of round trips is crucial
Straight forward application of TA/NRA?
expensive huge number of rounds trips
even with batching unpredictable performance

63
Where is the data?
P1
P0
P2
P3

Consider
network consumption
per peer load
latency (query response time)
network
I/O
processing

64
Three Phase Uniform Threshold Algorithm
Cao and Wang, PODC 2004
First distributed top-k algorithm with fixed
number of phases!

Exactly 3 phases
fetch k best entries (d, sj) from each of P1 ...
Pm and aggregate (?j1..m sj(d)) at query
initiator
ask each of P1 ... Pm for all entries with sj
min-k / m and aggregate results at query
initiator. min-k is score of item currently at
rank k.
fetch missing scores for all candidates by
random lookups at P1 ... Pm

65
Coordinator
Peer P0
Cohort
Cohort
Peer Pj
Peer Pi
score
score
...
...
Index List
Index List
66
Analysis of TPUT

Theorem TPUT is an exact algorithm, i.e.
identifies the true top-k items

Proof (sketch) TPUT cannot miss a true top-k
item.
Assume it misses one, i.e. item is below
mink/m in all lists.
? overall score
? not a true top-k item!

list 1
list 2
list 3
State after phase 2
min-k score

67
Analysis of TPUT

if mink / m is small TPUT retrieves a lot of data
in Phase 2
? high network traffic
random accesses
? high per-peer load

KLEE VLDB 05
Different philosophy approximate answers
Efficiency
Reduces (docId, score)-pair transfers
no random accesses at each peer
Two pillars
The HistogramBlooms structure
The Candidate List Filter structure

68
Additional Data Structures
increase the min-k / m threshold
Equi-width histogram Bloom filter for each
cell average score per cell upper/lower
score
Usage During Phase 1 fetch top-k from
each list top-c cells
69
KLEE
Coordinator
Peer P0
Cohort
Cohort
Peer Pj
Peer Pi
score
score
...
...
Index List
Index List
70
KLEE Candidate Set Reduction
Coordinator Peer P0
candidate set
current top-k
min-k / m
Cohort Peer Pi
Cohort Peer Pj
010010000100010001
100010100000010001
top k
0000100000100000001
0000100000100000001
score
...
Index List
71
KLEE Candidate Retrieval
Coordinator Peer P0
current top-k
candidate set
Cohort Peer Pi
Cohort Peer Pj
010010000100010001
100010100000010001
top k
0000100000100000001
0000100000100000001
score
...
Index List
72
Literature

Ronald Fagin Combining Fuzzy Information from
Multiple Systems. J. Comput. Syst. Sci. 58(1)
83-99 (1999)
Ronald Fagin, Amnon Lotem, Moni Naor Optimal
aggregation algorithms for middleware. J. Comput.
Syst. Sci. 66(4) 614-656 (2003)
Surya Nepal, M. V. Ramakrishna Query Processing
Issues in Image (Multimedia) Databases. ICDE
1999 22-29
Ulrich Güntzer, Wolf-Tilo Balke, Werner Kießling
Towards Efficient Multi-Feature Queries in
Heterogeneous Environments. ITCC 2001 622-628
Martin Theobald, Gerhard Weikum, Ralf Schenkel
Top-k Query Evaluation with Probabilistic
Guarantees. VLDB 2004 648-659
Holger Bast, Debapriyo Majumdar, Ralf Schenkel,
Martin Theobald, Gerhard Weikum IO-Top-k
Index-access Optimized Top-k Query Processing.
VLDB 2006 475-486
Amélie Marian, Nicolas Bruno, Luis Gravano
Evaluating top-k queries over web-accessible
databases. ACM Trans. Database Syst. 29(2)
319-362 (2004)
Pei Cao, Zhe Wang Efficient top-K query
calculation in distributed networks. PODC 2004
206-215
Sebastian Michel, Peter Triantafillou, Gerhard
Weikum KLEE A Framework for Distributed Top-k
Query Algorithms. VLDB 2005 637-648

73
Part II Social Search
74
(No Transcript)
75
Motivation

People connected through a network
People create links to other people
Links can express friendship, recommendations,
etc
Different graph structures appear
Sharing interests
Enables users to find others who share common
interests
Similar users can provide relevant content
Users and content spread at different sites
Distributed nature and continuously increasing
size call for peer-to-peer approaches

76
Outline of the Second Part

Link Analysis The Web as a Graph
PageRank
Distributed Approaches
BlockRank
Local PageRank ServerRank
Adaptive OPIC
JXP
Identifying common interests Semantic Overlay
Networks
Crespo and Garcia Molina
pSearch
p2pDating
Social Networks A new paradigm
What people share
Social graphs
Links, Tags, users analysis

77
Links are everywhere

connecting Web pages

78
Links are everywhere

connecting people

Example of a Flickrs friends network
79
Links are everywhere

connecting products

80
Links Analysis

The set of nodes/pages (e.g., web pages, people,
products, etc) and the links connecting them
define a graph

81
Link Analysis

At the end we have something like this
Lots of useful information can be obtained from
the analysis of the such graphs

82
Adjacency Matrix

Matrix representation of graphs
Given a graph G, its adjacency matrix A is nxn
and
aij 1, it there is a link from node i to node j
aij 0, otherwise

83
PageRank Exploring the Wisdom of Crowds

Measures relative importance of pages on the
graph
Importance of a page depends on the importance of
the pages that point to it
Random Surfer Model once in a page, the surfer
chooses to follow one of the outlinks with prob.
a, or to jump to a random page with prob. (1- a)
PR probability of being at a certain
page, after a enough number of jumps

S. Brin L. Page. The anatomy of a large-scale
hypertextual web search engine. In WWW Conf. 1998.
84
PageRank Formal Definition

N ? Total number of pages
PR(p) ? PageRank of page p
out(p) ? Outdegree of p
e? Random jump probability

Can be computed using power iteration method
In practice more efficient versions can be used
Google is believed to use it on the Web graph,
combined with other metrics, to rank their search
results

85
PageRank Matrix Notation

A ? Matrix containing the transition
probabilities
where Pij 1/out(i), if there is a link from
i to j, 0 otherwise E is the random jumps matrix
Probability distribution vector at time k
is the starting vector
PageRank ? Stationary distribution of the Markov
Chain described by A, i.e., principal eigenvector
or A

86
Going Distributed

PageRank in principle needs the whole graph at
one place
Shortcomings
Not Scalable for huge graphs, like the Web
Slow update PageRank in such huge graph can
take weeks
Not suitable for different network architectures
(e.g. P2P)
Distributed approaches, where the graph is
partitioned, are clearly needed
Some distributed approaches (more details on the
next slides)
Local PageRank ServerRank (Wang et al.)
BlockRank (Kamvar et al.)
JXP (Parreira et al.)

87
The Block Structure

Most of links are among web pages inside same host

Pages from Host A
Block structure can be exploited for speeding up
and/or distributing the PR computation
Pages from Host B
Adjacency Matrix
88
BlockRank

PageRank in three steps
Computes local PageRanks of pages for each
host, by considering only intra host links
Computes the importance of the host, using the
local PR values and the inter host links
Combines previous values to create the starting
vector for the standard PR algorithm
Speeds up computation
Step 1 can be parallelized
Still needs the whole matrix for step 3

S. Kamvar, T. Haveliwala, C. Manning G. Golub.
Exploiting the block structure of the web for
computing pagerank. Technical report, Stanford
University, 2003.
89
Going Distributed

Local PR ServerRank
Similar to BlockRank
Local PR PR computed inside each server using
intra server links
ServerRank PR computed on server graph using
inter server links
Server graph does not need to be materialized.
Computation is done by exchanging messages among
servers
Local PR and ServerRank are combined to
approximate the true PR of a page
Values can be further refined by using Local PR
info on ServerRank computation and vice versa.
Server partition can be a limitation

Y. Wang D. J. DeWitt. Computing pagerank in a
distributed internet search system. In VLDB, 2004.
90
Partition at peer level

In P2P networks, server partition is not suitable

91
Partition at peer level

Every peer crawls Web fragments at its discretion
Peers have only local (incomplete) information
Pages might be link to or linked by pages at
other peers
Overlaps between peers graphs may occur
Peers a priori unaware of other peers contents

92
Adaptive OPIC

OPIC Online Page Importance Computation
Computes the importance of a page on-line, with
few resources
Algorithm
Pages initially receive some cash
Pages are randomly visited
When a page is visited, its cash is distributed
between the pages it points to
The page importance for a given page is computed
using the history of cash of that page

Serge Abiteboul, Mihai Preda, and Gregory Cobena.
Adaptive on-line page importance computation. In
WWW, 2003.
93
Adaptive OPIC

Example
Small Web of 3 pages
Alice has all the cash to start (Importance
independent of the initial state)

Alice
George
Bob
Cash-Game History Alice received 600 (200400) 4
0 Bob received 600 (200100300) 40 George
received 300 (200100) 20
94
Adaptive OPIC

No particular graph partition
No need to store the link matrix
Adapts to the changes on the web graph by
considering only the recent part of the cash
history for each page
Time window now-T, now
High number of messages exchanged
Does not handle case where same page is stored at
more than one place

95
The JXP Algorithm

Decentralized algorithm for computing global
authority scores of pages in a P2P Network
Runs locally at every peer
No coordinator, asynchronous
Combines Local PageRank computations Meetings
between peers
JXP scores converge to the true global PageRank
scores

Josiane Xavier Parreira, Carlos Castillo,
Debora Donato, Sebastian Michel and Gerhard
Weikum The JXP Method for Robust PageRank
Approximation in a Peer-to-Peer Web Search
Network. The VLDB Journal, 2007.
96
The JXP Algorithm

World Node
Special node attached to the local graph at every
peer
Compact representation of all other pages in the
network
Special features
All links from local pages to external pages
point to World Node
Links from external pages that point to local
pages (discovered during meetings) are
represented at the World Node
Score and outdegree of these external pages are
stored World Node outgoing links are weighted to
reflect score mass given by original link
Self-loop link to represent transitions among
external pages

W
97
The JXP Algorithm

Initialization step
Local graph is extended by adding the world node
PageRank is computed in the extended graph ? JXP
Scores
Main algorithm (for every Pi in the network)
Select Pj to meet
Update world node
Add edges for pages in Pj that point to pages in
Pi
If an edge already exists at the world node, the
score of the source page is updated by taking the
highest of both scores
Compute PageRank ? JXP scores

98
The JXP Algorithm
Theorem In a fair series of JXP meetings, the
JXP scores of all nodes converge to the true
global PR scores
99
Locating parts of the Graph

Finding peers that share common interests
Many applications can benefit from it
Distributed PR
In principle, peers need to send content only to
the peers that contain their successors
Random messages guarantees that those peers will
eventually be reached, but part of messages will
be wasted

100
WASTED MEETING!!!! We want to avoid it!!!
101
Locating parts of the Graph

Query answering
Ideal Forward query only to peers that are more
likely to provide good answers to it
Query flooding is very expensive
Hash-based queries are not suitable for
approximate queries

102
Locating parts of the Graph

Locating relevant peers
Increase performance
Reduce traffic load
Idea Group peers according to the semantic of
their content and place them into different
overlay networks

103
Outline of the Second Part

Link Analysis The Web as a Graph
PageRank
Distributed Approaches
BlockRank
Local PageRank ServerRank
Adaptive OPIC
JXP
Identifying common interests Semantic Overlay
Networks
Crespo and Garcia Molina
pSearch
p2pDating
Social Networks A new paradigm
What people share
Social graphs
Links, Tags, users analysis

104
Semantic Overlay Networks

Partition the P2P network into several thematic
networks
Peers with similar or beneficial/complementary
content are clustered together
Queries for a content will be forwarded only to
peers with such content
Flooding in smaller networks with smaller TTL (or
more results with same)

105
Overlay Networks Random vs. Semantic

Random
Peers connect to a small set of random peers
Queries are flooded through the network
Peers with unrelated content receive query
Low performance High number of messages
Low recall if only few peers are contacted

Semantic
Peers connect to peers with related content ?
Cluster of peers
Peers identify querys topic and forward it only
the set of peers on that topic
Messages to peers with unrelated content are
avoided
Better performance Smaller number of messages
High recall by asking only few peers

106
When creating SONs

Two main things to consider
Node partitioning
Clustering criteria
Node partitioning - When does a peer belong to
SON A?
When it contains a doc of type A
When it contains more than x docs of type A
Less peers per SON ? more results sooner
Less SONs per peer ? less connections
Clustering criteria - Clustering must provide
Load-balance
Each category has similar number of nodes
Each node belongs to a small number of categories
Easy and accurate way to classify a document

107
Crespo and Garcia-Molina

Uses a classification hierarchy to form the
overlay networks
Documents and queries are classified into one or
more concepts
Queries are forwarded to peers in the super/sub
concepts

A. Crespo and H. Garcia-Molina. Semantic Overlay
Networks for P2P Systems. Technical report,
Stanford University, January 2003.
108
Crespo and Garcia-Molina

Reported results show a significant improvement
on number of messages
Music file sharing scenario To get half the
documents that match a query
SONs 461 msgs
Gnutella 1731 msgs
SON links are logical Two peers
that are connected on a SON can
actually be many hops away from
each other
Requirement that hierarchy and
classification algorithm are
shared among all nodes might
be a problem

109
pSearch

Semantic Overlay on top of Content Addressable
Networks (CANs)
Latent Semantic Indexing (LSI) is used to
generate a semantic vector for each document
Semantic vectors are used as keys to store docs
indices in the CAN
Indices close in semantics are stored close in
the overlay
Two types of operations
Publish document indices
Process queries

Chunqiang Tang, Zhichen Xu, and Sandhya
Dwarkadas. Peer-to-peer Information Retrieval
Using Self-Organizing Semantic Overlay Networks.
In SIGCOMM, 2003.
110
pSearch Key Idea
semantic space
doc
111
pSearch Key Idea
semantic space
doc
query
112
BackgroundContent-Addressable Network

Partition Cartesian space into zones
Each zone is assigned to a computer
Neighboring zones are routing neighbors
An object key is a point in the space
Object lookup is done through routing

113
Background Vector Space Model

Term Vectors represent documents and queries
Elements correspond to importance of term in
document or vector
Statistical computation of vector elements
Term frequency inverse document frequency
Ranking of retrieved documents
Similarity between document vector and query
vector

114
Background Vector Space Model
A books on computer networks B network
routing in P2P networks Q P2P network
115
Background Latent Semantic Indexing

Document vectors dimension has to match the
dimension of the CAN network
Latent Semantic Indexing uses Singular Value
Decomposition (SVD)
high-dimensional term vector to low-dimensional
semantic vector
elements correspond to importance of abstract
concept in document/query
Also helps to overcomes synonym problem (e.g.,
user looks for car and dont find document about
automobile)

116
Background Latent Semantic Indexing
documents
Va
Vb
terms
..

SVD singular value decomposition
Reduce dimensionality
Suppress noise
Discover word semantics
Car Automobile

117
pSearch Basic Algorithm Steps

Receive a new document A generate a semantic
vector Va, store the key in the index
Receive a new query Q generate a semantic vector
Vq, route the query in the overlay
The query is flooded to nodes within a radius r
R determined by similarity threshold or number of
wanted documents
All receiving nodes do a local search and report
references to best matching documents

118
pSearch Illustration
119
p2pDating

Start with a randomly connected network
Peers meet other peers they do not know (blind
dates)
If a peer likes another it will remember it as
a friend.
A remembers B ? abstract link A ? B
Directed links ? preserves peers autonomy
SONs dynamically evolve from the meeting process

J. X. Parreira et al. p2pDating Real Life
Inspired Semantic Overlay Networks for Web
Search. Information Processing Management 43,
643-664
120
p2pDating

Finding new friends
Random meetings (Blind dates)
Meet friends of friends

A
B
A
Bs Friends
If A and B are friends
it is very likely the Bs friends are friends
of A as well.
121
Defining Good Friends

Criteria for defining a good friend ? combination
of different measures
History Credits for good behavior in the past
Response time, query result precision, etc
Collection similarity
Collection Overlap
Different ways of estimating the overlap between
two collections
Number of links between peers
Etc
Peers might have more than one list of friends
E.g., according to different criterias

122
Going Social

Before
Only few content producers (e.g., companies,
universities)
Analysis was done using the content itself plus a
few implicit recommendations (links)
Very little information about the content
consumers (mainly through query logs)
Nowadays
New technologies to facilitate content sharing
Content consumers are now also content producers
and content describers (e.g., explicit
recommendations, tags, etc)
More and more crowd wisdom that can be harvested

123
Outline of the Second Part

Link Analysis The Web as a Graph
PageRank
Distributed Approaches
BlockRank
Local PageRank ServerRank
Adaptive OPIC
JXP
Identifying common interests Semantic Overlay
Networks
Crespo and Garcia Molina
pSearch
p2pDating
Social Networks A new paradigm
What people share
Social graphs
Links, Tags, users analysis

124
(No Transcript)
125
Social Networks

A social structure made of nodes (which are
generally individuals or organizations) that are
tied by one or more specific types of relations,
such as
values
visions
ideas
friends
conflict
web links
Etc
Social networks have been studied for over a
century

126
Social Network Services

Enable the creation of online social networks for
communities of people who share interests and
activities, or who are interested in exploring
the interests and activities of others
Online communities offer an easy way
for users to publish and share their content.

127
Social Networking Growth

Several social networking sites have experienced
dramatic growth during the past year.

Worldwide Growth of Selected Social Networking
Sites. June 2007 vs. June 2006, Users Age 15,
Source comScore
128
What people share
129
Social Networks

Besides sharing content, a user can
describe documents using tags
maintain a list of friends
make comments on other users content, exchange
opinions, discover users with similar profile.
In contrast to Web Graph, in Social Graphs users
are part of the model

130
Social Content Graph
Sihem Amer-Yahia, Michael Benedikt, Philip
Bohannon Challenges in Searching Online
Communities. IEEE Data Eng. Bull. 30(2) 23-31
(2007)
131
Social Graphs

Other models also possible
Directed vs. Undirected edges
Etc.

Standard IR techniques for Web retrieval need to
be adapted to work on social networks - Lot of
current research dedicated on this area
132
Social Networks

The Wisdom of Crowds Beyond PR
Spectral analysis of various graphs
E.g., SocialPageRank, FolkRank.
Tag semantic analysis
Discovering semantic from tags co-occurrence
E.g., SocialSimRank
Distributed View
Exploiting social relations to enhance search
E.g., PeerSpective

133
Link Analysis in Social Networks

SocialPageRank
High quality web pages are usually popularly
annotated and popular web pages, up-to-date web
users and hot social annotations can be mutual
enhanced.
Let MUT, MTD, MDU be the matrices corresponding
to relations UsersTags, TagsDocs, DocsUsers
Compute iteratively

S. Bao, X. Wu, B. Fei, G. Xue, Z. Su, Y. Yu
Optimizing Web Search Using Social Annotation.
WWW 2007
134
Link Analysis in Social Networks

FolkRank
Define graph G as union of graphs UsersTags,
TagsDocs, DocsUsers
Assume each user has personal preference vector
Compute iteratively
FolkRank vector of docs is

Andreas Hotho, Robert Jäschke, Christoph Schmitz,
Gerd Stumme Information Retrieval in
Folksonomies Search and Ranking. ESWC 2006
411-426
135
Tag Similarity

SocialSimRank
Idea Similar annotations (tags) are usually
assigned to similar web pages by users with
common interests.
sim(t1, t2) aggr sim(d1,d2) (t1,d1),
(t2,d2)?Tagging sim(d1, d2) aggr
sim(t1,t2) (t1,d1), (t2,d2)?Tagging

S. Bao, X. Wu, B. Fei, G. Xue, Z. Su, Y. Yu
Optimizing Web Search Using Social Annotation.
WWW 2007
136
Exploring friendship connections

PeerSpective users can query their friends
viewed pages
HTTP proxies on users computers index all browsed
content
When a Google search in performance, query is
also send to the other proxies in parallel

Alan Mislove, Krishna P. Gummadi, and Peter
Druschel. Exploiting Social Networks for Internet
Search. HotNets, 2006.
137
Social Networks