Finding Related Pages in the World Wide Web - PowerPoint PPT Presentation

1 / 28

About This Presentation

Title:

Finding Related Pages in the World Wide Web

Description:

Output: www.usatoday.com www.washingtonpost.com. Related web pages: same topic ... Input: (1) User: the URL of user's interest (2) Connectivity Server: the ... – PowerPoint PPT presentation

Number of Views:61

Avg rating:3.0/5.0

Slides: 29

Provided by: xiangyanga

Category:

more less

Transcript and Presenter's Notes

Title: Finding Related Pages in the World Wide Web

1
Finding Related Pages in the World Wide Web

Author
Jeffrey Dean
Monika R. Henzinger

Presented By Amal Banerjee
Xiang-Yang Alexander Liu
2
Outline

Introduction
Companion Algorithm
Cocitation Algorithm
Performance Comparison with Netscape
Conclusion

3
Introduction

Another kind of user input a URL address
Example
Input www.nytimes.com
Output www.usatoday.com www.washingtonpost.com
.
Related web pages same topic

4
Introduction (contd)

Input (1) User the URL of users interest
(2) Connectivity Server the
linkage information about this URL

Output A set of related web pages
Method Linkage analysis
Objective (1) high precision (2) high speed

Solution (1) Companion Algorithm
(2) Cocitation Algorithm

5
Companion Algorithm

Step 1 Build the vicinity graph based on
user input and linkage information.

Step 2 Near-duplicate elimination
Step 3 Compute hub and authority scores
Step 4 Sort and output

6
Companion Algorithm (contd)Step 1 Building the
vicinity graph (example)

Example

p2
p3
p1
u
c1
c3
c2
7
Companion Algorithm (contd)Step 1 Building the
vicinity graph (example)

Example

p2
p3
p1
B21
B22
B11
B12
B31
B32
u
b11
b12
b31
b21
b32
b22
c1
c3
c2
8
Companion Algorithm (cont.)Step 1 Building the
vicinity graph

Number of parents of u 2000
Number of children of every parent 8
Reduce the likelihood of the computation
dominated by a single parent

9
Companion Algorithm (cont.)Step 1 Building the
vicinity graph(link order)

Problem If a parent of u has more than 8
children, how to make the selection?

Observation the links to pages on a similar
topic tend to cluster together

Solution 4 above and 4 below based on the link
from p to u.

10
Companion Algorithm (cont.)Step 1 Building the
vicinity graph

Stoplist (1) unrelated to most queries
(2) have very high in-degree
21 URLs by experimentation
Most of them are popular search engines and
portals

11
Companion Algorithm (cont.)Step 1 Building the
vicinity graph(pseudocode)

Build-Vicinity-Graph(URL u, Connectivity Server)
Su stoplistOriginal-Stoplist which including
21 URLs
If u is in stoplist stoplistNULL SET
SSup to P parents of u from Connectivity
Server and the parent of u must not be in the
stoplist
for every p //p is a parent of u
if number of children of p lt Pc SSall
children of p
else SSPc/2 children above and Pc/2 children
below the link to u
SSup to C children of u from Connectivity
Server
for every c //c is a child of u
SSup to Cp parents of c from Connectivity
Server
return S

12
Companion Algorithm (cont.)Step 2
Near-duplicate elimination

Many pages are duplicated across hosts.
Example mirror sites, different aliases for same
pages
Near-duplicate elimination( S )
for every two nodes a and b in S
if (a and b each have more than 10 links)
( a and b have at least 95 of their links in
common)
c a links b links
S S a b c

13
Companion Algorithm (cont.)Step 3 Compute hub
and authority scores

Use the weighting scheme of Bharat and Hensinger
Compute hub and authority scores( S )
Initialize all elements of the hub vector H to
1.0
Initialize all elements of the authority vector
A to 1.0
While the vectors H and A have not converged
For all nodes n in the vicinity graph N
An
For all n in N
Hn
Normalize the H and A vectors

14
Cocitation Algorithm

Observation related pages are often linked
together by other web pages.

Two nodes are co-cited if they have at least one
common parent.

p2
p3
p1
u
S
15
Cocitation Algorithm

Degree of co-citation numbers of common parents
of two nodes
Idea Looking for sibling nodes with high degree
of co-citation

16
Cocitation Algorithm (cont.)

Cocitation( URL u, Connectivity Server)
ParentSetempty SiblingSetempty
ParentSetParentSet up to P parents of u
For every node p in ParentSet do
SiblingSetSiblingSet up to C children of p
for every node s in SiblingSet calculate the
degree of co-citation of (s, u)
Sort the nodes in SiblingSet according to degree
of co-citation
Output

17
Algorithm Implementation

Connectivity Server 180 million URLs - nodes
AlphaServer - 8GB RAM prevent page faults
Connect Connectivity Server - server code
mmap

18
Experimental Setup

18 people - at least 2 URLs each
59 URLs get top 10 answers for each, rate these
Page is rated as
0 Page not valuable/useful
1 Page valuable/useful
- Page inaccessible

19
Algorithm Performance Metrics

Intersection Group of URLs for which all return
at least one answer 37
Non-Netscape Group of URLs for which Netscape
did not return any answers 19

20
Algorithm Performance Metrics (contd)
21
Algorithm Performance Metrics (contd)
22
Algorithm Performance Metrics (contd)

Overlap between answers returned by algorithms

23
Algorithm Performance Metrics (contd)Sign Test
Example

Sample data set
97.5, 95.2, 97.3, 96.0, 96.8, 100.0, 97.4,
95.3, 93.2, 99.1, 96.1, 97.6, 98.2, 98.5, 94.9
Null Hypothesis median 98.5
Alternative Hypothesis median lt 98.5
X 2 values with values larger than 98.5

24
Algorithm Performance Metrics (contd)

Statistical significance of results

25
Algorithm Performance Metrics (contd)Timing
Characteristics

Average running times
Companion 109 ms for 50 URLs
Cocitation 195 ms for 58 URLs

26
Related Works

Order of links Chakrabarti et.al Enhanced
Hypertext Categorization Using Hyperlinks.
Cocitation and other forms of connectivity
Spertus A points to B and C B, C related
Pitkow Pirolli Cocitation

27
Conclusion and Future Works

Future Work Extend these two algorithms to
handle more than one input URL.
Conclusion The two algorithms significantly
outperform Netscapes performance for finding
related web pages.

28
Questions?