Title: Finding Related Pages in the World Wide Web
1Finding Related Pages in the World Wide Web
- Author
- Jeffrey Dean
- Monika R. Henzinger
Presented By Amal Banerjee
Xiang-Yang Alexander Liu
2Outline
- Introduction
- Companion Algorithm
- Cocitation Algorithm
- Performance Comparison with Netscape
- Conclusion
3Introduction
- Another kind of user input a URL address
- Example
- Input www.nytimes.com
- Output www.usatoday.com www.washingtonpost.com
. - Related web pages same topic
4Introduction (contd)
- Input (1) User the URL of users interest
- (2) Connectivity Server the
linkage information about this URL
- Output A set of related web pages
- Method Linkage analysis
- Objective (1) high precision (2) high speed
- Solution (1) Companion Algorithm
- (2) Cocitation Algorithm
5Companion Algorithm
- Step 1 Build the vicinity graph based on
user input and linkage information.
- Step 2 Near-duplicate elimination
- Step 3 Compute hub and authority scores
- Step 4 Sort and output
6Companion Algorithm (contd)Step 1 Building the
vicinity graph (example)
p2
p3
p1
u
c1
c3
c2
7Companion Algorithm (contd)Step 1 Building the
vicinity graph (example)
p2
p3
p1
B21
B22
B11
B12
B31
B32
u
b11
b12
b31
b21
b32
b22
c1
c3
c2
8Companion Algorithm (cont.)Step 1 Building the
vicinity graph
- Number of parents of u 2000
- Number of children of every parent 8
- Reduce the likelihood of the computation
dominated by a single parent
9Companion Algorithm (cont.)Step 1 Building the
vicinity graph(link order)
- Problem If a parent of u has more than 8
children, how to make the selection?
- Observation the links to pages on a similar
topic tend to cluster together
- Solution 4 above and 4 below based on the link
from p to u.
10Companion Algorithm (cont.)Step 1 Building the
vicinity graph
- Stoplist (1) unrelated to most queries
- (2) have very high in-degree
- 21 URLs by experimentation
- Most of them are popular search engines and
portals
11Companion Algorithm (cont.)Step 1 Building the
vicinity graph(pseudocode)
- Build-Vicinity-Graph(URL u, Connectivity Server)
- Su stoplistOriginal-Stoplist which including
21 URLs - If u is in stoplist stoplistNULL SET
- SSup to P parents of u from Connectivity
Server and the parent of u must not be in the
stoplist - for every p //p is a parent of u
- if number of children of p lt Pc SSall
children of p - else SSPc/2 children above and Pc/2 children
below the link to u -
- SSup to C children of u from Connectivity
Server - for every c //c is a child of u
- SSup to Cp parents of c from Connectivity
Server -
- return S
12Companion Algorithm (cont.)Step 2
Near-duplicate elimination
- Many pages are duplicated across hosts.
- Example mirror sites, different aliases for same
pages - Near-duplicate elimination( S )
- for every two nodes a and b in S
- if (a and b each have more than 10 links)
- ( a and b have at least 95 of their links in
common) - c a links b links
- S S a b c
-
13Companion Algorithm (cont.)Step 3 Compute hub
and authority scores
- Use the weighting scheme of Bharat and Hensinger
- Compute hub and authority scores( S )
- Initialize all elements of the hub vector H to
1.0 - Initialize all elements of the authority vector
A to 1.0 - While the vectors H and A have not converged
- For all nodes n in the vicinity graph N
- An
- For all n in N
- Hn
- Normalize the H and A vectors
14Cocitation Algorithm
- Observation related pages are often linked
together by other web pages.
- Two nodes are co-cited if they have at least one
common parent.
p2
p3
p1
u
S
15Cocitation Algorithm
- Degree of co-citation numbers of common parents
of two nodes - Idea Looking for sibling nodes with high degree
of co-citation
16Cocitation Algorithm (cont.)
- Cocitation( URL u, Connectivity Server)
- ParentSetempty SiblingSetempty
- ParentSetParentSet up to P parents of u
- For every node p in ParentSet do
- SiblingSetSiblingSet up to C children of p
- for every node s in SiblingSet calculate the
degree of co-citation of (s, u) - Sort the nodes in SiblingSet according to degree
of co-citation - Output
17Algorithm Implementation
- Connectivity Server 180 million URLs - nodes
- AlphaServer - 8GB RAM prevent page faults
- Connect Connectivity Server - server code
mmap
18Experimental Setup
- 18 people - at least 2 URLs each
- 59 URLs get top 10 answers for each, rate these
- Page is rated as
- 0 Page not valuable/useful
- 1 Page valuable/useful
- - Page inaccessible
-
19Algorithm Performance Metrics
- Intersection Group of URLs for which all return
at least one answer 37 - Non-Netscape Group of URLs for which Netscape
did not return any answers 19
20Algorithm Performance Metrics (contd)
21Algorithm Performance Metrics (contd)
22Algorithm Performance Metrics (contd)
- Overlap between answers returned by algorithms
23Algorithm Performance Metrics (contd)Sign Test
Example
- Sample data set
- 97.5, 95.2, 97.3, 96.0, 96.8, 100.0, 97.4,
95.3, 93.2, 99.1, 96.1, 97.6, 98.2, 98.5, 94.9 - Null Hypothesis median 98.5
- Alternative Hypothesis median lt 98.5
- X 2 values with values larger than 98.5
24Algorithm Performance Metrics (contd)
- Statistical significance of results
25Algorithm Performance Metrics (contd)Timing
Characteristics
- Average running times
- Companion 109 ms for 50 URLs
- Cocitation 195 ms for 58 URLs
26Related Works
- Order of links Chakrabarti et.al Enhanced
Hypertext Categorization Using Hyperlinks. - Cocitation and other forms of connectivity
Spertus A points to B and C B, C related
Pitkow Pirolli Cocitation
27Conclusion and Future Works
- Future Work Extend these two algorithms to
handle more than one input URL. - Conclusion The two algorithms significantly
outperform Netscapes performance for finding
related web pages.
28Questions?
- This presentation is available at
- http//www.cs.utexas.edu/alex/datamining-0205.ppt