Finding Replicated Web Collections

About This Presentation

Title:

Finding Replicated Web Collections

Description:

Collection similarity. Page content. 11. Page content similarity ... 25 widely replicated collections (cardinality: 5-10 copies, size: 50-1000 pages) ... – PowerPoint PPT presentation

Number of Views:20

Avg rating:3.0/5.0

Slides: 29

Provided by: Jungh1

Learn more at: http://oak.cs.ucla.edu

Category:

more less

Transcript and Presenter's Notes

Title: Finding Replicated Web Collections

1
Finding ReplicatedWeb Collections

Junghoo Cho
Narayanan Shivakumar
Hector Garcia-Molina

2
Replication is common!
3
Statistics (Preview)
More than 48 of pages have copies!
4
Reasons for replication

Actual replication
Simple copying or Mirroring
Apparent replication
Aliases (multiple site names)
Symbolic links
Multiple mount points

5
Challenges

Subgraph isomorphism NP
Hundreds of millions of pages
Slight differences between copies

6
Outline

Definitions
Web graph, collection
Identical collection
Similar collection
Algorithm
Applications
Results

7
Web graph

Node web page
Edge link between pages
Node label page content (excluding links)

8
Identical web collection

Collection induced subgraph
Identical collection one-to-one (equi-size)

9
Collection similarity

Coincides with intuitively similar collections
Computable similarity measure

10
Collection similarity

Page content

?
11
Page content similarity

Fingerprint-based approach (chunking)
Shingles Broders et al., 1997
Sentence Brin et al., 1995
Word Shivakumar et al., 1995
Many interesting issues
Threshold value
Iceberg query

12
Collection similarity

Link structure

?
13
Collection similarity

Size

14
Collection similarity
?

Size vs. Cardinality

15
Growth strategy
16
Essential property
Ls of pages linked from
Ld of pages linked to
Rb
17
Essential property
Ls of pages linked from
Ld of pages linked to
18
Algorithm

Based on the property we identified
Input set of pages collected from web
Output set of similar collections
Complexity O(n log n)

19
Algorithm

Step 1 Similar page identification (iceberg
query)
25 million pages
Fingerprint computation 44 hours
Replicated page computation 10 hours

web pages
Step 1
20
Algorithm

Step 2 link structure check

Link
R1
R2 (Copy of R1)
Pid
Pid
1
2
1
3
2
6
2
10
Group by (R1.Rid, R2.Rid)
Ra R1, Ls Count(R1.Rid), Ld
Count(R2.Rid), Rb R2
21
Algorithm

Step 3
S
For every (Ra, Ls, Ld, Rb) in step 2
If (Ra Ls Ld Rb)
S S U ltRa, Rbgt
Union-Find(S)
Step 2-3 10 hours

22
Experiment

25 widely replicated collections
(cardinality 5-10 copies, size 50-1000 pages)
gt Total number of pages 35,000
15,000 random pages
Result 180 collections
149 good collections
31 problem collections

23
Results
24
Applications

Web crawling archiving
Save network bandwidth
Save disk storage

25
Application (web crawling)

Before experiment 48
With our technique 13

replicationinfo
crawledpages
initialcrawl
offline copydetection
secondcrawl
26
Applications (web search)
27
Related work

Collection similarity
Altavista Bharat et al., 1999
Page similarity
COPS Brin et al., 1995 sentence
SCAM Shivakumar et al., 1995 word
Altavista Broder et al., 1997 shingle

Finding Replicated Web Collections - PowerPoint PPT Presentation

Finding Replicated Web Collections

Collection similarity. Page content. 11. Page content similarity ... 25 widely replicated collections (cardinality: 5-10 copies, size: 50-1000 pages) ... – PowerPoint PPT presentation