Finding Replicated Web Collections - PowerPoint PPT Presentation

About This Presentation
Title:

Finding Replicated Web Collections

Description:

Collection similarity. Page content. 11. Page content similarity ... 25 widely replicated collections (cardinality: 5-10 copies, size: 50-1000 pages) ... – PowerPoint PPT presentation

Number of Views:20
Avg rating:3.0/5.0
Slides: 29
Provided by: Jungh1
Learn more at: http://oak.cs.ucla.edu
Category:

less

Transcript and Presenter's Notes

Title: Finding Replicated Web Collections


1
Finding ReplicatedWeb Collections
  • Junghoo Cho
  • Narayanan Shivakumar
  • Hector Garcia-Molina

2
Replication is common!
3
Statistics (Preview)
More than 48 of pages have copies!
4
Reasons for replication
  • Actual replication
  • Simple copying or Mirroring
  • Apparent replication
  • Aliases (multiple site names)
  • Symbolic links
  • Multiple mount points

5
Challenges
  • Subgraph isomorphism NP
  • Hundreds of millions of pages
  • Slight differences between copies

6
Outline
  • Definitions
  • Web graph, collection
  • Identical collection
  • Similar collection
  • Algorithm
  • Applications
  • Results

7
Web graph
  • Node web page
  • Edge link between pages
  • Node label page content (excluding links)

8
Identical web collection
  • Collection induced subgraph
  • Identical collection one-to-one (equi-size)

9
Collection similarity
  • Coincides with intuitively similar collections
  • Computable similarity measure

10
Collection similarity
  • Page content

?
11
Page content similarity
  • Fingerprint-based approach (chunking)
  • Shingles Broders et al., 1997
  • Sentence Brin et al., 1995
  • Word Shivakumar et al., 1995
  • Many interesting issues
  • Threshold value
  • Iceberg query

12
Collection similarity
  • Link structure

?
13
Collection similarity
  • Size

14
Collection similarity
?
  • Size vs. Cardinality

15
Growth strategy
16
Essential property
Ls of pages linked from
Ld of pages linked to
Rb
17
Essential property
Ls of pages linked from
Ld of pages linked to
18
Algorithm
  • Based on the property we identified
  • Input set of pages collected from web
  • Output set of similar collections
  • Complexity O(n log n)

19
Algorithm
  • Step 1 Similar page identification (iceberg
    query)
  • 25 million pages
  • Fingerprint computation 44 hours
  • Replicated page computation 10 hours

web pages
Step 1
20
Algorithm
  • Step 2 link structure check

Link
R1
R2 (Copy of R1)
Pid
Pid
1
2
1
3
2
6
2
10
Group by (R1.Rid, R2.Rid)
Ra R1, Ls Count(R1.Rid), Ld
Count(R2.Rid), Rb R2
21
Algorithm
  • Step 3
  • S
  • For every (Ra, Ls, Ld, Rb) in step 2
  • If (Ra Ls Ld Rb)
  • S S U ltRa, Rbgt
  • Union-Find(S)
  • Step 2-3 10 hours

22
Experiment
  • 25 widely replicated collections
  • (cardinality 5-10 copies, size 50-1000 pages)
  • gt Total number of pages 35,000
  • 15,000 random pages
  • Result 180 collections
  • 149 good collections
  • 31 problem collections

23
Results
24
Applications
  • Web crawling archiving
  • Save network bandwidth
  • Save disk storage

25
Application (web crawling)
  • Before experiment 48
  • With our technique 13

replicationinfo
crawledpages
initialcrawl
offline copydetection
secondcrawl
26
Applications (web search)
27
Related work
  • Collection similarity
  • Altavista Bharat et al., 1999
  • Page similarity
  • COPS Brin et al., 1995 sentence
  • SCAM Shivakumar et al., 1995 word
  • Altavista Broder et al., 1997 shingle

28
Summary
  • Computable similarity measure
  • Efficient replication-detection algorithm
  • Application to real-world problems
Write a Comment
User Comments (0)
About PowerShow.com