Efficient Similarity Joins for Near Duplicate Detection - PowerPoint PPT Presentation

1 / 24
About This Presentation
Title:

Efficient Similarity Joins for Near Duplicate Detection

Description:

The University of New South Wales, Australia. Joint Work: Wei Wang (UNSW), Xuemin Lin (UNSW), Jeffrey Xu Yu (CUHK) ... On one end, a winded Pete Sampras tried ... – PowerPoint PPT presentation

Number of Views:622
Avg rating:3.0/5.0
Slides: 25
Provided by: chuan5
Category:

less

Transcript and Presenter's Notes

Title: Efficient Similarity Joins for Near Duplicate Detection


1
Efficient Similarity Joins for Near Duplicate
Detection
  • Chuan Xiao
  • The University of New South Wales, Australia
  • Joint Work Wei Wang (UNSW), Xuemin Lin (UNSW),
    Jeffrey Xu Yu (CUHK)

2
Outline
  • Introduction
  • Algorithms
  • Experiments
  • Conclusion

3
Near Duplicate Data
On one end, a winded Pete Sampras tried to summon
enough energy to give the New York fans another
memorable win to talk about it on the subway ride
home. On the other side, Roger Federer wore a sly
grin like he knew age was about to catch up to
the former world No. 1 - the man who owns the
record of 14 Grand Slams he wants.
By JAY COHEN, AP Sports Writer Mar 11, 423 am
EDT
03/11/2008 1128 AM
4
Applications
SPAM TEMPLATE Sir/Madam, We happily announce to
you the draw of the EURO MILLIONS SPANISH LOTTERY
INTERNATIONAL WINNINGS PROGRAM PROMOTIONS held on
the 27TH MARCH 2008 in SPAIN. Your company or
your personal e-mail address attached to ticket
number 653-908-321-675 with serial main number
ltNUMBERgt drew lucky star winning numbers
ltNUMBERgt which consequently won in the 2ND
category, you have therefore been approved for a
lump sum pay out of 960.000.00 Euros. (NINE
HUNDRED AND SIXTY THOUSAND EUROS). CONGRATULATIONS
!!! Sincerely yours, ltNAMEgt ltAFFILIATIONgt
  • For Web search engines
  • Perform focused crawling
  • Increase the quality and diversity of query
    results
  • Identify spams.
  • For Web mining
  • Perform document clustering
  • Find replicate Web collections
  • Detect plagiarism

Q. What are the advantages of RAID5 over
RAID4? A. 1. Several write requests could be
processed in parallel, since the bottleneck of a
unique check disk has been eliminated. 2. Read
requests have a higher level of parallelism.
Since the data is distributed over all disks,
read requests involve all disks, whereas in
systems with a dedicated check disk the check
disk never participates in read.
Q. What are the advantages of RAID5 over RAID4?
A. 1. Several write requests could be processed
in parallel, since the bottleneck of a single
check disk has been eliminated. 2. Read requests
have a higher level of parallelism on RAID5.
Since the data is distributed over all disks,
read requests involve all disks, whereas in
systems with a check disk the check disk never
participates in read.
5
Similarity Join
  • near duplicates pairs of objects with high
    similarity
  • similarity -gt quantitative way -gt similarity
    function
  • Given a collection of records, the similarity
    join problem is to find all pairs of records,
    ltx,ygt, such that sim(x,y)gtt
  • Tokenize
  • Each record is a set of tokens from a finite
    universe.
  • Suppose each record is a single text document
  • x yes as soon as possible
  • y as soon as possible please
  • x A, B, C, D, E
  • y B, C, D, E, F

6
Similarity Function
  • Common similarity functions
  • Jaccard
  • Cosine
  • Overlap
  • Jaccard can be equivalently converted to Overlap

7
Naïve and Index-Based Algorithms
  • Naïve Algorithm
  • Compare every pair of objects -gt O(n2) time
    complexity
  • Index-based Algorithm MIR, SIGMOD04

inverted lists
Record Set
Index Construction
Candidate Generation
Verification
Result Pairs
8
Index-Based Algorithm
  • Index-based Algorithm
  • Example
  • Suppose sim(x,y) O(x,y) gt t 3
  • Result ltw,xgt, ltw,ygt

stop words too many candidate pairs!
9
Prefix Filter ICDE06, WWW07
  • Sort the tokens by a global ordering
  • increasing order of document frequency
  • Index the first few tokens (prefix) for each
    record
  • Example
  • suppose sim(x,y) O(x,y) gt t 4
  • x
  • y
  • Must share at least one token in prefix to be a
    candidate pair

sorted
uboundO(x,y) 3 lt 4!
sorted
prefix
10
Prefix Filter ICDE06, WWW07
  • O(x,y) gt t ? prefix length x - t 1
  • J(x,y) gt t ? O(x,y) gt t x ? prefix length
    (1-t) x 1
  • Example suppose sim(x,y) J(x,y) gt t 0.8
  • w C, D, E, F
  • x B, C, D, E, F
  • y A, B, C, D, F
  • z G, A, B, E, F

Candidate Pairs ltw,xgt, ltx,ygt, lty,zgt
Results ltw,xgt
11
Prefix Positional Information
  • We use prefix filter (All-Pairs www07) as basic
    framework
  • Intuition
  • tokens sorted -gt rank, or position of tokens
    within a record
  • estimate tighter upper bounds of overlap between
    x and y with positional information
  • Contributions
  • index construction
  • index not only tokens, but their positions in the
    record
  • ? ppjoin algorithm
  • candidate generation
  • probe tokens in suffix, compare the positions in
    the record
  • ? ppjoin algorithm

12
Positional Filter within Prefix (ppjoin)
  • Index both tokens and their positions
  • position
  • x
  • y
  • uboundO(x,y) 1 min(x - px, y - py)

1
2
ubound O(x,y) 1 min(4, 3) 4
13
Positional Filter within Suffix (ppjoin)
  • probe tokens in suffix, and compare their
    positions
  • suppose sim(x,y) J(x,y) gt t 0.8
  • x y 18, O(x,y) gt 16
  • x
  • y
  • uboundO(x,y) 3 4 1
    7
  • 15 lt 16

prefix
suffix
Q
Q
binary search
14
Positional Filter within Suffix (ppjoin)
  • Divide and Conquer
  • ubounddep1
  • ubounddep2
  • ubounddep3
  • probe suffix recursively, until either candidate
    pair is pruned, or reach max-depth

prefix
suffix
1
2
2
3
3
3
3
4
6
1
7
18
4
3
1
1
1
3
1
3
17
4
1
1
1
1
1
1
1
1
1
1
1
15
15
Effect of Filters
  • sim(x,y) J(x,y) gt t 0.8
  • u C, D, E, F
  • v B, C, D, E, F
  • w A, B, C, D, F
  • x G, A, B, E, F
  • y A, B, D, E, F
  • z G, A, C, D, E, F
  • after prefix filter
  • ltu,vgt, ltv,wgt, ltv,ygt, ltw,xgt, ltw,ygt, ltw,zgt, ltx,ygt,
    ltx,zgt, lty,zgt
  • after ppjoin
  • ltu,vgt, ltw,ygt, ltw,zgt, ltx,zgt, lty,zgt
  • after ppjoin (max-depth 1)
  • ltu,vgt, ltx,zgt, lty,zgt
  • real result
  • ltu,vgt

16
Experiment Settings
  • Algorithms Compared
  • All-Pairs WWW07
  • PPJoin
  • PPJoin
  • Measure
  • Jaccard, Cosine
  • Candidate Size, Running Time
  • Near Duplicate Web Page Detection
  • compare with shingling SEQS97

17
Experiment Settings
  • Environment
  • Pentium D 3.00GHz CPU, 2GB RAM
  • Debian 4.1, GCC 4.1.2 with O3
  • Dataset

18
Experiment Results DBLP, Jaccard
  • Candidate Pairs
  • Running Time

19
Exp. Results Near Duplicate Web Page Detection
  • extract qgram and shingles set, and perform
    similarity join
  • rs result from TREC-32shingle, rq result from
    TREC-4gram
  • Precision tp / rs /
  • Recall tp / rq /
  • Results

rs
rq
tp
20
Conclusion
  • Contributions
  • New algorithms for set-similarity joins
  • positional filtering within prefix -gt ppjoin
  • positional filtering within suffix -gt ppjoin
  • Features
  • exact
  • outperform existing algorithms
  • integrated with near duplicate Web page detection
    methods
  • Future Work
  • other similarity function
  • edit-distance
  • top-k similarity search queries

21
Related Work
  • Approximate
  • LSH A. Gionis, P. Indyk, and R. Motwani.
    Similarity search in high dimensions via hashing.
    In VLDB, 1999.
  • Shingling A. Z. Broder. On the resemblence and
    containment of documents. In SEQS, 1997.
  • Exact
  • Index-based
  • S. Sarawagi and A. Kirpal. Efficient set joins on
    similarity predicates. In SIGMOD, 2004.
  • Prefix-based
  • S. Chaudhuri, V. Ganti, and R. Kaushik. A
    primitive operator for similarity joins in data
    cleaning. In ICDE, 2006.
  • All-Pairs R. J. Bayardo, Y. Ma, and R. Srikant.
    Scaling up all pairs similarity search. In WWW,
    2007.
  • Pigeon-hole principle based
  • PartEnum A. Arasu, V. Ganti, and R. Kaushik.
    Efficient exact set-similarity joins. In VLDB,
    2006.

22
  • Thank you!
  • Questions?

23
References
  • SEQS97 A. Z. Broder. On the resemblance and
    containment of documents. In SEQS 1997.
  • MIR R. Baeza-Yates and B. Ribeiro-Neto. Modern
    Information Retrival. Addison Wesley, 1st
    edition, May 1999.
  • VLDB99 LSH A. Gionis, P. Indyk, and R.
    Motwani. Similarity search in high dimensions via
    hashing. In VLDB, 1999.
  • SIGMOD04 S. Sarawagi and A. Kirpal. Efficient
    set joins on similarity predicates. In SIGMOD,
    2004.
  • ICDE06 S. Chaudhuri, V. Ganti, and R. Kaushik.
    A primitive operator for similarity joins in data
    cleaning. In ICDE, 2006.
  • VLDB06 PartEnum A. Arasu, V. Ganti, and R.
    Kaushik. Efficient exact set-similarity joins. In
    VLDB, 2006.
  • WWW07 All-Pairs R. J. Bayardo, Y. Ma, and R.
    Srikant. Scaling up all pairs similarity search.
    In WWW, 2007.

24
Backup Slides
  • Memory Issues
  • We need twice amount of memory as All-Pairs on
    building index.
  • Space / Time
  • Some techniques to deal with memory
  • Do not build index for widowed tokens (appear
    only once)
  • Sort the records are sorted by increasing size
    dynamically remove shorter records from inverted
    lists
  • Integrated with RDBMS
  • Prefix filter in RDBMS ICDE06
  • Need to implement positional filters in both
    prefix and suffix
  • Q What if the probing tokens are not found in y?
  • Convert overlap to hamming distance
  • Estimate the upper bound of hamming distance
Write a Comment
User Comments (0)
About PowerShow.com