Efficient Similarity Joins for Near Duplicate Detection - PowerPoint PPT Presentation

1 / 24

About This Presentation

Title:

Efficient Similarity Joins for Near Duplicate Detection

Description:

The University of New South Wales, Australia. Joint Work: Wei Wang (UNSW), Xuemin Lin (UNSW), Jeffrey Xu Yu (CUHK) ... On one end, a winded Pete Sampras tried ... – PowerPoint PPT presentation

Number of Views:622

Avg rating:3.0/5.0

Slides: 25

Provided by: chuan5

Category:

more less

Transcript and Presenter's Notes

Title: Efficient Similarity Joins for Near Duplicate Detection

1
Efficient Similarity Joins for Near Duplicate
Detection

Chuan Xiao
The University of New South Wales, Australia
Joint Work Wei Wang (UNSW), Xuemin Lin (UNSW),
Jeffrey Xu Yu (CUHK)

2
Outline

Introduction
Algorithms
Experiments
Conclusion

3
Near Duplicate Data
On one end, a winded Pete Sampras tried to summon
enough energy to give the New York fans another
memorable win to talk about it on the subway ride
home. On the other side, Roger Federer wore a sly
grin like he knew age was about to catch up to
the former world No. 1 - the man who owns the
record of 14 Grand Slams he wants.
By JAY COHEN, AP Sports Writer Mar 11, 423 am
EDT
03/11/2008 1128 AM
4
Applications
SPAM TEMPLATE Sir/Madam, We happily announce to
you the draw of the EURO MILLIONS SPANISH LOTTERY
INTERNATIONAL WINNINGS PROGRAM PROMOTIONS held on
the 27TH MARCH 2008 in SPAIN. Your company or
your personal e-mail address attached to ticket
number 653-908-321-675 with serial main number
ltNUMBERgt drew lucky star winning numbers
ltNUMBERgt which consequently won in the 2ND
category, you have therefore been approved for a
lump sum pay out of 960.000.00 Euros. (NINE
HUNDRED AND SIXTY THOUSAND EUROS). CONGRATULATIONS
!!! Sincerely yours, ltNAMEgt ltAFFILIATIONgt

For Web search engines
Perform focused crawling
Increase the quality and diversity of query
results
Identify spams.
For Web mining
Perform document clustering
Find replicate Web collections
Detect plagiarism

Q. What are the advantages of RAID5 over
RAID4? A. 1. Several write requests could be
processed in parallel, since the bottleneck of a
unique check disk has been eliminated. 2. Read
requests have a higher level of parallelism.
Since the data is distributed over all disks,
read requests involve all disks, whereas in
systems with a dedicated check disk the check
disk never participates in read.
Q. What are the advantages of RAID5 over RAID4?
A. 1. Several write requests could be processed
in parallel, since the bottleneck of a single
check disk has been eliminated. 2. Read requests
have a higher level of parallelism on RAID5.
Since the data is distributed over all disks,
read requests involve all disks, whereas in
systems with a check disk the check disk never
participates in read.
5
Similarity Join

near duplicates pairs of objects with high
similarity
similarity -gt quantitative way -gt similarity
function
Given a collection of records, the similarity
join problem is to find all pairs of records,
ltx,ygt, such that sim(x,y)gtt
Tokenize
Each record is a set of tokens from a finite
universe.
Suppose each record is a single text document
x yes as soon as possible
y as soon as possible please
x A, B, C, D, E
y B, C, D, E, F

6
Similarity Function

Common similarity functions
Jaccard
Cosine
Overlap
Jaccard can be equivalently converted to Overlap

7
Naïve and Index-Based Algorithms

Naïve Algorithm
Compare every pair of objects -gt O(n2) time
complexity
Index-based Algorithm MIR, SIGMOD04

inverted lists
Record Set
Index Construction
Candidate Generation
Verification
Result Pairs
8
Index-Based Algorithm

Index-based Algorithm
Example
Suppose sim(x,y) O(x,y) gt t 3
Result ltw,xgt, ltw,ygt

stop words too many candidate pairs!
9
Prefix Filter ICDE06, WWW07

Sort the tokens by a global ordering
increasing order of document frequency
Index the first few tokens (prefix) for each
record
Example
suppose sim(x,y) O(x,y) gt t 4
x
y
Must share at least one token in prefix to be a
candidate pair

sorted
uboundO(x,y) 3 lt 4!
sorted
prefix
10
Prefix Filter ICDE06, WWW07

O(x,y) gt t ? prefix length x - t 1
J(x,y) gt t ? O(x,y) gt t x ? prefix length
(1-t) x 1
Example suppose sim(x,y) J(x,y) gt t 0.8
w C, D, E, F
x B, C, D, E, F
y A, B, C, D, F
z G, A, B, E, F

Candidate Pairs ltw,xgt, ltx,ygt, lty,zgt
Results ltw,xgt
11
Prefix Positional Information

We use prefix filter (All-Pairs www07) as basic
framework
Intuition
tokens sorted -gt rank, or position of tokens
within a record
estimate tighter upper bounds of overlap between
x and y with positional information
Contributions
index construction
index not only tokens, but their positions in the
record
? ppjoin algorithm
candidate generation
probe tokens in suffix, compare the positions in
the record
? ppjoin algorithm

12
Positional Filter within Prefix (ppjoin)

Index both tokens and their positions
position
x
y
uboundO(x,y) 1 min(x - px, y - py)

1
2
ubound O(x,y) 1 min(4, 3) 4
13
Positional Filter within Suffix (ppjoin)

probe tokens in suffix, and compare their
positions
suppose sim(x,y) J(x,y) gt t 0.8
x y 18, O(x,y) gt 16
x
y
uboundO(x,y) 3 4 1
7
15 lt 16

prefix
suffix
Q
Q
binary search
14
Positional Filter within Suffix (ppjoin)

Divide and Conquer
ubounddep1
ubounddep2
ubounddep3
probe suffix recursively, until either candidate
pair is pruned, or reach max-depth

prefix
suffix
1
2
2
3
3
3
3
4
6
1
7
18
4
3
1
1
1
3
1
3
17
4
1
1
1
1
1
1
1
1
1
1
1
15
15
Effect of Filters

sim(x,y) J(x,y) gt t 0.8
u C, D, E, F
v B, C, D, E, F
w A, B, C, D, F
x G, A, B, E, F
y A, B, D, E, F
z G, A, C, D, E, F
after prefix filter
ltu,vgt, ltv,wgt, ltv,ygt, ltw,xgt, ltw,ygt, ltw,zgt, ltx,ygt,
ltx,zgt, lty,zgt
after ppjoin
ltu,vgt, ltw,ygt, ltw,zgt, ltx,zgt, lty,zgt
after ppjoin (max-depth 1)
ltu,vgt, ltx,zgt, lty,zgt
real result
ltu,vgt

16
Experiment Settings

Algorithms Compared
All-Pairs WWW07
PPJoin
PPJoin
Measure
Jaccard, Cosine
Candidate Size, Running Time
Near Duplicate Web Page Detection
compare with shingling SEQS97

17
Experiment Settings

Environment
Pentium D 3.00GHz CPU, 2GB RAM
Debian 4.1, GCC 4.1.2 with O3
Dataset

18
Experiment Results DBLP, Jaccard

Candidate Pairs

Running Time

19
Exp. Results Near Duplicate Web Page Detection

extract qgram and shingles set, and perform
similarity join
rs result from TREC-32shingle, rq result from
TREC-4gram
Precision tp / rs /
Recall tp / rq /
Results

rs
rq
tp
20
Conclusion

Contributions
New algorithms for set-similarity joins
positional filtering within prefix -gt ppjoin
positional filtering within suffix -gt ppjoin
Features
exact
outperform existing algorithms
integrated with near duplicate Web page detection
methods
Future Work
other similarity function
edit-distance
top-k similarity search queries

21
Related Work

Approximate
LSH A. Gionis, P. Indyk, and R. Motwani.
Similarity search in high dimensions via hashing.
In VLDB, 1999.
Shingling A. Z. Broder. On the resemblence and
containment of documents. In SEQS, 1997.
Exact
Index-based
S. Sarawagi and A. Kirpal. Efficient set joins on
similarity predicates. In SIGMOD, 2004.
Prefix-based
S. Chaudhuri, V. Ganti, and R. Kaushik. A
primitive operator for similarity joins in data
cleaning. In ICDE, 2006.
All-Pairs R. J. Bayardo, Y. Ma, and R. Srikant.
Scaling up all pairs similarity search. In WWW,
2007.
Pigeon-hole principle based
PartEnum A. Arasu, V. Ganti, and R. Kaushik.
Efficient exact set-similarity joins. In VLDB,
2006.

Thank you!
Questions?

23
References

SEQS97 A. Z. Broder. On the resemblance and
containment of documents. In SEQS 1997.
MIR R. Baeza-Yates and B. Ribeiro-Neto. Modern
Information Retrival. Addison Wesley, 1st
edition, May 1999.
VLDB99 LSH A. Gionis, P. Indyk, and R.
Motwani. Similarity search in high dimensions via
hashing. In VLDB, 1999.
SIGMOD04 S. Sarawagi and A. Kirpal. Efficient
set joins on similarity predicates. In SIGMOD,
2004.
ICDE06 S. Chaudhuri, V. Ganti, and R. Kaushik.
A primitive operator for similarity joins in data
cleaning. In ICDE, 2006.
VLDB06 PartEnum A. Arasu, V. Ganti, and R.
Kaushik. Efficient exact set-similarity joins. In
VLDB, 2006.
WWW07 All-Pairs R. J. Bayardo, Y. Ma, and R.
Srikant. Scaling up all pairs similarity search.
In WWW, 2007.

24
Backup Slides

Memory Issues
We need twice amount of memory as All-Pairs on
building index.
Space / Time
Some techniques to deal with memory
Do not build index for widowed tokens (appear
only once)
Sort the records are sorted by increasing size
dynamically remove shorter records from inverted
lists
Integrated with RDBMS
Prefix filter in RDBMS ICDE06
Need to implement positional filters in both
prefix and suffix
Q What if the probing tokens are not found in y?
Convert overlap to hamming distance
Estimate the upper bound of hamming distance