Title: Efficient Similarity Joins for Near Duplicate Detection
1Efficient Similarity Joins for Near Duplicate
Detection
- Chuan Xiao
- The University of New South Wales, Australia
- Joint Work Wei Wang (UNSW), Xuemin Lin (UNSW),
Jeffrey Xu Yu (CUHK)
2Outline
- Introduction
- Algorithms
- Experiments
- Conclusion
3Near Duplicate Data
On one end, a winded Pete Sampras tried to summon
enough energy to give the New York fans another
memorable win to talk about it on the subway ride
home. On the other side, Roger Federer wore a sly
grin like he knew age was about to catch up to
the former world No. 1 - the man who owns the
record of 14 Grand Slams he wants.
By JAY COHEN, AP Sports Writer Mar 11, 423 am
EDT
03/11/2008 1128 AM
4Applications
SPAM TEMPLATE Sir/Madam, We happily announce to
you the draw of the EURO MILLIONS SPANISH LOTTERY
INTERNATIONAL WINNINGS PROGRAM PROMOTIONS held on
the 27TH MARCH 2008 in SPAIN. Your company or
your personal e-mail address attached to ticket
number 653-908-321-675 with serial main number
ltNUMBERgt drew lucky star winning numbers
ltNUMBERgt which consequently won in the 2ND
category, you have therefore been approved for a
lump sum pay out of 960.000.00 Euros. (NINE
HUNDRED AND SIXTY THOUSAND EUROS). CONGRATULATIONS
!!! Sincerely yours, ltNAMEgt ltAFFILIATIONgt
- For Web search engines
- Perform focused crawling
- Increase the quality and diversity of query
results - Identify spams.
- For Web mining
- Perform document clustering
- Find replicate Web collections
- Detect plagiarism
Q. What are the advantages of RAID5 over
RAID4? A. 1. Several write requests could be
processed in parallel, since the bottleneck of a
unique check disk has been eliminated. 2. Read
requests have a higher level of parallelism.
Since the data is distributed over all disks,
read requests involve all disks, whereas in
systems with a dedicated check disk the check
disk never participates in read.
Q. What are the advantages of RAID5 over RAID4?
A. 1. Several write requests could be processed
in parallel, since the bottleneck of a single
check disk has been eliminated. 2. Read requests
have a higher level of parallelism on RAID5.
Since the data is distributed over all disks,
read requests involve all disks, whereas in
systems with a check disk the check disk never
participates in read.
5Similarity Join
- near duplicates pairs of objects with high
similarity - similarity -gt quantitative way -gt similarity
function - Given a collection of records, the similarity
join problem is to find all pairs of records,
ltx,ygt, such that sim(x,y)gtt - Tokenize
- Each record is a set of tokens from a finite
universe. - Suppose each record is a single text document
- x yes as soon as possible
- y as soon as possible please
- x A, B, C, D, E
- y B, C, D, E, F
6Similarity Function
- Common similarity functions
- Jaccard
- Cosine
- Overlap
- Jaccard can be equivalently converted to Overlap
7Naïve and Index-Based Algorithms
- Naïve Algorithm
- Compare every pair of objects -gt O(n2) time
complexity - Index-based Algorithm MIR, SIGMOD04
inverted lists
Record Set
Index Construction
Candidate Generation
Verification
Result Pairs
8Index-Based Algorithm
- Index-based Algorithm
- Example
- Suppose sim(x,y) O(x,y) gt t 3
- Result ltw,xgt, ltw,ygt
stop words too many candidate pairs!
9Prefix Filter ICDE06, WWW07
- Sort the tokens by a global ordering
- increasing order of document frequency
- Index the first few tokens (prefix) for each
record - Example
- suppose sim(x,y) O(x,y) gt t 4
- x
- y
- Must share at least one token in prefix to be a
candidate pair
sorted
uboundO(x,y) 3 lt 4!
sorted
prefix
10Prefix Filter ICDE06, WWW07
- O(x,y) gt t ? prefix length x - t 1
- J(x,y) gt t ? O(x,y) gt t x ? prefix length
(1-t) x 1 - Example suppose sim(x,y) J(x,y) gt t 0.8
- w C, D, E, F
- x B, C, D, E, F
- y A, B, C, D, F
- z G, A, B, E, F
Candidate Pairs ltw,xgt, ltx,ygt, lty,zgt
Results ltw,xgt
11Prefix Positional Information
- We use prefix filter (All-Pairs www07) as basic
framework - Intuition
- tokens sorted -gt rank, or position of tokens
within a record - estimate tighter upper bounds of overlap between
x and y with positional information - Contributions
- index construction
- index not only tokens, but their positions in the
record - ? ppjoin algorithm
- candidate generation
- probe tokens in suffix, compare the positions in
the record - ? ppjoin algorithm
12Positional Filter within Prefix (ppjoin)
- Index both tokens and their positions
- position
- x
- y
- uboundO(x,y) 1 min(x - px, y - py)
1
2
ubound O(x,y) 1 min(4, 3) 4
13Positional Filter within Suffix (ppjoin)
- probe tokens in suffix, and compare their
positions - suppose sim(x,y) J(x,y) gt t 0.8
- x y 18, O(x,y) gt 16
- x
- y
- uboundO(x,y) 3 4 1
7 - 15 lt 16
prefix
suffix
Q
Q
binary search
14Positional Filter within Suffix (ppjoin)
- Divide and Conquer
- ubounddep1
- ubounddep2
- ubounddep3
- probe suffix recursively, until either candidate
pair is pruned, or reach max-depth
prefix
suffix
1
2
2
3
3
3
3
4
6
1
7
18
4
3
1
1
1
3
1
3
17
4
1
1
1
1
1
1
1
1
1
1
1
15
15Effect of Filters
- sim(x,y) J(x,y) gt t 0.8
- u C, D, E, F
- v B, C, D, E, F
- w A, B, C, D, F
- x G, A, B, E, F
- y A, B, D, E, F
- z G, A, C, D, E, F
- after prefix filter
- ltu,vgt, ltv,wgt, ltv,ygt, ltw,xgt, ltw,ygt, ltw,zgt, ltx,ygt,
ltx,zgt, lty,zgt - after ppjoin
- ltu,vgt, ltw,ygt, ltw,zgt, ltx,zgt, lty,zgt
- after ppjoin (max-depth 1)
- ltu,vgt, ltx,zgt, lty,zgt
- real result
- ltu,vgt
16Experiment Settings
- Algorithms Compared
- All-Pairs WWW07
- PPJoin
- PPJoin
- Measure
- Jaccard, Cosine
- Candidate Size, Running Time
- Near Duplicate Web Page Detection
- compare with shingling SEQS97
17Experiment Settings
- Environment
- Pentium D 3.00GHz CPU, 2GB RAM
- Debian 4.1, GCC 4.1.2 with O3
- Dataset
18Experiment Results DBLP, Jaccard
19Exp. Results Near Duplicate Web Page Detection
- extract qgram and shingles set, and perform
similarity join - rs result from TREC-32shingle, rq result from
TREC-4gram - Precision tp / rs /
- Recall tp / rq /
- Results
rs
rq
tp
20Conclusion
- Contributions
- New algorithms for set-similarity joins
- positional filtering within prefix -gt ppjoin
- positional filtering within suffix -gt ppjoin
- Features
- exact
- outperform existing algorithms
- integrated with near duplicate Web page detection
methods - Future Work
- other similarity function
- edit-distance
- top-k similarity search queries
21Related Work
- Approximate
- LSH A. Gionis, P. Indyk, and R. Motwani.
Similarity search in high dimensions via hashing.
In VLDB, 1999. - Shingling A. Z. Broder. On the resemblence and
containment of documents. In SEQS, 1997. - Exact
- Index-based
- S. Sarawagi and A. Kirpal. Efficient set joins on
similarity predicates. In SIGMOD, 2004. - Prefix-based
- S. Chaudhuri, V. Ganti, and R. Kaushik. A
primitive operator for similarity joins in data
cleaning. In ICDE, 2006. - All-Pairs R. J. Bayardo, Y. Ma, and R. Srikant.
Scaling up all pairs similarity search. In WWW,
2007. - Pigeon-hole principle based
- PartEnum A. Arasu, V. Ganti, and R. Kaushik.
Efficient exact set-similarity joins. In VLDB,
2006.
22 23References
- SEQS97 A. Z. Broder. On the resemblance and
containment of documents. In SEQS 1997. - MIR R. Baeza-Yates and B. Ribeiro-Neto. Modern
Information Retrival. Addison Wesley, 1st
edition, May 1999. - VLDB99 LSH A. Gionis, P. Indyk, and R.
Motwani. Similarity search in high dimensions via
hashing. In VLDB, 1999. - SIGMOD04 S. Sarawagi and A. Kirpal. Efficient
set joins on similarity predicates. In SIGMOD,
2004. - ICDE06 S. Chaudhuri, V. Ganti, and R. Kaushik.
A primitive operator for similarity joins in data
cleaning. In ICDE, 2006. - VLDB06 PartEnum A. Arasu, V. Ganti, and R.
Kaushik. Efficient exact set-similarity joins. In
VLDB, 2006. - WWW07 All-Pairs R. J. Bayardo, Y. Ma, and R.
Srikant. Scaling up all pairs similarity search.
In WWW, 2007.
24Backup Slides
- Memory Issues
- We need twice amount of memory as All-Pairs on
building index. - Space / Time
- Some techniques to deal with memory
- Do not build index for widowed tokens (appear
only once) - Sort the records are sorted by increasing size
dynamically remove shorter records from inverted
lists - Integrated with RDBMS
- Prefix filter in RDBMS ICDE06
- Need to implement positional filters in both
prefix and suffix - Q What if the probing tokens are not found in y?
- Convert overlap to hamming distance
- Estimate the upper bound of hamming distance