Title: Similarity join problem with Pass-Join-K using Hadoop
1Similarity join problem with Pass-Join-K using
Hadoop
2Outline
- Background
- The introduction of Pass-Join-K
- Combining Pass-Join-K with Hadoop
3Background
- Similarity join Find all similar pairs from two
sets. - Data Cleaning.
- Query Relaxation
- Spellchecking
PO BOX 23, Main St. P.O. Box 23, Main St
information
imformation
4Background
- How to define similarity?
- Jaccard distance
- Cosine distance
- Edit distance
2016-6-19
http//datamining.xmu.edu.cn
4/32
5Background
- Edit distance
- The minimum number of edit operations
(insertion, deletion, and substitution) to
transform one string to another.
Bod
Body
Insertion
Baby
Body
Substitution
2016-6-19
http//datamining.xmu.edu.cn
5/32
6Background
- How does the edit distance compare with other
two? - Accuracy abcdefg,gfedcba
- Verification time O(mn) -gt O(mn)
2016-6-19
http//datamining.xmu.edu.cn
6/32
7Background
- Find similar pairs
- We have two string sets ,one is vldb,sigmod,.
,the other is pvldb,icde,. - Find some candidate pairs , and then verify these
pairs.
ltvldb,pvldbgt,ltvldb,icdegt,ltvldb,..gt,ltsigmod,pvldbgt
,ltsigmod,icdegt,.
ltvldb,pvldbgt Yes
ltvldb,icdegt No
8Background
- So we have to
- Finding candidate pairs. There are O(N2) if we do
not prune some pairs. - verifying these pairs.
O(mn)
2016-6-19
http//datamining.xmu.edu.cn
8/32
9Introduction of Pass-Join-K
- Some obvious pruning techniques
- Length based threshold 2,ltab,abceegt
- Shift-based ltabcd,cdefgt
a b c d
c d e f
10Introduction of Pass-Join-K
- Partition-based pruning technique
- We suppose the threshold tau 2, K2and we have
a pair ltabcdefghijk,abdefghkgt
abc def ghi jk
ab def gh k
2016-6-19
http//datamining.xmu.edu.cn
10/32
11Introduction of Pass-Join-K
- Partition Scheme
- We have seen that the longer the substrings are,
the harder they could be marched. - So we break the string into tauk parts and each
part while its length equals length/(tauk) or
length/(tauk)1.
2016-6-19
http//datamining.xmu.edu.cn
11/32
12Introduction of Pass-Join-K
- Partition Scheme
- So we break the string into tauk parts and each
part while its length equals length/(tauk) or
length/(tauk)1.
abc def ghi jk
2016-6-19
http//datamining.xmu.edu.cn
12/32
13Introduction of Pass-Join-K
- Partition Scheme
- r abcdefghijk s abdefghk
def
L11
1
3
4
2
abc def ghi jk
r
r
r
r
2016-6-19
http//datamining.xmu.edu.cn
13/32
14Introduction of Pass-Join-K
- Substring Selection
- Here we suppose tau 3 and k 1
abc def ghi jk
a b d e f g h k
2016-6-19
http//datamining.xmu.edu.cn
14/32
15Introduction of Pass-Join-K
- Substring Selection
- Here we suppose tau 3 and k 1
abc def ghi jk
a b d e f g h k
2016-6-19
http//datamining.xmu.edu.cn
15/32
16Introduction of Pass-Join-K
- Substring Selection
- Here we suppose tau 3 and k 1
abc def ghi jk
a b d e f gh k
2016-6-19
http//datamining.xmu.edu.cn
16/32
17Introduction of Pass-Join-K
- Substring Selection
- Here we suppose tau 3 and k 1
abc def ghi jk
abd efg hk
2016-6-19
http//datamining.xmu.edu.cn
17/32
18Introduction of Pass-Join-K
- Substring Selection
- Here we suppose tau 3 and k 1
abc def ghi jk
a b d e f g h k
2016-6-19
http//datamining.xmu.edu.cn
18/32
19Introduction of Pass-Join-K
- Substring Selection
- So what we do is to deduce the number of
substrings. More pruning techniques, please read
our paper Pass-Join-K?????????????
2016-6-19
http//datamining.xmu.edu.cn
19/32
20Introduction of Pass-Join-K
- Verification
- DP( Dynamic programming)
- D(m,n)max(D(m,n-1)1,D(m-1,n)1,D(m-1,n-1)flag)
where flag 1 when smrn , s and r are both
strings.
2016-6-19
http//datamining.xmu.edu.cn
20/32
21Introduction of Pass-Join-K
- Verification
- Here we suppose tau 3 and k 1
abc def ghi jk
def e f g h k
Tauleft 3
Tauright 3-30
2016-6-19
http//datamining.xmu.edu.cn
21/32
22Combining Pass-Join-K with Hadoop
- Inverted index tree in hadoop
- (abc, 1, 11,r) (def,2,11,r) (ghi,3,11,r)
(jk,4,11,r)
L11
1
3
4
2
abc def ghi jk
r
r
r
r
2016-6-19
http//datamining.xmu.edu.cn
22/32
23Combining Pass-Join-K with Hadoop
- Substrings in hadoop
- Suppose tau 3, k 1, and s abdefghk,
length(s) 8. We have to generate some records
such as (a,1,5,s),(a,2,6,s)(a,3,7,s),(ab,1,8,s),,
(ab,1,11,s),
2016-6-19
http//datamining.xmu.edu.cn
23/32
24Combining Pass-Join-K with Hadoop
- Substrings in hadoop
- Suppose tau 3, k 1, and s abdefghk,
length(s) 8. We have to generate more than
2tau(tauk)m records where m is the average
number that substring for each segment, such as
(a,1,5,s),(a,1,6,s)(a,1,7,s),(ab,1,8,s),,(ab,1,11
,s),
2016-6-19
http//datamining.xmu.edu.cn
24/32
25Combining Pass-Join-K with Hadoop
2016-6-19
http//datamining.xmu.edu.cn
25/32
26Combining Pass-Join-K with Hadoop
- How to improve the performance ?
- We have known that as k increased , the pairs we
need to verity would be decrease. - As k increased, more than (tauk1)/(tauk)
records should be translated in Mapper phase.
2016-6-19
http//datamining.xmu.edu.cn
26/32
27Combining Pass-Join-K with Hadoop
- Here we have 2 ways to improve our algorithm.
- Finding a dataset that the candidate pairs number
are large enough or making tau are large enough. - Decreasing the data which were generated in
Mapper phase.
2016-6-19
http//datamining.xmu.edu.cn
27/32
28Combining Pass-Join-K with Hadoop
2016-6-19
http//datamining.xmu.edu.cn
28/32
29Combining Pass-Join-K with Hadoop
- Decrease the data flows
- The inverted index record was formulated as
(substring,segmentNumber, LengthInf, Id, flag) - Each records length is length(substring)4sizeof
(int), and substring sometimes could be so long. - Hash(substring) -gt integer, then record length is
5sizeof(int)
2016-6-19
http//datamining.xmu.edu.cn
29/32
30Combining Pass-Join-K with Hadoop
- Decrease the data flows
- The substring would generate some similar records
such as (a,1,5,s),(a,1,6,s)(a,1,7,s) - Each substring would generate tauk similar
segments, so we combine them as ,for example,
(a,1,5,7,s). So we make the (tauk)4sizeof(int)
to 5sizeof(int).
2016-6-19
http//datamining.xmu.edu.cn
30/32
31Combining Pass-Join-K with Hadoop
- Decrease the data flows
- So by using two steps we have seen before, we
have reduced the (length(substring)4sizeof(int))
(tauk) to 5 times sizeof(int)
2016-6-19
http//datamining.xmu.edu.cn
31/32
32Thanks for patience
- Email yhycai_at_gmail.com