Title: Interactive De-Duplicate using Active Learning*
1Interactive De-Duplicate using Active Learning
Sunita Sarawagi Anuradha
Bhamidipaty sunita,anu_at_it.iitb.ac.in
Funded by the Ministry of Information
Technology, India.
2The de-duplication problem
- Given a list of semi-structured records,
- find all records that refer to a same entity
- Example applications
- Data warehousing merging name/address lists
- Entity
- Person
- Household
- Automatic citation databases (Citeseer)
references - Entity paper
- De-duplication
- is not clustering!
- precise external notion of correctness
3Challenges
- Errors and inconsistencies in data
- Duplicates may be spread far apart
- may not be group-able using obvious keys
- Domain-specific
- Existing manual approaches require retuning with
every new domain
4Motivating example Citations
- Our prior
- duplicate when author, title, booktitle and year
match..
- Author match could be hard
- L. Breiman, L. Friedman, and P. Stone, (1984).
- Leo Breiman, Jerome H. Friedman, Richard A.
Olshen, and Charles J. Stone.
- Conference match could be harder
- In VLDB-94
- In Proc. of the 20th Int'l Conference on Very
Large Databases, Santiago, Chile, September 1994.
5- Fields may not be segmented,
- Word overlap could be misleading
-
- Non-duplicates with lots of word overlap
- H. Balakrishnan, S. Seshan, and R. H. Katz.,
Improving Reliable Transport and Hando
Performance in Cellular Wireless Networks, ACM
Wireless Networks, 1(4), December 1995. - H. Balakrishnan, S. Seshan, E. Amir, R. H. Katz,
"Improving TCP/IP Performance over Wireless
Networks," Proc. 1st ACM Conf. on Mobile
Computing and Networking, November 1995.
- Duplicates with little overlap even in title
- Johnson Laird, Philip N. (1983). Mental models.
Cambridge, Mass. Harvard University Press. - P. N. Johnson-Laird. Mental Models Towards a
Cognitive Science of Language, Inference, and
Consciousness. Cambridge University Press, 1983
6The learning approach
Example labeled pairs
Similarity functions
f1 f2 fn
Record 1 D Record 2 Record 1 N Record
3 Record 4 D Record 5
7Example of a learnt function
Similarity functions
YearDifference gt 1
All-Ngrams ? 0.48
Non-Duplicate
AuthorTitleNgrams ? 0.4
Non Duplicate
Duplicate
TitleIsNull lt 1
PageMatch ? 0.5
Duplicate
Duplicate
AuthorEditDist ? 0.8
Non-Duplicate
Duplicate
Classifier automates the non-trivial task of
combining simple similarity functions
8Experiences with the learning approach
- Too much manual search in preparing training data
- Hard to spot challenging and covering sets of
duplicates in large lists - Even harder to find close non-duplicates that
will capture the nuances - Our solution examine instances that are highly
similar on one attribute but dissimilar on
another - Active learning is a generalization of this!
9The active learning approach
Example labeled pairs
Similarity functions
f1 f2 fn
Record 1 D Record 2 Record 3 N Record 4
Classifier
10Working of ALIAS
- Apply similarity functions on record pairs.
- Loop until user satisfaction
- Train classifier.
- Use active learning to select n instances
- Collect user feedback.
- Augment with pairs inferred using transitivity
- Add to training set
- Output classifier
11The ALIAS deduplication system
- Interactive discovery of deduplication function
using active learning - Manual effort reduced to
- Providing simple similarity functions
- Labeling selected pairs
- Efficient indexing mechanism
- Novel cluster-based evaluation engine
- Cost-based optimizer
12Architecture of ALIAS
Training data T
Lp
Initial training records
Mapped labeled instances
Infer pairs using transitivity
Mapper
Train classifier
Similarity Functions (F)
Pool of mapped unlabeled instances
Dp
Mapper
Select instances
S
Active Learner
Similarity Indices
Predicate for uncertain region
Deduplication function
Groups of duplicates in A
Evaluation engine
13Example active learning
Assume Points from two classes (red and green)
on a real line perfectly separable by a single
point separator
labeled points
Unlabeled points
x
The point x that has greatest uncertainty of
prediction yields largest expected reduction in
uncertainty region
14Active-learning
- Implicit measure
- Train classifier
- For each unlabeled instance
- Measure prediction uncertainty
- Choose representative instance with high
uncertainty
- Explicit measure
- For each unlabeled instance
- Add to training data,
- Train classifier
- Measure classifier error quantified as
- Size of confusion region, or,
- Sum prediction uncertainty on all instances
- Choose instance that yields lowest error
15Example active learning
Assume Points from two classes (red and green)
on a real line perfectly separable by a single
point separator
labeled points
Unlabeled points
x
The point x that has greatest uncertainty of
prediction yields largest expected reduction in
uncertainty region
16Measuring prediction certainty
- Classifier-specific methods
- Support vector machines
- Distance from separator
- Naïve Bayes classifier
- Posterior probability of winning class
- Decision tree classifier
- Weighted sum of distance from different
boundaries, error of the leaf, depth of the
leaf, etc - Committee-based approach
- Disagreements amongst members of a committee
- Most successfully used method
17Committee-based algorithm
- Train k classifiers C1, C2,.. Ck on training data
- For each unlabeled instance x
- Find prediction y1,.., yk from the k classifiers
- Compute uncertainty U(x) as entropy of above y-s
- Sampling for representativeness
- With weight as U(x), do weighted sampling to
select an instance for labeling.
18Forming a classifier committee
- Data partitioning
- Resampling training data
- Attribute Partitioning
- Random parameter perturbation
- Probabilistic classifiers.
- Sample from posterior distribution on parameters
given training data. - Example binomial parameter p has a beta
distribution with mean p
19Randomly perturbing trees
- Selecting split attribute
- Normally attribute with lowest entropy
- Perturbed random attribute within close range
of lowest - Selecting a split point
- Normally midpoint of range with lowest entropy
- Perturbed a random point anywhere in the range
with lowest entropy
20Experimental analysis
- 250 references from Citeseer ? 32000 pairs of
which only 150 duplicates - Citeseers script used to segment into author,
title, year, page and rest. - 20 text and integer similarity functions
- Initial labeled set just two pairs
21Methods of creating committee
- Data partition bad when limited data
- Attribute partition bad when sufficient data
- Parameter perturbation best overall
22Importance of randomization
Naïve Bayes
Decision tree
- Important to randomize selection for generative
classifiers like naïve Bayes
23Choosing the right classifier
- SVMs good initially but not effective in choosing
instances - Decision trees best overall
24Benefits of active learning
- Active learning much better than random
- With only 100 active instances
- 97 accuracy, Random only 30
- Committee-based selection close to optimal
25Analyzing selected instances
- Fraction of duplicates in selected instances 44
starting with only 0.5 - Non-duplicates in active set if replaced with
same number of random non-dups gives only 40
accuracy
26Related work
- Performance aspects given fixed function
- Hernandez and Stolfo (DMKD journal, 1998)
- Monge and Elkan (KDD-1996)
- Designing domain-specific similarity functions
- Library catalogs Toney 1992, Hilton 1996
- Census Bureau data Winkler 1995
- Learning approach
- Census Bureau, Winkler 1995, 1999
- Semi-supervised approach
- Tailor Elfeky, Verykios, Elmagarmid in ICDE
2002 - Relevance feedback
- Other applications of active learning
- Argamon-Engelson and I. Dagan, JAIR 1999
27Conclusion and future work
- Interactive discovery of deduplication function
using active learning - Manual effort reduced to
- Providing simple similarity functions
- Labeling selected pairs two orders of magnitude
fewer than random - Analyzed tradeoffs in various active learning
methods - Ongoing work
- Efficient evaluation on large data sets
- Multi-table de-duplication