Interactive De-Duplicate using Active Learning* - PowerPoint PPT Presentation

About This Presentation

Interactive De-Duplicate using Active Learning*


Given a list of semi-structured records, find all records that refer to a same entity ... Sure reds. Sure greens. Region of uncertainity ... – PowerPoint PPT presentation

Number of Views:41
Avg rating:3.0/5.0
Slides: 27
Provided by: KReSITF


Transcript and Presenter's Notes

Title: Interactive De-Duplicate using Active Learning*

Interactive De-Duplicate using Active Learning
Sunita Sarawagi Anuradha
Bhamidipaty sunita,
Funded by the Ministry of Information
Technology, India.
The de-duplication problem
  • Given a list of semi-structured records,
  • find all records that refer to a same entity
  • Example applications
  • Data warehousing merging name/address lists
  • Entity
  • Person
  • Household
  • Automatic citation databases (Citeseer)
  • Entity paper
  • De-duplication
  • is not clustering!
  • precise external notion of correctness

  • Errors and inconsistencies in data
  • Duplicates may be spread far apart
  • may not be group-able using obvious keys
  • Domain-specific
  • Existing manual approaches require retuning with
    every new domain

Motivating example Citations
  • Our prior
  • duplicate when author, title, booktitle and year
  • Author match could be hard
  • L. Breiman, L. Friedman, and P. Stone, (1984).
  • Leo Breiman, Jerome H. Friedman, Richard A.
    Olshen, and Charles J. Stone.
  • Conference match could be harder
  • In VLDB-94
  • In Proc. of the 20th Int'l Conference on Very
    Large Databases, Santiago, Chile, September 1994.

  • Fields may not be segmented,
  • Word overlap could be misleading
  • Non-duplicates with lots of word overlap
  • H. Balakrishnan, S. Seshan, and R. H. Katz.,
    Improving Reliable Transport and Hando
    Performance in Cellular Wireless Networks, ACM
    Wireless Networks, 1(4), December 1995.
  • H. Balakrishnan, S. Seshan, E. Amir, R. H. Katz,
    "Improving TCP/IP Performance over Wireless
    Networks," Proc. 1st ACM Conf. on Mobile
    Computing and Networking, November 1995.
  • Duplicates with little overlap even in title
  • Johnson Laird, Philip N. (1983). Mental models.
    Cambridge, Mass. Harvard University Press.
  • P. N. Johnson-Laird. Mental Models Towards a
    Cognitive Science of Language, Inference, and
    Consciousness. Cambridge University Press, 1983

The learning approach
Example labeled pairs
Similarity functions
f1 f2 fn
Record 1 D Record 2 Record 1 N Record
3 Record 4 D Record 5
Example of a learnt function
  • Bibtex entries

Similarity functions
YearDifference gt 1
All-Ngrams ? 0.48
AuthorTitleNgrams ? 0.4
Non Duplicate
TitleIsNull lt 1
PageMatch ? 0.5
AuthorEditDist ? 0.8
Classifier automates the non-trivial task of
combining simple similarity functions
Experiences with the learning approach
  • Too much manual search in preparing training data
  • Hard to spot challenging and covering sets of
    duplicates in large lists
  • Even harder to find close non-duplicates that
    will capture the nuances
  • Our solution examine instances that are highly
    similar on one attribute but dissimilar on
  • Active learning is a generalization of this!

The active learning approach
Example labeled pairs
Similarity functions
f1 f2 fn
Record 1 D Record 2 Record 3 N Record 4
Working of ALIAS
  • Apply similarity functions on record pairs.
  • Loop until user satisfaction
  • Train classifier.
  • Use active learning to select n instances
  • Collect user feedback.
  • Augment with pairs inferred using transitivity
  • Add to training set
  • Output classifier

The ALIAS deduplication system
  • Interactive discovery of deduplication function
    using active learning
  • Manual effort reduced to
  • Providing simple similarity functions
  • Labeling selected pairs
  • Efficient indexing mechanism
  • Novel cluster-based evaluation engine
  • Cost-based optimizer

Architecture of ALIAS
Training data T
Initial training records
Mapped labeled instances
Infer pairs using transitivity
Train classifier
Similarity Functions (F)
Pool of mapped unlabeled instances
Select instances
Active Learner
Similarity Indices
Predicate for uncertain region
Deduplication function
Groups of duplicates in A
Evaluation engine
Example active learning
Assume Points from two classes (red and green)
on a real line perfectly separable by a single
point separator
labeled points
Unlabeled points
The point x that has greatest uncertainty of
prediction yields largest expected reduction in
uncertainty region
  • Implicit measure
  • Train classifier
  • For each unlabeled instance
  • Measure prediction uncertainty
  • Choose representative instance with high
  • Explicit measure
  • For each unlabeled instance
  • Add to training data,
  • Train classifier
  • Measure classifier error quantified as
  • Size of confusion region, or,
  • Sum prediction uncertainty on all instances
  • Choose instance that yields lowest error

Example active learning
Assume Points from two classes (red and green)
on a real line perfectly separable by a single
point separator
labeled points
Unlabeled points
The point x that has greatest uncertainty of
prediction yields largest expected reduction in
uncertainty region
Measuring prediction certainty
  • Classifier-specific methods
  • Support vector machines
  • Distance from separator
  • Naïve Bayes classifier
  • Posterior probability of winning class
  • Decision tree classifier
  • Weighted sum of distance from different
    boundaries, error of the leaf, depth of the
    leaf, etc
  • Committee-based approach
  • Disagreements amongst members of a committee
  • Most successfully used method

Committee-based algorithm
  • Train k classifiers C1, C2,.. Ck on training data
  • For each unlabeled instance x
  • Find prediction y1,.., yk from the k classifiers
  • Compute uncertainty U(x) as entropy of above y-s
  • Sampling for representativeness
  • With weight as U(x), do weighted sampling to
    select an instance for labeling.

Forming a classifier committee
  • Data partitioning
  • Resampling training data
  • Attribute Partitioning
  • Random parameter perturbation
  • Probabilistic classifiers.
  • Sample from posterior distribution on parameters
    given training data.
  • Example binomial parameter p has a beta
    distribution with mean p

Randomly perturbing trees
  • Selecting split attribute
  • Normally attribute with lowest entropy
  • Perturbed random attribute within close range
    of lowest
  • Selecting a split point
  • Normally midpoint of range with lowest entropy
  • Perturbed a random point anywhere in the range
    with lowest entropy

Experimental analysis
  • 250 references from Citeseer ? 32000 pairs of
    which only 150 duplicates
  • Citeseers script used to segment into author,
    title, year, page and rest.
  • 20 text and integer similarity functions
  • Initial labeled set just two pairs

Methods of creating committee
  • Data partition bad when limited data
  • Attribute partition bad when sufficient data
  • Parameter perturbation best overall

Importance of randomization
Naïve Bayes
Decision tree
  • Important to randomize selection for generative
    classifiers like naïve Bayes

Choosing the right classifier
  • SVMs good initially but not effective in choosing
  • Decision trees best overall

Benefits of active learning
  • Active learning much better than random
  • With only 100 active instances
  • 97 accuracy, Random only 30
  • Committee-based selection close to optimal

Analyzing selected instances
  • Fraction of duplicates in selected instances 44
    starting with only 0.5
  • Non-duplicates in active set if replaced with
    same number of random non-dups gives only 40

Related work
  • Performance aspects given fixed function
  • Hernandez and Stolfo (DMKD journal, 1998)
  • Monge and Elkan (KDD-1996)
  • Designing domain-specific similarity functions
  • Library catalogs Toney 1992, Hilton 1996
  • Census Bureau data Winkler 1995
  • Learning approach
  • Census Bureau, Winkler 1995, 1999
  • Semi-supervised approach
  • Tailor Elfeky, Verykios, Elmagarmid in ICDE
  • Relevance feedback
  • Other applications of active learning
  • Argamon-Engelson and I. Dagan, JAIR 1999

Conclusion and future work
  • Interactive discovery of deduplication function
    using active learning
  • Manual effort reduced to
  • Providing simple similarity functions
  • Labeling selected pairs two orders of magnitude
    fewer than random
  • Analyzed tradeoffs in various active learning
  • Ongoing work
  • Efficient evaluation on large data sets
  • Multi-table de-duplication
Write a Comment
User Comments (0)