Title: Warren Shen, Xin Li, AnHai Doan
1Constraint-Based Entity Matching
-
- Warren Shen, Xin Li, AnHai Doan
- Database AI Groups
- University of Illinois, Urbana
2Entity Matching
- Decide if mentions refer to the same real-world
entity - Key problem in numerous applications
- Information integration
- Natural language understanding
- Semantic Web
Chris Li, Jane Smith. Numerical Analysis. SIAM
2001 Chen Li, Doug Chan. Ensemble Learning C.
Li, D. Chan. Ensemble Learning. ICML 2003
3State of the Art
- Numerous solutions in the AI, Database, and Web
communities - Cohen, Ravikumar, Fienberg 2003
- Li, Morie, Roth 2004
- Bhattacharya Getoor 2004
- McCallum, Nigam, Ungar 2000
- Pasula et. al. 2003
- Wellner et. al. 2004
- Most solutions largely exploit only syntactic
similarity - Jeff Smith J. Smith
- (217) 235-1234 235-1234
4Semantic Constraints
- Incompatible
- Subsumption
- Layout
C. Li. User Interfaces. SIGCHI 2000 C. Li, J.
Smith. Numerical Analysis. SIAM 2001
Numerical Analysis, SIAM 2001 with J. Smith.
Chris Lis Homepage
Chris Li, Jane Smith. Numerical Analysis. SIAM
2001
DBLP
Chen Li, Doug Chan. Ensemble Learning. ICML
2003 C. Li. Data Mining. KDD 2000
Chen Lis Homepage
5Numerous Semantic Constraint Types
Type Example
Aggregate No researcher has chaired more than 3 conferences in a year
Subsumption If a citation X from DBLP matches a citation Y in a homepage, then each author in Y matches some author in X
Neighborhood If authors X and Y share similar names and some coauthors, they are likely to match
Incompatible No researcher exists who has published in both HCI and numerical analysis
Layout If two mentions in the same document share similar names, they are likely to match
Uniqueness Mentions in the PC listing of a conference refer to different researchers
Ordering If two citations match,then their authors will be matched in order
Individual The researcher named Mayssam Saria has fewer than five mentions in DBLP (e.g. being a new graduate student with fewer than five papers)
6Our Contributions
- Develop a solution to exploit semantic
constraints - Models constraints in a uniform probabilistic
manner - Clusters mentions using a generative model
- Uses relaxation labeling to handle constraints
- Adds a pairwise layer to further improve accuracy
- Experimental results on two real-world domains
- Researchers, IMDB
- Improved accuracy over state of the art by 3-12
F-1
7Probabilistic Modeling of Constraints
- Modeled as the effect on the probability that a
mention refers to a real-world entity - If two mentions in the same document share
similar names, they are likely to match - Constraint probabilities have a natural
interpretation - Can be learned or manually specified by a domain
expert
P (m2e1 m1 e1) 0.8
8The Entity Matching Problem
- Solution
- Model document generation
- Cluster mentions using this model
9Modeling Document Generation
- Generate mentions for each document
- Select entities
- Generate and sprinkle mentions
- Check constraints for each mention
- Decide whether to enforce
- constraint c
- If enforced, check if
- mention violates c
- If yes, discard documents
- and repeat process
- (Extension of model in
- Li, Morie Roth 2004)
10Clustering with the Generative Model
- Find mention assignments F and model parameters ?
to maximize P (D, F ? ) - Difficult to compute exactly, so use a variant of
EM
11Incorporating Constraints
- Extend the step that assigns mentions
- Basic mention assignment
-
- Extension Use constraints to improve mention
assignments
12Enforcing Constraints on Clusters
- Apply constraints at each iteration
- Use relaxation labeling to apply constraints to
mention assignments
13Relaxation Labeling
- Start with an initial labeling of mentions with
entities - Iteratively improve mention labels, given
constraints - Can be extended to probabilistic constraints
- Scalable
Chen Li e1 C. Li e2 Y. Lee e3
Chris Lee e2 Jane Smith e4
C. Lee e2 Smith, J e4
Constraints c1 layout constraint p(c1)
0.8
14Relaxation Labeling
- Start with an initial labeling of mentions with
entities - Iteratively improve mention labels, given
constraints - Can be extended to probabilistic constraints
- Scalable
Chen Li e1 C. Li e2 ? e1 Y. Lee e3
Chris Lee e2 Jane Smith e4
C. Lee e2 Smith, J e4
Constraints c1 layout constraint p(c1)
0.8
15Handling Probabilistic Constraints
- Relaxation labeling can combine multiple
probabilistic constraints
16Pairwise Layer
- So far, we have applied constraints to clusters
- It may be unclear how to enforce constraints on
clusters - Add a pairwise layer
- Convert clusters into predicted matching pairs
- Remove only pairs that negative pairwise hard
constraints apply to
Constraint C. Li ? Li, C. Remove C. Li or Li,
C. ?
17Empirical Evaluation
- Two real-world domains
- Researchers, IMDB
- For each domain
- Collected documents
- Researchers homepages from DBLP and the web
- IMDB text and structured records from IMDB
- Marked up mentions and their attributes
- 4,991 researcher mentions
- 3,889 movie titles from IMDB
- Manually identified all correct matching pairs
- Evaluation Metric
- Precision true positives / predicted pairs
- Recall true positives / correct pairs
- F1 (2 P R) / (P R)
18Using Constraints Improves Accuracy
- Relaxation labeler improves F-1 by 3-12
- Relaxation labeling very fast
19Using Constraints Individually
- Each constraint makes a contribution
20Related Work
- Much work in entity matching
- Cohen, Ravikumar, Fienberg 2003
- Li, Morie, Roth 2004
- Bhattacharya Getoor 2004
- McCallum, Nigam, Ungar 2000
- Pasula et. al. 2003
- Wellner et. al. 2004
- Recent work has looked at exploiting semantic
constraints - Personal Information Management (Dong et. al.
2004) - Profiler based entity matching (Doan et. al.
2003) - Semantic constraints successfully exploited in
other applications - Clustering algorithms (Bilenko et. al. 2004),
ontology matching (Doan et. al. 2002)
21Summary and Future Work
- Exploit semantic constraints in entity matching
- Models constraints in a uniform probabilistic
manner - Uses a generative model and relaxation labeling
to handle constraints in a scalable way - Experimental results on two real-world domains
show effectiveness - Future work Learning constraints effectively
from current or external data