Title: Christine Preisach, Steffen Rendle and Lars Schmidt-Thieme
1Relational Classification Using Automatically
Extracted Relations by Record Linkage
- Christine Preisach, Steffen Rendle and Lars
Schmidt-Thieme - Information Systems and Machine Learning Lab
(ISMLL) - University of Hildesheim
- Germany
2Outline
- Motivation
- Relation Extraction and Multi-Relational
Classification Framework - Relation Extraction
- Multi-Relational Classification
- Evaluation
- Conclusion
3Motivation
Publication Title Author Conference Category
1 Classification of scientific publications John Smith ICDM Data Mining
2 Classification of Hypertext John Smith KDD Data Mining
3 Hierarchical Clustering Dan Miller ICDM Data Mining
4Motivation
- Traditional classifiers takes only local
attributes like keywords, title and abstract into
account - Assumption Instances are independent
- But Assumption does not hold
- Instances can be related to other documents by
the authorship, citations, same conference etc. - These relations should be exploited and combined
in order to improve classification accuracy. - But Manuel extraction of relations by experts is
expensive -
- Automatic extraction of relations from noisy
attributes.
5Relation Extraction and Relational Classification
Framework
- Relation Extraction Component
- Extraction of relations from objects
with noisy attributes - Multi-Relational Classification Component
- Use extracted relations instead or
additionally to local attributes for
classification
6Relation Extraction
- Pairwise feature extraction
- from noisy attributes with
several similarity measures (e.g.
TFIDF, cosine similarity, Levenshtein) - Probabilistic pairwise decision model
- Use extracted similarities as features for a
probabilistic classifierand build a model on
the training data - And apply it on unknown pairs
- Collective decision model
- If is an equivalence relation then use
constrained clustering (e.g. HAC) using the pair
wise decision model as a learned similarity
measure to transform into a binary relation -
7Relation Extraction
Collective Decision Model
Initialisation
Must Links
Cannot Links
8Multi-Relational Classification
- Relational classification problem
- Make use of additional information of related
objects (i.e. their classes or attributes) - Propositionalize the relational data e.g. with
-
- where
- is the neighborhood of
9Multi-Relational Classification
- Algorithm
- 1. for each relation R1 to m
- (a) Build a undirected weighted graph
with - (b) Perform relational classification
simultaneously for all instances in the test set - (c) Output a probability distribution
- 2. Apply ensemble classification to the
resulting probability distributions of these
relations - 3. Output final classification
10Multi-Relational Classification
- Simple Relational Methods
- Probabilistic Relational Neighbor Classifier
(EPRN) Macskassy and Provost 2003 - Where is a normalization factor, is the
weight and is the iteration - EPRN2HOP
- Takes additionally the neighbors of the direct
neighbors into account if the direct neighborhood
size is small
11Multi-Relational Classification
- Aggregation-based Relational Learning Methods
- Use aggregation functions in order to
propositionalize the set-valued attribute - Use aggregated values as attributes for
traditional machine learning methods - We used Logistic Regression as classifier
12Ensemble Classification
- Methods which combine different models
- Increases classification accuracy
- Usage
- Combine results achieved by relational
classification for different relations - Combine results of relational and local models
- Voting
- Stacking
- Use Meta-classifier to learn a model on the
results of different models - Build new instances
- Apply cross validation
13Evaluation
- Data
- CompuScience data set
- 147 571 scientific papers
- 77 topics (categories)
- Relations authors, reviewer, journals
- Cora deduplication data set
- 1 295 citations
- 112 unique publications
- RelationsamePaper
- Cora data set
- 3298 papers
- 12 categories
- Relations conferences, authors, citations
14Evaluation Relation Extraction
F1 measure for finding the SamePaper relation on
Cora
Evaluation set single linkage complete linkage average linkage
Xtst 0.90 0.74 0.92
X 0.92 0.71 0.93
Pairwise feature extraction with TFIDF,
Levenshtein, Jaccard, Cosine on all attributes
15Evaluation Multi-Relational Classification
3-fold cross validation on CompuScience for
Author, Reviewer and Journal relation
- The ensemble of relational and content-based text
classification achieved a significantly higher
F-measure then the pure text classifier
16Evaluation
- Multi-Relational Classification using
automatically extracted relations - 50/50 splits, 10 runs
17Conclusion and Future Work
- Summary
- Presented framework for relation extraction and
multi-relational classification - Automatic relation extraction with record linkage
- Relational classification using each extracted
relation for classification and fusing the
results with ensemble methods - Future Work
- Evaluate our framework on different data sets and
relations - Evaluate the relational classifiers quality
depending on the quality of the extracted
relations
18Thank you
- Questions ?
- www.ismll.uni-hildesheim.de
- Christine Preisach
- preisach_at_ismll.uni-hildesheim.de
- Steffen Rendle
- srendle_at_ismll.uni-hildesheim.de
- Lars Schmidt-Thieme
- schmidt-thieme_at_ismll.uni-hildesheim.de