Comparative Study of Name Disambiguation Problem using a Scalable Blocking-based Framework

1 / 24

About This Presentation

Title:

Comparative Study of Name Disambiguation Problem using a Scalable Blocking-based Framework

Description:

Comparative Study of Name Disambiguation Problem using a Scalable Blocking-based Framework Byung-Won On, Dongwon Lee, Jaewoo Kang, Prasenjit Mitra –

Number of Views:88

Avg rating:3.0/5.0

Slides: 25

Provided by: aug122

Category:

more less

Transcript and Presenter's Notes

Title: Comparative Study of Name Disambiguation Problem using a Scalable Blocking-based Framework

1
Comparative Study of Name Disambiguation Problem
using a Scalable Blocking-based Framework

Byung-Won On, Dongwon Lee, Jaewoo Kang, Prasenjit
Mitra
JCDL05

2
Abstract

They consider the problem of ambiguous author
names in bibliographic citations.
Scalable two-step framework
Reduce the number of candidates via blocking
(four methods)
Measure the distance of two names via coauthor
information (seven measures)

3
Introduction

Citation records are important resources for
academic communities.
Keeping citations correct and up-to-date proved
to be a challenging task in a large-scale.
We focus on the problem of ambiguous author
names.
It is difficult to get the complete list of the
publications of some authors.
John Doe published 100 articles, but DL keeps
two separate purported author names, John Doe
and J. D. Doe, each contains 50 citations.

4
(No Transcript)
5
Problem

Problem definition
The baseline approach

6
Solution

Rather than comparing each pair of author names
to find similar names, they advocate a scalable
two-step name disambiguation framework.
Partition all author-name strings into blocks
Visit each block and compare all possible pairs
of names within the block

7
Solution Overview
8
Blocking (1/3)

The goal of step 1 is to put similar records into
the same group by some criteria.
They examine four representative blocking methods
heuristics, token-based, n-gram, sampling

9
Blocking (2/3)

Spelling-based heuristics
Group author names based on name spellings
Heuristics iFfL, iFiL, fL, combination
iFfL e.g. Jeffrey Ullman, J. Ullman
Token-based
Author names sharing at least one common token
are grouped into the same block
e.g., Jeffrey D. Ullman and Ullman, Jason

10
Blocking (3/3)

N-gram
N4
The number of author names put into the same
block is the largest one.
e.g. David R. Johnson, F. Barr-David
Sampling
Sampling-based join approximation
Each token from all author names has an TFIDF
weight.
Each author name has its token weight vector.
All pairs of names with similarity of at least ?
can be put into the same block.

11
Measuring Distances

The goal of step 2 is, for each block, to
identify top-k author names that are the closest.
Supervised method
Naïve Bayes Model, Support Vector Machine
Unsupervised method
String-based Distance, Vector-based Cosine
Distance

12
Supervised Methods (1)

Naïve Bayes Model
Training
A collection of coauthors of x are randomly
split, and only the half is used for training.
They estimate each coauthors conditional
probability P(Ajx)
Testing

13
Supervised Methods (2)

Support Vector Machine
All coauthor information of an author in a block
is transformed into vector-space representation.
Author names in a block are randomly split, 50
is used for training, and the other 50 is used
for testing.
SVM creates a maximum-margin hyperplane that
splits the YES and NO training examples.
In testing, the SVM classifies vectors by mapping
them via kernel trick to a high dimensional.
Radial Basis Function kernel

14
Unsupervised Methods(1)

String-based Distance
The distance between two author names are
measured by the distance between their coauthor
lists.
Two token-based string distances
Two edit-distance-based string distances

15
Unsupervised Methods(2)

Vector-based Cosine Distance
They model the coauthor lists as vectors in the
vector space and compute the distances between
the vectors.
They use the simple cosine distance.

16
Experiment
17
Data Sets

They gathered real citation data from four
different domains.
DBLP, e-Print, BioMed, EconPapers
Different disciplines appear to have slightly
different citation policies and the conventions
of citations also vary.
Number of coauthors per article
Use the initial of first name instead of full
name

18
Artificial name variants

Given the large number of citations, it is not
possible nor practical to find a real solution
set.
They pick top-100 author names from Y according
to their number of citations, and generate 100
corresponding new name variants artificially.
Grzegorz Rozenberg with 344 citations and 114
coauthors in DBLP, we create a new name like G.
Rozenberg or Grzegorz Rozenbergg.
Splitting the original 344 citations into halves,
each name carries half of citations 172
They test if the algorithm is able to find the
corresponding artificial name variant in Y

19
Artificial name variants

Error type e.g. Ji-Woo K. Li
Abbreviation J. K. Ki
Name alternation Li, Ji-Woo K.
Typo Ji-Woo K. Lee or Jee-Woo K. Li
Contraction Jiwoo K. Li
Omission Ji-Woo Li
Combinations
The quantify the effect of error types on the
accuracy of name disambiguation is measured.

20
Artificial name variants

(1) mixed error types of abbreviation (30),
alternation (30), typo (12 each in first/last
name), contraction (2), omission (4), and
combination (10)
(2) abbreviation of the first name (85) and typo
(15)

21
Evaluation metrics

Scalability
Size of blocks generated in step 1
Time it took to process both step 1 and 2
Accuracy
They measured the accuracy of top-k.

22
Scalability

The average of authors in each block
Processing time for step 1 and 2

23
Accuracy

Four blocking methods combined with seven
distance metrics for all four data set with k
5.
EconPapers data set is omitted.

24
Conclusion

They compared various configurations (four
blocking in step 1, seven distance metrics via
coauthor information in step 2), against four
data sets.
A combination of token-based or N-gram blocking
(step 1) and SVM as a supervised method or cosine
metric as a unsupervised method (step 2) gave the
best scalability/accuracy trade-off.
The accuracy of simple name spelling based
heuristics were shown to be quite sensitive to
the error types.
Edit distance based distance metrics such as Jaro
or Jaro-Winkler proved to be inadequate for
large-scale name disambiguation problem for its
slow processing time.