Comparative Study of Name Disambiguation Problem using a Scalable Blocking-based Framework

1 / 24
About This Presentation
Title:

Comparative Study of Name Disambiguation Problem using a Scalable Blocking-based Framework

Description:

Comparative Study of Name Disambiguation Problem using a Scalable Blocking-based Framework Byung-Won On, Dongwon Lee, Jaewoo Kang, Prasenjit Mitra –

Number of Views:88
Avg rating:3.0/5.0
Slides: 25
Provided by: aug122
Category:

less

Transcript and Presenter's Notes

Title: Comparative Study of Name Disambiguation Problem using a Scalable Blocking-based Framework


1
Comparative Study of Name Disambiguation Problem
using a Scalable Blocking-based Framework
  • Byung-Won On, Dongwon Lee, Jaewoo Kang, Prasenjit
    Mitra
  • JCDL05

2
Abstract
  • They consider the problem of ambiguous author
    names in bibliographic citations.
  • Scalable two-step framework
  • Reduce the number of candidates via blocking
    (four methods)
  • Measure the distance of two names via coauthor
    information (seven measures)

3
Introduction
  • Citation records are important resources for
    academic communities.
  • Keeping citations correct and up-to-date proved
    to be a challenging task in a large-scale.
  • We focus on the problem of ambiguous author
    names.
  • It is difficult to get the complete list of the
    publications of some authors.
  • John Doe published 100 articles, but DL keeps
    two separate purported author names, John Doe
    and J. D. Doe, each contains 50 citations.

4
(No Transcript)
5
Problem
  • Problem definition
  • The baseline approach

6
Solution
  • Rather than comparing each pair of author names
    to find similar names, they advocate a scalable
    two-step name disambiguation framework.
  • Partition all author-name strings into blocks
  • Visit each block and compare all possible pairs
    of names within the block

7
Solution Overview
8
Blocking (1/3)
  • The goal of step 1 is to put similar records into
    the same group by some criteria.
  • They examine four representative blocking methods
  • heuristics, token-based, n-gram, sampling

9
Blocking (2/3)
  • Spelling-based heuristics
  • Group author names based on name spellings
  • Heuristics iFfL, iFiL, fL, combination
  • iFfL e.g. Jeffrey Ullman, J. Ullman
  • Token-based
  • Author names sharing at least one common token
    are grouped into the same block
  • e.g., Jeffrey D. Ullman and Ullman, Jason

10
Blocking (3/3)
  • N-gram
  • N4
  • The number of author names put into the same
    block is the largest one.
  • e.g. David R. Johnson, F. Barr-David
  • Sampling
  • Sampling-based join approximation
  • Each token from all author names has an TFIDF
    weight.
  • Each author name has its token weight vector.
  • All pairs of names with similarity of at least ?
    can be put into the same block.

11
Measuring Distances
  • The goal of step 2 is, for each block, to
    identify top-k author names that are the closest.
  • Supervised method
  • Naïve Bayes Model, Support Vector Machine
  • Unsupervised method
  • String-based Distance, Vector-based Cosine
    Distance

12
Supervised Methods (1)
  • Naïve Bayes Model
  • Training
  • A collection of coauthors of x are randomly
    split, and only the half is used for training.
  • They estimate each coauthors conditional
    probability P(Ajx)
  • Testing

13
Supervised Methods (2)
  • Support Vector Machine
  • All coauthor information of an author in a block
    is transformed into vector-space representation.
  • Author names in a block are randomly split, 50
    is used for training, and the other 50 is used
    for testing.
  • SVM creates a maximum-margin hyperplane that
    splits the YES and NO training examples.
  • In testing, the SVM classifies vectors by mapping
    them via kernel trick to a high dimensional.
  • Radial Basis Function kernel

14
Unsupervised Methods(1)
  • String-based Distance
  • The distance between two author names are
    measured by the distance between their coauthor
    lists.
  • Two token-based string distances
  • Two edit-distance-based string distances

15
Unsupervised Methods(2)
  • Vector-based Cosine Distance
  • They model the coauthor lists as vectors in the
    vector space and compute the distances between
    the vectors.
  • They use the simple cosine distance.

16
Experiment
17
Data Sets
  • They gathered real citation data from four
    different domains.
  • DBLP, e-Print, BioMed, EconPapers
  • Different disciplines appear to have slightly
    different citation policies and the conventions
    of citations also vary.
  • Number of coauthors per article
  • Use the initial of first name instead of full
    name

18
Artificial name variants
  • Given the large number of citations, it is not
    possible nor practical to find a real solution
    set.
  • They pick top-100 author names from Y according
    to their number of citations, and generate 100
    corresponding new name variants artificially.
  • Grzegorz Rozenberg with 344 citations and 114
    coauthors in DBLP, we create a new name like G.
    Rozenberg or Grzegorz Rozenbergg.
  • Splitting the original 344 citations into halves,
    each name carries half of citations 172
  • They test if the algorithm is able to find the
    corresponding artificial name variant in Y

19
Artificial name variants
  • Error type e.g. Ji-Woo K. Li
  • Abbreviation J. K. Ki
  • Name alternation Li, Ji-Woo K.
  • Typo Ji-Woo K. Lee or Jee-Woo K. Li
  • Contraction Jiwoo K. Li
  • Omission Ji-Woo Li
  • Combinations
  • The quantify the effect of error types on the
    accuracy of name disambiguation is measured.

20
Artificial name variants
  • (1) mixed error types of abbreviation (30),
    alternation (30), typo (12 each in first/last
    name), contraction (2), omission (4), and
    combination (10)
  • (2) abbreviation of the first name (85) and typo
    (15)

21
Evaluation metrics
  • Scalability
  • Size of blocks generated in step 1
  • Time it took to process both step 1 and 2
  • Accuracy
  • They measured the accuracy of top-k.

22
Scalability
  • The average of authors in each block
  • Processing time for step 1 and 2

23
Accuracy
  • Four blocking methods combined with seven
    distance metrics for all four data set with k
    5.
  • EconPapers data set is omitted.

24
Conclusion
  • They compared various configurations (four
    blocking in step 1, seven distance metrics via
    coauthor information in step 2), against four
    data sets.
  • A combination of token-based or N-gram blocking
    (step 1) and SVM as a supervised method or cosine
    metric as a unsupervised method (step 2) gave the
    best scalability/accuracy trade-off.
  • The accuracy of simple name spelling based
    heuristics were shown to be quite sensitive to
    the error types.
  • Edit distance based distance metrics such as Jaro
    or Jaro-Winkler proved to be inadequate for
    large-scale name disambiguation problem for its
    slow processing time.
Write a Comment
User Comments (0)
About PowerShow.com