Author Name Disambiguation in Medline - PowerPoint PPT Presentation

About This Presentation
Title:

Author Name Disambiguation in Medline

Description:

B and C share co-authors, affiliations. But A and C share nothing! ... Affiliations matched to each authors in PMC, online papers ... – PowerPoint PPT presentation

Number of Views:142
Avg rating:3.0/5.0
Slides: 20
Provided by: Nsm81
Category:

less

Transcript and Presenter's Notes

Title: Author Name Disambiguation in Medline


1
Author Name Disambiguation in Medline
  • Vetle I. Torvik and
  • Neil R. Smalheiser
  • August 31, 2006

2
(No Transcript)
3
(No Transcript)
4
A Statistically Based Model
  • Hypothesis an individual tends to publish papers
    with similar attributes, sufficiently so that
    these attributes suffice to disambiguate the
    authors.
  • The case of Dr. Tom Jobe?
  • Large, automatically generated training sets of
    pairs of articles matching (Last Name, First
    Initial) written by the same person vs. by
    different individuals.

5
Name Attributes
  • Suffix if present (III, Jr.)
  • Middle initial match
  • Original model got very good performance without
    using first names at all!
  • First name (if available in Medline, or if it can
    be scraped from online papers)
  • Name spelling variants
  • Name frequency

6
Article Attributes
  • Journal name
  • Number of co-author names in common
  • Affiliation words in common
  • may not be given for all co-authors
  • Name correlations with affiliations (e.g. Ito and
    Japan are correlated)
  • Language of the article
  • Title words in common
  • Email addresses, if given
  • assign to right author
  • MeSH headings in common

7
A Monotone Model
  • Each pair of papers creates a vector of 10
    dimensions, each of which has a matching score.
  • Assume monotonicity more attributes in common,
    more likely written by the same person
  • Allows for nonlinear and interactive effects
    across dimensions

8
Estimate Pairwise Frequencies
  • For a given pair of articles, compute the match
    vector, then look up its frequencies in pos vs.
    neg training sets ratio R value
  • For a given name, estimate the a priori
    probability P that any two papers will be written
    by the same person
  • This is a whole story in itself.
  • 1/1 (1-P)/PR probability of a match
  • The Author-ity Site at (http//arrowsmith.psych.ui
    c.edu)

9
Beyond Pairwise Comparisons
  • A and B share titles, journals
  • B and C share co-authors, affiliations
  • But A and C share nothing!
  • Yet p(AC) must be gt (p(AB) p(BC) -1)
  • Triangle inequality using probabilities, detect
    and correct anomalies
  • due to missing data or higher order correlations
  • Catch un-characteristic papers by an author
  • Another long story to optimize the methods!

10
Clustering all papers in Medline according to
author-individuals
  • First we compute all pairwise probabilities for
    each (last name, first initial) modified with
    triplet correction
  • Then we form clusters
  • at p 0.95 (high precision)
  • and at p 0.5 (high recall)
  • i.e. the chance is greater than 0.5 that it
    belongs to some cluster, or it stays as a
    singleton

11
First-Pass Disambiguation is Complete!
  • Except for several hundred names having more than
    3000 papers each,
  • reach memory limit,
  • will assess if the model is reliable for the
    biggest names
  • For now, proceed for papers giving first names.
  • Monitoring for over-clustering and
    under-clustering
  • Summarizing global statistics

12
Immediate Next Steps
  • Evaluate the clustering performance
  • Old vs. new papers
  • Importance of missing data
  • Very frequent names
  • Singletons, least confident assignments
  • Update the web interface

13
(No Transcript)
14
(No Transcript)
15
(No Transcript)
16
Upcoming Grant RenewalAim 1 Special Cases
  • name reversal, hyphenated names, spelling errors,
    Gerald vs. Jerry, Rick vs. A. Rick
  • Use co-author assignment to help disambiguate
    another co-author
  • Compute confidence level of assignment for each
    paper, identify least confident assignments

17
Upcoming Grant RenewalAim 2 Update the Model
  • Original model covers 1966-present, but new
    papers have different information, MeSH, emails,
    online information
  • Modify training sets with recent papers.
  • Journal name partial match
  • Abstract words match?
  • Affiliations matched to each authors in PMC,
    online papers
  • References Cited information taken from PMC

18
Upcoming Grant RenewalAim 3 Web Interface
  • Update the pairwise interface (given name, a
    particular paper, list all others in order of
    match probability)
  • Show clusters given a name, show all clusters
    of author-individuals, link to Community of
    Science, searchable by attributes, can summarize
    and explore further (Anne OTate tool)
  • Author profile/collaboration finder tools
  • Data made available for bibliometrics and
    collaboration network research

19
Upcoming Grant RenewalAim 4 Curation
  • Curator to identify errors and least-confident
    assignments
  • manually
  • machine methods (e.g. wobble in clustering)
  • change the database and alter the model as needed
  • Wiki Authors will monitor postings to Wiki and
    change the database as verified and warranted
    (e.g. maiden name to married name)
Write a Comment
User Comments (0)
About PowerShow.com