Social Network Analysis on Name Disambiguation - PowerPoint PPT Presentation

1 / 21
About This Presentation
Title:

Social Network Analysis on Name Disambiguation

Description:

The same author names mistakenly appear under multiple name variants. ... Edit-distance, Affine Gap, Smith-Waterman, Jaro, etc. Token-based similarity metrics ... – PowerPoint PPT presentation

Number of Views:137
Avg rating:3.0/5.0
Slides: 22
Provided by: dong164
Category:

less

Transcript and Presenter's Notes

Title: Social Network Analysis on Name Disambiguation


1
Social Network Analysis on Name Disambiguation
  • On, Byung-Won
  • U. British Columbia
  • Nov. 12, 2008

2
Outline
  • Motivation
  • Problem Definition
  • Solution
  • Context Information
  • Similarity Function
  • Our Framework
  • Experimental Analysis
  • Summary

3
Name Disambiguation _at_ DLs
Jeffrey D. Ullman _at_ Stanford Univ.
The same author names mistakenly appear under
multiple name variants.
Name Disambiguation Problem
Detect/consolidate all name variants!!
4
Problem Definition
and
,
Y
X
names
of
lists
two
Given
ÃŽ
ÃŽ
)
(
,...,
,
),
(
Y
y
y
y
names
of
set
a
find
X
x
name
each
for
2
1
m
.
variant
)
1
(
x
of
a
is
m
i
y
that
such


i
Y
X
A. Elbert
1
Paul R. McJones
1
Frank Manola
2
Frank Manola
2
F. Manola
...

Karl Swartz
K
Karl L. Swartz
N
5
Solution
  • Treat additional information associated with x
    (resp. y) as a string.
  • What is additional info?
  • Compute all pair-wise string similarities.
  • How can similarities be measured?
  • If similarity(x,y) ?, y is the name variant of
    x.

6
Context Information
  • Hypothesis
  • If two authors are identical, they will share
    more number of coauthors and common title/venue
    tokens in their citations.
  • Information associated with an author _at_ DL
  • Author field
  • Shawn R. Jeffrey, Michael J. Franklin, Alon Y.
    Halevy
  • Title field
  • Pay-as-you-go user feedback for dataspace systems
  • Venue field
  • SIGMOD 2008
  • Ex Alon Y. Levy vs. Alon Halevy
  • Alon Y. Levy a set of title tokens data,
    management, integration
  • Alon Halevy a set of title tokens data,
    integration, lineage

7
Similarity Function
  • Why
  • Most useful for matching problems with little
    prior knowledge or unstructured data (Cohen et
    al. 2003)
  • Character-based similarity metrics
  • Edit-distance, Affine Gap, Smith-Waterman, Jaro,
    etc.
  • Token-based similarity metrics
  • Jaccard, TF/IDF cosine similarity, Monge-Elkan,
    etc.

8
Similarity Function
  • Every similarity function tends to work well in
    particular data set
  • Each function has pros and cons in measuring the
    similarity between two strings
  • Variations of token order
  • Jaccard(Jeffrey D Ullman, Ullman
    Jeffrey)0.67
  • Jaro(Jeffrey D Ullman, Ullman Jeffrey)0
  • Spelling errors
  • Jaccard(Jeffrey D Ullman, Jeffrey
    Ullmann)0.25
  • Jaro(Jeffrey D Ullman, Jeffrey Ullmann)0.94

9
Similarity Function
  • Given two strings S and T as the input
  • JaccardSnT/SUT
  • Cosine similarity
  • S (resp. T) is represented as vector VS (resp.
    VT).
  • Cosine(?) VS VT / VS VT
  • Edit-distance (e.g., Levenshtein distance)
  • The cost of best sequence of edit operations that
    convert S to T.
  • The operations can be character insertion,
    deletion, or substitution.
  • Each operation must be assigned a cost.

10
Our Framework
  • Similarity Function (sim)
  • Jaccard, Cosine similarity, or Edit-distance
  • Input of each similarity function
  • Given two authors x and y
  • S a set of coauthor names (title tokens, or
    venue tokens) collected from xs citations
  • T a set of coauthor names (title tokens, or
    venue tokens) collected from ys citations
  • If sim(S,T) ?, y is the name variant of x.
  • Ex. sim(S,T)0.6 gt ? (0.5) consider x and y to
    be identical.

11
james smiths citations
james smith, gene golub, xml query, vldb 06 james
smith, gene golub, xml preprocess, cikm 07 jame
smith, xml security, vldb 08
smith, j.s citations
smith, j. golub, g., xml query, very large
database 06 smith, j. golub, g., xml
preprocessing, cikm 07 smith, j. xml security,
very large database 08
Context information (e.g., title tokens)
S (james smith) xml, query, preprocess,
security
T (smith, j.) xml, query, preprocessing,
security
Similarity function (e.g., Jaccard)
smith, j.
sim(S,T)3/50.6
sim(S,T) ? (0.5) smith, j. is the variant
name of james smith
james smith
Duplicate name graph
12
Objective
  • Represent name disambiguation problem as a graph
  • A duplicate name graph is formed semantically by
    the similarities of pair-wise nodes.
  • If two nodes are connected in the graph, they are
    name variants.
  • Observing topological features in the graph,
    investigate the effectiveness of similarity
    functions and context information
  • Jaccard, Cosine similarity, Edit-distance
  • A set of coauthors, title tokens, venue tokens

13
Topological Features
14
Experimental Analysis
  • 128 real author names and variants
  • Manually collected from ACM Portal
  • Manually verify that two authors (eg, Chong Kwan
    Un vs. C. K. Un, Chong K Un) are the same author
    name in ACM
  • From 128 author names,
  • Eg, two name variants Chun Wu Leng vs. Chun-Wu
    Leng
  • Consider Chun Wu Leng as the representative name
  • Consider Chun-Wu Leng as a variant name
  • of representative names 43
  • Each representative name has 2.98 name variants
  • Max. of variants 5 (A. Y. Halevy, Alon
    Halevy, Alon Levy, Alon Y. Halevy, Alon Y. Levy)

15
(No Transcript)
16
  • Each representative name has at most 2 variants
  • If a similarity function (e.g., Cosine
    similarity) identifies variants effectively,
    there are many forests and topological features
    of random graphs.
  • But the duplicate name graph is a scale-free
    network
  • Power-law distribution
  • Cosine similarity function does not find
    identical author names effectively.
  • Due to false positives (co-authors) in the graph

17
(No Transcript)
18
(No Transcript)
19
Summary
  • Analyze/visualize the name disambiguation problem
    using social network analysis methods
  • Jaccard, Cosine similarity, and Edit-distance do
    not work effectively
  • Showing the scale-free topological feature
  • Best is Jaccard or Cosine similarity using
    context info of coauthors or title tokens

20
TF/IDF Cosine Similarity
  • A string is represented as a set of the string
    tokens
  • Each token t is assigned w(t)log(TF(t)1)log(IDF
    (t))
  • TF(t) of times that t appears in the string
  • IDF(t) total of strings / of strings
    containing t

21
Similarity Functions
Write a Comment
User Comments (0)
About PowerShow.com