Title: Social Network Analysis on Name Disambiguation
1Social Network Analysis on Name Disambiguation
- On, Byung-Won
- U. British Columbia
- Nov. 12, 2008
2Outline
- Motivation
- Problem Definition
- Solution
- Context Information
- Similarity Function
- Our Framework
- Experimental Analysis
- Summary
3Name Disambiguation _at_ DLs
Jeffrey D. Ullman _at_ Stanford Univ.
The same author names mistakenly appear under
multiple name variants.
Name Disambiguation Problem
Detect/consolidate all name variants!!
4Problem Definition
and
,
Y
X
names
of
lists
two
Given
ÃŽ
ÃŽ
)
(
,...,
,
),
(
Y
y
y
y
names
of
set
a
find
X
x
name
each
for
2
1
m
.
variant
)
1
(
x
of
a
is
m
i
y
that
such
i
Y
X
A. Elbert
1
Paul R. McJones
1
Frank Manola
2
Frank Manola
2
F. Manola
...
Karl Swartz
K
Karl L. Swartz
N
5Solution
- Treat additional information associated with x
(resp. y) as a string. - What is additional info?
- Compute all pair-wise string similarities.
- How can similarities be measured?
- If similarity(x,y) ?, y is the name variant of
x.
6Context Information
- Hypothesis
- If two authors are identical, they will share
more number of coauthors and common title/venue
tokens in their citations. - Information associated with an author _at_ DL
- Author field
- Shawn R. Jeffrey, Michael J. Franklin, Alon Y.
Halevy - Title field
- Pay-as-you-go user feedback for dataspace systems
- Venue field
- SIGMOD 2008
- Ex Alon Y. Levy vs. Alon Halevy
- Alon Y. Levy a set of title tokens data,
management, integration - Alon Halevy a set of title tokens data,
integration, lineage
7Similarity Function
- Why
- Most useful for matching problems with little
prior knowledge or unstructured data (Cohen et
al. 2003) - Character-based similarity metrics
- Edit-distance, Affine Gap, Smith-Waterman, Jaro,
etc. - Token-based similarity metrics
- Jaccard, TF/IDF cosine similarity, Monge-Elkan,
etc.
8Similarity Function
- Every similarity function tends to work well in
particular data set - Each function has pros and cons in measuring the
similarity between two strings - Variations of token order
- Jaccard(Jeffrey D Ullman, Ullman
Jeffrey)0.67 - Jaro(Jeffrey D Ullman, Ullman Jeffrey)0
- Spelling errors
- Jaccard(Jeffrey D Ullman, Jeffrey
Ullmann)0.25 - Jaro(Jeffrey D Ullman, Jeffrey Ullmann)0.94
9Similarity Function
- Given two strings S and T as the input
- JaccardSnT/SUT
- Cosine similarity
- S (resp. T) is represented as vector VS (resp.
VT). - Cosine(?) VS VT / VS VT
- Edit-distance (e.g., Levenshtein distance)
- The cost of best sequence of edit operations that
convert S to T. - The operations can be character insertion,
deletion, or substitution. - Each operation must be assigned a cost.
10Our Framework
- Similarity Function (sim)
- Jaccard, Cosine similarity, or Edit-distance
- Input of each similarity function
- Given two authors x and y
- S a set of coauthor names (title tokens, or
venue tokens) collected from xs citations - T a set of coauthor names (title tokens, or
venue tokens) collected from ys citations - If sim(S,T) ?, y is the name variant of x.
- Ex. sim(S,T)0.6 gt ? (0.5) consider x and y to
be identical.
11james smiths citations
james smith, gene golub, xml query, vldb 06 james
smith, gene golub, xml preprocess, cikm 07 jame
smith, xml security, vldb 08
smith, j.s citations
smith, j. golub, g., xml query, very large
database 06 smith, j. golub, g., xml
preprocessing, cikm 07 smith, j. xml security,
very large database 08
Context information (e.g., title tokens)
S (james smith) xml, query, preprocess,
security
T (smith, j.) xml, query, preprocessing,
security
Similarity function (e.g., Jaccard)
smith, j.
sim(S,T)3/50.6
sim(S,T) ? (0.5) smith, j. is the variant
name of james smith
james smith
Duplicate name graph
12Objective
- Represent name disambiguation problem as a graph
- A duplicate name graph is formed semantically by
the similarities of pair-wise nodes. - If two nodes are connected in the graph, they are
name variants. - Observing topological features in the graph,
investigate the effectiveness of similarity
functions and context information - Jaccard, Cosine similarity, Edit-distance
- A set of coauthors, title tokens, venue tokens
13Topological Features
14Experimental Analysis
- 128 real author names and variants
- Manually collected from ACM Portal
- Manually verify that two authors (eg, Chong Kwan
Un vs. C. K. Un, Chong K Un) are the same author
name in ACM - From 128 author names,
- Eg, two name variants Chun Wu Leng vs. Chun-Wu
Leng - Consider Chun Wu Leng as the representative name
- Consider Chun-Wu Leng as a variant name
- of representative names 43
- Each representative name has 2.98 name variants
- Max. of variants 5 (A. Y. Halevy, Alon
Halevy, Alon Levy, Alon Y. Halevy, Alon Y. Levy)
15(No Transcript)
16- Each representative name has at most 2 variants
- If a similarity function (e.g., Cosine
similarity) identifies variants effectively,
there are many forests and topological features
of random graphs. - But the duplicate name graph is a scale-free
network - Power-law distribution
- Cosine similarity function does not find
identical author names effectively. - Due to false positives (co-authors) in the graph
17(No Transcript)
18(No Transcript)
19Summary
- Analyze/visualize the name disambiguation problem
using social network analysis methods - Jaccard, Cosine similarity, and Edit-distance do
not work effectively - Showing the scale-free topological feature
- Best is Jaccard or Cosine similarity using
context info of coauthors or title tokens
20TF/IDF Cosine Similarity
- A string is represented as a set of the string
tokens - Each token t is assigned w(t)log(TF(t)1)log(IDF
(t)) - TF(t) of times that t appears in the string
- IDF(t) total of strings / of strings
containing t
21Similarity Functions