Title: Dynamic Author Similarities
1Dynamic Author Similarities
- Sheila Alston
- Old Dominion University
- Advisor Dr. Johan Bollen
- May 6, 2005
2Agenda
- Introduction
- Methodology
- Implementing Dynamically Generated Graphs
- Results
- Conclusions and Future Work
- Demonstration
3Overview
- Inception of the internet since early 1990s
- Information accessible to researchers
- Challenges in the Digital libraries community
which encourage research
4A Research Tool
- Visual representation of on-line publications
- Visibly show a network of author similarities
- Identify Trends
5General Approach
- Design a GUI
- Search a collection
- Dynamically generate graphs
- Link authors according to their publication
similarities - Navigate the resulting network of related authors
6Data Collection
- D-Lib Magazine collection of documents retrieved
from the web - Publications ranged from the years of 1995 to
2003 - Total number of publications 907
- Unique number of publications - 753
7Data Pruning
- Each document separated into terms
- Spurious characters such as . , , ,,
? were removed - Stop words were removed, such as be, the,
a, each, is,so - Approximately 390 stop words
- No stemming of the terms
- 35, 667 terms after stop words were removed
8Term Frequency
- Each term was assigned a unique ID
- Unique IDs started with 20,000
- The unique ID of 20,000 aided in identifying
terms vs. documents in the TF/IDF matrix - Term and term ID were used as input to compute
the Term Frequency
9What is Term Frequency?
- The frequency of terms in a publication
- The frequency of the term in the publication over
the maximum frequency in the publication
10Inverse Document Frequency
- Each publication may have one or many authors
- Each publication was assigned a unique ID
- There are 753 distinct publications
- List of unique publications served as input to
the IDF
11Inverse Document Frequency
- An Experimental End-User Service across E-Print
Archives - Spacer Line
- Spacer
- Herbert Van de Sompel
- Los Alamos National Laboratory - Research
Library, New Mexico, US, and - Automation Department of the Central Library
of the University of - Ghent, Belgium
- 1herbert.vandesompel_at_rug.ac.be.
- Thomas Krichel
- University of Surrey, UK
- 2T.Krichel_at_surrey.ac.uk
- Michael L. Nelson
- NASA Langley Research Center, Hampton VA, USA
- 3m.l.nelson_at_larc.nasa.gov
12What is Inverse Document Frequency?
- The idf of term i is given by
- N is the number of publications
- ni is the number of publications that contain
term i
13What is Inverse Document Frequency?
- Provides high values for terms that are rare
- Provide low values for terms that are common
- N 1000 publications
- ni 1000 publications that contain termi
- Log(1000/1000) 0
- Log(1000/10) 2
14TF/IDF Weights
15Vector Space Model
- An information retrieval method
- Allows partial matching
- Weights are assigned to terms in queries and in
documents - Weights are used to compute the degree of
similarity - The weights are sorted
16Cosine Similarity
17Cosine Similarity, cont.
18Cosine Similarity, cont.
19Cosine Similarity, cont.
- Numerator
- DROP TABLE IF EXISTS simnum
- CREATE TABLE simnum
- ( authorx int(7),
- authory int(7),
- sim_num double(22,4),
- UNIQUE ndx_author(authorx, authory))
- insert into simnum (authorx, authory, sim_num )
- select t1.authorid AS authorx
- , t2.authorid AS authory
- , sum(t1.weight t2.weight)
- from all1 t1
- , all1 t2
- where t1.authorid ltgt t2.authorid
- and t1.termid t2.termid
- group by t1.authorid, t2.authorid
20Cosine Similarity, cont.
- Denominator
- DROP TABLE IF EXISTS simdenom
- CREATE TABLE simdenom
- ( sim_denom double(15,6) default 0,
- authorx int(7),
- unique ndx_author (authorx))
- insert into simdenom (sim_denom, authorx )
- select
- SQRT(SUM(POW(t1.weight,2))) AS
sim_denom - , t1.authorid AS authorx
- from all1 t1
- group by t1.authorid
21Cosine Similarity, cont.
- DROP TABLE IF EXISTS cossim
- CREATE TABLE cossim
- select
- t1.authorx AS authorx
- , t1.authory AS authory
- , t1.sim_num
- , t2.sim_denom simdenom1
- , t3.sim_denom simdenom2
- , t1.sim_num/(t2.sim_denom
t3.sim_denom) AS cos_sim - from simnum t1
- , simdenom t2
- , simdenom t3
- where t1.authorx t2.authorx
- and t1.authory t3.authorx
22Evaluating Results
Highly Related Authors with Branch of 3, depth of
2
23Evaluating Results, Cont.
Authors Minimally Related Showing a Branch of 3
and Depth of 2
24Evaluating Results, Cont.
Moderately Related Authors with Branch of 2,
depth of 4
25File Layout for Generating Graphs
- graph G
- JohanBollen -- Luce label"1.0"
- Luce URL"http//tango/ project1.jsp?authorLuce_
Rickbranch3depth1" - JohanBollen -- Vemulapalli label"0.8"
- Vemulapalli URL"http//tango /project1.jsp?autho
rVemulapalli_Soma_Sekharabranch3depth1" - JohanBollen -- Xu label"0.8"
- Xu URL"http//tango /project1.jsp?authorXu_Wein
ingbranch3depth1" - JohanBollen shapepolygon, sides5
-
26Conclusion
- Useful tool for researchers
- Identifies high and low similarities
- Gives a graphical representation of what authors
are in a collection - Allows an author drill down capability
27Future Work
- D-Lib Magazine collection is separated into years
- Build graphs for author similarities by year
- Compare trends of authors for different years
28Demonstration
http//tango.cs.odu.edu1990/Thesis/project.html