Dynamic Author Similarities - PowerPoint PPT Presentation

1 / 28

About This Presentation

Title:

Dynamic Author Similarities

Description:

Inception of the internet since early 1990s. Information accessible to researchers ... University of Surrey, UK [2]T.Krichel_at_surrey.ac.uk. Michael L. Nelson ... – PowerPoint PPT presentation

Number of Views:52

Avg rating:3.0/5.0

Slides: 29

Provided by: sheila73

Category:

more less

Transcript and Presenter's Notes

Title: Dynamic Author Similarities

1
Dynamic Author Similarities

Sheila Alston
Old Dominion University
Advisor Dr. Johan Bollen
May 6, 2005

2
Agenda

Introduction
Methodology
Implementing Dynamically Generated Graphs
Results
Conclusions and Future Work
Demonstration

3
Overview

Inception of the internet since early 1990s
Information accessible to researchers
Challenges in the Digital libraries community
which encourage research

4
A Research Tool

Visual representation of on-line publications
Visibly show a network of author similarities
Identify Trends

5
General Approach

Design a GUI
Search a collection
Dynamically generate graphs
Link authors according to their publication
similarities
Navigate the resulting network of related authors

6
Data Collection

D-Lib Magazine collection of documents retrieved
from the web
Publications ranged from the years of 1995 to
2003
Total number of publications 907
Unique number of publications - 753

7
Data Pruning

Each document separated into terms
Spurious characters such as . , , ,,
? were removed
Stop words were removed, such as be, the,
a, each, is,so
Approximately 390 stop words
No stemming of the terms
35, 667 terms after stop words were removed

8
Term Frequency

Each term was assigned a unique ID
Unique IDs started with 20,000
The unique ID of 20,000 aided in identifying
terms vs. documents in the TF/IDF matrix
Term and term ID were used as input to compute
the Term Frequency

9
What is Term Frequency?

The frequency of terms in a publication
The frequency of the term in the publication over
the maximum frequency in the publication

10
Inverse Document Frequency

Each publication may have one or many authors
Each publication was assigned a unique ID
There are 753 distinct publications
List of unique publications served as input to
the IDF

11
Inverse Document Frequency

An Experimental End-User Service across E-Print
Archives
Spacer Line
Spacer
Herbert Van de Sompel
Los Alamos National Laboratory - Research
Library, New Mexico, US, and
Automation Department of the Central Library
of the University of
Ghent, Belgium
1herbert.vandesompel_at_rug.ac.be.
Thomas Krichel
University of Surrey, UK
2T.Krichel_at_surrey.ac.uk
Michael L. Nelson
NASA Langley Research Center, Hampton VA, USA
3m.l.nelson_at_larc.nasa.gov

12
What is Inverse Document Frequency?

The idf of term i is given by
N is the number of publications
ni is the number of publications that contain
term i

13
What is Inverse Document Frequency?

Provides high values for terms that are rare
Provide low values for terms that are common
N 1000 publications
ni 1000 publications that contain termi
Log(1000/1000) 0
Log(1000/10) 2

14
TF/IDF Weights
15
Vector Space Model

An information retrieval method
Allows partial matching
Weights are assigned to terms in queries and in
documents
Weights are used to compute the degree of
similarity
The weights are sorted

16
Cosine Similarity
17
Cosine Similarity, cont.
18
Cosine Similarity, cont.
19
Cosine Similarity, cont.

Numerator
DROP TABLE IF EXISTS simnum
CREATE TABLE simnum
( authorx int(7),
authory int(7),
sim_num double(22,4),
UNIQUE ndx_author(authorx, authory))
insert into simnum (authorx, authory, sim_num )
select t1.authorid AS authorx
, t2.authorid AS authory
, sum(t1.weight t2.weight)
from all1 t1
, all1 t2
where t1.authorid ltgt t2.authorid
and t1.termid t2.termid
group by t1.authorid, t2.authorid

20
Cosine Similarity, cont.

Denominator
DROP TABLE IF EXISTS simdenom
CREATE TABLE simdenom
( sim_denom double(15,6) default 0,
authorx int(7),
unique ndx_author (authorx))
insert into simdenom (sim_denom, authorx )
select
SQRT(SUM(POW(t1.weight,2))) AS
sim_denom
, t1.authorid AS authorx
from all1 t1
group by t1.authorid

21
Cosine Similarity, cont.

DROP TABLE IF EXISTS cossim
CREATE TABLE cossim
select
t1.authorx AS authorx
, t1.authory AS authory
, t1.sim_num
, t2.sim_denom simdenom1
, t3.sim_denom simdenom2
, t1.sim_num/(t2.sim_denom
t3.sim_denom) AS cos_sim
from simnum t1
, simdenom t2
, simdenom t3
where t1.authorx t2.authorx
and t1.authory t3.authorx

22
Evaluating Results
Highly Related Authors with Branch of 3, depth of
2
23
Evaluating Results, Cont.
Authors Minimally Related Showing a Branch of 3
and Depth of 2
24
Evaluating Results, Cont.
Moderately Related Authors with Branch of 2,
depth of 4
25
File Layout for Generating Graphs

graph G
JohanBollen -- Luce label"1.0"
Luce URL"http//tango/ project1.jsp?authorLuce_
Rickbranch3depth1"
JohanBollen -- Vemulapalli label"0.8"
Vemulapalli URL"http//tango /project1.jsp?autho
rVemulapalli_Soma_Sekharabranch3depth1"
JohanBollen -- Xu label"0.8"
Xu URL"http//tango /project1.jsp?authorXu_Wein
ingbranch3depth1"
JohanBollen shapepolygon, sides5

26
Conclusion