Dynamic Author Similarities - PowerPoint PPT Presentation

1 / 28
About This Presentation
Title:

Dynamic Author Similarities

Description:

Inception of the internet since early 1990s. Information accessible to researchers ... University of Surrey, UK [2]T.Krichel_at_surrey.ac.uk. Michael L. Nelson ... – PowerPoint PPT presentation

Number of Views:52
Avg rating:3.0/5.0
Slides: 29
Provided by: sheila73
Category:

less

Transcript and Presenter's Notes

Title: Dynamic Author Similarities


1
Dynamic Author Similarities
  • Sheila Alston
  • Old Dominion University
  • Advisor Dr. Johan Bollen
  • May 6, 2005

2
Agenda
  • Introduction
  • Methodology
  • Implementing Dynamically Generated Graphs
  • Results
  • Conclusions and Future Work
  • Demonstration

3
Overview
  • Inception of the internet since early 1990s
  • Information accessible to researchers
  • Challenges in the Digital libraries community
    which encourage research

4
A Research Tool
  • Visual representation of on-line publications
  • Visibly show a network of author similarities
  • Identify Trends

5
General Approach
  • Design a GUI
  • Search a collection
  • Dynamically generate graphs
  • Link authors according to their publication
    similarities
  • Navigate the resulting network of related authors

6
Data Collection
  • D-Lib Magazine collection of documents retrieved
    from the web
  • Publications ranged from the years of 1995 to
    2003
  • Total number of publications 907
  • Unique number of publications - 753

7
Data Pruning
  • Each document separated into terms
  • Spurious characters such as . , , ,,
    ? were removed
  • Stop words were removed, such as be, the,
    a, each, is,so
  • Approximately 390 stop words
  • No stemming of the terms
  • 35, 667 terms after stop words were removed

8
Term Frequency
  • Each term was assigned a unique ID
  • Unique IDs started with 20,000
  • The unique ID of 20,000 aided in identifying
    terms vs. documents in the TF/IDF matrix
  • Term and term ID were used as input to compute
    the Term Frequency

9
What is Term Frequency?
  • The frequency of terms in a publication
  • The frequency of the term in the publication over
    the maximum frequency in the publication

10
Inverse Document Frequency
  • Each publication may have one or many authors
  • Each publication was assigned a unique ID
  • There are 753 distinct publications
  • List of unique publications served as input to
    the IDF

11
Inverse Document Frequency
  • An Experimental End-User Service across E-Print
    Archives
  • Spacer Line
  • Spacer
  • Herbert Van de Sompel
  • Los Alamos National Laboratory - Research
    Library, New Mexico, US, and
  • Automation Department of the Central Library
    of the University of
  • Ghent, Belgium
  • 1herbert.vandesompel_at_rug.ac.be.
  • Thomas Krichel
  • University of Surrey, UK
  • 2T.Krichel_at_surrey.ac.uk
  • Michael L. Nelson
  • NASA Langley Research Center, Hampton VA, USA
  • 3m.l.nelson_at_larc.nasa.gov

12
What is Inverse Document Frequency?
  • The idf of term i is given by
  • N is the number of publications
  • ni is the number of publications that contain
    term i

13
What is Inverse Document Frequency?
  • Provides high values for terms that are rare
  • Provide low values for terms that are common
  • N 1000 publications
  • ni 1000 publications that contain termi
  • Log(1000/1000) 0
  • Log(1000/10) 2

14
TF/IDF Weights
15
Vector Space Model
  • An information retrieval method
  • Allows partial matching
  • Weights are assigned to terms in queries and in
    documents
  • Weights are used to compute the degree of
    similarity
  • The weights are sorted

16
Cosine Similarity
17
Cosine Similarity, cont.
18
Cosine Similarity, cont.
19
Cosine Similarity, cont.
  • Numerator
  • DROP TABLE IF EXISTS simnum
  • CREATE TABLE simnum
  • ( authorx int(7),
  • authory int(7),
  • sim_num double(22,4),
  • UNIQUE ndx_author(authorx, authory))
  • insert into simnum (authorx, authory, sim_num )
  • select t1.authorid AS authorx
  • , t2.authorid AS authory
  • , sum(t1.weight t2.weight)
  • from all1 t1
  • , all1 t2
  • where t1.authorid ltgt t2.authorid
  • and t1.termid t2.termid
  • group by t1.authorid, t2.authorid

20
Cosine Similarity, cont.
  • Denominator
  • DROP TABLE IF EXISTS simdenom
  • CREATE TABLE simdenom
  • ( sim_denom double(15,6) default 0,
  • authorx int(7),
  • unique ndx_author (authorx))
  • insert into simdenom (sim_denom, authorx )
  • select
  • SQRT(SUM(POW(t1.weight,2))) AS
    sim_denom
  • , t1.authorid AS authorx
  • from all1 t1
  • group by t1.authorid

21
Cosine Similarity, cont.
  • DROP TABLE IF EXISTS cossim
  • CREATE TABLE cossim
  • select
  • t1.authorx AS authorx
  • , t1.authory AS authory
  • , t1.sim_num
  • , t2.sim_denom simdenom1
  • , t3.sim_denom simdenom2
  • , t1.sim_num/(t2.sim_denom
    t3.sim_denom) AS cos_sim
  • from simnum t1
  • , simdenom t2
  • , simdenom t3
  • where t1.authorx t2.authorx
  • and t1.authory t3.authorx

22
Evaluating Results
Highly Related Authors with Branch of 3, depth of
2
23
Evaluating Results, Cont.
Authors Minimally Related Showing a Branch of 3
and Depth of 2
24
Evaluating Results, Cont.
Moderately Related Authors with Branch of 2,
depth of 4
25
File Layout for Generating Graphs
  • graph G
  • JohanBollen -- Luce label"1.0"
  • Luce URL"http//tango/ project1.jsp?authorLuce_
    Rickbranch3depth1"
  • JohanBollen -- Vemulapalli label"0.8"
  • Vemulapalli URL"http//tango /project1.jsp?autho
    rVemulapalli_Soma_Sekharabranch3depth1"
  • JohanBollen -- Xu label"0.8"
  • Xu URL"http//tango /project1.jsp?authorXu_Wein
    ingbranch3depth1"
  • JohanBollen shapepolygon, sides5

26
Conclusion
  • Useful tool for researchers
  • Identifies high and low similarities
  • Gives a graphical representation of what authors
    are in a collection
  • Allows an author drill down capability

27
Future Work
  • D-Lib Magazine collection is separated into years
  • Build graphs for author similarities by year
  • Compare trends of authors for different years

28
Demonstration
http//tango.cs.odu.edu1990/Thesis/project.html
Write a Comment
User Comments (0)
About PowerShow.com