Title: Term Cooccurrence Analysis as an Interface to Digital Libraries
1Term Co-occurrence Analysis as an Interface to
Digital Libraries
- Jan W. Buzydlowski
- Howard D. White
- Xia Lin
- College of Information Science and Technology
- Drexel University, Philadelphia, Pennsylvania, USA
2Digital Library Research
- First Wave
- How to store it
- Next Wave
- How to retrieve it (IR)
- Text Mining
- Visual Information Retrieval Interface (VIRI)
- Term Co-occurrence Analysis (TCA)
- Co-occurrence vs. lexical associations
- Maps vs. lists
3Term Definition
- Unit of Analysis
- Words
- Documents
- Authors
- Journals
- Section of Focus
- Abstract/Text
- Title
- Bibliography
- Keywords
4Example
- Words in Title
- Term
- Co-occurrence
- Analysis
- Interface
- Digital
- Library
- Authors in Bibliography
- Salton-G
- Chen-C
- White-HD
- Ding-Y
- Cleveland-W
- McCain-K
- Lin-X
- Schvaneveldt-R
- Kamada-T
- Fruchterman-T
5Term Co-occurrence Methodology
- User determines which terms are of interest
- Via a seed term
- From a pre-defined list
- The system returns the pair-wise co-occurrence
counts of the terms over the collection of records
6Example
- Unit Author Section Bibliography
- User Supplied List Plato, Aristotle, Smith,
Brown - For a given data set (N 4 unique terms)
- Article 1 Plato, Aristotle, Smith,
- Article 2 Plato, Smith,
- Article 3 Plato, Aristotle, Smith, Brown,
- The following co-citations (C(4,2) 6) are found
- COMBINATION COUNT ARTICLES
- Plato and Smith 3 1, 2, 3
- Plato and Aristotle 2 1, 3
- Plato and Brown 1 3
- Aristotle and Smith 2 1, 3
- Aristotle and Brown 1 3
- Smith and Brown 1 3
7Term Co-occurrence Significance
- The frequent co-occurrence of term pairs within a
set of documents indicates a strong association
between those terms, whereas a infrequent count
indicates the opposite - The association you would expect is borne out by
the frequency - The frequency you compute suggests a level of
association
- Pain and Management Pain and Obtainment
- Plato and Aristotle Plato and Cher
- Science and Nature Science and National Tattler
- A and B C and D
8Term Co-occurrence Uses
- Allows a user to get a foothold with just one
term - One seed term returns many other related terms
- Allows a user to get a overview with
user-supplied/system-supplied terms - Co-occurrence counts with visualization
9Seeding
- User types in
- One term, e.g., Plato
- Boolean expression, e.g., Plato AND Brown
- System supplies top n terms, in ranked order of
frequency of co-occurrence with the initial term
10Example
ARISTOTLE PLUTARCH CICERO HOMER BIBLE EURIPIDES AR
ISTOPHANES XENOPHON AUGUSTINE HERODOTUS KANT-I AES
CHYLUS
SOPHOCLES THUCYDIDES OVID HESIOD DIOGENES-LAERTI H
EIDEGGER-M DERRIDA-J PINDAR NIETZSCHE-F HEGEL-GWF
VERGIL AQUINAS-T
11Need for Visualization
- Given a list of user- / system-supplied terms
- Find the frequency of co-occurrence of each
pair-wise combination of terms - Plato AND Aristotle 1,920
- Plato AND Plutarch 380,
-
- Too many numbers to take in at once
- C(25, 2) (25 24)/ 2 300 pairs
- Three major visualization techniques
- Multidimensional Scaling (MDS)
- Self-Organizing (Kohonen) Maps (SOMs)
- PathFinder Networks (PFNETs)
12P Arabie
JH Ward
JC Gower
M Wish
RN Shepard
RR Sokal
JB Kruskal
SC Johnson
PHA Sneath
JD Carroll
PE Green
JA Hartigan
HA Skinner
VE McGee
RK Blashfield
Whites MDS map of 15 co-cited classificationists,
ca. 1990
13(No Transcript)
14Whites PFNet of co-cited authors in Biblical and
literary hermeneutics, 1988-1997
15Our System
- Three tiered
- User interface
- Server
- Database
- Real-time and interactive
- Significant data sources
- ISI AHCI
- MedLine
- Live interface for retrieval
16(No Transcript)
17User Interface - Seed
18User Interface SOM
19Interface - PFNET
20Interface - Visual Information Retrieval
Interface (VIRI)
21User Interface IV
22Database Interface
- API
- String findRel( String, int )
- Int findOcc( String )
- Implemented on
- BRS
- API via a wrapper
- Oracle
- API via JDBC
- Noah
- Specialized co-occurrence database
- API via JNI
23Future Plans
- User Study
- Preference
- Type of map, etc.
- Cognitive map
- How well does the map match experts mental
models - Larger datasets
- Additional data sources
24(No Transcript)
25(No Transcript)