Title: Small World Networks: Applications in Document Clustering and Healthcare
1Small World Networks Applications in Document
Clustering and Healthcare
- Brant Chee
- Bruce Schatz
- University of Illinois
- http//www.beespace.uiuc.edu
2Small World Graph
Clauset et al., 2004
3Small World Graph
- Characteristic Path Length
- The typical separation of nodes in a graph.
- lrand ln(N)/ln(z)
- Clustering Coefficient C
- Average fraction of pairs of neighbors of a node
which are also neighbors of each other. - Average number of nodes that are cliques!
- Crand z/N
- Small World Graph
- C gtgt Crand
- L Lrand
- Ngtgt z gtgt ln(N)
Newman, 2000
4SW MI Graph
Sole et al., 2003
5PurposeSo What?
- Facilitate Exploratory Process
- Search result clustering
- Information discovery
- Develop Middle Ground Algorithms
- Interactive responses AND
- Useful clusters
- Language as a Small World Network
- Make use of underlying structure of language
6System Overview
7Graph Construction
- A node is a term in the index
- Terms bounded by frequency cutoff.
- Terms occurring lt 5 documents gt 25 documents are
removed. - Edges between nodes are determined by Mutual
Information - P(x,y) is calculated in a window of the size of
the abstract
log2
Church and Hanks, 1989
8What threshold?
Threshold N z l lrand ?l C Crand
0.001 6612 9.85 4.31 3.85 .46 0.57 0.002
0.002 4087 5.09 4.84 5.11 .27 0.68 0.001
0.005 1342 2.66 5.51 7.36 1.85 0.75 0.002
0.01 517 1.69 1.21 11.91 10.7 0.81 0.003
0.02 161 1.22 0.34 25.55 25.2 0.91 0.008
0.05 25 1.12 0.06 28.40 28.3 1.0 0.045
9Where to cut?
10Clustering Algorithm
- Clauset, Newman and Moore, 2004
- Generalization for nodes based upon Newmans
algorithm. - Based upon modularity The fraction of edges
within communities versus the fraction falling at
random in the same network. 0 if little
community structure, between .3 if there is
significant structure. - If just looking at the fraction of nodes within
communities, then max modularity will always be
when all nodes are in one cluster.
?(ci,cj) 1 if ci and cj are in the same
community
2m of edges in graph
11Experiments
- 3 clustering algorithms
- Complete Link (Cluto)
- K means (Cluto)
- Small World
12Test Collections
Collection Search Terms Number of Abstracts Number of Terms
C1 General plasticity OR acetylcholine 81,746 267,981
C2 Specific microarray OR muscarinic OR plasticity OR ((cholinergic OR noradrenergic) AND receptor) 74,533 285,623
13Experimental Setup
- Parameters left at package defaults
- Clustered with n 50,100,150 and 200.
- Clusters with less than 4 elements or more than
50 elements were eliminated and the clustering
which resulted in less than 40 clusters was
chosen to be evaluated.
14Quantitative Results
15(No Transcript)
16Conclusions
- Developed Balanced Clustering System
- Fast running time
- Good clustering results
- Modified Small World Algorithm
- Clustered text based on language model
- Produced many similar sized clusters
17Social Networks as Small World Networks
- Social Network
- Network demonstrating who interacts with whom
- Threaded messages in a Newsgroup
- Create a network based on various characteristics
- Homophily
- Similar people tend to interact more than those
who are dissimilar - Race, Age, Gender, Social Class
18Social Networks Inform Healthcare
- You do what your peers do
- Framingham Study
- 20 years of data
- Manually constructed networks
- Smoking Cessation
- Obesity
- Happiness
- Can we construct Social Networks automatically?
19Social Network Construction and Evaluation
- We have lots of text available
- 30K message groups from Yahoo! Health
- Utilize threaded messaging to establish network
- Our cognitive model is evident in what we write
- Differentiate Schizophrenic from
non-Schizophrenic - LIWC
- Poets who commit suicide vs those that do not
- Differentiate depressed vs non depressed college
students - Sentiment positive or negative polarity
- Score evaluation metric
20Example Message
- Hi All, I need your input. I'm havingabout
27,000 extra pre-ventricular beats in a24 hour
period, per a Holter monitor test.
Myelectrophysiologist and cardiologist agree
thatI should go on ltLinkgtsotalollt/Linkgt/Betapace
. They are putting me in the hospitalon February
26 to titrate me up on it. I'verefused the drug
in the past because it is sucha dangerous
drug. Is there anyone out there who couldgive
me an idea of how you've done on thisdrug? I'd
sure appreciate hearing about yourexperiences.
Thanks so much.
21Sentiment
22Results
- Sentiment over all messages
- Proxy for mental model how happy they are
- Difference in average sentiment between two
people - Higher between random people in a network
- Lower for pairs that are closely connected
- Test methodology
- Compare means of differences between highly
connected nodes vs random pairs of nodes - T-Test for statistical significance
- P-value lt .0001 for 10 randomly selected groups
23Acknowledgements
- Nyla Ismail for evaluating results
- Todd Littell for the MI code
24Questions?
- Live demonstration available at
- http//www.beespace.uiuc.edu
25References
- Church, K. W. and Hanks, P., (1989). Word
association norms, mutual information, and
lexicography. in Proc. of the 27th Annual
Conference of the Association of Computational
Linguistics, (Vancouver, B.C.), ACM Press, 76-83.
- Clauset, A., Newman, M. E. J., and Moore, C.,
(2004). Finding community structure in very
large networks. Phys. Rev. E, 70 (6), 066111. - Kuhlthau, C. C., (1989). Information search
process A Summary of research and implications
for school library media programs. SLMQ, 18(1). - Newman, M. E. J., (2000). Models of the small
world. J. Stat. Phys., 101, 819-841. - Solé, R., Ferrer-Cancho, R., Montoya, J. M., and
Valverde, S., (2003). Selection, tinkering, and
emergence in complex networks. Complexity, 8 (1),
20-33.