Title: Thesaurus Creation/ Term Clustering
1Thesaurus Creation/Term Clustering
- Two major applications
- Query expansion
- fleshing out sparse queries with related words
- improves recall
- (at possible expense of reduced precision)
- Termset dimensionality reduction
- Similar outcome with smaller model
2Term Dimensionality Reduction
Query Water Spaniel Diseases
Spaniel Spaniels disease diseases collie illness
ant
Original vector
VQ
Reduced vector
VQ
DOG ILL
INSECT
0
1
1
- Collie illnesses
- Poodle sickness
Problem Reduced flexibility in partial
weighting of synonym set ?
Synonyms got as much weight as the original
- Equivalent to query expansion
- when ?i for all synonyms is 1
3Query Expansion
Query Water Spaniel Diseases
Original
syn
Expanded
1 1 1 ?1 ?2 ?3
?4 0
Water Spaniel diseases Spaniels diseases
illness collie ant
stem
? ? ?i ? semantic dist(wi ,t)
Relate Document set D1 Water Spaniels D2
Water Spaniel illnesses D3 Collie diseases
stem
syn
syn
4Query Expansion
Query Water Spaniel diseases document1
water spaniels ..
5Simplest Term Clustering
? Stemming is a clustering method
stemming
Original Term Set Clustered
Term Set
computing flies computers houses computation flown
compute flew house
comput fly
6Another simple clustering method
Pre-existing thesauri(e.g. Rogets)
different pos
- illness ? disease, sickness, unwell, sick, ill,
- PhD ? Ph.D, PhD, Phd, Ph.D., .
- Term equivalence classes
- Loosely related topic sets
- DOG ? Spaniel, Collie, Schunauzer, bulldog,
Poodle, .
same part of speech(pos)
7Term Clustering
Non-hierarchical methods single pass(Salton,
71)
- Given clustering threshold/target size and
similarity function sim(i , j ) - ? Pick random document Dj
- Assign a document di sim(Dj , dj) lt ? to
cluster Cj and recalc centroid - else create a new cluster Ck with centroid dk
- ? Exclude di from document list
- ? Repeat until document list empty
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
?
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
D2
.
.
?
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
D1
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
?
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
D3
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
8Types of Clustering Behavior/Criterion
- Sim(ti , tj)
- Document level co-occurrence in same document
- Verb-object
- Syntagmatic similarity sim(drink, wine)
- appears together sim(eat, meat)
- in region sim(drink, water)
- Paradigmatic similarity sim(wine, water)
- appears as objects sim(wine, drink)
- of the same verb sim(wine, meat)
based on object of drink or of all verbs
?
?
9N-gram Syntagmatic similarity sim(Hong,
Kong) occur together sim(soap,
opera) sim(soap, suds) Paradigmatic
similarity sim(opera, suds) occur in same
sim(tall, short) context sim(long,
short) sim(Hong, Kong)
?
soap opera suds residue
Ivory soap Dial Lye