Thesaurus Creation/ Term Clustering - PowerPoint PPT Presentation

About This Presentation
Title:

Thesaurus Creation/ Term Clustering

Description:

fleshing out sparse queries with related words. improves recall ... DOG Spaniel, Collie, Schunauzer, bulldog, Poodle, .... Pre-existing thesauri(e.g. Rogets' ... – PowerPoint PPT presentation

Number of Views:95
Avg rating:3.0/5.0
Slides: 10
Provided by: andre9
Learn more at: https://www.cs.jhu.edu
Category:

less

Transcript and Presenter's Notes

Title: Thesaurus Creation/ Term Clustering


1
Thesaurus Creation/Term Clustering
  • Two major applications
  • Query expansion
  • fleshing out sparse queries with related words
  • improves recall
  • (at possible expense of reduced precision)
  • Termset dimensionality reduction
  • Similar outcome with smaller model

2
Term Dimensionality Reduction
Query Water Spaniel Diseases
Spaniel Spaniels disease diseases collie illness
ant
Original vector
VQ
Reduced vector
VQ
DOG ILL
INSECT
0
1
1
  • Collie illnesses
  • Poodle sickness

Problem Reduced flexibility in partial
weighting of synonym set ?
Synonyms got as much weight as the original
  • Equivalent to query expansion
  • when ?i for all synonyms is 1

3
Query Expansion
Query Water Spaniel Diseases
Original
syn
Expanded
1 1 1 ?1 ?2 ?3
?4 0
Water Spaniel diseases Spaniels diseases
illness collie ant
stem
? ? ?i ? semantic dist(wi ,t)
Relate Document set D1 Water Spaniels D2
Water Spaniel illnesses D3 Collie diseases
stem
syn
syn
4
Query Expansion
Query Water Spaniel diseases document1
water spaniels ..
5
Simplest Term Clustering
? Stemming is a clustering method
stemming
Original Term Set Clustered
Term Set
computing flies computers houses computation flown
compute flew house
comput fly
6
Another simple clustering method
Pre-existing thesauri(e.g. Rogets)
different pos
  • illness ? disease, sickness, unwell, sick, ill,
  • PhD ? Ph.D, PhD, Phd, Ph.D., .
  • Term equivalence classes
  • Loosely related topic sets
  • DOG ? Spaniel, Collie, Schunauzer, bulldog,
    Poodle, .

same part of speech(pos)
7
Term Clustering
Non-hierarchical methods single pass(Salton,
71)
  • Given clustering threshold/target size and
    similarity function sim(i , j )
  • ? Pick random document Dj
  • Assign a document di sim(Dj , dj) lt ? to
    cluster Cj and recalc centroid
  • else create a new cluster Ck with centroid dk
  • ? Exclude di from document list
  • ? Repeat until document list empty

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
?
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
D2
.
.
?
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
D1
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
?
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
D3
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
8
Types of Clustering Behavior/Criterion
  • Sim(ti , tj)
  • Document level co-occurrence in same document
  • Verb-object
  • Syntagmatic similarity sim(drink, wine)
  • appears together sim(eat, meat)
  • in region sim(drink, water)
  • Paradigmatic similarity sim(wine, water)
  • appears as objects sim(wine, drink)
  • of the same verb sim(wine, meat)

based on object of drink or of all verbs
?
?
9
N-gram Syntagmatic similarity sim(Hong,
Kong) occur together sim(soap,
opera) sim(soap, suds) Paradigmatic
similarity sim(opera, suds) occur in same
sim(tall, short) context sim(long,
short) sim(Hong, Kong)
?
soap opera suds residue
Ivory soap Dial Lye
Write a Comment
User Comments (0)
About PowerShow.com