Title: CS 430 INFO 430 Information Retrieval
1CS 430 / INFO 430 Information Retrieval
Lecture 26 Classification
2Course Administration
3Automatic Thesaurus Construction
Outline Select a subject domain. Choose
a corpus of documents that cover the domain.
Create vocabulary by extracting terms,
normalization, precoordination of phrases, etc.
Devise a measure of similarity between terms
and thesaurus classes. Cluster terms into
thesaurus classes, using a cluster method that
generates compact clusters (e.g., complete
linkage).
4Normalization of vocabulary
- Normalization rules map variant forms into base
- expressions. Typical normalization rules for
manual - thesaurus construction are
- (a) Nouns only, or nouns and noun phrases.
- Singular nouns only.
- Spelling (e.g., U.S.).
- Capitalization, punctuation (e.g., hyphens),
initials (e.g., IBM), abbreviations (e.g., Mr.). - Usually, many possible decisions can be made, but
they - should be followed consistently.
- Which of these can be carried out automatically
with - reasonable accuracy?
5Phrase construction
In a precoordinated thesaurus, term classes may
contain phrases. Informal definitions pair-freque
ncy (i, j) is the frequency that a given pair of
words occur in context (e.g., in succession
within a sentence) phrase is a pair of words, i
and j that occur in context with a higher
frequency than would be expected from their
overall frequency cohesion (i, j)
observed pair-frequency (i, j)
expected pair-frequency if i, j
independent
6Phrase construction simple case
Example corpus of n terms pi, j is the number of
occurrences of a given pair of terms occur in
succession. fi is the number of occurrences of
term i in the corpus. There are n-1 pairs. If
the terms are independent, the probability that a
given pair begins with term i and ends with term
j is (fi/n).(fj/n) cohesion (i,
j) pi, j fi.fj
(n-1) n2
7Phrase construction
Salton and McGill algorithm 1. Computer
pair-frequency for all terms. 2. Reject all
pairs that fall below a certain frequency
threshold 3. Calculate cohesion values 4. If
cohesion above a threshold value, consider word
pair as a phrase. There is promising research on
phrase identification using methods of
computational linguistics
8Similarities
The vocabulary consists of a set of elements,
each of which can be a single term or a phrase.
The next step is to calculate a measure of
similarity between elements. One measure of
similarity is the number of documents that have
terms i and k in common S(tj, tk) ?
tijtik where tij 1 if document i contains term
j and 0 otherwise.
n
i1
9Similarities Incidence array
alpha bravo charlie delta echo foxtrot golf
D1 1 1 1 1 1 1 1
D2 1 1 1
D3 1 1 1 1 D4 1
1 1 1
10Term similarity matrix
alpha bravo charlie delta echo foxtrot golf alpha
1 1 3 1 2 3 bravo
2 1 2 2 1 charlie
1 2 2 1 delta 1
2 3 echo 2
1 foxtrot 2 golf
Using count of documents that have two terms in
common
11Similarity measures
Improved similarity measures can be generated
by Using term frequency matrix instead of
incidence matrix Weighting terms by
frequency cosine measure ?
tijtik tj
tk dice measure ? tijtik
? tij ? tik
n
i1
S(tj, tk)
n
S(tj, tk)
i1
n
n
i1
i1
12Term similarity matrix
alpha bravo charlie delta echo foxtrot golf alpha
0.2 0.2 0.5 0.2 0.33
0.5 bravo 0.5 0.2 0.5 0.4
0.2 charlie 0.2 0.5 0.4
0.2 delta 0.2 0.33
0.5 echo 0.4
0.2 foxtrot
0.33 golf
Using incidence matrix and dice weighting
13Cluster Analysis
Cluster Analysis Methods that divide a set of n
objects into m non- overlapping subsets. For
information discovery, cluster analysis is
applied to terms for thesaurus
construction documents to divide into
categories (sometimes called automatic
classification, but classification usually
requires a pre-determined set of categories).
14Cluster Analysis Metrics
? Documents clustered on the basis of a
similarity measure calculated from the terms that
they contain. ? Documents clustered on the basis
of co-occurring citations. ? Terms clustered on
the basis of the documents in which they co-occur.
15Non-hierarchical and Hierarchical Methods
Non-hierarchical methods Elements are divided
into m non-overlapping sets where m is
predetermined. Hierarchical methods m is varied
progressively to create a hierarchy of
solutions. Agglomerative methods m is initially
equal to n, the total number of elements, where
every element is considered to be a cluster with
one element. The hierarchy is produced by
incrementally combining clusters.
16Simple Hierarchical Methods Single Link
Concept
x
x
x
x
x
x
x
x
x
x
x
x
Similarity between clusters is similarity between
most similar elements
17Single Link
Single Link A simple agglomerative
method. Initially, each element is its own
cluster with one element. At each step,
calculate the similarity between each pair of
clusters as the most similar pair of elements
that are not yet in the same cluster. Merge the
two clusters that are most similar. May lead to
long, straggling clusters (chaining). Very
simple computation.
18Example -- single link
1
alpha delta golf bravo echo
charlie foxtrot
Agglomerative step 1
19Example -- single link
2
1
alpha delta golf bravo echo
charlie foxtrot
Agglomerative step 2
20Example -- single link
3
2
1
alpha delta golf bravo echo
charlie foxtrot
Agglomerative step 3
21Example -- single link
6
5
4
3
2
1
alpha delta golf bravo echo
charlie foxtrot
This style of diagram is called a dendrogram.
22Simple Hierarchical Methods Complete Linkage
Concept
x
x
x
x
x
x
x
x
x
x
x
x
Similarity between clusters is similarity between
least similar elements
23Complete linkage
Complete linkage A simple agglomerative
method. Initially, each element is its own
cluster with one element. At each step,
calculate the similarity between each pair of
clusters as the similarity between the least
similar pair of elements in the two clusters.
Merge the two clusters that are most
similar. Generates small, tightly bound clusters
24Term similarity matrix
alpha bravo charlie delta echo foxtrot golf alpha
0.2 0.2 0.5 0.2 0.33
0.5 bravo 0.5 0.2 0.5 0.4
0.2 charlie 0.2 0.5 0.4
0.2 delta 0.2 0.33
0.5 echo 0.4
0.2 foxtrot
0.33 golf
Using incidence matrix and dice weighting
25Example complete linkage
Cluster a b c d e f g elements
Least similar pair / distance
a - ab/.2 ac/.2 ad/.5 ae/.2 af/.33 ag/.5
b - bc/.5 bd/.2 be/.5 bf/.4 bg/.2
c - cd/.2 ce/.5 cf/.4 cg/.2
d - de/.2 df/.33 dg/.5
e - ef/.4 eg/.2 f - fg/.33
g -
Step 1. Merge clusters a and d
26Example complete linkage
Cluster a,d b c e f g elements
Least similar pair / distance
a,d - ab/.2 ac/.2 ae/.2 df/.33 ag/.5
b - bc/.5 be/.5 bf/.4 bg/.2
c - ce/.5 cf/.4 cg/.2 e - ef/.4 eg/.2
f - fg/.33 g -
Step 2. Merge clusters a,d and g
27Example complete linkage
Cluster a,d,g b c e f elements Least
similar pair / distance a,d,g - ab/.2 ac/.2
ae/.2 af/.33 b - bc/.5 be/.5 bf/.4
c - ce/.5 cf/.4 e - ef/.4 f -
Step 3. Merge clusters b and c
28Example complete linkage
Cluster a,d,g b,c e f elements Least similar
pair / distance a,d,g - ab/.2 ae/.2 af/.33
b,c - be/.5 bf/.4 e - ef/.4
f -
Step 4. Merge clusters b,c and e
29Example -- complete linkage
Step 6
Step 5
Step 4
Step 3
Step 2
Step 1
alpha delta golf bravo charlie
echo foxtrot
30Non-Hierarchical Methods K-means
1 Define a similarity measure between any two
points in the space (e.g., square of
distance). 2 Choose k points as initial group
centroids. 3 Assign each object to the group that
has the closest centroid. 4 When all objects have
been assigned, recalculate the positions of the k
centroids. 5 Repeat Steps 3 and 4 until the
centroids no longer move. This produces a
separation of the objects into groups from which
the metric to be minimized can be calculated.
31K-means
Iteration converges under a very general set of
conditions Results depend on the choice of the
k initial centroids Methods can be used to
generate a sequence of solutions for k increasing
from 1 to n. Note that, in general, the results
will not be hierarchical.
32Problems with cluster analysis in information
retrieval
? Selection of attributes on which items are
clustered ? Choice of similarity measure and
algorithm ? Computational resources ? Assessing
validity and stability of clusters ? Updating
clusters as data changes ? Method for using the
clusters in information retrieval
33Example Concept Spaces for Scientific Terms
Large-scale searches can only match terms
specified by the user to terms appearing in
documents. Cluster analysis can be used to
provide information retrieval by concepts, rather
than by terms. Bruce Schatz, William H. Mischo,
Timothy W. Cole, Joseph B. Hardin, Ann P. Bishop
(University of Illinois), Hsinchun Chen
(University of Arizona), Federating Diverse
Collections of Scientific Literature, IEEE
Computer, May 1996. Federating Diverse
Collections of Scientific Literature
34Concept Spaces Methodology
Concept space A similarity matrix based on
co-occurrence of terms. Approach Use cluster
analysis to generate "concept spaces"
automatically, i.e., clusters of terms that
embrace a single semantic concept. Arrange
concepts in a hierarchical classification.
35Concept Spaces INSPEC Data
Data set 1 All terms in 400,000 records from
INSPEC, containing 270,000 terms with 4,000,000
links. 24.5 hours of CPU on 16-node Silicon
Graphics supercomputer.
computer-aided instruction see also
education UF teaching machines BT educational
computing TT computer applications RT
education RT teaching
36Concept Space Compendex Data
Data set 2 (a) 4,000,000 abstracts from the
Compendex database covering all of engineering as
the collection, partitioned along classification
code lines into some 600 community repositories.
Four days of CPU on 64-processor Convex
Exemplar. (b) In the largest experiment,
10,000,000 abstracts, were divided into sets of
100,000 and the concept space for each set
generated separately. The sets were selected by
the existing classification scheme.
37Objectives
Semantic retrieval (using concept spaces for
term suggestion) Semantic interoperability
(vocabulary switching across subject domains)
Semantic indexing (concept identification of
document content) Information representation
(information units for uniform manipulation)
38Use of Concept Space Term Suggestion