Title: Prof' Ray Larson
1Lecture 8 Clustering
Principles of Information Retrieval
- Prof. Ray Larson
- University of California, Berkeley
- School of Information
- Tuesday and Thursday 1030 am - 1200 pm
- Spring 2007
- http//courses.ischool.berkeley.edu/i240/s07
Some slides in this lecture were originally
created by Prof. Marti Hearst
- Need to make groups
- Today
- Systems
- SMART (not recommended)
- ftp//ftp.cs.cornell.edu/pub/smart
- MG (We have a special version if interested)
- http//www.mds.rmit.edu.au/mg/welcome.html
- Cheshire II 3
- II ftp//cheshire.berkeley.edu/pub/cheshire
http//cheshire.berkeley.edu - 3 http//cheshire3.sourceforge.org
- Zprise (Older search system from NIST)
- http//www.itl.nist.gov/iaui/894.02/works/zp2/zp2.
html - IRF (new Java-based IR framework from NIST)
- http//www.itl.nist.gov/iaui/894.02/projects/irf/i
rf.html - Lemur
- http//www-2.cs.cmu.edu/lemur
- Lucene (Java-based Text search engine)
- http//jakarta.apache.org/lucene/docs/index.html
- Others?? (See http//searchtools.com )
- Proposed Schedule
- February 15 Database and previous Queries
- February 27 report on system acquisition and
setup - March 8, New Queries for testing
- April 19, Results due
- April 24 or 26, Results and system rankings
- May 8 Group reports and discussion
4Review IR Models
- Set Theoretic Models
- Boolean
- Fuzzy
- Extended Boolean
- Vector Models (Algebraic)
- Probabilistic Models (probabilistic)
5Similarity Measures
Simple matching (coordination level
match) Dices Coefficient Jaccards
Coefficient Cosine Coefficient Overlap
6Documents in Vector Space
7Vector Space with Term Weights and Cosine Matching
Di(di1,wdi1di2, wdi2dit, wdit) Q
(qi1,wqi1qi2, wqi2qit, wqit)
Term B
Q (0.4,0.8) D1(0.8,0.3) D2(0.2,0.7)
Term A
8Term Weights in SMART
- In SMART weights are decomposed into three
9SMART Freq Components
Binary maxnorm augmented log
10Collection Weighting in SMART
Inverse squared probabilistic frequency
11Term Normalization in SMART
sum cosine fourth max
12Problems with Vector Space
- There is no real theoretical basis for the
assumption of a term space - it is more for visualization than having any real
basis - most similarity measures work about the same
regardless of model - Terms are not really orthogonal dimensions
- Terms are not independent of all other terms
- Clustering
- Automatic Classification
- Cluster-enhanced search
- Introduction to Automatic Classification and
Clustering - Classification of Classification Methods
- Classification Clusters and Information Retrieval
in Cheshire II - DARPA Unfamiliar Metadata Project
- The grouping together of items (including
documents or their representations) which are
then treated as a unit. The groupings may be
predefined or generated algorithmically. The
process itself may be manual or automated. - In document classification the items are grouped
together because they are likely to be wanted
together - For example, items about the same topic.
16Automatic Indexing and Classification
- Automatic indexing is typically the simple
deriving of keywords from a document and
providing access to all of those words. - More complex Automatic Indexing Systems attempt
to select controlled vocabulary terms based on
terms in the document. - Automatic classification attempts to
automatically group similar documents using
either - A fully automatic clustering method.
- An established classification scheme and set of
documents already indexed by that scheme.
17Background and Origins
- Early suggestion by Fairthorne
- The Mathematics of Classification
- Early experiments by Maron (1961) and Borko and
Bernick(1963) - Work in Numerical Taxonomy and its application to
Information retrieval Jardine, Sibson, van
Rijsbergen, Salton (1970s). - Early IR clustering work more concerned with
efficiency issues than semantic issues.
18Document Space has High Dimensionality
- What happens beyond three dimensions?
- Similarity still has to do with how many tokens
are shared in common. - More terms -gt harder to understand which subsets
of words are shared among similar documents. - One approach to handling high dimensionality
19Vector Space Visualization
20Cluster Hypothesis
- The basic notion behind the use of classification
and clustering methods - Closely associated documents tend to be relevant
to the same requests. - C.J. van Rijsbergen
21Classification of Classification Methods
- Class Structure
- Intellectually Formulated
- Manual assignment (e.g. Library classification)
- Automatic assignment (e.g. Cheshire
Classification Mapping) - Automatically derived from collection of items
- Hierarchic Clustering Methods (e.g. Single Link)
- Agglomerative Clustering Methods (e.g. Dattola)
- Hybrid Methods (e.g. Query Clustering)
22Classification of Classification Methods
- Relationship between properties and classes
- monothetic
- polythetic
- Relation between objects and classes
- exclusive
- overlapping
- Relation between classes and classes
- ordered
- unordered
Adapted from Sparck Jones
23Properties and Classes
- Monothetic
- Class defined by a set of properties that are
both necessary and sufficient for membership in
the class - Polythetic
- Class defined by a set of properties such that to
be a member of the class some individual must
have some number (usually large) of those
properties, and that a large number of
individuals in the class possess some of those
properties, and no individual possesses all of
the properties.
24Monothetic vs. Polythetic
25Exclusive Vs. Overlapping
- Item can either belong exclusively to a single
class - Items can belong to many classes, sometimes with
a membership weight
26Ordered Vs. Unordered
- Ordered classes have some sort of structure
imposed on them - Hierarchies are typical of ordered classes
- Unordered classes have no imposed precedence or
structure and each class is considered on the
same level - Typical in agglomerative methods
27Text Clustering
- Clustering is
- The art of finding groups in data.
- -- Kaufmann and Rousseeu
Term 1
Term 2
28Text Clustering
- Clustering is
- The art of finding groups in data.
- -- Kaufmann and Rousseeu
Term 1
Term 2
29Text Clustering
- Finds overall similarities among groups of
documents - Finds overall similarities among groups of tokens
- Picks out some themes, ignores others
30Coefficients of Association
- Simple
- Dices coefficient
- Jaccards coefficient
- Cosine coefficient
- Overlap coefficient
31Pair-wise Document Similarity
How to compute document similarity?
32Pair-wise Document Similarity(no normalization
for simplicity)
33Pair-wise Document Similarity
cosine normalization
34Document/Document Matrix
35Clustering Methods
- Hierarchical
- Agglomerative
- Hybrid
- Automatic Class Assignment
36Hierarchic Agglomerative Clustering
- Basic method
- 1) Calculate all of the interdocument similarity
coefficients - 2) Assign each document to its own cluster
- 3) Fuse the most similar pair of current clusters
- 4) Update the similarity matrix by deleting the
rows for the fused clusters and calculating
entries for the row and column representing the
new cluster (centroid) - 5) Return to step 3 if there is more than one
cluster left
37Hierarchic Agglomerative Clustering
38Hierarchic Agglomerative Clustering
39Hierarchic Agglomerative Clustering
40Hierarchical Methods
Single Link Dissimilarity Matrix
Hierarchical methods Polythetic, Usually
Exclusive, Ordered Clusters are order-independent
41Threshold .1
Single Link Dissimilarity Matrix
42Threshold .2
43Threshold .3
44K-means Rocchio Clustering
Agglomerative methods Polythetic, Exclusive or
Overlapping, Unordered clusters are
Rocchios method
1. Select initial centers (I.e. seed the
space) 2. Assign docs to highest matching centers
and compute centroids 3. Reassign all documents
to centroid(s)
45K-Means Clustering
- 1 Create a pair-wise similarity measure
- 2 Find K centers using agglomerative clustering
- take a small sample
- group bottom up until K groups found
- 3 Assign each document to nearest center,
forming new clusters - 4 Repeat 3 as necessary
- Cutting, Pedersen, Tukey Karger 92, 93, Hearst
Pedersen 95 - Cluster sets of documents into general themes,
like a table of contents - Display the contents of the clusters by showing
topical terms and typical titles - User chooses subsets of the clusters and
re-clusters the documents within - Resulting new groups have different themes
47S/G Example query on star
- Encyclopedia text
- 14 sports
- 8 symbols 47 film, tv
- 68 film, tv (p) 7 music
- 97 astrophysics
- 67 astronomy(p) 12 stellar phenomena
- 10 flora/fauna 49 galaxies, stars
- 29 constellations
- 7 miscelleneous
- Clustering and re-clustering is entirely
48(No Transcript)
49(No Transcript)
50(No Transcript)
51(No Transcript)
52Another use of clustering
- Use clustering to map the entire huge
multidimensional document space into a huge
number of small clusters. - Project these onto a 2D graphical
53Clustering Multi-Dimensional Document Space
(image from Wise et al 95)
54Clustering Multi-Dimensional Document Space
(image from Wise et al 95)
55Concept Landscapes
- (e.g., Lin, Chen, Wise et al.)
- Too many concepts, or too coarse
- Single concept per document
- No titles
- Browsing without search
- Advantages
- See some main themes
- Disadvantage
- Many ways documents could group together are
hidden - Thinking point what is the relationship to
classification systems and facets?
57Automatic Class Assignment
Automatic Class Assignment Polythetic, Exclusive
or Overlapping, usually ordered clusters are
order-independent, usually based on an
intellectually derived scheme
58Automatic Categorization in Cheshire II
- Cheshire supports a method we call
classification clustering that relies on having
a set of records that have been indexed using
some controlled vocabulary. - Classification clustering has the following steps
59Cheshire II - Cluster Generation
- Define basis for clustering records.
- Select field (I.e., the contolled vocabulary
terms) to form the basis of the cluster. - Evidence Fields to use as contents of the
pseudo-documents. (E.g. the titles or other
topical parts) - During indexing cluster keys are generated with
basis and evidence from each record. - Cluster keys are sorted and merged on basis and
pseudo-documents created for each unique basis
element containing all evidence fields. - Pseudo-Documents (Class clusters) are indexed on
combined evidence fields.
60Cheshire II - Two-Stage Retrieval
- Using the LC Classification System
- Pseudo-Document created for each LC class
containing terms derived from content-rich
portions of documents in that class (e.g.,
subject headings, titles, etc.) - Permits searching by any term in the class
- Ranked Probabilistic retrieval techniques attempt
to present the Best Matches to a query first. - User selects classes to feed back for the second
stage search of documents. - Can be used with any classified/Indexed
61Cheshire II Demo
- Examples from the
- SciMentor(BioSearch) project
- Journal of Biological Chemistry and MEDLINE data
- CHESTER (EconLit)
- Journal of Economic Literature subjects
- Unfamiliar Metadata TIDES Projects
- Basis for clusters is a normalized Library of
Congress Class Number - Evidence is provided by terms from record titles
(and subject headings for the all languages - Five different training sets (Russian, German,
French, Spanish, and All Languages - Testing cross-language retrieval and
classification - 4W Project Search