Prof' Ray Larson - PowerPoint PPT Presentation

About This Presentation

Title:

Prof' Ray Larson

Description:

University of California, Berkeley. School of Information ... Lemur. http://www-2.cs.cmu.edu/~lemur. Lucene (Java-based Text search engine) ... – PowerPoint PPT presentation

Number of Views:55

Avg rating:3.0/5.0

Slides: 62

Provided by: ValuedGate70

Learn more at: https://courses.ischool.berkeley.edu

Category:

more less

Transcript and Presenter's Notes

Title: Prof' Ray Larson

1
Lecture 8 Clustering
Principles of Information Retrieval

Prof. Ray Larson
University of California, Berkeley
School of Information
Tuesday and Thursday 1030 am - 1200 pm
Spring 2007
http//courses.ischool.berkeley.edu/i240/s07

Some slides in this lecture were originally
created by Prof. Marti Hearst
2
Mini-TREC

Need to make groups
Today
Systems
SMART (not recommended)
ftp//ftp.cs.cornell.edu/pub/smart
MG (We have a special version if interested)
http//www.mds.rmit.edu.au/mg/welcome.html
Cheshire II 3
II ftp//cheshire.berkeley.edu/pub/cheshire
http//cheshire.berkeley.edu
3 http//cheshire3.sourceforge.org
Zprise (Older search system from NIST)
http//www.itl.nist.gov/iaui/894.02/works/zp2/zp2.
html
IRF (new Java-based IR framework from NIST)
http//www.itl.nist.gov/iaui/894.02/projects/irf/i
rf.html
Lemur
http//www-2.cs.cmu.edu/lemur
Lucene (Java-based Text search engine)
http//jakarta.apache.org/lucene/docs/index.html
Others?? (See http//searchtools.com )

3
Mini-TREC

Proposed Schedule
February 15 Database and previous Queries
February 27 report on system acquisition and
setup
March 8, New Queries for testing
April 19, Results due
April 24 or 26, Results and system rankings
May 8 Group reports and discussion

4
Review IR Models

Set Theoretic Models
Boolean
Fuzzy
Extended Boolean
Vector Models (Algebraic)
Probabilistic Models (probabilistic)

5
Similarity Measures
Simple matching (coordination level
match) Dices Coefficient Jaccards
Coefficient Cosine Coefficient Overlap
Coefficient
6
Documents in Vector Space
t3
D1
D9
D11
D5
D3
D10
D2
D4
t1
D7
D6
D8
t2
7
Vector Space with Term Weights and Cosine Matching
Di(di1,wdi1di2, wdi2dit, wdit) Q
(qi1,wqi1qi2, wqi2qit, wqit)
Term B
1.0
Q (0.4,0.8) D1(0.8,0.3) D2(0.2,0.7)
Q
D2
0.8
0.6
0.4
D1
0.2
0.8
0.6
0.4
0.2
0
1.0
Term A
8
Term Weights in SMART

In SMART weights are decomposed into three
factors

9
SMART Freq Components
Binary maxnorm augmented log
10
Collection Weighting in SMART
Inverse squared probabilistic frequency
11
Term Normalization in SMART
sum cosine fourth max
12
Problems with Vector Space

There is no real theoretical basis for the
assumption of a term space
it is more for visualization than having any real
basis
most similarity measures work about the same
regardless of model
Terms are not really orthogonal dimensions
Terms are not independent of all other terms

13
Today

Clustering
Automatic Classification
Cluster-enhanced search

14
Overview

Introduction to Automatic Classification and
Clustering
Classification of Classification Methods
Classification Clusters and Information Retrieval
in Cheshire II
DARPA Unfamiliar Metadata Project

15
Classification

The grouping together of items (including
documents or their representations) which are
then treated as a unit. The groupings may be
predefined or generated algorithmically. The
process itself may be manual or automated.
In document classification the items are grouped
together because they are likely to be wanted
together
For example, items about the same topic.

16
Automatic Indexing and Classification

Automatic indexing is typically the simple
deriving of keywords from a document and
providing access to all of those words.
More complex Automatic Indexing Systems attempt
to select controlled vocabulary terms based on
terms in the document.
Automatic classification attempts to
automatically group similar documents using
either
A fully automatic clustering method.
An established classification scheme and set of
documents already indexed by that scheme.

17
Background and Origins

Early suggestion by Fairthorne
The Mathematics of Classification
Early experiments by Maron (1961) and Borko and
Bernick(1963)
Work in Numerical Taxonomy and its application to
Information retrieval Jardine, Sibson, van
Rijsbergen, Salton (1970s).
Early IR clustering work more concerned with
efficiency issues than semantic issues.

18
Document Space has High Dimensionality

What happens beyond three dimensions?
Similarity still has to do with how many tokens
are shared in common.
More terms -gt harder to understand which subsets
of words are shared among similar documents.
One approach to handling high dimensionality
Clustering

19
Vector Space Visualization
20
Cluster Hypothesis

The basic notion behind the use of classification
and clustering methods
Closely associated documents tend to be relevant
to the same requests.
C.J. van Rijsbergen

21
Classification of Classification Methods

Class Structure
Intellectually Formulated
Manual assignment (e.g. Library classification)
Automatic assignment (e.g. Cheshire
Classification Mapping)
Automatically derived from collection of items
Hierarchic Clustering Methods (e.g. Single Link)
Agglomerative Clustering Methods (e.g. Dattola)
Hybrid Methods (e.g. Query Clustering)

22
Classification of Classification Methods

Relationship between properties and classes
monothetic
polythetic
Relation between objects and classes
exclusive
overlapping
Relation between classes and classes
ordered
unordered

Adapted from Sparck Jones
23
Properties and Classes

Monothetic
Class defined by a set of properties that are
both necessary and sufficient for membership in
the class
Polythetic
Class defined by a set of properties such that to
be a member of the class some individual must
have some number (usually large) of those
properties, and that a large number of
individuals in the class possess some of those
properties, and no individual possesses all of
the properties.

24
Monothetic vs. Polythetic
25
Exclusive Vs. Overlapping

Item can either belong exclusively to a single
class
Items can belong to many classes, sometimes with
a membership weight

26
Ordered Vs. Unordered

Ordered classes have some sort of structure
imposed on them
Hierarchies are typical of ordered classes
Unordered classes have no imposed precedence or
structure and each class is considered on the
same level
Typical in agglomerative methods

27
Text Clustering

Clustering is
The art of finding groups in data.
-- Kaufmann and Rousseeu

Term 1
Term 2
28
Text Clustering

Clustering is
The art of finding groups in data.
-- Kaufmann and Rousseeu

Term 1
Term 2
29
Text Clustering

Finds overall similarities among groups of
documents
Finds overall similarities among groups of tokens
Picks out some themes, ignores others

30
Coefficients of Association

Simple
Dices coefficient
Jaccards coefficient
Cosine coefficient
Overlap coefficient

31
Pair-wise Document Similarity
How to compute document similarity?
32
Pair-wise Document Similarity(no normalization
for simplicity)
33
Pair-wise Document Similarity
cosine normalization
34
Document/Document Matrix
35
Clustering Methods

Hierarchical
Agglomerative
Hybrid
Automatic Class Assignment

36
Hierarchic Agglomerative Clustering

Basic method
1) Calculate all of the interdocument similarity
coefficients
2) Assign each document to its own cluster
3) Fuse the most similar pair of current clusters
4) Update the similarity matrix by deleting the
rows for the fused clusters and calculating
entries for the row and column representing the
new cluster (centroid)
5) Return to step 3 if there is more than one
cluster left

37
Hierarchic Agglomerative Clustering
A B C D E F G H I
38
Hierarchic Agglomerative Clustering
A B C D E F G H I
39
Hierarchic Agglomerative Clustering
A B C D E F G H I
40
Hierarchical Methods
Single Link Dissimilarity Matrix
Hierarchical methods Polythetic, Usually
Exclusive, Ordered Clusters are order-independent
41
Threshold .1
Single Link Dissimilarity Matrix
42
Threshold .2
43
Threshold .3
44
K-means Rocchio Clustering
Agglomerative methods Polythetic, Exclusive or
Overlapping, Unordered clusters are
order-dependent.
Rocchios method
1. Select initial centers (I.e. seed the
space) 2. Assign docs to highest matching centers
and compute centroids 3. Reassign all documents
to centroid(s)
45
K-Means Clustering

1 Create a pair-wise similarity measure
2 Find K centers using agglomerative clustering
take a small sample
group bottom up until K groups found
3 Assign each document to nearest center,
forming new clusters
4 Repeat 3 as necessary

46
Scatter/Gather

Cutting, Pedersen, Tukey Karger 92, 93, Hearst
Pedersen 95
Cluster sets of documents into general themes,
like a table of contents
Display the contents of the clusters by showing
topical terms and typical titles
User chooses subsets of the clusters and
re-clusters the documents within
Resulting new groups have different themes

47
S/G Example query on star

Encyclopedia text
14 sports
8 symbols 47 film, tv
68 film, tv (p) 7 music
97 astrophysics
67 astronomy(p) 12 stellar phenomena
10 flora/fauna 49 galaxies, stars
29 constellations
7 miscelleneous
Clustering and re-clustering is entirely
automated

48
(No Transcript)
49
(No Transcript)
50
(No Transcript)
51
(No Transcript)
52
Another use of clustering

Use clustering to map the entire huge
multidimensional document space into a huge
number of small clusters.
Project these onto a 2D graphical
representation

53
Clustering Multi-Dimensional Document Space
(image from Wise et al 95)
54
Clustering Multi-Dimensional Document Space
(image from Wise et al 95)
55
Concept Landscapes

Pharmocology

(e.g., Lin, Chen, Wise et al.)
Too many concepts, or too coarse
Single concept per document
No titles
Browsing without search

56
Clustering

Advantages
See some main themes
Disadvantage
Many ways documents could group together are
hidden
Thinking point what is the relationship to
classification systems and facets?

57
Automatic Class Assignment
Automatic Class Assignment Polythetic, Exclusive
or Overlapping, usually ordered clusters are
order-independent, usually based on an
intellectually derived scheme
58
Automatic Categorization in Cheshire II

Cheshire supports a method we call
classification clustering that relies on having
a set of records that have been indexed using
some controlled vocabulary.
Classification clustering has the following steps

59
Cheshire II - Cluster Generation

Define basis for clustering records.
Select field (I.e., the contolled vocabulary
terms) to form the basis of the cluster.
Evidence Fields to use as contents of the
pseudo-documents. (E.g. the titles or other
topical parts)
During indexing cluster keys are generated with
basis and evidence from each record.
Cluster keys are sorted and merged on basis and
pseudo-documents created for each unique basis
element containing all evidence fields.
Pseudo-Documents (Class clusters) are indexed on
combined evidence fields.

60
Cheshire II - Two-Stage Retrieval

Using the LC Classification System
Pseudo-Document created for each LC class
containing terms derived from content-rich
portions of documents in that class (e.g.,
subject headings, titles, etc.)
Permits searching by any term in the class
Ranked Probabilistic retrieval techniques attempt
to present the Best Matches to a query first.
User selects classes to feed back for the second
stage search of documents.
Can be used with any classified/Indexed
collection.

61
Cheshire II Demo

Examples from the
SciMentor(BioSearch) project
Journal of Biological Chemistry and MEDLINE data
CHESTER (EconLit)
Journal of Economic Literature subjects
Unfamiliar Metadata TIDES Projects
Basis for clusters is a normalized Library of
Congress Class Number
Evidence is provided by terms from record titles
(and subject headings for the all languages
Five different training sets (Russian, German,
French, Spanish, and All Languages
Testing cross-language retrieval and
classification
4W Project Search