Prof' Ray Larson - PowerPoint PPT Presentation

About This Presentation
Title:

Prof' Ray Larson

Description:

University of California, Berkeley. School of Information ... Lemur. http://www-2.cs.cmu.edu/~lemur. Lucene (Java-based Text search engine) ... – PowerPoint PPT presentation

Number of Views:55
Avg rating:3.0/5.0
Slides: 62
Provided by: ValuedGate70
Category:
Tags: larson | lemur | prof | ray

less

Transcript and Presenter's Notes

Title: Prof' Ray Larson


1
Lecture 8 Clustering
Principles of Information Retrieval
  • Prof. Ray Larson
  • University of California, Berkeley
  • School of Information
  • Tuesday and Thursday 1030 am - 1200 pm
  • Spring 2007
  • http//courses.ischool.berkeley.edu/i240/s07

Some slides in this lecture were originally
created by Prof. Marti Hearst
2
Mini-TREC
  • Need to make groups
  • Today
  • Systems
  • SMART (not recommended)
  • ftp//ftp.cs.cornell.edu/pub/smart
  • MG (We have a special version if interested)
  • http//www.mds.rmit.edu.au/mg/welcome.html
  • Cheshire II 3
  • II ftp//cheshire.berkeley.edu/pub/cheshire
    http//cheshire.berkeley.edu
  • 3 http//cheshire3.sourceforge.org
  • Zprise (Older search system from NIST)
  • http//www.itl.nist.gov/iaui/894.02/works/zp2/zp2.
    html
  • IRF (new Java-based IR framework from NIST)
  • http//www.itl.nist.gov/iaui/894.02/projects/irf/i
    rf.html
  • Lemur
  • http//www-2.cs.cmu.edu/lemur
  • Lucene (Java-based Text search engine)
  • http//jakarta.apache.org/lucene/docs/index.html
  • Others?? (See http//searchtools.com )

3
Mini-TREC
  • Proposed Schedule
  • February 15 Database and previous Queries
  • February 27 report on system acquisition and
    setup
  • March 8, New Queries for testing
  • April 19, Results due
  • April 24 or 26, Results and system rankings
  • May 8 Group reports and discussion

4
Review IR Models
  • Set Theoretic Models
  • Boolean
  • Fuzzy
  • Extended Boolean
  • Vector Models (Algebraic)
  • Probabilistic Models (probabilistic)

5
Similarity Measures
Simple matching (coordination level
match) Dices Coefficient Jaccards
Coefficient Cosine Coefficient Overlap
Coefficient
6
Documents in Vector Space
t3
D1
D9
D11
D5
D3
D10
D2
D4
t1
D7
D6
D8
t2
7
Vector Space with Term Weights and Cosine Matching
Di(di1,wdi1di2, wdi2dit, wdit) Q
(qi1,wqi1qi2, wqi2qit, wqit)
Term B
1.0
Q (0.4,0.8) D1(0.8,0.3) D2(0.2,0.7)
Q
D2
0.8
0.6
0.4
D1
0.2
0.8
0.6
0.4
0.2
0
1.0
Term A
8
Term Weights in SMART
  • In SMART weights are decomposed into three
    factors

9
SMART Freq Components
Binary maxnorm augmented log
10
Collection Weighting in SMART
Inverse squared probabilistic frequency
11
Term Normalization in SMART
sum cosine fourth max
12
Problems with Vector Space
  • There is no real theoretical basis for the
    assumption of a term space
  • it is more for visualization than having any real
    basis
  • most similarity measures work about the same
    regardless of model
  • Terms are not really orthogonal dimensions
  • Terms are not independent of all other terms

13
Today
  • Clustering
  • Automatic Classification
  • Cluster-enhanced search

14
Overview
  • Introduction to Automatic Classification and
    Clustering
  • Classification of Classification Methods
  • Classification Clusters and Information Retrieval
    in Cheshire II
  • DARPA Unfamiliar Metadata Project

15
Classification
  • The grouping together of items (including
    documents or their representations) which are
    then treated as a unit. The groupings may be
    predefined or generated algorithmically. The
    process itself may be manual or automated.
  • In document classification the items are grouped
    together because they are likely to be wanted
    together
  • For example, items about the same topic.

16
Automatic Indexing and Classification
  • Automatic indexing is typically the simple
    deriving of keywords from a document and
    providing access to all of those words.
  • More complex Automatic Indexing Systems attempt
    to select controlled vocabulary terms based on
    terms in the document.
  • Automatic classification attempts to
    automatically group similar documents using
    either
  • A fully automatic clustering method.
  • An established classification scheme and set of
    documents already indexed by that scheme.

17
Background and Origins
  • Early suggestion by Fairthorne
  • The Mathematics of Classification
  • Early experiments by Maron (1961) and Borko and
    Bernick(1963)
  • Work in Numerical Taxonomy and its application to
    Information retrieval Jardine, Sibson, van
    Rijsbergen, Salton (1970s).
  • Early IR clustering work more concerned with
    efficiency issues than semantic issues.

18
Document Space has High Dimensionality
  • What happens beyond three dimensions?
  • Similarity still has to do with how many tokens
    are shared in common.
  • More terms -gt harder to understand which subsets
    of words are shared among similar documents.
  • One approach to handling high dimensionality
    Clustering

19
Vector Space Visualization
20
Cluster Hypothesis
  • The basic notion behind the use of classification
    and clustering methods
  • Closely associated documents tend to be relevant
    to the same requests.
  • C.J. van Rijsbergen

21
Classification of Classification Methods
  • Class Structure
  • Intellectually Formulated
  • Manual assignment (e.g. Library classification)
  • Automatic assignment (e.g. Cheshire
    Classification Mapping)
  • Automatically derived from collection of items
  • Hierarchic Clustering Methods (e.g. Single Link)
  • Agglomerative Clustering Methods (e.g. Dattola)
  • Hybrid Methods (e.g. Query Clustering)

22
Classification of Classification Methods
  • Relationship between properties and classes
  • monothetic
  • polythetic
  • Relation between objects and classes
  • exclusive
  • overlapping
  • Relation between classes and classes
  • ordered
  • unordered

Adapted from Sparck Jones
23
Properties and Classes
  • Monothetic
  • Class defined by a set of properties that are
    both necessary and sufficient for membership in
    the class
  • Polythetic
  • Class defined by a set of properties such that to
    be a member of the class some individual must
    have some number (usually large) of those
    properties, and that a large number of
    individuals in the class possess some of those
    properties, and no individual possesses all of
    the properties.

24
Monothetic vs. Polythetic
25
Exclusive Vs. Overlapping
  • Item can either belong exclusively to a single
    class
  • Items can belong to many classes, sometimes with
    a membership weight

26
Ordered Vs. Unordered
  • Ordered classes have some sort of structure
    imposed on them
  • Hierarchies are typical of ordered classes
  • Unordered classes have no imposed precedence or
    structure and each class is considered on the
    same level
  • Typical in agglomerative methods

27
Text Clustering
  • Clustering is
  • The art of finding groups in data.
  • -- Kaufmann and Rousseeu

Term 1
Term 2
28
Text Clustering
  • Clustering is
  • The art of finding groups in data.
  • -- Kaufmann and Rousseeu

Term 1
Term 2
29
Text Clustering
  • Finds overall similarities among groups of
    documents
  • Finds overall similarities among groups of tokens
  • Picks out some themes, ignores others

30
Coefficients of Association
  • Simple
  • Dices coefficient
  • Jaccards coefficient
  • Cosine coefficient
  • Overlap coefficient

31
Pair-wise Document Similarity
How to compute document similarity?
32
Pair-wise Document Similarity(no normalization
for simplicity)
33
Pair-wise Document Similarity
cosine normalization
34
Document/Document Matrix
35
Clustering Methods
  • Hierarchical
  • Agglomerative
  • Hybrid
  • Automatic Class Assignment

36
Hierarchic Agglomerative Clustering
  • Basic method
  • 1) Calculate all of the interdocument similarity
    coefficients
  • 2) Assign each document to its own cluster
  • 3) Fuse the most similar pair of current clusters
  • 4) Update the similarity matrix by deleting the
    rows for the fused clusters and calculating
    entries for the row and column representing the
    new cluster (centroid)
  • 5) Return to step 3 if there is more than one
    cluster left

37
Hierarchic Agglomerative Clustering
A B C D E F G H I
38
Hierarchic Agglomerative Clustering
A B C D E F G H I
39
Hierarchic Agglomerative Clustering
A B C D E F G H I
40
Hierarchical Methods
Single Link Dissimilarity Matrix
Hierarchical methods Polythetic, Usually
Exclusive, Ordered Clusters are order-independent
41
Threshold .1
Single Link Dissimilarity Matrix
42
Threshold .2
43
Threshold .3
44
K-means Rocchio Clustering
Agglomerative methods Polythetic, Exclusive or
Overlapping, Unordered clusters are
order-dependent.
Rocchios method
1. Select initial centers (I.e. seed the
space) 2. Assign docs to highest matching centers
and compute centroids 3. Reassign all documents
to centroid(s)
45
K-Means Clustering
  • 1 Create a pair-wise similarity measure
  • 2 Find K centers using agglomerative clustering
  • take a small sample
  • group bottom up until K groups found
  • 3 Assign each document to nearest center,
    forming new clusters
  • 4 Repeat 3 as necessary

46
Scatter/Gather
  • Cutting, Pedersen, Tukey Karger 92, 93, Hearst
    Pedersen 95
  • Cluster sets of documents into general themes,
    like a table of contents
  • Display the contents of the clusters by showing
    topical terms and typical titles
  • User chooses subsets of the clusters and
    re-clusters the documents within
  • Resulting new groups have different themes

47
S/G Example query on star
  • Encyclopedia text
  • 14 sports
  • 8 symbols 47 film, tv
  • 68 film, tv (p) 7 music
  • 97 astrophysics
  • 67 astronomy(p) 12 stellar phenomena
  • 10 flora/fauna 49 galaxies, stars
  • 29 constellations
  • 7 miscelleneous
  • Clustering and re-clustering is entirely
    automated

48
(No Transcript)
49
(No Transcript)
50
(No Transcript)
51
(No Transcript)
52
Another use of clustering
  • Use clustering to map the entire huge
    multidimensional document space into a huge
    number of small clusters.
  • Project these onto a 2D graphical
    representation

53
Clustering Multi-Dimensional Document Space
(image from Wise et al 95)
54
Clustering Multi-Dimensional Document Space
(image from Wise et al 95)
55
Concept Landscapes
  • Pharmocology
  • (e.g., Lin, Chen, Wise et al.)
  • Too many concepts, or too coarse
  • Single concept per document
  • No titles
  • Browsing without search

56
Clustering
  • Advantages
  • See some main themes
  • Disadvantage
  • Many ways documents could group together are
    hidden
  • Thinking point what is the relationship to
    classification systems and facets?

57
Automatic Class Assignment
Automatic Class Assignment Polythetic, Exclusive
or Overlapping, usually ordered clusters are
order-independent, usually based on an
intellectually derived scheme
58
Automatic Categorization in Cheshire II
  • Cheshire supports a method we call
    classification clustering that relies on having
    a set of records that have been indexed using
    some controlled vocabulary.
  • Classification clustering has the following steps

59
Cheshire II - Cluster Generation
  • Define basis for clustering records.
  • Select field (I.e., the contolled vocabulary
    terms) to form the basis of the cluster.
  • Evidence Fields to use as contents of the
    pseudo-documents. (E.g. the titles or other
    topical parts)
  • During indexing cluster keys are generated with
    basis and evidence from each record.
  • Cluster keys are sorted and merged on basis and
    pseudo-documents created for each unique basis
    element containing all evidence fields.
  • Pseudo-Documents (Class clusters) are indexed on
    combined evidence fields.

60
Cheshire II - Two-Stage Retrieval
  • Using the LC Classification System
  • Pseudo-Document created for each LC class
    containing terms derived from content-rich
    portions of documents in that class (e.g.,
    subject headings, titles, etc.)
  • Permits searching by any term in the class
  • Ranked Probabilistic retrieval techniques attempt
    to present the Best Matches to a query first.
  • User selects classes to feed back for the second
    stage search of documents.
  • Can be used with any classified/Indexed
    collection.

61
Cheshire II Demo
  • Examples from the
  • SciMentor(BioSearch) project
  • Journal of Biological Chemistry and MEDLINE data
  • CHESTER (EconLit)
  • Journal of Economic Literature subjects
  • Unfamiliar Metadata TIDES Projects
  • Basis for clusters is a normalized Library of
    Congress Class Number
  • Evidence is provided by terms from record titles
    (and subject headings for the all languages
  • Five different training sets (Russian, German,
    French, Spanish, and All Languages
  • Testing cross-language retrieval and
    classification
  • 4W Project Search
Write a Comment
User Comments (0)
About PowerShow.com