Lecture 8: Clustering

About This Presentation

Title:

Lecture 8: Clustering

Description:

Five different training sets (Russian, German, French, Spanish, and All Languages ... Lucy Kuntz, Paul O'Leary & Ralph Moon, 'Cheshire II: Designing a Next-Generation ... – PowerPoint PPT presentation

Number of Views:53

Avg rating:3.0/5.0

Slides: 80

Provided by: ValuedGate70

Learn more at: https://courses.ischool.berkeley.edu

Category:

more less

Transcript and Presenter's Notes

Title: Lecture 8: Clustering

1
Lecture 8 Clustering
Automatic Classification

University of California, Berkeley
School of InformationIS 245 Organization of
Information In Collections

Some slides in this lecture were originally
created by Prof. Marti Hearst
2
Overview

Introduction to Automatic Classification and
Clustering
Classification of Classification Methods
Classification Clusters and Information Retrieval
in Cheshire II
The 4W project revisited

3
Classification

The grouping together of items (including
documents or their representations) which are
then treated as a unit. The groupings may be
predefined or generated algorithmically. The
process itself may be manual or automated.
In document classification the items are grouped
together because they are likely to be wanted
together
For example, items about the same topic.

4
Automatic Indexing and Classification

Automatic indexing is typically the simple
deriving of keywords from a document and
providing access to all of those words.
More complex Automatic Indexing Systems attempt
to select controlled vocabulary terms based on
terms in the document.
Automatic classification attempts to
automatically group similar documents using
either
A fully automatic clustering method.
An established classification scheme and set of
documents already indexed by that scheme.

5
Background and Origins

Early suggestion by Fairthorne
The Mathematics of Classification
Early experiments by Maron (1961) and Borko and
Bernick(1963)
Work in Numerical Taxonomy and its application to
Information retrieval Jardine, Sibson, van
Rijsbergen, Salton (1970s).
Early IR clustering work more concerned with
efficiency issues than semantic issues.

6
Cluster Hypothesis

The basic notion behind the use of classification
and clustering methods
Closely associated documents tend to be relevant
to the same requests.
C.J. van Rijsbergen

7
Classification of Classification Methods

Class Structure
Intellectually Formulated
Manual assignment (e.g. Library classification)
Automatic assignment (e.g. Cheshire
Classification Mapping)
Automatically derived from collection of items
Hierarchic Clustering Methods (e.g. Single Link)
Agglomerative Clustering Methods (e.g. Dattola)
Hybrid Methods (e.g. Query Clustering)

8
Classification of Classification Methods

Relationship between properties and classes
monothetic
polythetic
Relation between objects and classes
exclusive
overlapping
Relation between classes and classes
ordered
unordered

Adapted from Sparck Jones
9
Properties and Classes

Monothetic
Class defined by a set of properties that are
both necessary and sufficient for membership in
the class
Polythetic
Class defined by a set of properties such that to
be a member of the class some individual must
have some number (usually large) of those
properties, and that a large number of
individuals in the class possess some of those
properties, and no individual possesses all of
the properties.

10
Monothetic vs. Polythetic
Polythetic
Monothetic
Adapted from van Rijsbergen, 79
11
Exclusive Vs. Overlapping

Item can either belong exclusively to a single
class
Items can belong to many classes, sometimes with
a membership weight

12
Ordered Vs. Unordered

Ordered classes have some sort of structure
imposed on them
Hierarchies are typical of ordered classes
Unordered classes have no imposed precedence or
structure and each class is considered on the
same level
Typical in agglomerative methods

13
Clustering Methods

Hierarchical
Agglomerative
Hybrid
Automatic Class Assignment

14
Coefficients of Association

Simple
Dices coefficient
Jaccards coefficient
Cosine coefficient
Overlap coefficient

15
Hierarchical Methods
Single Link Dissimilarity Matrix
Hierarchical methods Polythetic, Usually
Exclusive, Ordered Clusters are order-independent
16
Threshold .1
Single Link Dissimilarity Matrix
17
Threshold .2
18
Threshold .3
19
Clustering
Agglomerative methods Polythetic, Exclusive or
Overlapping, Unordered clusters are
order-dependent.
Doc
Doc
Doc
Doc
Doc
Doc
Doc
Doc
Rocchios method (similar to current K-means
methods
1. Select initial centers (I.e. seed the
space) 2. Assign docs to highest matching centers
and compute centroids 3. Reassign all documents
to centroid(s)
20
Automatic Class Assignment
Automatic Class Assignment Polythetic, Exclusive
or Overlapping, usually ordered clusters are
order-independent, usually based on an
intellectually derived scheme
21
Automatic Categorization in Cheshire II

The Cheshire II system is intended to provide a
bridge between the purely bibliographic realm of
previous generations of online catalogs and the
rapidly expanding realm of full-text and
multimedia information resources. It is currently
used in the UC Berkeley Digital Library Project
and for a number of other sites and projects.

22
Overview of Cheshire II

It supports SGML as the primary database type.
It is a client/server application.
Uses the Z39.50 Information Retrieval Protocol.
Supports Boolean searching of all servers.
Supports probabilistic ranked retrieval in the
Cheshire search engine.
Supports nearest neighbor'' searches, relevance
feedback and Two-Stage Search.
GUI interface on X window displays (Tcl/Tk).
HTTP/CGI interface for the Web (Tcl scripting).

23
Cheshire II - Cluster Generation

Define basis for clustering records.
Select field to form the basis of the cluster.
Evidence Fields to use as contents of the
pseudo-documents.
During indexing cluster keys are generated with
basis and evidence from each record.
Cluster keys are sorted and merged on basis and
pseudo-documents created for each unique basis
element containing all evidence fields.
Pseudo-Documents (Class clusters) are indexed on
combined evidence fields.

24
Cheshire II - Two-Stage Retrieval

Using the LC Classification System
Pseudo-Document created for each LC class
containing terms derived from content-rich
portions of documents in that class (subject
headings, titles, etc.)
Permits searching by any term in the class
Ranked Probabilistic retrieval techniques attempt
to present the Best Matches to a query first.
User selects classes to feed back for the second
stage search of documents.
Can be used with any classified/Indexed
collection.

25
Probabilistic Retrieval Logistic Regression

Estimates for relevance based on log-linear model
with various statistical measures of document
content as independent variables.

Log odds of relevance is a linear function of
attributes
Term contributions summed
Probability of Relevance is inverse of log odds
26
Probabilistic Retrieval Logistic Regression
In Cheshire II probability of relevance is based
on Logistic regression from a sample set of TREC
documents to determine values of the
coefficients. At retrieval the probability
estimate is obtained by
For 6 attributes or clues about term usage in
the documents and the query
27
Probabilistic Retrieval Logistic Regression
attributes
Average Absolute Query Frequency Query
Length Average Absolute Document
Frequency Document Length Average Inverse
Document Frequency Inverse Document
Frequency Number of Terms in common between
query and document (M) -- logged
28
Cheshire II Demo

Examples from the
SciMentor(BioSearch) project
Journal of Biological Chemistry and MEDLINE data
CHESTER (EconLit)
Journal of Economic Literature subjects
Unfamiliar Metadata TIDES Projects
Basis for clusters is a normalized Library of
Congress Class Number
Evidence is provided by terms from record titles
(and subject headings for the all languages
Five different training sets (Russian, German,
French, Spanish, and All Languages
Testing cross-language retrieval and
classification
4W Project Search

29
References

Christian Plaunt Barbara Norgard, An
Association Based Method for Automatic Indexing
with a Controlled Vocabulary. JASIS 49(10),
1998.
Preprint available available on class web site
Ray R. Larson, Jerome McDonough, Lucy Kuntz, Paul
OLeary Ralph Moon, Cheshire II Designing a
Next-Generation Online Catalog. JASIS, 47(7)
555-567, 1996.

30
Developing a Metadata Infrastructure for
Information AccessWhat, Where, When and Who?

Prof. Ray R. Larson
University of California, BerkeleySchool of
Information

31
Overview

Metadata as Infrastructure
What, Where, When and Who?
What are Entry Vocabulary Indexes?
Notion of an EVI
How are EVIs Built
Time Period Directories
Mining Metadata for new metadata

32
Metadata as Infrastructure

The difference between memorization and
understanding lies in knowing the context and
relationships of whatever is of interest. When
setting out to learn about a new topic, a
well-tested practice is to follow the traditional
5Ws and the H Who?, What?, When?, Where?,
Why?, and How?

33
Metadata as Infrastructure

The reference collections of paper-based
libraries provide a structured environment for
resources, with encyclopedias and subject
catalogs, gazetteers, chronologies, and
biographical dictionaries, offering direct
support for at least What, Where, When, and Who.
The digital environment does not yet provide an
effective, and easily exploited, infrastructure
comparable to the traditional reference library.

34
What?

Searching texts by topic, e.g. Dewey, LCSH, any
subject index, or category scheme applied to
documents.
Two kinds of mapping in every search
Documents are assigned to topic categories, e.g.
Dewey
Queries have to map to topic categories, e.g.
Deweys Relativ Index from ordinary words/phrases
to Decimal Classification numbers.
Also mapping between topic systems, e.g. US
Patent classification and International Patent
Classification.

35
What searches involve mapping to controlled
vocabularies
Thesaurus/ Ontology
Texts
36
Start with a collection of documents.
37
Classify and index with controlled vocabulary Or
use a pre-indexed collection.
38
ProblemControlled Vocabularies can be difficult
for people to use.
pass mtr veh spark ign eng
39
SolutionEntry Level Vocabulary Indexes.
Index
EVI
40
What and Entry Vocabulary Indexes

EVIs are a means of mapping from users
vocabulary to the controlled vocabulary of a
collection of documents

41
Building and Searching EVIs
42
Technical Details
43
Association Measure
44
Association Measure

Maximum Likelihood ratio

45
Alternatively

Because the evidence terms in EVIs can be
considered a document, you can also use IR
techniques and use the top-ranked classes for
classification or query expansion

46
(No Transcript)
47
EVI example
Index termpass mtr veh spark ign eng
EVI 1
User Query Automobile
Index termautomobiles OR internal
combustible engines
EVI 2
48
But why stop there?
Index
EVI
49
Which EVI do I use?
Index
EVI
Index
EVI
Index
EVI
Index
50
EVI to EVIs
Index
EVI
EVI2
Index
EVI
Index
EVI
Index
51
Why not treat language the same way?
52
It is also difficult to move between different
media forms
Thesaurus/ Ontology
Texts
EVI
Numeric datasets
53
Searching across data types

Different media can be linked indirectly via
metadata, but often (e.g. for socio-economic
numeric data series) you also need to specify
WHERE to get correct results

54
But texts associated with numeric data can be
mapped as well
Thesaurus/ Ontology
Texts
EVI
EVI
captions
Numeric datasets
55
EVI to Numeric Data example
56
But there are also geographic dependencies
Thesaurus/ Ontology
Texts
EVI
EVI
captions
Maps/ Geo Data
Numeric datasets
57
WHERE Place names are problematic

Variant forms St. Petersburg, ????? ?????????,
Saint-Pétersbourg, . . .
Multiple names Cluj, in Romania / Roumania /
Rumania, is also called Klausenburg and
Kolozsvar.
Names changes Bombay ? Mumbai.
HomographsVienna, VA, and Vienna, Austria
50 Springfields.
Anachronisms No Germany before 1870
Vague, e.g. Midwest, Silicon Valley
Unstable boundaries 19th century Poland
Balkans USSR
Use a gazetteer!

58
WHERE. Geo-temporal search interface. Place
names found in documents. Gazetteer provided lat.
long. Places displayed on map.
Timebar?
59
Zoom on map. Click on place for a list of
records. Click on record to display text.
60
Catalogs and gazetteers should talk to each other!
Catalog search
Gazetteer search
Geographic sort / display of catalog search
result.
61
So geographic search becomes part of the
infrastructure
Thesaurus/ Ontology
Texts
EVI
Gazetteers
captions
Maps/ Geo Data
Numeric datasets
62
WHEN Search by time is also weakly supported

Calendars are the standard for time
But people use the names of events to refer to
time periods
Named time periods resemble place names in being
Unstable European War, Great War, First World
War
Multiple Second World War, Great Patriotic War
Ambiguous Civil war in different centuries in
England, USA, Spain, etc.
Places have temporal aspects periods have
geographical aspects When the Stone Age was,
varies by region

63
Similarity between place names and period names

Suggests a similar solution A gazetteer-like
Time Period Directory.
Gazetteer
Place name Type Spatial markers (Lat long)
-- When
Time Period Directory
Period name Type Time markers (Calendar)
Where
Note the symmetry in the connections between
Where and When.

64
Solution - Time Period Directories

Initial development involved mining the Library
of Congress Subject Authority file for named time
periods

65
LC MARC Authorities Records
ltUSMARCgt ltFld001gtsh 00000613 lt/Fld001gt ltFld151gtltagt
Magdeburg (Germany)lt/agtltxgtHistorylt/xgtltygtSiege,
1550-1551lt/ygtlt/Fld151gt ltFld550gtltwgtglt/wgtltagtSiegeslt/
agtltzgtGermanylt/zgtlt/Fld550gt ltFld670gtltagtWork cat.
45053442 Besselmeier, S. Warhafftige history vnd
beschreibung des Magdeburgischen Kriegs,
1552.lt/agtlt/Fld670gt ltFld670gtltagtCath.
encyc.lt/agtltbgt(Magdeburg besieged (1550-51) by
the Margrave Maurice of Saxony)lt/bgtlt/Fld670gt ltFld6
70gtltagtOx. encyc. reformationlt/agtltbgt(Magdeburg
... during the 1550-1551 siege of Magdeburg
...)lt/bgtlt/Fld670gt lt/USMARCgt
66
timePeriodEntry Time Period Directory Instance Contains components described below
- periodID Unique identifier
- periodName Period name, can be repeated for alternative names Information about language, script, transliteration scheme Source information and notes (where was the period name mentioned)
- descriptiveNotes Description of time period
- dates Calendar and date format Begin end date (exact, earliest, latest, most-likely, advocated-by-source, ongoing) Notes, sources
- periodClassification Period type, e.g. Period of Conflict, Art movement Can plug in different classification schemes Can be repeated for several classifications
- location Associated places with time period Contains both place name and entry to a gazetteer providing more specific place information like latitude / longitude coordinates Can plug in different location indicators (e.g. ADL gazetteer, Getty Thesaurus of Geographic names) Recently added coordinates for direct use
- relatedPeriod Related time periods periodID of related periods Information about relationship type (part-of, successor etc.) Can plug in different relationship type schemes
- entryMetadata Notes about creator / creation of instance Entry date Modification date
67
(No Transcript)
68
Time periods by named location
69
Catalog Search Result
70
Web Interface - Access by map
71
Zoomable interface gives access to geographically
focused info
72
Web Interface - Access by timeline
Link initiates search of the Library of Congress
catalog for all records relating to this time
period.
73
WHEN and WHAT

These named time periods are derived from Library
of Congress catalog subject headings and so can
be used for catalog searching which finds books
on topics important for that time period

74
Time period directories link via the place (or
time)
75
WHEN, WHERE and WHO

Catalog records found from a time period search
commonly include names of persons important at
that time. Their names can be forwarded to, e.g.,
biographies in the Wikipedia encyclopedia.

76
Place and time are broadly important across
numerous tools and genres including, e.g.
Language atlases, Library catalogs, Biographical
dictionaries, Bibliographies, Archival finding
aids, Museum records, etc., etc. Biographical
dictionaries are heavy on place and time
Emanuel Goldberg, Born Moscow 1881. PhD under
Wilhelm Ostwald, Univ. of Leipzig, 1906.
Director, Zeiss Ikon, Dresden, 1926-33. Moved to
Palestine 1937. Died Tel Aviv, 1970. Life as a
series of episodes involving Activity (WHAT),
WHERE, WHEN, and WHO else.
77
A new form of biographical dictionary would link
to all
78
A Metadata Infrastructure
79
Acknowledgements

Electronic Cultural Atlas Initiative project
This work was partially supported by the
Institute of Museum and Library Services through
a National Leadership Grant for Libraries, award
number LG-02-04-0041-04, Oct 2004 - Sept 2006
entitled Supporting the Learner What, Where,
When and Who See http//ecai.org/imls2004
Michael Buckland, Fred Gey, Vivien Petras, Matt
Meiske, Kim Carl, Anya Kartavenko, Minakshi
Mukherjee
Contact ray_at_sims.berkeley.edu