Title: Lecture 8: Clustering
1Lecture 8 Clustering
Automatic Classification
- University of California, Berkeley
- School of InformationIS 245 Organization of
Information In Collections
Some slides in this lecture were originally
created by Prof. Marti Hearst
2Overview
- Introduction to Automatic Classification and
Clustering - Classification of Classification Methods
- Classification Clusters and Information Retrieval
in Cheshire II - The 4W project revisited
3Classification
- The grouping together of items (including
documents or their representations) which are
then treated as a unit. The groupings may be
predefined or generated algorithmically. The
process itself may be manual or automated. - In document classification the items are grouped
together because they are likely to be wanted
together - For example, items about the same topic.
4Automatic Indexing and Classification
- Automatic indexing is typically the simple
deriving of keywords from a document and
providing access to all of those words. - More complex Automatic Indexing Systems attempt
to select controlled vocabulary terms based on
terms in the document. - Automatic classification attempts to
automatically group similar documents using
either - A fully automatic clustering method.
- An established classification scheme and set of
documents already indexed by that scheme.
5Background and Origins
- Early suggestion by Fairthorne
- The Mathematics of Classification
- Early experiments by Maron (1961) and Borko and
Bernick(1963) - Work in Numerical Taxonomy and its application to
Information retrieval Jardine, Sibson, van
Rijsbergen, Salton (1970s). - Early IR clustering work more concerned with
efficiency issues than semantic issues.
6Cluster Hypothesis
- The basic notion behind the use of classification
and clustering methods - Closely associated documents tend to be relevant
to the same requests. - C.J. van Rijsbergen
7Classification of Classification Methods
- Class Structure
- Intellectually Formulated
- Manual assignment (e.g. Library classification)
- Automatic assignment (e.g. Cheshire
Classification Mapping) - Automatically derived from collection of items
- Hierarchic Clustering Methods (e.g. Single Link)
- Agglomerative Clustering Methods (e.g. Dattola)
- Hybrid Methods (e.g. Query Clustering)
8Classification of Classification Methods
- Relationship between properties and classes
- monothetic
- polythetic
- Relation between objects and classes
- exclusive
- overlapping
- Relation between classes and classes
- ordered
- unordered
Adapted from Sparck Jones
9Properties and Classes
- Monothetic
- Class defined by a set of properties that are
both necessary and sufficient for membership in
the class - Polythetic
- Class defined by a set of properties such that to
be a member of the class some individual must
have some number (usually large) of those
properties, and that a large number of
individuals in the class possess some of those
properties, and no individual possesses all of
the properties.
10Monothetic vs. Polythetic
Polythetic
Monothetic
Adapted from van Rijsbergen, 79
11Exclusive Vs. Overlapping
- Item can either belong exclusively to a single
class - Items can belong to many classes, sometimes with
a membership weight
12Ordered Vs. Unordered
- Ordered classes have some sort of structure
imposed on them - Hierarchies are typical of ordered classes
- Unordered classes have no imposed precedence or
structure and each class is considered on the
same level - Typical in agglomerative methods
13Clustering Methods
- Hierarchical
- Agglomerative
- Hybrid
- Automatic Class Assignment
14Coefficients of Association
- Simple
- Dices coefficient
- Jaccards coefficient
- Cosine coefficient
- Overlap coefficient
15Hierarchical Methods
Single Link Dissimilarity Matrix
Hierarchical methods Polythetic, Usually
Exclusive, Ordered Clusters are order-independent
16Threshold .1
Single Link Dissimilarity Matrix
17Threshold .2
18Threshold .3
19Clustering
Agglomerative methods Polythetic, Exclusive or
Overlapping, Unordered clusters are
order-dependent.
Doc
Doc
Doc
Doc
Doc
Doc
Doc
Doc
Rocchios method (similar to current K-means
methods
1. Select initial centers (I.e. seed the
space) 2. Assign docs to highest matching centers
and compute centroids 3. Reassign all documents
to centroid(s)
20Automatic Class Assignment
Automatic Class Assignment Polythetic, Exclusive
or Overlapping, usually ordered clusters are
order-independent, usually based on an
intellectually derived scheme
21Automatic Categorization in Cheshire II
- The Cheshire II system is intended to provide a
bridge between the purely bibliographic realm of
previous generations of online catalogs and the
rapidly expanding realm of full-text and
multimedia information resources. It is currently
used in the UC Berkeley Digital Library Project
and for a number of other sites and projects.
22Overview of Cheshire II
- It supports SGML as the primary database type.
- It is a client/server application.
- Uses the Z39.50 Information Retrieval Protocol.
- Supports Boolean searching of all servers.
- Supports probabilistic ranked retrieval in the
Cheshire search engine. - Supports nearest neighbor'' searches, relevance
feedback and Two-Stage Search. - GUI interface on X window displays (Tcl/Tk).
- HTTP/CGI interface for the Web (Tcl scripting).
23Cheshire II - Cluster Generation
- Define basis for clustering records.
- Select field to form the basis of the cluster.
- Evidence Fields to use as contents of the
pseudo-documents. - During indexing cluster keys are generated with
basis and evidence from each record. - Cluster keys are sorted and merged on basis and
pseudo-documents created for each unique basis
element containing all evidence fields. - Pseudo-Documents (Class clusters) are indexed on
combined evidence fields.
24Cheshire II - Two-Stage Retrieval
- Using the LC Classification System
- Pseudo-Document created for each LC class
containing terms derived from content-rich
portions of documents in that class (subject
headings, titles, etc.) - Permits searching by any term in the class
- Ranked Probabilistic retrieval techniques attempt
to present the Best Matches to a query first. - User selects classes to feed back for the second
stage search of documents. - Can be used with any classified/Indexed
collection.
25Probabilistic Retrieval Logistic Regression
- Estimates for relevance based on log-linear model
with various statistical measures of document
content as independent variables.
Log odds of relevance is a linear function of
attributes
Term contributions summed
Probability of Relevance is inverse of log odds
26Probabilistic Retrieval Logistic Regression
In Cheshire II probability of relevance is based
on Logistic regression from a sample set of TREC
documents to determine values of the
coefficients. At retrieval the probability
estimate is obtained by
For 6 attributes or clues about term usage in
the documents and the query
27Probabilistic Retrieval Logistic Regression
attributes
Average Absolute Query Frequency Query
Length Average Absolute Document
Frequency Document Length Average Inverse
Document Frequency Inverse Document
Frequency Number of Terms in common between
query and document (M) -- logged
28Cheshire II Demo
- Examples from the
- SciMentor(BioSearch) project
- Journal of Biological Chemistry and MEDLINE data
- CHESTER (EconLit)
- Journal of Economic Literature subjects
- Unfamiliar Metadata TIDES Projects
- Basis for clusters is a normalized Library of
Congress Class Number - Evidence is provided by terms from record titles
(and subject headings for the all languages - Five different training sets (Russian, German,
French, Spanish, and All Languages - Testing cross-language retrieval and
classification - 4W Project Search
29References
- Christian Plaunt Barbara Norgard, An
Association Based Method for Automatic Indexing
with a Controlled Vocabulary. JASIS 49(10),
1998. - Preprint available available on class web site
- Ray R. Larson, Jerome McDonough, Lucy Kuntz, Paul
OLeary Ralph Moon, Cheshire II Designing a
Next-Generation Online Catalog. JASIS, 47(7)
555-567, 1996.
30Developing a Metadata Infrastructure for
Information AccessWhat, Where, When and Who?
- Prof. Ray R. Larson
- University of California, BerkeleySchool of
Information
31Overview
- Metadata as Infrastructure
- What, Where, When and Who?
- What are Entry Vocabulary Indexes?
- Notion of an EVI
- How are EVIs Built
- Time Period Directories
- Mining Metadata for new metadata
32Metadata as Infrastructure
- The difference between memorization and
understanding lies in knowing the context and
relationships of whatever is of interest. When
setting out to learn about a new topic, a
well-tested practice is to follow the traditional
5Ws and the H Who?, What?, When?, Where?,
Why?, and How?
33Metadata as Infrastructure
- The reference collections of paper-based
libraries provide a structured environment for
resources, with encyclopedias and subject
catalogs, gazetteers, chronologies, and
biographical dictionaries, offering direct
support for at least What, Where, When, and Who. - The digital environment does not yet provide an
effective, and easily exploited, infrastructure
comparable to the traditional reference library.
34What?
- Searching texts by topic, e.g. Dewey, LCSH, any
subject index, or category scheme applied to
documents. - Two kinds of mapping in every search
- Documents are assigned to topic categories, e.g.
Dewey - Queries have to map to topic categories, e.g.
Deweys Relativ Index from ordinary words/phrases
to Decimal Classification numbers. - Also mapping between topic systems, e.g. US
Patent classification and International Patent
Classification.
35What searches involve mapping to controlled
vocabularies
Thesaurus/ Ontology
Texts
36Start with a collection of documents.
37Classify and index with controlled vocabulary Or
use a pre-indexed collection.
38ProblemControlled Vocabularies can be difficult
for people to use.
pass mtr veh spark ign eng
39SolutionEntry Level Vocabulary Indexes.
Index
EVI
40What and Entry Vocabulary Indexes
- EVIs are a means of mapping from users
vocabulary to the controlled vocabulary of a
collection of documents
41Building and Searching EVIs
42Technical Details
43Association Measure
44Association Measure
45Alternatively
- Because the evidence terms in EVIs can be
considered a document, you can also use IR
techniques and use the top-ranked classes for
classification or query expansion
46(No Transcript)
47EVI example
Index termpass mtr veh spark ign eng
EVI 1
User Query Automobile
Index termautomobiles OR internal
combustible engines
EVI 2
48But why stop there?
Index
EVI
49Which EVI do I use?
Index
EVI
Index
EVI
Index
EVI
Index
50EVI to EVIs
Index
EVI
EVI2
Index
EVI
Index
EVI
Index
51Why not treat language the same way?
52It is also difficult to move between different
media forms
Thesaurus/ Ontology
Texts
EVI
Numeric datasets
53Searching across data types
- Different media can be linked indirectly via
metadata, but often (e.g. for socio-economic
numeric data series) you also need to specify
WHERE to get correct results
54But texts associated with numeric data can be
mapped as well
Thesaurus/ Ontology
Texts
EVI
EVI
captions
Numeric datasets
55EVI to Numeric Data example
56But there are also geographic dependencies
Thesaurus/ Ontology
Texts
EVI
EVI
captions
Maps/ Geo Data
Numeric datasets
57WHERE Place names are problematic
- Variant forms St. Petersburg, ????? ?????????,
Saint-Pétersbourg, . . . - Multiple names Cluj, in Romania / Roumania /
Rumania, is also called Klausenburg and
Kolozsvar. - Names changes Bombay ? Mumbai.
- HomographsVienna, VA, and Vienna, Austria
- 50 Springfields.
- Anachronisms No Germany before 1870
- Vague, e.g. Midwest, Silicon Valley
- Unstable boundaries 19th century Poland
Balkans USSR - Use a gazetteer!
58WHERE. Geo-temporal search interface. Place
names found in documents. Gazetteer provided lat.
long. Places displayed on map.
Timebar?
59Zoom on map. Click on place for a list of
records. Click on record to display text.
60Catalogs and gazetteers should talk to each other!
Catalog search
Gazetteer search
Geographic sort / display of catalog search
result.
61So geographic search becomes part of the
infrastructure
Thesaurus/ Ontology
Texts
EVI
Gazetteers
captions
Maps/ Geo Data
Numeric datasets
62WHEN Search by time is also weakly supported
- Calendars are the standard for time
- But people use the names of events to refer to
time periods - Named time periods resemble place names in being
- Unstable European War, Great War, First World
War - Multiple Second World War, Great Patriotic War
- Ambiguous Civil war in different centuries in
England, USA, Spain, etc. - Places have temporal aspects periods have
geographical aspects When the Stone Age was,
varies by region
63Similarity between place names and period names
- Suggests a similar solution A gazetteer-like
Time Period Directory. - Gazetteer
- Place name Type Spatial markers (Lat long)
-- When - Time Period Directory
- Period name Type Time markers (Calendar)
Where - Note the symmetry in the connections between
Where and When.
64Solution - Time Period Directories
- Initial development involved mining the Library
of Congress Subject Authority file for named time
periods
65LC MARC Authorities Records
ltUSMARCgt ltFld001gtsh 00000613 lt/Fld001gt ltFld151gtltagt
Magdeburg (Germany)lt/agtltxgtHistorylt/xgtltygtSiege,
1550-1551lt/ygtlt/Fld151gt ltFld550gtltwgtglt/wgtltagtSiegeslt/
agtltzgtGermanylt/zgtlt/Fld550gt ltFld670gtltagtWork cat.
45053442 Besselmeier, S. Warhafftige history vnd
beschreibung des Magdeburgischen Kriegs,
1552.lt/agtlt/Fld670gt ltFld670gtltagtCath.
encyc.lt/agtltbgt(Magdeburg besieged (1550-51) by
the Margrave Maurice of Saxony)lt/bgtlt/Fld670gt ltFld6
70gtltagtOx. encyc. reformationlt/agtltbgt(Magdeburg
... during the 1550-1551 siege of Magdeburg
...)lt/bgtlt/Fld670gt lt/USMARCgt
66timePeriodEntry Time Period Directory Instance Contains components described below
- periodID Unique identifier
- periodName Period name, can be repeated for alternative names Information about language, script, transliteration scheme Source information and notes (where was the period name mentioned)
- descriptiveNotes Description of time period
- dates Calendar and date format Begin end date (exact, earliest, latest, most-likely, advocated-by-source, ongoing) Notes, sources
- periodClassification Period type, e.g. Period of Conflict, Art movement Can plug in different classification schemes Can be repeated for several classifications
- location Associated places with time period Contains both place name and entry to a gazetteer providing more specific place information like latitude / longitude coordinates Can plug in different location indicators (e.g. ADL gazetteer, Getty Thesaurus of Geographic names) Recently added coordinates for direct use
- relatedPeriod Related time periods periodID of related periods Information about relationship type (part-of, successor etc.) Can plug in different relationship type schemes
- entryMetadata Notes about creator / creation of instance Entry date Modification date
67(No Transcript)
68Time periods by named location
69Catalog Search Result
70Web Interface - Access by map
71Zoomable interface gives access to geographically
focused info
72Web Interface - Access by timeline
Link initiates search of the Library of Congress
catalog for all records relating to this time
period.
73WHEN and WHAT
- These named time periods are derived from Library
of Congress catalog subject headings and so can
be used for catalog searching which finds books
on topics important for that time period
74Time period directories link via the place (or
time)
75WHEN, WHERE and WHO
- Catalog records found from a time period search
commonly include names of persons important at
that time. Their names can be forwarded to, e.g.,
biographies in the Wikipedia encyclopedia.
76Place and time are broadly important across
numerous tools and genres including, e.g.
Language atlases, Library catalogs, Biographical
dictionaries, Bibliographies, Archival finding
aids, Museum records, etc., etc. Biographical
dictionaries are heavy on place and time
Emanuel Goldberg, Born Moscow 1881. PhD under
Wilhelm Ostwald, Univ. of Leipzig, 1906.
Director, Zeiss Ikon, Dresden, 1926-33. Moved to
Palestine 1937. Died Tel Aviv, 1970. Life as a
series of episodes involving Activity (WHAT),
WHERE, WHEN, and WHO else.
77A new form of biographical dictionary would link
to all
78A Metadata Infrastructure
79Acknowledgements
- Electronic Cultural Atlas Initiative project
- This work was partially supported by the
Institute of Museum and Library Services through
a National Leadership Grant for Libraries, award
number LG-02-04-0041-04, Oct 2004 - Sept 2006
entitled Supporting the Learner What, Where,
When and Who See http//ecai.org/imls2004 - Michael Buckland, Fred Gey, Vivien Petras, Matt
Meiske, Kim Carl, Anya Kartavenko, Minakshi
Mukherjee - Contact ray_at_sims.berkeley.edu