Title: CS 430: Information Discovery
1CS 430 Information Discovery
Lecture 23 Thesauruses and Cluster Analysis 1
2Course Administration
Loss of files Because of hardware failure, some
files associated with Assignment 2 were lost.
The original grades have been retrieved. If you
had your grade changed because of a regrade,
please send a message to cs430, copying the email
message with the grade. Assignment 3 Grades
will be sent out shortly. Assignment 4 Will be
posted within a couple of days.
3Lexicon and Thesaurus
Lexicon contains information about words, their
morphological variants, and their grammatical
usage. Thesaurus relates words by
meaning ship, vessel, sail craft, navy,
marine, fleet, flotilla book, writing, work,
volume, tome, tract, codex search, discovery,
detection, find, revelation (From Roget's
Thesaurus, 1911)
4Thesaurus in Information Retrieval
Use of a thesaurus in indexing (precoordination) A
. Manual Used to guide human indexer to assign
standard terms and associations.
computer-aided instruction see also
education UF teaching machines BT educational
computing TT computer applications RT
education RT teaching
From INSPEC Thesaurus
5Thesaurus in Information Retrieval
Use of a thesaurus in indexing (precoordination) B
. Automatic Divide terms into thesaurus classes.
Replace similar terms by a thesaurus class.
408 dislocation 409 blast-cooled junction heat-f
low minority-carrier heat-transfer n-p-n
p-n-p 410 anneal point-contact strain
recombine transition unijunction
From Salton and McGill
6Desirable Properties for Information Retrieval
Thesaurus is specific to a subject area.
Contains only terms of interest for
identification within that subject
area. Ambiguous terms are coded only for
the senses important for that field.
Target is that each thesaurus class should
include terms of moderate frequency.
Ideally the classes should have similar
frequency.
7Art and Architecture Thesaurus
- Controlled vocabulary for describing and
retrieving information - fine art, architecture, decorative art, and
material culture. - Almost 120,000 terms for objects, textual
materials, images, - architecture and culture from all periods and all
cultures. - Used by archives, museums, and libraries to
describe items in their - collections.
- Used to search for materials.
- Used by computer programs, for information
retrieval, and natural - language processing.
- A project of the J. Paul Getty Trust
8Art and Architecture Thesaurus
Provides the terminology for objects, and the
vocabulary necessary to describe them, such as
style, period, shape, color, construction, or
use, and scholarly concepts, such as theories, or
criticism. Concept a cluster of terms, one of
which is established as the preferred term, or
descriptor. Categories associated concepts,
physical attributes, styles and periods, agents,
activities, materials, and objects.
9Art and Architecture Thesaurus Sample Record
Record ID 198841 Descriptor rhyta Note Refers
to vessels from Ancient Greece, eastern Europe,
or the Middle East that typically have a closed
form with two openings, one at the top for
filling and one at the base so that liquid could
stream out. They are often in the shape of a horn
or an animal's head, and were typically used as a
drinking cup or for pouring wine into another
vessel. Hierarchy Containers
TQ ...ltcontainers by function or
contextgt ...........ltculinary containersgt ......
.............ltcontainers for serving and
consuming foodgt
10Art and Architecture Thesaurus Sample Record
(continued)
Terms rhyta rhyton (alternate, singular)
protomai protome rhea rheon rheons
Related concepts stirrup cups sturzbechers dr
inking vessels ceremonial vessels
11MeSH -- Medical Subject Headings
Controlled vocabulary for indexing articles, for
cataloging books and other holdings, and for
searching MeSH-indexed databases, including
MEDLINE. About 19,000 primary subject
headings Thesaurus of 110,000 chemical
terms. Total vocabulary over 300,000
terms. National Library of Medicine provides MeSH
subject headings for each of the 400,000 articles
that it indexes every year. "MeSH terminology
provides a consistent way to retrieve information
that may use different terminology for the same
concepts."
12MeSH -- Medical Subject Headings
MeSH hierarchy general terms, e.g., anatomy,
organisms, diseases, biological sciences
anatomy is divided into sixteen topics, e.g.,
body regions and musculoskeletal system body
regions is divided into sections, e.g., abdomen,
axilla, back etc.
13Example of MeSH hierarchy
Biological Sciences G Biological
Sciences G01 Health Occupations
G02 Environment and Public Health
G03 Biological Phenomena, Cell
Phenomena, and Immunity G04
Genetics G05 Biochemical
Phenomena, Metabolism, and Nutrition G06
Physiological Processes G07
Reproductive and Urinary Physiology G08
Circulatory and Respiratory Physiology
G09 Digestive, Oral, and Skin
Physiology G10 Musculoskeletal,
Neural, and Ocular Physiology G11
Chemical and Pharmacologic Phenomena G12
14Example of MeSH hierarchy (continued)
Physiological Processes G07
Adaptation, Physiological G07.062
Aging G07.168 Body
Constitution G07.265 Body
Temperature G07.315
Body Temperature Regulation G07.315.232
Skin Temperature
G07.315.753 Chronobiology
G07.450 Electrophysiology
G07.453 Fluid Shifts
G07.503 Growth and Embryonic
Development G07.553
Homeostasis G07.621 Tensile
Strength G07.900 Tropism
G07.950
15Example of MeSH hierarchy (continued)
MeSH Heading Body Temperature Tree
Number E01.370.600.120 Tree Number G07.315
Entry Term Organ Temperature See Also Fever See
Also Thermography See Also Thermometers
Allowable Qualifiers DE GE IM PH RE Unique
ID D001831
16Observations about Manually Maintained Thesaurus
Permit very rich structure of relationships
Most effective when user of search system is
skilled in the discipline and trained in
the use of the thesaurus (e.g., medical
librarian) Needs continually updating as a
field develops new terminology
Expensive to create and maintain
17Gazetteers
The Alexandria Digital Library (ADL) geolibrary
at University of California at Santa Barbara
where a primary attribute of objects is location
on Earth (e.g., map, satellite photograph). Geogra
phic footprint latitude and longitude values
that represent a point, a bounding box, a linear
feature, or a complete polygonal boundary.
Gazetteer list of geographic names, with
geographic locations and other descriptive
information. Geographic name proper name for a
geographic place or feature (e.g., Santa Barbara
County, Mount Washington, St. Francis Hospital,
and Southern California)
18Alexandria Thesaurus Example
canals A feature type category for places such
as the Erie Canal. Used for The category
canals is used instead of any of the following.
canal bends canalized streams ditch
mouths ditches drainage canals
drainage ditches ... more ... Broader Terms
Canals is a sub-type of hydrographic structures.
19Alexandria Thesaurus Example (continued)
canals (continued) Related Terms The following
is a list of other categories related to canals
(non-hierarchial relationships). channels
locks transportation features tunnels Scope
Note Manmade waterway used by watercraft or for
drainage, irrigation, mining, or water power.
Definition of canals.
20Use of a Gazetteer
Answers the "Where is" question for
example, "Where is Santa Barbara?"
Translates between geographic names and
locations. A user can find objects by
matching the footprint of a geographic name
to the footprints of the collection
objects. Locates particular types of
geographic features in a designated area.
For example, a user can draw a box around
an area on a map and find the schools, hospitals,
lakes, or volcanoes in the area.
21Alexandria Gazetteer Example from a search on
"Tulsa"
Feature name State County Type Latitude Longitude
Tulsa OK Tulsa pop pl 360914N 0955933W Tulsa
Country OK Osage locale 360958N 0960012W Club
Tulsa County OK Tulsa civil 360600N 09554
00W Tulsa Helicopters OK Tulsa airport 360500N 095
5205W Incorporated Heliport
22Challenges for the Alexandria Gazetteer
Content standard A standard conceptual schema
for gazetteer information. Feature
types A type scheme to categorize individual
features, is rich in term variants and
extensible. Temporal aspects Geographic names
and attributes change through time. "Fuzzy"
footprints Extent of a geographic feature is
often approximate or ill-defined (e.g., Southern
California).
23Challenges for the Alexandria Gazetteer
(continued)
Quality aspects (a) Indicate the accuracy of
latitude and longitude data. (b) Ensure that the
reported coordinates agree with the other
elements of the description. Spatial
extents (a) Points do not represent the extent
of the geographic locations and are therefore
only minimally useful. (b) Bounding boxes, often
include too much territory (e.g., the bounding
box for California also includes Nevada).
24Examples of Gazetteers
Alexandria Digital Library Linda L. Hill, James
Frew, and Qi Zheng, Geographic Names The
Implementation of a Gazetteer in a Georeferenced
Digital Library. D-Lib Magazine, 5 1, January
1999. http//www.dlib.org/dlib/january99/hill/01hi
ll.html Getty Thesaurus of Geographic
Names http//www.getty.edu/research/tools/vocabula
ry/tgn/