Title: Lecture 07: Controlled Vocabularies
1Lecture 07 Controlled Vocabularies
SIMS 202 Information Organization and Retrieval
- Prof. Ray Larson Prof. Marc Davis
- UC Berkeley SIMS
- Tuesday and Thursday 1030 am - 1200 am
- Fall 2003
Some slides in this lecture were developed by
Prof. Marti Hearst
2Lecture Contents
- Phone Project
- Review
- Metadata Systems
- Dublin Core
- Controlled Vocabularies
- Name Authority Files
- Other Types of Controlled Vocabularies
- Faceted vs. Hierarchic Organization of
Vocabularies - Discussion Questions
3Lecture Contents
- Phone Project
- Review
- Metadata Systems
- Dublin Core
- Controlled Vocabularies
- Name Authority Files
- Other Types of Controlled Vocabularies
- Faceted vs. Hierarchic Organization of
Vocabularies - Discussion Questions
4Assignments
- Assignment 2 Due
- Assignment 3 Photo Capture and Annotation
- Assigned Sept 18
- Due Sept 23
5Phone Project Consent Forms
- Collection of Data for the Phone Project
- Informed Consent and Release Form
- Informed Consent to Release Academic Information
- You must sign these forms to receive a phone and
participate in the Phone Project - Signing these consent forms is not a condition of
your participation in this course, nor will it be
used as a basis for grading your performance
therein
6Collection of Data for the Phone Project
- Call logging
- All phone calls made from the phones provided to
you will be logged. The phone conversations
themselves are not going to be recorded, but
record will be made of which numbers were called
when and for how long. - Approximate location logging
- Your approximate location may be logged whenever
the phone is used either for phone calls or to
take, upload, annotate or retrieve photos. - Data correlation
- The information call logging and approximate
location logging may be correlated with various
other sources of information (e.g., raw location
data may be correlated with map data to try to
determine in which buildings the phone was used.)
- Sublicensing of data collected
- Garage Cinema Research may sublicense portions of
the collected data to other parties. This may
include images of you or provided by you, as well
as metadata about you or provided by you. - Privacy projections
- Garage Cinema Research will not release your
name, email address, or the complete phone
numbers of the parties you called, except for
their area codes and except for calls made
between two Phone Project phones.
7Informed Consent and Release Form
- License to content
- License to the content contributed by you to the
system, including but not limited to images,
annotations, and annotation frameworks, as well
as any data that will be collected in accordance
with the privacy protecting measures. - Identifying information and pseudonyms
- Use of your name and email address by the system,
understanding that they are not going to be
released to third parties. Your name will be
replaced with a pseudonym if the data is released
to third parties. - Personal data collection
- Applications built in the system will benefit
from the use of personal information, however,
you are not required to provide the system with
any personal information about yourself or other
people beyond the data that is being collected
automatically. - Right of inspection/correction/deletion of photos
- You have the right to inspect photos of you or
information about you submitted by you and/or
other users of the system and to have them
corrected or removed.
8Consent to Release Academic Information
- Agreement to post work on IS202 web site
- You agree to have your Phone Project course work
posted, including your name, on the IS202 web
site, which is accessible to the general public. - Understanding of course enrollment and authorship
disclosure - You understand that this will publicly reveal
that you are a student at the University of
California at Berkeley, that you are taking this
course, and that you are an author of this work. - Indefinite time period of posting
- You understand that my name may be posted on this
web site indefinitely, starting in September
2003. - Optional email address posting
- The posting of student email addresses on the
IS202 web site Phone Project group pages, while
kindly requested, is not required.
9Lecture Contents
- Phone Project
- Review
- Metadata Systems
- Dublin Core
- Controlled Vocabularies
- Name Authority Files
- Other Types of Controlled Vocabularies
- Faceted vs. Hierarchic Organization of
Vocabularies - Discussion Questions
10Metadata
- Structures and languages for the description of
information resources and their elements
(components or features) - Metadata is information on the organization of
the data, the various data domains, and the
relationship between them (Baeza-Yates p. 142)
11Metadata
- Often two main types of metadata are
distinguished - Descriptive metadata
- Describes the information/data object and its
properties - May use a variety of descriptive formats and
rules - Topical metadata
- Describes the topic or aboutness of an
information/data object - May include a variety of vocabularies for
describing, subjects, topics, categories, etc.
12Metadata Systems and Standards
- Naming and ID systems URLS, ISBNS
- Bibliographic description MARC, Dublin Core,
TEI, etc. - Music SMDL
- Images and objects CIMI, VRA core categories
- Numeric data DDI, SDSM
- Geospatial data FGDC
- Collections EAD
13Dublin Core
- Simple metadata for describing internet resources
- For Document-Like Objects
- 15 Elements (in base DC)
14Dublin Core Elements
- Title
- Creator
- Subject
- Description
- Publisher
- Other Contributors
- Date
- Resource Type
- Format
- Resource Identifier
- Source
- Language
- Relation
- Coverage
- Rights Management
15Lecture Contents
- Phone Project
- Review
- Metadata Systems
- Dublin Core
- Controlled Vocabularies
- Name Authority Files
- Other Types of Controlled Vocabularies
- Faceted vs. Hierarchic Organization of
Vocabularies - Discussion Questions
16Controlled Vocabularies
- Vocabulary control is the attempt to provide a
standardized and consistent set of terms (such as
subject headings, names, classifications, etc.)
with the intent of aiding the searcher in finding
information - That is, it is an attempt to provide a consistent
set of descriptions for use in (or as) metadata
17Controlled Vocabularies
- Names and name authorities
- Gazetteers (geographic names)
- Code lists (e.g., LC language codes)
- Subject heading lists
- Classification schemes
- Thesauri
18Control of Names
- Cutters (1876) objectives of bibliographic
description - To enable a person to find a document of which
- The author, or
- The title, or
- The subject is known
- To show what a library has
- By a given author
- On a given subject (and related subjects)
- In a given kind (or form) of literature.
- First serves access
- Second serves collocation
19Problems with Names
- How many names should be associated with a
document? - Which of these should be the main entry?
- What form should each of the names take?
- What references should be made from other
possible forms of names that havent been used?
20The Problem
- Proliferation of the forms of names
- Different names for the same person
- Different people with the same names
- Examples
- from Books in Print (semi-controlled but not
consistent) - ERIC author index (not controlled)
21Goethe
etc
22John Muir
23Pauline Cochrane nee Atherton
24Pauline Cochrane nee Atherton
25Rules for Description
- AACR II and other sets of descriptive cataloging
rules provide guidelines for - Determining the number of name entries
- Choosing a main entry
- Deciding on the form of name to be used
- Deciding when to make references
26Authority Control
- Authority control is concerned with creation and
maintenance of a set of terms that have been
chosen as the standard representatives (also know
as established) based on some set of rules - If you have rules, why do you need to keep track
of all of the headings? Cant you just infer the
headings from the rules?
27Conditions of Authorship?
- Single person or single corporate entity
- Unknown or anonymous authors
- Fictitiously ascribed works
- Shared responsibility
- Collections or editorially assembled works
- Works of mixed responsibility (e.g.,
translations) - Related works
28Added Entries
- Personal names
- Collaborators
- Editors, compilers, writers
- Translators (in some cases)
- Illustrators (in some cases)
- Other persons associated with the work (such as
the honoree in a festschrift) - Corporate names
- Any prominently named corporate body that has
involvement in the work beyond publication,
distribution, etc.
29Choice of Name
- AACR II says that the predominant form of the
name used in a particular authors writings
should be chosen as the form of name - References should be made from the other forms of
the name
30Form of the Name
- When names appear in multiple forms, one form
needs to be chosen - Criteria for choice are
- Fullness (e.g., full names vs. initials only)
- Language of the name
- Spelling (choose predominant form)
- Entry element
- John Smith or Smith, John?
- Mao Zedong or Zedong, Mao? (Mao Tse Tung?)
31Name Authority Files
IDNAFL8057230 STp ELn STHa MSc
UIPa TD19910821174242 KRCa NMUa
CRCc UPNa SBUa SBCa DIDn
DF05-14-80 RFEa CSC SRUb SRTn
SRNn TSS TGA? ROM? MOD VSTd
08-21-91 Other Versions
earlier 040 DLCcDLCdDLCdOCoLC 053
PR6005.R517 100 10 Creasey, John 400 10
Cooke, M. E. 400 10 Cooke, Margaret,d1908-1973
400 10 Cooper, Henry St. John,d1908-1973
400 00 Credo,d1908-1973 400 10 Fecamps,
Elise 400 10 Gill, Patrick,d1908-1973 400
10 Hope, Brian,d1908-1973 400 10 Hughes,
Colin,d1908-1973 400 10 Marsden, James 400
10 Matheson, Rodney 400 10 Ranger, Ken 400
20 St. John, Henry,d1908-1973 400 10 Wilde,
Jimmy 500 10 wnnncaAshe, Gordon,d1908-1973
Different names for the same person
32Name Authority Files
IDNAFO9114111 STp ELn STHa MSn
UIPa TD19910817053048 KRCa NMUa
CRCc UPNa SBUa SBCa DIDn
DF06-03-91 RFEa CSCc SRUb SRTn
SRNn TSS TGA? ROM? MOD VSTd
08-19-91 040 OCoLCcOCoLC 100 10 Marric,
J. J.,d1908-1973 500 10 wnnncaCreasey,
John 663 Works by this author are entered
under the name used in the item. For a
listing of other names used by this author,
search also underbCrease y, John 670
OCLC 13441825 His Gideon's day, 1955b(hdg.
Creasey, John usage J .J. Marric) 670
LC data base, 6/10/91b(hdg. Creasey, John
usage J.J. Marric) 670 Pseuds. and
nicknames dict., c1987b(Creasey, John,
1908-1973 Britis h author pseud.
Marric, J. J.)
33Name Authority Files
IDNAFL8166762 STp ELn STHa MSc
UIPa TD19910604053124 KRCa NMUa
CRCc UPNa SBUa SBCa DIDn
DF08-20-81 RFEa CSC SRUb SRTn
SRNn TSS TGA? ROM? MOD VSTd
06-06-91 Other Versions
earlier 040 DLCcDLCdDLCdOCoLC 100 10
Butler, William Vivian,d1927- 400 10 Butler,
W. V.q(William Vivian),d1927- 400 10 Marric,
J. J.,d1927- 670 His The durable
desperadoes, 1973. 670 His The young
detective's handbook, c1981bt.p. (W.V. Butler)
670 His Gideon's way, 1986bCIP t.p.
(William Vivian Butler writing as J .J.
Marric)
Different people writing with the same name
34The Haunting of Lauran Paine
Batchelor, Reg. Beck,
Harry. Bedford, Kenneth.
Bosworth, Frank. Bovee, Ruth.
Cassidy, Claude. Custer, Clint.
Dana, Amber. Dana, Richard.
Davis, Audrey. Drexler, J. F.
Duchesne, Antoinette. Fisher,
Margot. Fleck, Betty.
Frost, Joni. Gordon, Angela.
Gorman, Beth. Hayden, Jay.
Houston, Will. Howard, Troy.
Ingersol, Jared.
Kelly, Ray. Ketchum, Jack.
Liggett, Hunter. Lucas, J.
K. Lyon, Buck. Morgan,
Arlene. Morgan, Valerie.
O'Connor, Clint. St. George, Arthur.
Sharp, Helen. Thorn,
Barbara. Archer, Dennis.
Clark, Badger.
1. Paine, Lauran. ALSO KNOWN AS
Carrel, Mark. Thompson, Russ.
Andrews, A. A. Benton, Will.
Bradford, Will. Bradley,
Concho. Brennan, Will.
Carter, Nevada. Allen, Clay.
Almonte, Rosa. Armour, John.
Cassady, Claude. Glendenning, Donn.
Kelley, Ray. Kilgore, John.
Martin, Tom. Slaughter,
Jim. Standish, Buck.
35Some Interesting Ones
36Structure of an IR System
37Uses of Controlled Vocabularies
- Library subject headings, classification, and
authority files - Commercial journal indexing services and
databases - Yahoo, and other web classification schemes
- Online and manual systems within organizations
- SunSolve
- MacArthur
38Types of Indexing Languages
- Uncontrolled keyword indexing
- Indexing languages
- Controlled, but not structured
- Thesauri
- Controlled and structured
- Classification systems
- Controlled, structured, and coded
- Faceted thesauri and classification systems
39Indexing Languages
- An index is a systematic guide designed to
indicate topics or features of documents in order
to facilitate retrieval of documents or parts of
documents - An Indexing language is the set of terms used in
an index to represent topics or features of
documents, and the rules for combining or using
those terms
40Indexing Languages
- Library of Congress Subject Headings
- Yellow pages topics
- Wilson indexes (readers guide)
41Thesauri
- A thesaurus is a collection of selected
vocabulary (preferred terms or descriptors) with
links among - Synonymous
- Equivalent
- Broader
- Narrower, and
- Other related terms
- National and international standards for thesauri
(More next time)
42Classification Systems
- A classification system is an indexing language
often based on a broad ordering of topical areas - Thesauri and classification systems both use this
broad ordering and maintain a structure of
broader, narrower, and related topics - Classification schemes commonly use a coded
notation for representing a topic and its place
in relation to other terms
43Classification Systems (Cont.)
- Examples
- The Library of Congress Classification System
- The Dewey Decimal Classification System
- The ACM Computing Reviews Categories
- The American Mathematical Society Classification
System
44Using Controlled Vocabulary
- Start with the text of the document
- Attempt to control or regularize
- The concepts expressed within
- mutually exclusive
- exhaustive
- The language used to express those concepts
- limit the normal linguistic variations
- regulate word order and structure of phrases
- reduce the number of synonyms or near-synonyms
- Also, provide cross-references between concepts
and their expression
(These slides follow Bates 88)
Slide author Marti Hearst
45Classification Schemes
- Classify possible concepts.
- Goals
- Completely distinct conceptual categories
(mutually exclusive) - Complete coverage of conceptual categories
(exhaustive)
Slide author Marti Hearst
46Assigning Headings vs. Descriptors
- Descriptors
- Mix and match
- Subject headings
- Assign one (or a few) complex heading(s) to the
document
How would we describe recipes using each
technique?
Slide author Marti Hearst
47Subject Heading vs. Descriptors
- Wilsonline
- Athletes
- Athletes -- Heathhygiene
- Athletes -- Nutrition
- Athletes -- Physical Exams
-
- Athletics
- Athletics -- Administration
- Athletics -- Equipment -- Catalogs
-
- Sports -- Accidents and Injuries
- Sports -- Accidents and Injuries -- Prevention
- ERIC
- Athletes
- Athletic Coaches
- Athletic Equipment
- Athletic Fields
- Athletics
-
- Sports Psychology
- Sportsmanship
Slide author Marti Hearst
48Subject Headings vs. Descriptors
- Describe the contents of an entire document
- Designed to be looked up in an alphabetical index
- Look up document under its heading
- Few (1-5) headings per document
- AKA Precoordination
- Describe one concept within a document
- Designed to be used in Boolean searching
- Combine to describe the desired document
- Many (5-25) descriptors per document
- AKA Postcoordination
Slide author Marti Hearst
49Lecture Contents
- Phone Project
- Review
- Metadata Systems
- Dublin Core
- Controlled Vocabularies
- Name Authority Files
- Other Types of Controlled Vocabularies
- Faceted vs. Hierarchic Organization of
Vocabularies - Discussion Questions
50Hierarchical Classification
- Each category is successively broken down into
smaller and smaller subdivisions - No item occurs in more than one subdivision
- Each level divided out by a character of
division (also known as a feature) - Example
- Distinguish Literature based on
- Language
- Genre
- Time Period
Slide author Marti Hearst
51Hierarchical Classification
Slide author Marti Hearst
52Labeled Categories for Hierarchical Classification
- LITERATURE
- 100 English Literature
- 110 English Prose
- English Prose 16th Century
- English Prose 17th Century
- English Prose 18th Century
- ...
- 111 English Poetry
- 121 English Poetry 16th Century
- 122 English Poetry 17th Century
- ...
- 112 English Drama
- 130 English Drama 16th Century
-
- 200 French Literature
Slide author Marti Hearst
53Faceted Classification
- Create a separate, free-standing list for each
characteristic or division (feature) - Combine features to create a classification
Slide author Marti Hearst
54Faceted Classification Along With Labeled
Categories
- A Language
- a English
- b French
- c Spanish
- B Genre
- a Prose
- b Poetry
- c Drama
- C Period
- a 16th Century
- b 17th Century
- c 18th Century
- d 19th Century
- Aa English Literature
- AaBa English Prose
- AaBaCa English Prose 16th Century
- AbBbCd French Poetry 19th Century
- BbCd Drama 19th Century
Slide author Marti Hearst
55Questions
- How (and when) to use both types of
classification structures? - How to look through them?
- How to use them in searching?
Slide author Marti Hearst
56Lecture Contents
- Phone Project
- Review
- Metadata Systems
- Dublin Core
- Controlled Vocabularies
- Name Authority Files
- Other Types of Controlled Vocabularies
- Faceted vs. Hierarchic Organization of
Vocabularies - Discussion Questions
57Sarah Ellinger on Svenonius
- Many of the studies Svenonius cites seem to
grapple with the same issue how, or from whose
perspective, do we measure the success of a
database search? Should a successful search
return all information, or distinguish by
relevance? Can we always accept the searcher's
view of relevant material? If an issue is under
debate, should our search technologies provide
the user with information from all sides, or only
the side with which the searcher agrees?
58Sarah Ellinger on Svenonius
- In regards to discipline-specific search
vocabularies, Svenonius asks, "Would it not make
more sense to custom tailor a vocabulary-control
tool to the vocabulary being tailored?" In a
world where academic terms are prone to change,
how do we avoid replicating obsoletisms like
"Vietnamese conflict" in disciplinary
vocabularies? Would such a vocabulary preserve
outdated associations in the minds of searchers
or lend credence to some theories over others?
How can a controlled vocabulary reflect academic
debate?
59Matt Meiske on Bates
- Bates article was written over 17 years ago.
Since then, online catalogues have changed (e.g.,
web-based Melvyl), but not to the extent that
Bates proposes. Why not?
60Matt Meiske on Bates
- In her proposal, Bates states that a good online
catalogue design will provide some means of
orientation, so that the user can get a feel
for the system. Seventeen years later, the world
is far a more computer-centric place. Are we
becoming naturally oriented to systems of this
sort? Is the issue of orientation / docking
still relevant?
61Paul Laskowski on Borgman
- Borgman wrote in 1996, but certain passages
already seem outdated to me (e.g., "a customer
trying to operate a mouse as a foot pedal." 499)
The spread of GUI interfaces, in particular, may
solve some of the interface problems Borgman
identifies. Is there still a problem with "user
education," or is it now time to focus on how
catalogues react to user queries? Is the real
problem for users one of "technical skills," or
should users be trained specifically to formulate
queries "strategically"? Is this a skill that
can be taught?
62Paul Laskowski on Borgman
- I opened up JSTOR (http//www.jstor.com) to try
to compare Borgman's ideas to practice. JSTOR
allows me to query the following fields author,
title, abstract, and full-text. Borgman does not
seem to foresee the ability to search the full
text. Does this ability make a subject query
obsolete? In what scenario might I prefer to
query a subject field? JSTOR allows me to
constrain my search along multiple fields, using
the operators AND, OR, and NEAR (10 words or 25
words). Sure enough, no information seems to be
given on the order of operations. In what cases
might this foil my search attempt?
63Next Time
- Thesaurus Design and Construction
- Readings/Discussion
- Chapter F Flow of Work in the Construction of
Indexing Languages and Thesauri (Soergel) - Simon - The House of Quality (Hauser and Clausing) - Sean
- Designing the Organizational Framework (Sano) -
Lisa - Phone Project
- Phones!
- Phone demo
- Assignment 3 Photo Capture and Annotation
64Discussion Questions Leaders
- Soergel
- Hauser and Clausing
- Sano