Title: Lecture 21: Facetted Classification
1Lecture 21 Facetted Classification
SIMS 202 Information Organization and Retrieval
- Prof. Ray Larson Prof. Marc Davis
- UC Berkeley SIMS
- Tuesday and Thursday 1030 am - 1200 am
- Fall 2004
2Agenda
- Facetted Classification
- Traditional vs. Facetted Classification
- Designing Facetted Classifications
- Thesaurus Design
- Assignment 6
- Discussion Questions
- Action Items for Next Time
3Agenda
- Facetted Classification
- Traditional vs. Facetted Classification
- Designing Facetted Classifications
- Thesaurus Design
- Assignment 6
- Discussion Questions
- Action Items for Next Time
4Controlled Vocabularies
- Vocabulary control is the attempt to provide a
standardized and consistent set of terms (such as
subject headings, names, classifications, etc.)
with the intent of aiding the searcher in finding
information - That is, it is an attempt to provide a consistent
set of descriptions for use in (or as) metadata
5Hierarchical Classification
- Each category is successively broken down into
smaller and smaller subdivisions - No item occurs in more than one subdivision
- Each level divided out by a character of
division (also known as a feature) - Example
- Distinguish Literature based on
- Language
- Genre
- Time Period
Slide author Marti Hearst
6Hierarchical Classification
Slide author Marti Hearst
7Labeled Categories for Hierarchical Classification
- LITERATURE
- 100 English Literature
- 110 English Prose
- English Prose 16th Century
- English Prose 17th Century
- English Prose 18th Century
- ...
- 111 English Poetry
- 121 English Poetry 16th Century
- 122 English Poetry 17th Century
- ...
- 112 English Drama
- 130 English Drama 16th Century
-
- 200 French Literature
Slide author Marti Hearst
8Faceted Categories
- Mutually exclusive
- Non-overlapping, distinct categories
- Relational
- Relations between facets, subfacets, and foci
(elements) are not restricted to hierarchical
generalization-specialization relations - Composable
- Combined using grammars of order and relation to
form compound descriptions
9Faceted Classification Along With Labeled
Categories
- A Language
- a English
- b French
- c Spanish
- B Genre
- a Prose
- b Poetry
- c Drama
- C Period
- a 16th Century
- b 17th Century
- c 18th Century
- d 19th Century
- Aa English Literature
- AaBa English Prose
- AaBaCa English Prose 16th Century
- AbBbCd French Poetry 19th Century
- BbCd Drama 19th Century
Slide author Marti Hearst
10Ranganathan
- PMEST Facets
- P(ersonality)
- WHO Types of things
- M(atter)
- WHAT Constituent materials
- E(nergy)
- HOW Action or activity terms
- S(pace)
- WHERE Where things occur
- T(ime)
- WHEN When things occur
11Classical Facet Analysis
- Entity
- Kind
- Part
- Property
- Material
- Process
- Operation
- Patient
- Product
- By-Product
- Agent
- Space
- Time
12Classical Facet Analysis
- What is being done?
- Entity
- Kind
- Product
- By-Product
- What are its parts?
- Part
- What are its properties?
- Property
- Material
- How is this achieved?
- Process
- By what means?
- Operation
- By whom?
- Agent
- Patient
- Where?
- Space
- When?
- Time
13Classical Facet Analysis
- Nouns
- Entity
- Kind
- Part
- Patient
- Product
- By-Product
- Agent
- Adjectives
- Property
- Material
- Intransitive Verb
- Process
- Transitive Verb
- Operation
- Adverb
- Space
- Time
14Semantic and Syntactic Relationships
- Semantic relationships
- Is-A (thing/kind, genus/species)
- Mammals
- Primates
- Humans
- Has-Parts
- Human
- Head
- Eyes
- Syntactic relationships
- Compounds
- Wheat harvesting wheat harvesting
- Object operation operation on object
15Faceted Classification
- Clearly distinguishes between semantic
relationships and syntactic relationships - Semantic relationships
- Within a facet
- Containment relations
- Syntactic relationships
- Across facets
- Combinatoric relations
- Have a syntax for syntactic combination of
semantic terms
16Power of Facet Combinations
- The syntactic relations of faceted
classifications enable a small controlled
vocabulary to produce - Many, many structured descriptions
- Complex, but formally structured descriptions
using nested compound descriptions - Descriptions for things we do not have words for
17Example Objects
Red Plastic Glass
Blue Paper Straw
18Project Team Facetted Classifications
- 007
- Personality
- Straw
- Glass
- Operation
- Drinking
- Slurping
- Sipping
- Material
- Plastic
- Paper
- Color
- Blue
- Red
- ARTery
- Color
- Size
- Material
- Weight
- Shape
- Radius/Circumference
- Density
- Volume/Capacity
- Function/Use
- Hardness/Softness
- Yin/Yang
19Project Team Facetted Classifications
- Culture Feed
- Color
- Red
- Blue
- Material
- Plastic
- Paper
- Use
- Drink from
- Drink with
- Dimensions
- Circumference
- Height
- Diameter
- Picture Portal
- Color
- Red
- Blue
- Material
- Paper
- Plastic
- Use
- Containment
- Transport
- Shape
- Torus
- Planar
- Holes
- 0
- 1
20Project Team Facetted Classifications
- F.U.N.
- Shape
- Color
- Material
- Rigidity
- Function
- Container
- Conduit
- Locale
- Weight
- Size
- MNM
- Functionality
- What it does
- What you can do with it
- Physical Properties
- Color
- Shape
- Material
21Project Team Facetted Classifications
- pillBox
- Function
- Container
- Conduit
- Form
- Shape
- Cylinder
- Composition
- Paper
- Plastic
- Color
- Blue
- Red
- Size
- Tall and skinny
- Short and fat
- Team iTour
- Color
- Red
- Blue
- State
- Solid
- Non-porous
- Flexible
- Material
- Plastic
- Paper
- Geometry
- Cylindrical
- Hollow
- Function
- Container
- Drinking
- Sucking
- Blowing
22Example Objects
Gray Metal Glass
Two Yellow Plastic Straws
23Example Objects
- Function
- Form
- Shape
- Material
- Color
- Number
- Function Drinking
- Form
- Shape Cylinder
- Material Plastic
- Color Red
- Number 1
24Agenda
- Facetted Classification
- Traditional vs. Facetted Classification
- Designing Facetted Classifications
- Thesaurus Design
- Assignment 6
- Discussion Questions
- Action Items for Next Time
25Faceted Classification Design
- Collect examples that need to be classified
- Identify candidates for facets and subfacets
- Test classification scheme on examples for facet
orthogonality - Order foci within facets
- Explicate grammar for ordering and combining
facets and subfacets - Test classification scheme on examples for
combinatoric power - Extend foci for comprehensiveness where
applicable - Create new facets and subfacets where needed
- Test classification scheme on new examples,
especially boundary cases - Iterate and refine throughout
26Facet Guidelines
- Terms on the same level in the ontology should be
of the same level and type - Facets, subfacets, and foci should have a
discernible order - Use of capitalization and singular/plural forms
should be uniform
- Sports
- Team Sports
- Baseball
- Football
- Basketball
- Solo Sports
- Marathon Running
- Sports
- Team Sports
- Baseball
- Football
- Basketball
- Solo Sports
- Marathon Running
27Ordering Foci (Array)
- Simple to complex
- (Locomotions walk, run, jump, skip, hurdle,
cartwheel) - Common/popular to uncommon/unpopular
- (Vegetarian Pizza Toppings mushroom, onion,
olive, artichoke, pineapple, pine nuts) - Spatial, geographical, or geometric
- (Southwestern States California, Nevada,
Arizona, New Mexico ) - Chronological, historical, or evolutionary
- (Dinosaur Eras Triassic, Jurassic, Cretaceous)
- Canonical (pre-established order)
- (Playground Counting Eenie, Meenie, Mynee, Mo)
- Alphabetical
- (Boys Names Al, Bob, Chuck, David, Ed, Frank,
George, Harry) - Size
- (T-Shirts Small, Medium, Large, XL, XXL)
28Agenda
- Facetted Classification
- Traditional vs. Facetted Classification
- Designing Facetted Classifications
- Thesaurus Design
- Assignment 6
- Discussion Questions
- Action Items for Next Time
29Why Develop a Thesaurus?
- To provide a conceptual structure or space for
a body of information - To make it possible to adequately describe the
topical content of information resources at an
appropriate level of generality or specificity - To provide enhanced search capabilities and to
improve the effectiveness of searching (i.e., to
retrieve most of the relevant material without
too much irrelevant material)
30Why Develop a Thesaurus?
- To provide vocabulary (or terminological) control
- When there are several possible terms designating
a single concept, the thesaurus should lead the
indexer or searcher to the appropriate concept,
regardless of the terms they start with
31Preliminary Considerations
- What is used now?
- Continue using an existing thesaurus?
- Ad hoc modification of existing thesaurus?
- Develop a new well-structured thesaurus?
- What is the scope and complexity of the subject
field? - What kind of retrieval objects or data will be
dealt with? - How exhaustive and specific is the desired
description of objects?
32Preliminary Considerations
- The scope and complexity of the field will
provide some indication of the scope and
complexity of the thesaurus - It is better to plan for a larger and more
comprehensive system than a smaller system that
rapidly will become inadequate as the database
grows - Development of a good thesaurus requires a major
intellectual effort as well as clerical
operations like data entry and production of
sorted lists
33Development of a Thesaurus
- Term selection
- Merging and development of concept classes
- Definition of broad subject fields and subfields
- Development of classificatory structure
- Review, testing, application, revision
34Flow of Work in Thesaurus Construction
351. Term Selection
- Select sources for the collection of terms
- Prearranged Sources
- Open-ended Sources
- Assign codes to each source
- Selection of terms
- For part of pre-arranged and for all open-ended
sources - Enter terms into database with all information
361.1 Kinds of Sources
- Prearranged Sources
- Existing descriptor lists, classification schemes
thesauri - This includes universal schemes like DDC or LCSH
- Nomenclatures of single disciplines
- Treatises on the terminology of a field
- Encyclopedias, lexica, dictionaries and
glossaries - Tables of contents of textbooks and handbooks
- Indexes of journals or abstracting journals
- Indexes of other publications in the field
371.1 Kinds of Sources
- Open-ended sources
- Lists of search requests or interest profiles
- Description of projects/activities to be served
by the information retrieval system - Discussion with specialists in the field
- Sample of documents in the field
- Ask users why and how these documents relate to
the field - Have documents indexed by experts in the field
- Lists of titles of documents in the field
- Abstracts and reviews of documents
- Your own knowledge
38Selection of Sources
- Prearranged sources require less effort in
gathering the material, and may already indicate
some relationships between terms and concepts and
relationships among terms - Open-ended sources can reflect current
terminology and may provide more complete
coverage - Choose a set of sources that are current, as
complete as possible, and considered authoritative
39Selection of Sources
- Each selected source is assigned an ID for
tracking its use in the development of the
thesaurus - Useful when making decisions about which terms to
prefer - Useful for backtracking when questions arise
(where did this come from?)
40Selection of Terms
- Terms can be transferred directly from
prearranged sources to the recording medium
(cards or database) - Have to decide which terms and references to
include, or to take the whole source
41Selection of Terms
- In open-ended sources you read through the source
and pick out terms (i.e. words and phrases) that
might be useful in retrieval or as references to
other terms - Alternatively, use keyword and phrase extraction
software to create lists of terms and select from
those - Transfer selected terms to the recording medium
(cards or database)
422. Merging and Development of Concept Classes
- Sort Term DB into alphabetical order
- First Round
- Merge information for identical terms, possibly
pulling info from additional sources - Second Round
- Merge synonyms or terms in the same concept class
433. Definition of Broad Subject Fields and
Subfields
- Define broad subject fields and sort terms into
these broad fields - Define subfields within each broad field and sort
terms into these subfields - Work out the detailed structure
- Select preferred terms
- Merge information for terms in the same concept
class - Repeat these steps
- For each subfield within a broad field
- And for each broad field
- Until all terms have been consolidated and
preferred terms selected
444. Development of Classificatory Structure
- Produce preliminary version of classified index
and update the working database - Improve classificatory structure
- Reality check
- Produce and distribute a version of the
classified index - Distribute to users/experts
455. Final Stages
- Review
- Testing
- Application
- Revision
46Review
- Discuss classified index with users/experts
- Select descriptors and checklist descriptors
- Assign notational symbols
- Produce main thesaurus and indexes
47Review (cont.)
- Check cross references and insert where needed
- Produce test version
- Test by indexing
- Modify as needed
- Produce production version
48Testing a Thesaurus
- Assign descriptors to a sample set of NEW
documents (use enough to get an idea of any gaps
in the thesaurus) - Test retrieval using sample questions and seeing
how effectively the thesaurus maps to the
appropriate descriptor
49Art and Architecture Thesaurus
- http//orange.sims.berkeley.edu/cgi-bin/flamenco/a
a/Flamenco
50Agenda
- Facetted Classification
- Traditional vs. Facetted Classification
- Designing Facetted Classifications
- Thesaurus Design
- Assignment 6
- Discussion Questions
- Action Items for Next Time
51Phone Project Assignments
- Photo Metadata Design (Assignment 6)
- Having your application and the overall project
goals in mind, you will design a suitable
metadata framework to use for annotating photos
such that all photos would be accessible not only
for the needs of your particular application, but
also for the reusability of your photos and
metadata by other applications.
52Agenda
- Facetted Classification
- Traditional vs. Facetted Classification
- Designing Facetted Classifications
- Thesaurus Design
- Assignment 6
- Discussion Questions
- Action Items for Next Time
53Discussion Questions
- Paul Poling on Broughton
- What are the major inadequacies of 19th century
classification systems which faceted
classification overcomes? - Some answers
- They don't "display very much in the way of
internal logic, or fundamental structural
principles ineffectual at addressing the
specific problems of vocabulary they do not
consider the precise relations between concepts
multilingual switching difficult, particular in
group/set names "fail to make adequate
distinction between permanent hierarchical
relationships, and relationships of syntactic
association in complexes.  As a result,
structures are not logical (since the analysis is
not rigorous), positioning of compound subjects
is not predictable (since no operating rules for
combination are normally present), and retrieval
is unreliable"
54Discussion Questions
- Paul Poling on Broughton
- The author makes the somewhat startling claim
that, "the fundamental thirteen categories have
been found to be sufficient for the analysis of
vocabulary in almost all areas of knowledge."Â
Are there any exceptions to this that come to
mind?
55Discussion Questions
- Paul Poling on Broughton
- Broughton later notes that some aspects of
digital materials cannot be represented by the 13
categories used for the BC2 system. For use with
our cameraphones, what are some categories that
would need to be included?  More importantly,
what is the minimum set of additional categories
needed?Â
56Discussion Questions
- Paul Poling on Broughton
- Broughton states that, "There is no obvious way
in which the core vocabulary can be dealt with by
machines...the initial allocation of vocabulary
tocategories must be carried out
intellectually."Â The author goes on to suggest
that all but the initial category assignments can
be done by a computer. How feasible is the BC2
system for the web, considering this requirement,
when one considers the fairly rapidly expanding
categories in so many fields of human
knowledge? Â
57Discussion Questions
- Steve Chan on Broughton
- The category system used in BLISS/BC2 is based on
a general ? specific ordering and on 13
functional categories. How do you think that
Lakoff's ideas of base level categories, and
the importance of metaphor/embodiment relate to
the categories chosen in Bliss/BC2?
58Discussion Questions
- Steve Chan on Broughton
- Many of the relationships in the categories fall
into types such as "is a kind of" or "is a part
of". These are very similar to the predicates in
WordNet. As a thought experiment, what would it
take to interface WordNet into something like
BC2, so that documents could be parsed for
content and then automatically categorized? Would
you want to let such a system generate the
categories?
59Discussion Questions
- Scott Fisher on Faceted Classification
- What are some different ways of ordering the
facets within a classification notation? When
might one ordering be more appropriate than
another? Â Why might the result be especially
important for non-electronic documents?
60Discussion Questions
- Scott Fisher on Faceted Classification
- Why is it important that characteristics of
division be mutually exclusive? Explain what
might happen if they are not.
61Discussion Questions
- Morgan Ames on Vickery
- Though facets are a powerful tool for organizing
information, they can be very time-consuming to
define. Vickery describes the creation of
facets, starting with the analysis of terms used
by a user group, then the sorting of the terms
into facets, the development of facets (depending
on how often they're used), the arrangement of
the facets, and finally, the establishment of a
notation for the facets. Could one automate some
or all of the process of defining facets for a
particular area - say, an online community? If
so, which parts could be automated, and how? If
not, why not - what are the limitations of
automation?
62Discussion Questions
- Morgan Ames on Vickery
- How do the properties of facets compare with the
properties of relational databases?
63Discussion Questions
- Lilia Manguy on Thesaurus Construction
- The reading mentions thesauri being constructed
for institutions. What are some examples of
institutions with specialized thesauri? Why were
they deemed necessary?
64Discussion Questions
- Lilia Manguy on Thesaurus Construction
- In our field, what are some scenarios in which a
thesaurus would need to be constructed? How would
you determine who would be your expert
consultants? Who would you choose?
65Discussion Questions
- Lilia Manguy on Thesaurus Construction
- Using the process outlined in the reading for
constructing a thesaurus, how would you qualify
whether your thesaurus is good or bad?
66Discussion Questions
- Christine Jones on Card Sorting
- Considering the "vocabulary problem" laid forth
in "The Vocabulary Problem in Human System
Communication," by Furnas et. al., do you think
the card sorting technique is an effective
approach for categorizing information for the
SunWeb Intranet, i.e. do you think menus and the
search function contain vocabulary users will
understand? Would yourecommend any other tools
for the user to increase their understanding of
the SunWeb information space?
67Discussion Questions
- Christine Jones on Card Sorting
- Usability studies including card sorting, icon
intuitiveness testing, card distribution to
icons, and thinking aloud walkthrough were
performed and the results were based in part on
subjective interpretation. For example, instead
of depending on formal statistics, eyeballing the
data was used and when deciding whether to keep
icons, the user interface designers made the
final decisions. Do you think this level of
subjective interpretation was justified for a
project of this nature? What (if any) changes
would you make to this approach if the project
was a redesign or design of Sun's external
Website?
68Discussion Questions
- Carrie Burgener on Flamenco
- How do the search and browse functions used by
Flamenco compare to Bates Berry Picking Model?
69Discussion Questions
- Carrie Burgener on Flamenco
- The examples in the article were collections of
images that had existing metadata associated. It
has been presented in IS203 that people take
pictures and generally do not organize them. How
can the UI design of Flamenco be applied to photo
annotation?
70Discussion Questions
- Carrie Burgener References for Flamenco
- PhotoCompas tool using Flamenco interface
- http//shark.stanford.edu4230/cgi-bin/flamenco/mo
r_full/Flamenco?usernamedefault - Presentation by Professor Hearst
- http//bailando.sims.berkeley.edu/talks/dli02.ppt
- Different article
- http//www.sims.berkeley.edu/hearst/papers/cacm02
.pdf
71Agenda
- Facetted Classification
- Traditional vs. Facetted Classification
- Designing Facetted Classifications
- Thesaurus Design
- Assignment 6
- Discussion Questions
- Action Items for Next Time
72Homework (!)
- Assignment 6
- Due Thursday, November 18
- Read
- Textbook Organization of Information Chapters
3-5 (Taylor) - Chitra 3
- Shufei 4
- Jaime 5