Title: DL:Lesson 5 Classification Schemas Luca Dini dini@celi.it
1DLLesson 5Classification SchemasLuca
Dinidini_at_celi.it
2Overview
- The Dublin Core defines a number of metadata
elements, but what about the values for those
elements? - Should they be unrestricted text values or come
from pre-defined vocabularies? - "it depends".
- We will discuss how to determine the appropriate
approach for an organization's situation. - We will also cover how pre-defined vocabularies
should be sourced, structured, and maintained.
3Vocabulary development and maintenance
- Vocabulary development and maintenance is the
LEAST of three problems - The Vocabulary Problem How are we going to build
and maintain the lists of pre-defined values that
can go into some of the metadata elements? - The Tagging Problem How are we going to populate
metadata elements with complete and consistent
values? - What can we expect to get from automatic
classifiers? What kind of error detection and
error correction procedures do we need? - The ROI Problem How are we going to use content,
metadata, and vocabularies in applications to
obtain business benefits? - More sales? Lower support costs? Greater
productivity? - How much content? How big an operating budget?
- Need to know the answer to the ROI Problem before
solving the Vocabulary Problem.
4Definitions
Term Definition
Metadata Element A field for storing information about one piece of content. Examples Title, Creator, Subject, Date,
Metadata Value The contents of one Metadata Element. Values may be text strings, or selections from a predefined vocabulary.
Metadata Schema A defined set of metadata elements. The Dublin Core is one schema.
Free Text Value An unconstrained text metadata value. Some text values are constrained to follow a format (e.g. YYYY-MM-DD).
Vocabulary A list of predefined values for a metadata element.
Controlled Vocabulary A vocabulary with a defined and enforced procedure for its update.
5Controlled vocabularies
Hierarchical classification of things into a tree
structure
6Types of vocabularies
Vocabulary Type Cplxty. Description Relation Type
Term List 1 Simple list of terms with no internal structure or relations. None
Synonym Rings 2 List of sets of terms to regard as equivalent. Widely supported in search software. Equivalence
Authority Files 3 List of names for known entities people, organizations, books, etc. Reference
Classification Schemes 4 Hierarchical arrangement of concepts. Loose Hierarchy
Thesauri 5 Hierarchical arrangement of concepts plus supporting information and additional, non-hierarchical, relations. Is-a Hierarchy plus Loose Relations
Ontologies 6 Arrangement of concepts and relations based on a model of underlying reality e.g. organs, symptoms, diseases treatments in medicine. Model-based Typed Relations
7Vocabulary Control
- The degree of control over a vocabulary is
(mostly) independent of its type. - Uncontrolled Anybody can add anything at any
time and no effort is made to keep things
consistent. Multiple lists and variations will
abound. - Managed Software makes sure there is a list
that is consistent (no duplicates, no orphan
nodes) at any one time. Almost anybody can add
anything, subject to consistency rules. (e.g.
File System Hierarchy) - Controlled A documented process is followed for
the update of the vocabulary. Few people have
authority to change the list. Software may help,
but emphasis is on human processes and
custodianship. (e.g. Employee list) - Term lists, synonym lists, can be controlled,
managed, or uncontrolled. - Ontologies are managed.
8Type of controls
- Controlled vocabularies are frequently mentioned
- That does not mean they are always necessary
- Control comes at a cost, but can provide
significant data quality benefits by reducing
variations. - Is this a well-controlled vocabulary?
- No! It is an uncontrolled, but well-managed, term
list - Is this part of an appropriate solution to the
ROI problem? - Yes! There is no budget to do ongoing control and
QA
Source http//del.icio.us/tag/
9Likelihood of controlled values
(Virtually) Mandatory Highly Likely Maybe Highly Unlikely (Virtually) Impossible
Language RFC 3066
Format IMT
Coverage ISO 3166
Type DCMI Type?
Subject Custom
Creator LDAP?
Publisher Custom
Contributor LDAP?
Identifier Custom
Date W3C DTF
Rights
Title
Relation
Source
Description
10Mandatory
- DC recommends specific best practices
- Language RFC 3066 (which works with ISO 639)
- Format Internet Media Types (aka MIME)
- These vocabularies are widely used throughout the
Internet. If you want to do something else, it
should be justified. - Describing physical objects?
- Use Extent and Medium refinements instead of
Format. - Regional (vs. National) dialects?
- a) Why?
- b) Consider a custom element in addition to
standard Language
11Likely
- DC recommends specific best practices
- Coverage ISO 3166
- ISO 3166 should be used unless you have good
reasons to use something else - Consider Getty Thesaurus of Geographic Names if
you need cities, rivers, etc. (http//www.getty.ed
u/research/conducting_research/vocabularies/tgn/) - DC provides Encodings for both
- Type DCMITypes (http//dublincore.org/documents/d
cmi-type-vocabulary/) - DCMIType list is not necessarily a best practice
- No widely accepted type list exists, so a custom
list is likely
12May be
- Creator, Contributor could come from an
authority file - LC NAF in library contexts
- LDAP Directory in corporate contexts
- Recommended where possible
- Many exceptions where author is outside LDAP
- Publisher could come from an authority file
- Org chart in corporate contexts e.g. internal
records management system. - Identifier should be a URI
- Organization may manage these, but its typically
a text field, not a controlled list.
13Subject and extensions
- Best practice Use pre-defined subject schemes,
not user-selected keywords. - DC Encodings (DDC, LCC, LCSH, MESH, UDC) most
useful in library contexts. - Not useful for most corporate needs
- Recommended Factor Subject into separate
facets. - People, Places, Organizations, Events, Objects,
Products Services, Industry sectors, Content
types, Audiences, Business Functions,
Competencies, - Store the different facets in different fields
- Use DC elements where appropriate (coverage,
type, audience, ) - Extend with custom elements for other fields
(industry, products, )
14Thesauri
- A Thesaurus is a collection of selected
vocabulary (preferred terms or descriptors) with
links among synonymous, equivalent, broader,
narrower and other related terms
15Standards
- National and International Standards for Thesauri
- ANSI/NISO z39.19-1994 American National
Standard Guidelines for the Construction, Format
and Management of Monolingual Thesauri - ANSI/NISO Draft Standard Z39.4-199x American
National Standard Guidelines for Indexes in
Information Retrieval - ISO 2788 Documentation Guidelines for the
establishment and development of monolingual
thesauri - ISO 5964 Documentation Guidelines for the
establishment and development of multilingual
thesauri
16Thesaurus Examples
- Examples
- The ERIC Thesaurus of Descriptors
- The Medical Subject Headings (MESH) of the
National Library of Medicine - The Art and Architecture Thesaurus
17ERIC Thesaurus Entry
18ERIC Thesaurus Online
http//www.eric.ed.gov/ERICWebPortal/Home.portal?_
nfpbtrue_pageLabelThesaurus_nflsfalse
19MeSh
20MeSh Online
http//www.nlm.nih.gov/mesh/meshhome.html
21Dewey
- Dewey Decimal Classification System (DDC) first
published in 1876 by Melvil Dewey - Most widely used classification system in the
world (used in 135 countries) - In this country used primarily by public and
school libraries - Maintained by the Library of Congress
22Dewey
- DDC is divided into ten main classes, then ten
divisions, each division into ten sections - The first digit in each three-digit number
represents the main class. - 500 natural sciences and mathematics.
- The second digit in each three-digit number
indicates the division. - 500 is used for general works on the sciences
- 510 for mathematics
- 520 for astronomy
- 530 for physics
23Dewey
- The third digit in each three-digit number
indicates the section. - 530is used for general works on physics
- 531 for classical mechanics
- 532 for fluid mechanics
- 533 for gas mechanics
- A decimal point follows the third digit in a
class number, after which division by ten
continues to the specific degree of
classification needed.
24Library of Congress Subjects
- Essentially an artificial indexing language
- Based on literary warrant
- Entry vocabulary provided in the form of
reference structure - Moving slowly towards a real thesaurus structure
(not there yet) - Not facetedsubdivisions pre-selected, based on
individual heading or pattern heading
25LCSH
- Digital libraries
- see from Electronic libraries
- see from Virtual libraries
- see broader term Libraries
- see also Information storage and retrieval
systems
26Library of Congress Classification
- 21 basic classes, based on single alphabetic
character (Klaw, Nart, etc.) - Subdivided into two or three alpha characters
(KFAmerican Law, NDpainting, etc.) - Further subdivision by specific numeric
assignment - Author numbers and dates arrange works by a
particular author together and in chronological
order
27LCC
- 153aQL638.E55hZoologyhChordates.
VertebrateshFisheshSystematic
divisionshOsteichthys (Bony fishes). By family,
A-ZhFamiliesjEngraulidae (Anchovies) - a Classification number--single number or
beginning number of span (R) - h Caption hierarchy
- j Caption (lowest level, relating to the
specific number in a)
28DMOZ A worst case example of a unified subject
- DMOZ has over 600k categories
- Most are a combination of common facets
Geography, Organization, Person, Document Type, - (e.g.) Top Regional Europe Spain Travel and
Tourism Travel Guides - www.dmoz.org
29History of Faceted Navigation
- Relatively New -- Taxonomies - Aristotle
- S. R. Ranganathan 1960s
- Issue of Compound Subjects
- The Universe consists of PMEST
- Personality, Matter, Energy, Space, Time
- Classification Research Group- 1950s, 1970s
- Based on Ranganathan, simplified, less
doctrinaire - Principles
- Division a facet must represent only one
characteristic - Mutual Exclusivity
- Classification Theory to Web Implementation
- An Idea waiting for a technology
- Multiple Filters / dimensions
30What are Facets?
- Facets are not categories
- Entities or concepts belong to a category
- Entities have facets
- Facets are metadata - properties or attributes
- Entities or concepts fit into one category
- All entities have all facets defined by set of
values - Facets are orthogonal mutually exclusive
dimensions - An event is not a person is not a document is not
a place. - A winery is not a region is not a price is not a
color. - Relations between facets, subfacets, and foci
(elements) are not restricted to hierarchical
generalization-specialization relations - Combined using grammars of order and relation to
form compound descriptions
31Facetted Classification
- Clearly distinguishes between semantic
relationships and syntactic relationships - Semantic relationships
- Within a facet
- Containment relations
- Syntactic relationships
- Across facets
- Combinatoric relations
- Have a syntax for syntactic combination of
semantic terms
32Semantic and Syntactic Relationships
- Semantic relationships
- Is-A (thing/kind, genus/species)
- Mammals
- Primates
- Humans
- Has-Parts
- Human
- Head
- Eyes
- Syntactic relationships
- Compounds
- Wheat harvesting wheat harvesting
- Object operation operation on object
33What is Faceted Navigation?
- Not a Yahoo-style Browse
- Computer Stores under Computers and Internet
- One value per facet per entity
- Faceted Navigation is not hierarchical
- Tree travel up and down, not across
- Facets are filters, multidimensional
- Facets are applied at search results time
post-coordination, not pre-coordination Advanced
Search - Faceted Navigation is an active interface
dynamic combination of search and browse
34When to Use Faceted NavigationAdvantages
- Systematic Advantages
- Need fewer Elements
- 4 facets of 10 nodes 10,000 node taxonomy
- Ability to Handle Compound Subjects
- Content Management Advantages
- Easier to categorize not as conceptual
- Fewer simple, can use auto-classification
better - Flexible can add new facets, elements in facet
35When to Use Faceted NavigationAdvantages
Implementation
- More intuitive easy to guess what is behind
each door - Simplicity of internal organization
- 20 questions we know and use
- Dynamic selection of categories
- Allow multiple perspectives
- Trick Users into using Advanced Search
- wine where color red, price x-y, etc.
- Click on color red, click on price x-y, etc.
- Flexible can be combined with other navigation
elements
36When to Use Faceted NavigationDisadvantages
- Systematic Disadvantages
- Lack of Standards for Faceted Classifications
- Every project is unique customization
- Implementation Disadvantages
- Loss of Browse Context
- Difficult to grasp scope and relationships
- No immediate support for popular subjects
- Essential Limit of Faceted Navigation
- Limited Domain Applicability type and size
- Entities not concepts, documents, web sites
37Developing Facet StructureSelection of Facets
Theory
- Issue - Complete Model of a domain
- Ranganathan PMEST
- Personality Person, animal, event
- Matter what x is made of
- Energy how x changes
- Space where x is
- Time when x happens
- Three Planes Idea, Verbal, Notational
38Facets an example
- A Language
- a English
- b French
- c Spanish
- B Genre
- a Prose
- b Poetry
- c Drama
- C Period
- a 16th Century
- b 17th Century
- c 18th Century
- d 19th Century
- Aa English Literature
- AaBa English Prose
- AaBaCa English Prose 16th Century
- AbBbCd French Poetry 19th Century
- BbCd Drama 19th Century
39Developing Facet Structure Selection of Facets
Practice Wine.com
- Top Rated Wines
- 90 under 20
- Top Sellers
- Cabinet Sauvignon
- Pinot Noir
- Hot Features
- Wine outlet
- Sideways collection
- Region
- Australia, California
- Type
- Red Wine, White, Bubbly
- Winery
- Alphabetical listing
- Price
- 25 and below
- 25-50
40Faceted Approach
- Power
- 4 independent categories of 10 nodes 10,000
nodes (104) - Faster construction
- Use existing taxonomies in specific fields
- Reduced maintenance cost
- More opportunity for data reuse
- Can be easier to navigate with appropriate UI
60 nodes 24,000 combinations
41Organization
- Either expose them directly in the user interface
(post-coordinating) or - Combine them in a minimal hierarchy
(pre-coordination) or - Hide them to the user!
- Post-coordination takes software support, which
may be fancy or basic. - How many facets?
- Log10(documents) as a guide
42Element Data Type Length Req. / Repeat Source Purpose
Asset Metadata Asset Metadata Asset Metadata Asset Metadata Asset Metadata Asset Metadata
Unique ID Integer Fixed 1 System supplied Basic accountability
Recipe Title String Variable 1 Licensed Content Text search results display
Recipe summary String Variable 1 Licensed Content Content
Main Ingredients List Variable ? Main Ingredients vocabulary Key index to retrieve aggregate recipes, generate shopping list
Subject Metadata Subject Metadata Subject Metadata Subject Metadata Subject Metadata Subject Metadata
Meal Types List Variable Meal Types vocab Browse or group recipes filter search results
Cuisines List Variable Cuisines Browse or group recipes filter search results
Courses List Variable Courses vocab Browse or group recipes filter search results
Cooking Method Flag Fixed Cooking vocab Browse or group recipes filter search results
Link Metadata Link Metadata Link Metadata Link Metadata Link Metadata Link Metadata
Recipe Image Pointer Variable ? Product Group Merchandize products
Use Metadata Use Metadata Use Metadata Use Metadata Use Metadata Use Metadata
Rating String Variable 1 Licensed Content Filter, rank, evaluate recipes
Release Date Date Fixed 1 Product Group Publish feature new recipes
dcidentifier
dctitle
dcdescription
X
X
X
X
X
dctermshasPart
dcdate
dctyperecipe, dcformattext/html,
dclanguageen
43Project/exercise
- Produce a faced classification of your documents
(at least 3 facets, min 5 foci each) - Encode the facet classification as an extension
of dcsubject - Attribute facets to your docs.
- Check exptensibility by adding 10 new docs