DL:Lesson 5 Classification Schemas Luca Dini dini@celi.it - PowerPoint PPT Presentation

1 / 43
About This Presentation
Title:

DL:Lesson 5 Classification Schemas Luca Dini dini@celi.it

Description:

DL:Lesson 5 Classification Schemas Luca Dini dini_at_celi.it Overview Vocabulary development and maintenance Vocabulary development and maintenance is the LEAST of three ... – PowerPoint PPT presentation

Number of Views:226
Avg rating:3.0/5.0
Slides: 44
Provided by: Sukh5
Category:

less

Transcript and Presenter's Notes

Title: DL:Lesson 5 Classification Schemas Luca Dini dini@celi.it


1
DLLesson 5Classification SchemasLuca
Dinidini_at_celi.it
2
Overview
  • The Dublin Core defines a number of metadata
    elements, but what about the values for those
    elements?
  • Should they be unrestricted text values or come
    from pre-defined vocabularies?
  • "it depends".
  • We will discuss how to determine the appropriate
    approach for an organization's situation.
  • We will also cover how pre-defined vocabularies
    should be sourced, structured, and maintained.

3
Vocabulary development and maintenance
  • Vocabulary development and maintenance is the
    LEAST of three problems
  • The Vocabulary Problem How are we going to build
    and maintain the lists of pre-defined values that
    can go into some of the metadata elements?
  • The Tagging Problem How are we going to populate
    metadata elements with complete and consistent
    values?
  • What can we expect to get from automatic
    classifiers? What kind of error detection and
    error correction procedures do we need?
  • The ROI Problem How are we going to use content,
    metadata, and vocabularies in applications to
    obtain business benefits?
  • More sales? Lower support costs? Greater
    productivity?
  • How much content? How big an operating budget?
  • Need to know the answer to the ROI Problem before
    solving the Vocabulary Problem.

4
Definitions
Term Definition
Metadata Element A field for storing information about one piece of content. Examples Title, Creator, Subject, Date,
Metadata Value The contents of one Metadata Element. Values may be text strings, or selections from a predefined vocabulary.
Metadata Schema A defined set of metadata elements. The Dublin Core is one schema.
Free Text Value An unconstrained text metadata value. Some text values are constrained to follow a format (e.g. YYYY-MM-DD).
Vocabulary A list of predefined values for a metadata element.
Controlled Vocabulary A vocabulary with a defined and enforced procedure for its update.
5
Controlled vocabularies
Hierarchical classification of things into a tree
structure
6
Types of vocabularies
Vocabulary Type Cplxty. Description Relation Type
Term List 1 Simple list of terms with no internal structure or relations. None
Synonym Rings 2 List of sets of terms to regard as equivalent. Widely supported in search software. Equivalence
Authority Files 3 List of names for known entities people, organizations, books, etc. Reference
Classification Schemes 4 Hierarchical arrangement of concepts. Loose Hierarchy
Thesauri 5 Hierarchical arrangement of concepts plus supporting information and additional, non-hierarchical, relations. Is-a Hierarchy plus Loose Relations
Ontologies 6 Arrangement of concepts and relations based on a model of underlying reality e.g. organs, symptoms, diseases treatments in medicine. Model-based Typed Relations
7
Vocabulary Control
  • The degree of control over a vocabulary is
    (mostly) independent of its type.
  • Uncontrolled Anybody can add anything at any
    time and no effort is made to keep things
    consistent. Multiple lists and variations will
    abound.
  • Managed Software makes sure there is a list
    that is consistent (no duplicates, no orphan
    nodes) at any one time. Almost anybody can add
    anything, subject to consistency rules. (e.g.
    File System Hierarchy)
  • Controlled A documented process is followed for
    the update of the vocabulary. Few people have
    authority to change the list. Software may help,
    but emphasis is on human processes and
    custodianship. (e.g. Employee list)
  • Term lists, synonym lists, can be controlled,
    managed, or uncontrolled.
  • Ontologies are managed.

8
Type of controls
  • Controlled vocabularies are frequently mentioned
  • That does not mean they are always necessary
  • Control comes at a cost, but can provide
    significant data quality benefits by reducing
    variations.
  • Is this a well-controlled vocabulary?
  • No! It is an uncontrolled, but well-managed, term
    list
  • Is this part of an appropriate solution to the
    ROI problem?
  • Yes! There is no budget to do ongoing control and
    QA

Source http//del.icio.us/tag/
9
Likelihood of controlled values
(Virtually) Mandatory Highly Likely Maybe Highly Unlikely (Virtually) Impossible
Language RFC 3066
Format IMT
Coverage ISO 3166
Type DCMI Type?
Subject Custom
Creator LDAP?
Publisher Custom
Contributor LDAP?
Identifier Custom
Date W3C DTF
Rights
Title
Relation
Source
Description
10
Mandatory
  • DC recommends specific best practices
  • Language RFC 3066 (which works with ISO 639)
  • Format Internet Media Types (aka MIME)
  • These vocabularies are widely used throughout the
    Internet. If you want to do something else, it
    should be justified.
  • Describing physical objects?
  • Use Extent and Medium refinements instead of
    Format.
  • Regional (vs. National) dialects?
  • a) Why?
  • b) Consider a custom element in addition to
    standard Language

11
Likely
  • DC recommends specific best practices
  • Coverage ISO 3166
  • ISO 3166 should be used unless you have good
    reasons to use something else
  • Consider Getty Thesaurus of Geographic Names if
    you need cities, rivers, etc. (http//www.getty.ed
    u/research/conducting_research/vocabularies/tgn/)
  • DC provides Encodings for both
  • Type DCMITypes (http//dublincore.org/documents/d
    cmi-type-vocabulary/)
  • DCMIType list is not necessarily a best practice
  • No widely accepted type list exists, so a custom
    list is likely

12
May be
  • Creator, Contributor could come from an
    authority file
  • LC NAF in library contexts
  • LDAP Directory in corporate contexts
  • Recommended where possible
  • Many exceptions where author is outside LDAP
  • Publisher could come from an authority file
  • Org chart in corporate contexts e.g. internal
    records management system.
  • Identifier should be a URI
  • Organization may manage these, but its typically
    a text field, not a controlled list.

13
Subject and extensions
  • Best practice Use pre-defined subject schemes,
    not user-selected keywords.
  • DC Encodings (DDC, LCC, LCSH, MESH, UDC) most
    useful in library contexts.
  • Not useful for most corporate needs
  • Recommended Factor Subject into separate
    facets.
  • People, Places, Organizations, Events, Objects,
    Products Services, Industry sectors, Content
    types, Audiences, Business Functions,
    Competencies,
  • Store the different facets in different fields
  • Use DC elements where appropriate (coverage,
    type, audience, )
  • Extend with custom elements for other fields
    (industry, products, )

14
Thesauri
  • A Thesaurus is a collection of selected
    vocabulary (preferred terms or descriptors) with
    links among synonymous, equivalent, broader,
    narrower and other related terms

15
Standards
  • National and International Standards for Thesauri
  • ANSI/NISO z39.19-1994 American National
    Standard Guidelines for the Construction, Format
    and Management of Monolingual Thesauri
  • ANSI/NISO Draft Standard Z39.4-199x American
    National Standard Guidelines for Indexes in
    Information Retrieval
  • ISO 2788 Documentation Guidelines for the
    establishment and development of monolingual
    thesauri
  • ISO 5964 Documentation Guidelines for the
    establishment and development of multilingual
    thesauri

16
Thesaurus Examples
  • Examples
  • The ERIC Thesaurus of Descriptors
  • The Medical Subject Headings (MESH) of the
    National Library of Medicine
  • The Art and Architecture Thesaurus

17
ERIC Thesaurus Entry
18
ERIC Thesaurus Online
http//www.eric.ed.gov/ERICWebPortal/Home.portal?_
nfpbtrue_pageLabelThesaurus_nflsfalse
19
MeSh
20
MeSh Online
http//www.nlm.nih.gov/mesh/meshhome.html
21
Dewey
  • Dewey Decimal Classification System (DDC) first
    published in 1876 by Melvil Dewey
  • Most widely used classification system in the
    world (used in 135 countries)
  • In this country used primarily by public and
    school libraries
  • Maintained by the Library of Congress

22
Dewey
  • DDC is divided into ten main classes, then ten
    divisions, each division into ten sections
  • The first digit in each three-digit number
    represents the main class.
  • 500 natural sciences and mathematics.
  • The second digit in each three-digit number
    indicates the division.
  • 500 is used for general works on the sciences
  • 510 for mathematics
  • 520 for astronomy
  • 530 for physics

23
Dewey
  • The third digit in each three-digit number
    indicates the section.
  • 530is used for general works on physics
  • 531 for classical mechanics
  • 532 for fluid mechanics
  • 533 for gas mechanics
  • A decimal point follows the third digit in a
    class number, after which division by ten
    continues to the specific degree of
    classification needed.

24
Library of Congress Subjects
  • Essentially an artificial indexing language
  • Based on literary warrant
  • Entry vocabulary provided in the form of
    reference structure
  • Moving slowly towards a real thesaurus structure
    (not there yet)
  • Not facetedsubdivisions pre-selected, based on
    individual heading or pattern heading

25
LCSH
  • Digital libraries
  • see from Electronic libraries
  • see from Virtual libraries
  • see broader term Libraries
  • see also Information storage and retrieval
    systems

26
Library of Congress Classification
  • 21 basic classes, based on single alphabetic
    character (Klaw, Nart, etc.)
  • Subdivided into two or three alpha characters
    (KFAmerican Law, NDpainting, etc.)
  • Further subdivision by specific numeric
    assignment
  • Author numbers and dates arrange works by a
    particular author together and in chronological
    order

27
LCC
  • 153aQL638.E55hZoologyhChordates.
    VertebrateshFisheshSystematic
    divisionshOsteichthys (Bony fishes). By family,
    A-ZhFamiliesjEngraulidae (Anchovies)
  • a Classification number--single number or
    beginning number of span (R)
  • h Caption hierarchy
  • j Caption (lowest level, relating to the
    specific number in a)

28
DMOZ A worst case example of a unified subject
  • DMOZ has over 600k categories
  • Most are a combination of common facets
    Geography, Organization, Person, Document Type,
  • (e.g.) Top Regional Europe Spain Travel and
    Tourism Travel Guides
  • www.dmoz.org

29
History of Faceted Navigation
  • Relatively New -- Taxonomies - Aristotle
  • S. R. Ranganathan 1960s
  • Issue of Compound Subjects
  • The Universe consists of PMEST
  • Personality, Matter, Energy, Space, Time
  • Classification Research Group- 1950s, 1970s
  • Based on Ranganathan, simplified, less
    doctrinaire
  • Principles
  • Division a facet must represent only one
    characteristic
  • Mutual Exclusivity
  • Classification Theory to Web Implementation
  • An Idea waiting for a technology
  • Multiple Filters / dimensions

30
What are Facets?
  • Facets are not categories
  • Entities or concepts belong to a category
  • Entities have facets
  • Facets are metadata - properties or attributes
  • Entities or concepts fit into one category
  • All entities have all facets defined by set of
    values
  • Facets are orthogonal mutually exclusive
    dimensions
  • An event is not a person is not a document is not
    a place.
  • A winery is not a region is not a price is not a
    color.
  • Relations between facets, subfacets, and foci
    (elements) are not restricted to hierarchical
    generalization-specialization relations
  • Combined using grammars of order and relation to
    form compound descriptions

31
Facetted Classification
  • Clearly distinguishes between semantic
    relationships and syntactic relationships
  • Semantic relationships
  • Within a facet
  • Containment relations
  • Syntactic relationships
  • Across facets
  • Combinatoric relations
  • Have a syntax for syntactic combination of
    semantic terms

32
Semantic and Syntactic Relationships
  • Semantic relationships
  • Is-A (thing/kind, genus/species)
  • Mammals
  • Primates
  • Humans
  • Has-Parts
  • Human
  • Head
  • Eyes
  • Syntactic relationships
  • Compounds
  • Wheat harvesting wheat harvesting
  • Object operation operation on object

33
What is Faceted Navigation?
  • Not a Yahoo-style Browse
  • Computer Stores under Computers and Internet
  • One value per facet per entity
  • Faceted Navigation is not hierarchical
  • Tree travel up and down, not across
  • Facets are filters, multidimensional
  • Facets are applied at search results time
    post-coordination, not pre-coordination Advanced
    Search
  • Faceted Navigation is an active interface
    dynamic combination of search and browse

34
When to Use Faceted NavigationAdvantages
  • Systematic Advantages
  • Need fewer Elements
  • 4 facets of 10 nodes 10,000 node taxonomy
  • Ability to Handle Compound Subjects
  • Content Management Advantages
  • Easier to categorize not as conceptual
  • Fewer simple, can use auto-classification
    better
  • Flexible can add new facets, elements in facet

35
When to Use Faceted NavigationAdvantages
Implementation
  • More intuitive easy to guess what is behind
    each door
  • Simplicity of internal organization
  • 20 questions we know and use
  • Dynamic selection of categories
  • Allow multiple perspectives
  • Trick Users into using Advanced Search
  • wine where color red, price x-y, etc.
  • Click on color red, click on price x-y, etc.
  • Flexible can be combined with other navigation
    elements

36
When to Use Faceted NavigationDisadvantages
  • Systematic Disadvantages
  • Lack of Standards for Faceted Classifications
  • Every project is unique customization
  • Implementation Disadvantages
  • Loss of Browse Context
  • Difficult to grasp scope and relationships
  • No immediate support for popular subjects
  • Essential Limit of Faceted Navigation
  • Limited Domain Applicability type and size
  • Entities not concepts, documents, web sites

37
Developing Facet StructureSelection of Facets
Theory
  • Issue - Complete Model of a domain
  • Ranganathan PMEST
  • Personality Person, animal, event
  • Matter what x is made of
  • Energy how x changes
  • Space where x is
  • Time when x happens
  • Three Planes Idea, Verbal, Notational

38
Facets an example
  • A Language
  • a English
  • b French
  • c Spanish
  • B Genre
  • a Prose
  • b Poetry
  • c Drama
  • C Period
  • a 16th Century
  • b 17th Century
  • c 18th Century
  • d 19th Century
  • Aa English Literature
  • AaBa English Prose
  • AaBaCa English Prose 16th Century
  • AbBbCd French Poetry 19th Century
  • BbCd Drama 19th Century

39
Developing Facet Structure Selection of Facets
Practice Wine.com
  • Top Rated Wines
  • 90 under 20
  • Top Sellers
  • Cabinet Sauvignon
  • Pinot Noir
  • Hot Features
  • Wine outlet
  • Sideways collection
  • Region
  • Australia, California
  • Type
  • Red Wine, White, Bubbly
  • Winery
  • Alphabetical listing
  • Price
  • 25 and below
  • 25-50

40
Faceted Approach
  • Power
  • 4 independent categories of 10 nodes 10,000
    nodes (104)
  • Faster construction
  • Use existing taxonomies in specific fields
  • Reduced maintenance cost
  • More opportunity for data reuse
  • Can be easier to navigate with appropriate UI

60 nodes 24,000 combinations
41
Organization
  • Either expose them directly in the user interface
    (post-coordinating) or
  • Combine them in a minimal hierarchy
    (pre-coordination) or
  • Hide them to the user!
  • Post-coordination takes software support, which
    may be fancy or basic.
  • How many facets?
  • Log10(documents) as a guide

42
Element Data Type Length Req. / Repeat Source Purpose
Asset Metadata Asset Metadata Asset Metadata Asset Metadata Asset Metadata Asset Metadata
Unique ID Integer Fixed 1 System supplied Basic accountability
Recipe Title String Variable 1 Licensed Content Text search results display
Recipe summary String Variable 1 Licensed Content Content
Main Ingredients List Variable ? Main Ingredients vocabulary Key index to retrieve aggregate recipes, generate shopping list
Subject Metadata Subject Metadata Subject Metadata Subject Metadata Subject Metadata Subject Metadata
Meal Types List Variable Meal Types vocab Browse or group recipes filter search results
Cuisines List Variable Cuisines Browse or group recipes filter search results
Courses List Variable Courses vocab Browse or group recipes filter search results
Cooking Method Flag Fixed Cooking vocab Browse or group recipes filter search results
Link Metadata Link Metadata Link Metadata Link Metadata Link Metadata Link Metadata
Recipe Image Pointer Variable ? Product Group Merchandize products
Use Metadata Use Metadata Use Metadata Use Metadata Use Metadata Use Metadata
Rating String Variable 1 Licensed Content Filter, rank, evaluate recipes
Release Date Date Fixed 1 Product Group Publish feature new recipes
dcidentifier
dctitle
dcdescription
X

X
X
X
X

dctermshasPart


dcdate
dctyperecipe, dcformattext/html,
dclanguageen
43
Project/exercise
  • Produce a faced classification of your documents
    (at least 3 facets, min 5 foci each)
  • Encode the facet classification as an extension
    of dcsubject
  • Attribute facets to your docs.
  • Check exptensibility by adding 10 new docs
Write a Comment
User Comments (0)
About PowerShow.com