Prof. Ray Larson - PowerPoint PPT Presentation

About This Presentation
Title:

Prof. Ray Larson

Description:

Lecture 22: Thesaurii and Metadata SIMS 202: Information Organization and Retrieval Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am ... – PowerPoint PPT presentation

Number of Views:98
Avg rating:3.0/5.0
Slides: 77
Provided by: ValuedGate891
Category:
Tags: larson | poetry | prof | ray | review

less

Transcript and Presenter's Notes

Title: Prof. Ray Larson


1
Lecture 22 Thesaurii and Metadata
SIMS 202 Information Organization and Retrieval
  • Prof. Ray Larson Prof. Marc Davis
  • UC Berkeley SIMS
  • Tuesday and Thursday 1030 am - 1200 pm
  • Fall 2004
  • http//www.sims.berkeley.edu/academics/courses/is2
    02/f04/

2
Lecture Overview
  • Review (and expansion)
  • Facetted Classification
  • Thesaurus Design and Development
  • Metadata And Markup
  • XML As A Metadata Lingua Franca
  • Dublin Core Revisited
  • METS
  • Other Metadata schemas and protocols in XML
  • Discussion

3
Lecture Overview
  • Review (and expansion)
  • Facetted Classification
  • Thesaurus Design and Development
  • Metadata And Markup
  • XML As A Metadata Lingua Franca
  • Dublin Core Revisited
  • METS
  • Other Metadata schemas and protocols in XML
  • Discussion

4
Indexing Languages
  • An index is a systematic guide designed to
    indicate topics or features of documents in order
    to facilitate retrieval of documents or parts of
    documents
  • An indexing language is the set of terms used in
    an index to represent topics or features of
    documents, and the rules for combining or using
    those terms

5
Controlled Vocabularies
  • Vocabulary control is the attempt to provide a
    standardized and consistent set of terms (such as
    subject headings, names, classifications, etc.)
    with the intent of aiding the searcher in finding
    information
  • That is, it is an attempt to provide a consistent
    set of descriptions for use in (or as) metadata

6
Hierarchical Classification
Slide author Marti Hearst
7
Labeled Categories for Hierarchical Classification
  • LITERATURE
  • 100 English Literature
  • 110 English Prose
  • English Prose 16th Century
  • English Prose 17th Century
  • English Prose 18th Century
  • ...
  • 111 English Poetry
  • 121 English Poetry 16th Century
  • 122 English Poetry 17th Century
  • ...
  • 112 English Drama
  • 130 English Drama 16th Century
  • 200 French Literature

Slide author Marti Hearst
8
Facetted Categories
  • Mutually exclusive
  • Non-overlapping, distinct categories
  • Relational
  • Relations between facets, subfacets, and foci
    (elements) are not restricted to hierarchical
    generalization-specialization relations
  • Composable
  • Combined using grammars of order and relation to
    form compound descriptions

9
Facetted Classification Along With Labeled
Categories
  • A Language
  • a English
  • b French
  • c Spanish
  • B Genre
  • a Prose
  • b Poetry
  • c Drama
  • C Period
  • a 16th Century
  • b 17th Century
  • c 18th Century
  • d 19th Century
  • Aa English Literature
  • AaBa English Prose
  • AaBaCa English Prose 16th Century
  • AbBbCd French Poetry 19th Century
  • BbCd Drama 19th Century

Slide author Marti Hearst
10
Ranganathan
  • PMEST Facets
  • P(ersonality)
  • WHO Types of things
  • M(atter)
  • WHAT Constituent materials
  • E(nergy)
  • HOW Action or activity terms
  • S(pace)
  • WHERE Where things occur
  • T(ime)
  • WHEN When things occur

11
Classical Facet Analysis
  • What is being done?
  • Entity
  • Kind
  • Product
  • By-Product
  • What are its parts?
  • Part
  • What are its properties?
  • Property
  • Material
  • How is this achieved?
  • Process
  • By what means?
  • Operation
  • By whom?
  • Agent
  • Patient
  • Where?
  • Space
  • When?
  • Time

12
Semantic and Syntactic Relationships
  • Semantic relationships
  • Is-A (thing/kind, genus/species)
  • Mammals
  • Primates
  • Humans
  • Has-Parts
  • Human
  • Head
  • Eyes
  • Syntactic relationships
  • Compounds
  • Wheat harvesting wheat harvesting
  • Object operation operation on object

13
Facetted Classification
  • Clearly distinguishes between semantic
    relationships and syntactic relationships
  • Semantic relationships
  • Within a facet
  • Containment relations
  • Syntactic relationships
  • Across facets
  • Combinatoric relations
  • Have a syntax for syntactic combination of
    semantic terms

14
Power of Facet Combinations
  • The syntactic relations of facetted
    classifications enable a small controlled
    vocabulary to produce
  • Many, many structured descriptions
  • Complex, but formally structured descriptions
    using nested compound descriptions
  • Descriptions for things we do not have words for

15
Lecture Overview
  • Review (and expansion)
  • Facetted Classification
  • Thesaurus Design and Development
  • Metadata And Markup
  • XML As A Metadata Lingua Franca
  • Dublin Core Revisited
  • METS
  • Other Metadata schemas and protocols in XML
  • Discussion

16
Types of Indexing Languages
  • Uncontrolled keyword indexing
  • Indexing languages
  • Controlled, but not structured
  • Thesauri
  • Controlled and structured
  • Classification systems
  • Controlled, structured, and coded
  • Facetted classification systems

17
Thesauri
  • A Thesaurus is a collection of selected
    vocabulary (preferred terms or descriptors) with
    links among synonymous, equivalent, broader,
    narrower and other related terms

18
Thesaurus Standards
  • National and International Standards for Thesauri
  • ANSI/NISO z39.19-1994 American National
    Standard Guidelines for the Construction, Format
    and Management of Monolingual Thesauri
  • ANSI/NISO Draft Standard Z39.4-199x American
    National Standard Guidelines for Indexes in
    Information Retrieval
  • ISO 2788 Documentation Guidelines for the
    establishment and development of monolingual
    thesauri
  • ISO 5964 Documentation Guidelines for the
    establishment and development of multilingual
    thesauri

19
Thesaurus Examples
  • Examples
  • The ERIC Thesaurus of Descriptors
  • The Medical Subject Headings (MESH) of the
    National Library of Medicine
  • The Art and Architecture Thesaurus

20
ERIC Thesaurus Entry
21
ERIC Thesaurus Alphabetic
22
ERIC Thesaurus KWIC Index
23
ERIC Thesaurus Hierarchies
24
ERIC Thesaurus Groups
25
ERIC Thesaurus Online
http//www.ericfacility.net/extra/pub/thessearch.c
fm
26
MESH Entry
27
MESH Alphabetic
28
MESH Tree Structures
29
MESH KWOC Index
30
MESH - Online
http//www.nlm.nih.gov/mesh/meshhome.html
31
AAT Facets
32
AAT Hierarchies (print)
33
AAT Hierarchies (online)
http//www.getty.edu/research/tools/vocabulary/aat
/
34
AAT Entry (online)
35
Why Develop a Thesaurus?
  • To provide a conceptual structure or space for
    a body of information
  • To make it possible to adequately describe the
    topical content of information resources at an
    appropriate level of generality or specificity
  • To provide enhanced search capabilities and to
    improve the effectiveness of searching (i.e., to
    retrieve most of the relevant material without
    too much irrelevant material)

36
Why Develop a Thesaurus?
  • To provide vocabulary (or terminological) control
  • When there are several possible terms designating
    a single concept, the thesaurus should lead the
    indexer or searcher to the appropriate concept,
    regardless of the terms they start with

37
Preliminary Considerations
  • What is used now?
  • Continue using an existing thesaurus?
  • Ad hoc modification of existing thesaurus?
  • Develop a new well-structured thesaurus?
  • What is the scope and complexity of the subject
    field?
  • What kind of retrieval objects or data will be
    dealt with?
  • How exhaustive and specific is the desired
    description of objects?

38
Preliminary Considerations
  • The scope and complexity of the field will
    provide some indication of the scope and
    complexity of the thesaurus
  • It is better to plan for a larger and more
    comprehensive system than a smaller system that
    rapidly will become inadequate as the database
    grows
  • Development of a good thesaurus requires a major
    intellectual effort as well as clerical
    operations like data entry and production of
    sorted lists

39
Development of a Thesaurus
  • Term selection
  • Merging and development of concept classes
  • Definition of broad subject fields and subfields
  • Development of classificatory structure
  • Review, testing, application, revision

40
Flow of Work in Thesaurus Construction
41
1. Term Selection
  • Select sources for the collection of terms
  • Prearranged Sources
  • Open-ended Sources
  • Assign codes to each source
  • Selection of terms
  • For part of pre-arranged and for all open-ended
    sources
  • Enter terms into database with all information

42
1.1 Kinds of Sources
  • Prearranged Sources
  • Existing descriptor lists, classification schemes
    thesauri
  • This includes universal schemes like DDC or LCSH
  • Nomenclatures of single disciplines
  • Treatises on the terminology of a field
  • Encyclopedias, lexica, dictionaries and
    glossaries
  • Tables of contents of textbooks and handbooks
  • Indexes of journals or abstracting journals
  • Indexes of other publications in the field

43
1.1 Kinds of Sources
  • Open-ended sources
  • Lists of search requests or interest profiles
  • Description of projects/activities to be served
    by the information retrieval system
  • Discussion with specialists in the field
  • Sample of documents in the field
  • Ask users why and how these documents relate to
    the field
  • Have documents indexed by experts in the field
  • Lists of titles of documents in the field
  • Abstracts and reviews of documents
  • Your own knowledge

44
Selection of Sources
  • Prearranged sources require less effort in
    gathering the material, and may already indicate
    some relationships between terms and concepts and
    relationships among terms
  • Open-ended sources can reflect current
    terminology and may provide more complete
    coverage
  • Choose a set of sources that are current, as
    complete as possible, and considered authoritative

45
Selection of Terms
  • In open-ended sources you read through the source
    and pick out terms (i.e. words and phrases) that
    might be useful in retrieval or as references to
    other terms
  • Alternatively, use keyword and phrase extraction
    software to create lists of terms and select from
    those
  • Transfer selected terms to the recording medium
    (cards or database)

46
Work Form Still relevant??
From Soergel, p. 399
47
2. Merging and Development of Concept Classes
  • Sort Term DB into alphabetical order
  • First Round
  • Merge information for identical terms, possibly
    pulling info from additional sources
  • Second Round
  • Merge synonyms or terms in the same concept class

48
3. Definition of Broad Subject Fields and
Subfields
  • Define broad subject fields and sort terms into
    these broad fields
  • Define subfields within each broad field and sort
    terms into these subfields
  • Work out the detailed structure
  • Select preferred terms
  • Merge information for terms in the same concept
    class
  • Repeat these steps
  • For each subfield within a broad field
  • And for each broad field
  • Until all terms have been consolidated and
    preferred terms selected

49
4. Development of Classificatory Structure
  • Produce preliminary version of classified index
    and update the working database
  • Improve classificatory structure
  • Reality check
  • Produce and distribute a version of the
    classified index
  • Distribute to users/experts

50
5. Final Stages
  • Review
  • Testing
  • Application
  • Revision

51
Review
  • Discuss classified index with users/experts
  • Select descriptors and checklist descriptors
  • Assign notational symbols
  • Produce main thesaurus and indexes

52
Testing a Thesaurus
  • Assign descriptors to a sample set of NEW
    documents (use enough to get an idea of any gaps
    in the thesaurus)
  • Test retrieval using sample questions and seeing
    how effectively the thesaurus maps to the
    appropriate descriptor

53
Lecture Overview
  • Review (and expansion)
  • Facetted Classification
  • Thesaurus Design and Development
  • Metadata And Markup
  • XML As A Metadata Lingua Franca
  • Dublin Core Revisited
  • METS
  • Other Metadata schemas and protocols in XML
  • Discussion

54
XML as a common syntax
  • XML (and SGML) provide a way of expressing the
    structure of documents that can be verified and
    validated by document processing systems
  • Documents can be metadata structures
  • Such as the description of a particular
    photograph in our Phone project
  • XML thus provides a way of representing metadata
    descriptions as well as the content that they
    describe

55
XML as a common syntax
  • All XML documents follow some simple rules that
    make them interchangeable and usable across
    different systems
  • All data and markup is in UNICODE
  • All elements are marked by begin and end tags
  • All markup is case-sensitive
  • XML DTDs and/or Schemas define the valid
    structure (and sometimes content) of the documents

56
Dublin Core
  • Review
  • Simple metadata for describing internet resources
  • For Document-Like Objects
  • 15 Elements

57
Dublin Core Elements
  • Title
  • Creator
  • Subject
  • Description
  • Publisher
  • Other Contributors
  • Date
  • Resource Type
  • Format
  • Resource Identifier
  • Source
  • Language
  • Relation
  • Coverage
  • Rights Management

58
DC XML DTD Implementation
  • There have been various versions
  • This one is the one recommended (required) by the
    Open Archives Initiative Metadata Harvesting
    Protocol (OAI-MHP)
  • Uses XML Name Spaces
  • Available at http//dublincore.org/documents/2001/
    09/20/dcmes-xml/

59
DC Element and Attribute Definitions
lt!-- The elements from DCMES 1.1 --gt lt!-- The
name given to the resource. --gt lt!ELEMENT
dctitle (PCDATA)gt lt!ATTLIST dctitle xmllang
CDATA IMPLIEDgt lt!-- An entity primarily
responsible for making the content of the
resource. --gt lt!ELEMENT dccreator (PCDATA)gt
lt!ATTLIST dccreator xmllang CDATA IMPLIEDgt
lt!-- The topic of the content of the resource.
--gt lt!ELEMENT dcsubject (PCDATA)gt lt!ATTLIST
dcsubject xmllang CDATA IMPLIEDgt lt!-- An
account of the content of the resource. --gt
lt!ELEMENT dcdescription (PCDATA)gt lt!ATTLIST
dcdescription xmllang CDATA IMPLIEDgt lt!--
The entity responsible for making the resource
available. --gt lt!ELEMENT dcpublisher
(PCDATA)gt lt!ATTLIST dcpublisher xmllang CDATA
IMPLIEDgt lt!-- An entity responsible for making
contributions to the content of the resource.
--gt lt!ELEMENT dccontributor (PCDATA)gt
lt!ATTLIST dccontributor xmllang CDATA
IMPLIEDgt lt!-- A date associated with an event
in the life cycle of the resource. --gt lt!ELEMENT
dcdate (PCDATA)gt lt!ATTLIST dcdate xmllang
CDATA IMPLIEDgt
60
DC Element Definitions (cont.)
lt!-- The nature or genre of the content of the
resource. --gt lt!ELEMENT dctype (PCDATA)gt
lt!ATTLIST dctype xmllang CDATA IMPLIEDgt lt!--
The physical or digital manifestation of the
resource. --gt lt!ELEMENT dcformat (PCDATA)gt
lt!ATTLIST dcformat xmllang CDATA IMPLIEDgt
lt!-- An unambiguous reference to the resource
within a given context. --gt lt!ELEMENT
dcidentifier (PCDATA)gt lt!ATTLIST dcidentifier
xmllang CDATA IMPLIEDgt lt!ATTLIST dcidentifier
rdfresource CDATA IMPLIEDgt lt!-- A Reference
to a resource from which the present resource is
derived. --gt lt!ELEMENT dcsource (PCDATA)gt
lt!ATTLIST dcsource xmllang CDATA IMPLIEDgt
lt!ATTLIST dcsource rdfresource CDATA
IMPLIEDgt lt!-- A language of the intellectual
content of the resource. --gt lt!ELEMENT
dclanguage (PCDATA)gt lt!ATTLIST dclanguage
xmllang CDATA IMPLIEDgt lt!-- A reference to a
related resource. --gt lt!ELEMENT dcrelation
(PCDATA)gt lt!ATTLIST dcrelation xmllang CDATA
IMPLIEDgt lt!ATTLIST dcrelation rdfresource
CDATA IMPLIEDgt lt!-- The extent or scope of the
content of the resource. --gt lt!ELEMENT
dccoverage (PCDATA)gt lt!ATTLIST dccoverage
xmllang CDATA IMPLIEDgt lt!-- Information about
rights held in and over the resource. --gt
lt!ELEMENT dcrights (PCDATA)gt lt!ATTLIST
dcrights xmllang CDATA IMPLIEDgt
61
A More Complex SGML DTD
lt!DOCTYPE USMARC lt!-- USMARC DTD. UCB-SLIS
v.0.08 --gt lt!-- By Jerome P. McDonough, April 1,
1994 --gt lt!ELEMENT USMARC - - (Leader, Directry,
VarFlds)gt lt!ATTLIST USMARC Material
(BKAMCFMPMUVMSE) "BK" id
CDATA IMPLIEDgt lt!-- Author's Note the id
attribute for the USMARC element is
intended to hold a unique record number
for each MARC record in the
local database. That is to
say, it is intended ONLY as an aid in
maintaining the local database of MARC
records --gt lt!ELEMENT Leader - O (LRL, RecStat,
RecType, BibLevel, UCP, IndCount, SFCount,
BaseAddr, EncLevel, DscCatFm,
LinkRec, EntryMap)gt lt!ELEMENT Directry - O
(PCDATA)gt lt!ELEMENT VarFlds - O (VarCFlds,
VarDFlds)gt lt!-- Component parts of Leader
--gt lt!-- Logical Record Length --gt lt!ELEMENT LRL
- O (PCDATA)gt etc
62
More Complex DTD (cont.)
lt!-- Variable Data Fields --gt lt!ELEMENT VarDFlds
- O (NumbCode, MainEnty?, Titles, EdImprnt?,
PhysDesc?, Series?,
Notes?, SubjAccs?, AddEnty?, LinkEnty?,
SAddEnty?, HoldAltG?,
Fld9XX?)gt lt!-- Component Parts of Variable Data
Fields --gt lt!-- Numbers Codes --gt lt!ELEMENT
NumbCode - O (Fld010?, Fld011?, Fld015?, Fld017,
Fld018?, Fld019, Fld020,
Fld022, Fld023, Fld024,
Fld025, Fld027, Fld028, Fld029,
Fld030, Fld032, Fld033, Fld034,
Fld035, Fld036?,
Fld037, Fld039, Fld040?, Fld041?, Fld042?,
Fld043?, Fld044?,
Fld045?, Fld046?, Fld047?, Fld048, Fld050,
Fld051, Fld052,
Fld055, Fld060, Fld061, Fld066?,
Fld069, Fld070,
Fld071, Fld072, Fld074, Fld080?,
Fld082, Fld084, Fld086, Fld088, Fld090,
Fld096)gt lt!-- Main Entries --gt lt!ELEMENT
MainEnty - O (Fld100?, Fld110?, Fld111?,
Fld130?)gt lt!-- Titles --gt lt!ELEMENT Titles - O
(Fld210?, Fld211, Fld212, Fld214, Fld222,
Fld240?, Fld242, Fld243?, Fld245,
Fld246, Fld247)gt lt!-- Edition, Imprint, etc.
--gt lt!ELEMENT EdImprnt - O (Fld250?, Fld254?,
Fld255, Fld256?, Fld257?, Fld260?,
Fld261?, Fld262?, Fld263?,
Fld265?)gt lt!-- Physical Description, etc.
--gt lt!ELEMENT PhysDesc - O (Fld300, Fld305,
Fld306?, Fld310?, Fld315?,
Fld321, Fld340, Fld350?, Fld351, Fld355,
Fld357, Fld362)gt etc
63
Complex DTD (cont.)
lt!-- Title Statement --gt lt!ELEMENT Fld245 - O
(Six?, (abcfghknps))gt lt!ATTLIST Fld245
AddEnty (NoYesBlank) IMPLIED
NFChars (0123456789Blnk)
IMPLIEDgt etc lt!-- Subfield Element
Declarations --gt lt!ELEMENT a - O
(PCDATA)gt lt!ELEMENT b - O
(PCDATA)gt lt!ELEMENT c - O
(PCDATA)gt lt!ELEMENT d - O
(PCDATA)gt lt!ELEMENT e - O (PCDATA)gt
64
Example METS
  • METS the Metadata Encoding and Transmission
    Standard is a new Schema intended to provide
  • a standard for encoding descriptive,
    administrative, and structural metadata regarding
    objects within a digital library, expressed using
    the XML schema language of the World Wide Web
    Consortium
  • METS can be used to wrap complex sets of data
    (the actual data, with rules for encoding binary
    forms), the metadata describing the parts of that
    data, and the sequence and conditions under which
    the data can or should be presented or displayed

65
Other Protocols and Metadata Systems Using XML
  • SOAP (Simple Object Access Protocol)
  • SRW (Search and Retrieval for the Web)
  • OAI-MHP (Open Archives Initiative Metadata
    Harvesting Protocol)
  • RDF (Resource Description Framework)
  • MPEG-7 (more next time)
  • METS
  • ADL Gazetteer Protocol
  • DAV/DASL (Distributed Authoring and Versioning)
  • SDLIP (Simple Digital Library Interoperability
    Protocol)
  • Also versions of MARC and other formats in XML

66
Lecture Overview
  • Review
  • Types of Controlled Vocabularies
  • Name Authority Control
  • Thesaurus Design and Development
  • Controlled Vocabularies for topical description
  • Thesaurus Design
  • Steps In Thesaurus Development
  • Indexing
  • Discussion (including some from last time)

67
Discussion Questions
  • Morgan Ames on Vickery
  • Though facets are a powerful tool for organizing
    information, they can be very time-consuming to
    define.  Vickery describes the creation of
    facets, starting with the analysis of terms used
    by a user group, then the sorting of the terms
    into facets, the development of facets (depending
    on how often they're used), the arrangement of
    the facets, and finally, the establishment of a
    notation for the facets.  Could one automate some
    or all of the process of defining facets for a
    particular area - say, an online community?  If
    so, which parts could be automated, and how?  If
    not, why not - what are the limitations of
    automation?

68
Discussion Questions
  • Lilia Manguy on Thesaurus Construction
  • The reading mentions thesauri being constructed
    for institutions. What are some examples of
    institutions with specialized thesauri? Why were
    they deemed necessary?

69
Discussion Questions
  • Lilia Manguy on Thesaurus Construction
  • In our field, what are some scenarios in which a
    thesaurus would need to be constructed? How would
    you determine who would be your expert
    consultants? Who would you choose?

70
Discussion Questions
  • Lilia Manguy on Thesaurus Construction
  • Using the process outlined in the reading for
    constructing a thesaurus, how would you qualify
    whether your thesaurus is good or bad?

71
Discussion Questions
  • SorryWe will come back to this in the section on
    Interfaces for IR
  • Christine Jones on Card Sorting
  • Carrie Burgener on Flamenco

72
Discussion Questions
  • Chitra Madhwacharyula on Org. of Info., Chap 3
  • Associative indexing is the concept in which
    items are linked together and any item can lead
    to access of other related information (e.g.
    hypertext documents). Is it possible to have
    efficient and usable associative indexing without
    the use of computers and if so how?
  • How does Google use the concept of associative
    indexing?

73
Discussion Questions
  • Chitra Madhwacharyula on Org. of Info., Chap 3
  • In the 1930s Vannevar Bush developed the idea of
    memex, "a device in which an individual stores
    all his books, records, and communications, and
    which is mechanized so that it may be consulted
    with exceeding speed and flexibility". It was
    based on the concept of associative indexing. How
    similar/dissimilar is this device to the current
    generation cataloging and/or retrieval systems?

74
Discussion Questions
  • Jaime Parada on Org. of Info., Chap 5
  • The fierce competition between vendors in the
    OPAC and Online Index market may increase the
    development of new innovative technology and
    better systems, but it contributes to the lack of
    standardization in system design. How can the
    Z39.50 protocol help with this issue? Does an
    increase on standardization reduce the innovative
    nature of vendors and the creation of better
    systems?
  • User-centered design may refer to "enhancing
    system performance to deliver better results,
    designing for particular users since one size
    does not fit all". How does user-centered design
    interfere with standardization?

75
Announcements and Next
  • Midterms Returned
  • Extra Credit
  • Next time
  • Multimedia Information Organization and Retrieval
  • Readings/Discussion
  • Computational Media Aesthetics Finding Meaning
    Beautiful
  • The Holy Grail of Content-Based Media Analysis
  • Editing Out Video Editing

76
(No Transcript)
Write a Comment
User Comments (0)
About PowerShow.com