Title: Metadata for the Web From Discovery to Description
1Metadata for the WebFrom Discovery to Description
- CS 502 20020224
- Carl Lagoze Cornell University
2The fifteen Dublin Core Elements
http//dublincore.org/usage/terms/dc/current-eleme
nts/ http//dublincore.org
3A Pidgin for Digital Tourists
- Metadata is language
- Dublin Core is a small and simple language -- a
pidgin -- for finding resources across domains. - Speakers of different languages naturally
"pidginize" to communicate - E.g., tourists using simple phrases to order beer
("zwei Bier bitte" "dva pivo" "biru o san
bai"...) - We are all "tourists" on the global Internet.
4What is the Dublin Core (1)
- A simple set of properties to support resource
discovery on the web (fuzzy search buckets)?
5What is Dublin Core (2)?
- An extensible ontology for resource desciption?
Greater Functionality Cost
6What is the Dublin Core (3)?
- A cross-domain switchboard for interoperable
metadata?
- projections to application-specific metadata
vocabularies
Switchboard
7Dublin Core Qualifiers
- From fuzzy buckets to more specific description
- Model of graceful degradation
- Support both simplicity and specificity
- Intra-domain and inter-domain semantics
8Varieties of qualifiers Element Refinements
- Make the meaning of an element narrower or more
specific. - Narrowing implies an is a relationship
- a "date created is a "date
- an "is part of relation is a "relation
- If your software does not understand the
qualifier, you can safely ignore it.
9Varieties of Qualifiers Value Encoding Schemes
- Says that the value is
- a term from a controlled vocabulary (e.g.,
Library of Congress Subject Headings) - a string formatted in a standard way (e.g.,
"2001-05-02" means May 3, not February 5) - Even if a scheme is not known by software, the
value should be "appropriate" and usable for
resource discovery.
10A Grammar of Dublin Core
- http//www.dlib.org/dlib/october00/baker/10baker.h
tml - By design not as subtle as mother tongues, but
easy to learn and extremely useful in practice - Pidgins small vocabularies (Dublin Core fifteen
special nouns and lots of optional adjectives) - Simple grammars sentences (statements) follow a
simple fixed pattern...
11Example Dublin Core statements
- Resource has Title 'Grammar of Dublin Core'.
- Resource has Creator 'Tom Baker'.
- Resource has Subject 'Metadata'.
- Resource has Relation http//foo.org/file.htm.
12implied verb
one of 15 properties
property value (an appropriate literal)
DCCreator DCTitle DCSubject DCDate...
implied subject
Resource
has
property
X
13implied verb
one of 15 properties
property value (an appropriate literal)
DCCreator DCTitle DCSubject DCDate...
implied subject
Resource
has
property
X
qualifiers (adjectives)
optional qualifier
optional qualifier
14Resource
has
Subject
"Languages -- Grammar"
LCSH
Resource
has
Date
"2000-06-13"
Revised
ISO8601
15Dumb-Down Principle for Qualifiers
- The fifteen elements should be usable and
understandable with or without the qualifiers - Qualifiers refine meaning (but may be harder to
understand) - Nouns can stand on their own without adjectives
- If your software encounters an unfamiliar
qualifier, look it up -- or just ignore it! - "has a relations break the model
- E.g., a creator has a hair color
16Test for good qualifiers cover and ask
-- Does the statement still make sense?
-- Is it still correct?
Resource
has
Subject
"Languages -- Grammar"
LCSH
Resource
has
Date
"2000-06-13"
Revised
ISO8601
17Incorrect Qualification
Resource
has
creator
Cornell University
affiliation
Resource
has
subject
pre-schoolers
audience
18Open questions in this model
- Are uncontrolled and unconstrained values really
useful for discovery? - Is it possible for an organization (DCMI) to
control the evolution of a language? - How can "simple discovery metadata" be combined
with complex descriptions? Is there a notion of
graceful degradation? - Can DC serve as a lingua franca (mapping
template) among more complex models
19Models for Deploying Metadata
- Embedded in the resource
- low deployment threshold
- Limited flexibility, limited model
- Linked to from resource
- Using xlink
- Is there only one source of metadata?
- Independent resource referencing resource
- Model of accessing the object through its
surrogate - Resource doesnt have metadata, metadata is
just one resource annotating another
20Syntax AlternativesHTML
- Advantages
- Simple Mechanism META tags embedded in content
- Widely deployed tools and knowledge
- Disadvantages
- Limited structural richness (wont support
hierarchical,tree-structured data or entity
distinctions).
21Dublin Core in HTML
- http//www.dublincore.org/documents/2000/08/15/dcq
-html/ - HTML constructs
- ltlinkgt to establish pseudo-namespace
- ltmetagt for metadata statements
- name attribute for DC element (DC.element.ER)
- content attribute for element value
- scheme attribute for encoding scheme or
controlled vocabulary - lang attribute for language of element value
22Dublin Core in HTML example
ltlink rel"schema.DC" href"http//purl.org/dc/ele
ments/1.1"gt ltmeta name"DC.Title"
content"Business Unusualgtltmeta nameDC.Title
langes contentnegocio inusualgt ltmeta
name"DC.Creator" content"Carl Lagoze"gt ltmeta
name"DC.Subject" content"bibliographic control
web cataloging "gt ltmeta name"DC.Date.Created"
scheme"W3CDTF" content"2000-10-23"gt ltmeta
name"DC.Format" content"text/html"gt ltmeta
name"DC.Identifier" content"http//lcweb.loc
.gov/lagoze_paper.html"gt
23Unqualified Dublin Core in XML
http//dublincore.org/documents/2002/09/09/dc-xml-
guidelines/
24Multi-entity nature of object description
25Attribute/Value approaches to metadata
The playwright of Hamlet was Shakespeare
Hamlet has a creator
Shakespeare
26run into problems for richer descriptions
The playwright of Hamlet was Shakespeare,who was
born in Stratford
Hamlet has a creator
Stratford
birthplace
27because of their failure to model entity
distinctions
Shakespeare
name
R1
R2
creator
birthplace
title
Stratford
Hamlet
28Applying a Model-Centric Approach
- Formally define common entities and relationships
underlying multiple metadata vocabularies - Describe them (and their inter-relationships) in
a simple logical model - Provide the framework for extending these common
semantics to domain and application-specific
metadata vocabularies.
29Events are key to understanding resource
complexity?
- Events are implicit in most metadata formats
(e.g., date published, translator) - Modeling implied events as first-class objects
provides attachment points for common entities
e.g., agents, contexts (times places), roles. - Clarifying attachment points facilitates
understanding and querying who was responsible
for what when.
30ABC/Harmony Event-aware metadata ontology
- Recognizing inherent lifecycle aspects of
description (esp. of digital content) - Modeling incorporates time (events and
situations) as first-class objects - Supplies clear attachment points for agents,
roles, existential properties - Resource description as a story-telling activity
31Resource-centric Metadata
32(No Transcript)
33Breaking the metadata bottleneck Human vs.
machine generation
- Simple text scraping
- HTML tags as hint
- Other structural methods
- Natural language methods and machine learning
- Contextual methods
- Google (text and image search)
34Putting metadata in its place
35Query engine architecture space