Title: Prof. Ray Larson
1Lecture 14 Metadata and Markup
SIMS 202 Information Organization and Retrieval
- Prof. Ray Larson Prof. Marc Davis
- UC Berkeley SIMS
- Tuesday and Thursday 1030 am - 1200 pm
- Fall 2003
- http//www.sims.berkeley.edu/academics/courses/is2
02/f03/
2Lecture Overview
- Review
- XML and Document Engineering
- Metadata And Markup
- XML As A Metadata Lingua Franca
- METS
- SGML vs. XML DTD Construction
- XML Schemas
- XML For Protocols And Metadata Languages
- Readings/Discussion
3Lecture Overview
- Review
- XML and Document Engineering
- Metadata And Markup
- XML As A Metadata Lingua Franca
- METS
- SGML vs. XML DTD Construction
- XML Schemas
- XML For Protocols And Metadata Languages
- Readings/Discussion
4Lecture Overview
- Review
- XML and Document Engineering
- Metadata And Markup
- XML As A Metadata Lingua Franca
- METS
- SGML vs. XML DTD Construction
- XML Schemas
- XML For Protocols And Metadata Languages
- Readings/Discussion
5XML as a common syntax
- XML (and SGML) provide a way of expressing the
structure of documents that can be verified and
validated by document processing systems - Documents can be metadata structures
- Such as the description of a particular
photograph in our Phone project - XML thus provides a way of representing metadata
descriptions as well as the content that they
describe
6XML as a common syntax
- All XML documents follow some simple rules that
make them interchangeable and usable across
different systems - All data and markup is in UNICODE
- All elements are marked by begin and end tags
- All markup is case-sensitive
- XML DTDs and/or Schemas define the valid
structure (and sometimes content) of the documents
7Example METS
- METS the Metadata Encoding and Transmission
Standard is a new Schema intended to provide - a standard for encoding descriptive,
administrative, and structural metadata regarding
objects within a digital library, expressed using
the XML schema language of the World Wide Web
Consortium - METS can be used to wrap complex sets of data
(the actual data, with rules for encoding binary
forms), the metadata describing the parts of that
data, and the sequence and conditions under which
the data can or should be presented or displayed
8Lecture Overview
- Review
- XML and Document Engineering
- Metadata And Markup
- XML As A Metadata Lingua Franca
- METS
- SGML vs. XML DTD Construction
- XML Schemas
- XML For Protocols And Metadata Languages
- Readings/Discussion
9SGML/XML Structure
- An SGML document consists of three parts
- The SGML Declaration
- The Document Type Definition (DTD)
- The Document Instance
- An XML document REQUIRES only the document
instance, but for effective processing a DTD is
very important - XML Schema (later) provides an alternative to
DTDs for XML applications
10Document Type Definitions
- The DTD describes the structural elements and
"shorthand" markup for a particular document type
and defines - Names of "legal" elements
- How many times elements can appear
- The order of elements in a document
- Whether markup can be omitted (SGML only)
- Contents of elements (i.e., nested structures)
- Attributes associated with elements
- Names of "entities"
- Short-hand conventions for element tags (SGML
only)
11DTD Components
- The major components of a DTD are
- Entity Declarations
- Element Declarations
- Attribute Declarations
12Document Type Definitions
- Entity Declarations are a "macro" definition
facility for both DTD and Document instance parts - General Internal Entity Definitionslt!ENTITY name
"substitute string"gtreferenced by name - General External Entity Definitionslt!ENTITY name
SYSTEM "file path"gtreferenced by name - Parameter Entity Definitions (used only inside
DTDs)lt!ENTITY name "substitute
string"gtorlt!ENTITY name SYSTEM "file
path"gtreferenced by name or name
13 Document Type Definitions
- SGML Element Declarations define the structural
elements of a document and its associated
markuplt!ELEMENT name - - content_model or
declared_content (include_list) -(exclude_list)
gt - Omitted tag minimization indicates whether
start-tags or end-tags can be omitted in the
markup (o) or (-) are required in SGML but can
NOT be used in XML
14Document Type Definitions
- Content model provides a nested structural
description of the elements that make up this
element, e.g. - lt!ELEMENT memo - - ((to from), body, close?)gt
- lt!ELEMENT body - O (p) gt
- lt!ELEMENT p - O (PCDATA q)gt
- lt!ELEMENT q - - (PCDATA)gt...
- ANY (in SGML) may be used to indicate a content
model of any elements in the DTD, in any order
15Document Type Definitions
- Same content model in XML
- lt?xml version 1.0?gt
- lt!DOCTYPE memo
- lt!ELEMENT memo ((to from), body,
close?)gtlt!ELEMENT body (p) gtlt!ELEMENT p
(PCDATA q) gtlt!ELEMENT q (PCDATA)gt - gt
- Note the XML processing instruction Prolog
- Note that in previous page is not legal XML
16Document Type Definitions
- Declared content can bePCDATA, CDATA, RCDATA,
EMPTY - Inclusion and Exclusion lists can be used to
indicate elements that can occur or are forbidden
to occur in any sub-elements of the content model
(NOT in XML), e.g. - lt!ELEMENT memo -- ((to from), body close?)
(fn)gt - Says that element fn can appear anyplace in the
memo
17Document Type Definitions
- Attribute Declarations define attributes
associated with (potentially) each element of a
document and provide the acceptable values for
those attributes
18Attributes Example
- lt!ATTLIST associate_element attribute_name
declared_value default_value gt - lt!ATTLIST memo status (PUBLIC CONFIDENTIAL)
PUBLICgt - In markup of a document ltmemo
status"CONFIDENTIAL"gtalso, because of the
default setltmemogtwould be the same as ltmemo
status"PUBLIC"gtThere are a variety of special
defaults and data types that can be given in
attribute definitions
19Sample SGML DTD
lt!doctype ELIB-TEXTS lt!-- This is a DTD for
bibliographic records extracted from the
elib/rfc1357 simple bibliographic format.
--gt lt!ELEMENT ELIB-TEXTS o o (ELIB-BIB)gt lt!--
We allow most elements to occur any number of
times in any order --gt lt!-- this is because there
is little consistency in the actual usage.
--gt lt!ELEMENT ELIB-BIB - - (BIB-VERSION, ID,
ENTRY?, DATE?, TITLE, ORGANIZATION, (SERIES
TYPE REVISION REVISION-DATE AUTHOR-PERSONAL
AUTHOR-INSTITUTIONAL AUTHOR-CONTRIBUTING-PERSO
NAL AUTHOR-CONTRIBUTING-PERSONAL
AUTHOR-CONTRIBUTING-INSTITUTIONAL
CONTACT AUTHOR PROJECT PAGES BIOREGION
CERES-BIOREGION TEXTSOUP LOCATION
ULTIMATE-CLIENT URL KEYWORDS NOTES
ABSTRACT), (TEXT-REF PAGED-REF) )gt lt!-- We
won't make any assumptions about content... all
PCDATA --gt lt!ELEMENT ID - o (PCDATA)gt lt!ELEMENT
ABSTRACT - o (PCDATA)gt lt!ELEMENT
AUTHOR-CONTRIBUTING-INSTITUTIONAL - o
(PCDATA)gt lt!ELEMENT AUTHOR-CONTRIBUTING-PERSONAL
- o (PCDATA)gt lt!ELEMENT AUTHOR-PERSONAL-CONTRIBU
TING - o (PCDATA)gt etc gt
20XML Version
lt!doctype ELIB-TEXTS lt!-- This is a DTD for
bibliographic records extracted from the
elib/rfc1357 simple bibliographic format.
--gt lt!ELEMENT ELIB-TEXTS(ELIB-BIB)gt lt!-- We
allow most elements to occur any number of times
in any order --gt lt!-- this is because there is
little consistency in the actual usage.
--gt lt!ELEMENT ELIB-BIB (BIB-VERSION, ID, ENTRY?,
DATE?, TITLE, ORGANIZATION, (SERIES TYPE
REVISION REVISION-DATE AUTHOR-PERSONAL
AUTHOR-INSTITUTIONAL AUTHOR-CONTRIBUTING-PERSONA
L AUTHOR-CONTRIBUTING-PERSONAL
AUTHOR-CONTRIBUTING-INSTITUTIONAL
CONTACT AUTHOR PROJECT PAGES BIOREGION
CERES-BIOREGION TEXTSOUP LOCATION
ULTIMATE-CLIENT URL KEYWORDS NOTES
ABSTRACT), (TEXT-REF PAGED-REF) )gt lt!-- We
won't make any assumptions about content... all
PCDATA --gt lt!ELEMENT ID (PCDATA)gt lt!ELEMENT
ABSTRACT (PCDATA)gt lt!ELEMENT AUTHOR-CONTRIBUTING-
INSTITUTIONAL (PCDATA)gt lt!ELEMENT
AUTHOR-CONTRIBUTING-PERSONAL (PCDATA)gt lt!ELEMENT
AUTHOR-PERSONAL-CONTRIBUTING (PCDATA)gt etc gt
21Document Using That DTD
ltELIB-BIBgt ltBIB-VERSIONgtELIB-v1.0
lt/BIB-VERSIONgt ltIDgt6lt/IDgt ltENTRYgtFebruary 13
1995lt/ENTRYgt ltDATEgtMarch 1, 1993lt/DATEgt ltTITLEgtWat
er Conditions in California Report
2lt/TITLEgt ltORGANIZATIONgtCalifornia Department of
Water Resourceslt/ORGANIZATIONgt ltSERIESgt120-93lt/SER
IESgt ltTYPEgtbulletinlt/TYPEgt ltAUTHOR-INSTITUTIONALgtC
alifornia Department of Water Resources
lt/AUTHOR-INSTITUTIONALgt ltPAGESgt17lt/PAGESgt ltTEXT-RE
Fgt/elib/data/disk/disk5/documents/6/HYPEROCR/hyper
ocr.html lt/TEXT-REFgt ltPAGED-REFgt/elib/data/disk/
disk5/documents/6/OCR-ASCII-NOZONE
lt/PAGED-REFgt lt/ELIB-BIBgt
22Dublin Core
- Review
- Simple metadata for describing internet resources
- For Document-Like Objects
- 15 Elements
23Dublin Core Elements
- Title
- Creator
- Subject
- Description
- Publisher
- Other Contributors
- Date
- Resource Type
- Format
- Resource Identifier
- Source
- Language
- Relation
- Coverage
- Rights Management
24DC XML DTD Implementation
- There have been various versions
- This one is the one recommended (required) by the
Open Archives Initiative Metadata Harvesting
Protocol (OAI-MHP) - Uses XML Name Spaces
- Available at http//dublincore.org/documents/2001/
09/20/dcmes-xml/
25DC Element and Attribute Definitions
lt!-- The elements from DCMES 1.1 --gt lt!-- The
name given to the resource. --gt lt!ELEMENT
dctitle (PCDATA)gt lt!ATTLIST dctitle xmllang
CDATA IMPLIEDgt lt!-- An entity primarily
responsible for making the content of the
resource. --gt lt!ELEMENT dccreator (PCDATA)gt
lt!ATTLIST dccreator xmllang CDATA IMPLIEDgt
lt!-- The topic of the content of the resource.
--gt lt!ELEMENT dcsubject (PCDATA)gt lt!ATTLIST
dcsubject xmllang CDATA IMPLIEDgt lt!-- An
account of the content of the resource. --gt
lt!ELEMENT dcdescription (PCDATA)gt lt!ATTLIST
dcdescription xmllang CDATA IMPLIEDgt lt!--
The entity responsible for making the resource
available. --gt lt!ELEMENT dcpublisher
(PCDATA)gt lt!ATTLIST dcpublisher xmllang CDATA
IMPLIEDgt lt!-- An entity responsible for making
contributions to the content of the resource.
--gt lt!ELEMENT dccontributor (PCDATA)gt
lt!ATTLIST dccontributor xmllang CDATA
IMPLIEDgt lt!-- A date associated with an event
in the life cycle of the resource. --gt lt!ELEMENT
dcdate (PCDATA)gt lt!ATTLIST dcdate xmllang
CDATA IMPLIEDgt
26DC Element Definitions (cont.)
lt!-- The nature or genre of the content of the
resource. --gt lt!ELEMENT dctype (PCDATA)gt
lt!ATTLIST dctype xmllang CDATA IMPLIEDgt lt!--
The physical or digital manifestation of the
resource. --gt lt!ELEMENT dcformat (PCDATA)gt
lt!ATTLIST dcformat xmllang CDATA IMPLIEDgt
lt!-- An unambiguous reference to the resource
within a given context. --gt lt!ELEMENT
dcidentifier (PCDATA)gt lt!ATTLIST dcidentifier
xmllang CDATA IMPLIEDgt lt!ATTLIST dcidentifier
rdfresource CDATA IMPLIEDgt lt!-- A Reference
to a resource from which the present resource is
derived. --gt lt!ELEMENT dcsource (PCDATA)gt
lt!ATTLIST dcsource xmllang CDATA IMPLIEDgt
lt!ATTLIST dcsource rdfresource CDATA
IMPLIEDgt lt!-- A language of the intellectual
content of the resource. --gt lt!ELEMENT
dclanguage (PCDATA)gt lt!ATTLIST dclanguage
xmllang CDATA IMPLIEDgt lt!-- A reference to a
related resource. --gt lt!ELEMENT dcrelation
(PCDATA)gt lt!ATTLIST dcrelation xmllang CDATA
IMPLIEDgt lt!ATTLIST dcrelation rdfresource
CDATA IMPLIEDgt lt!-- The extent or scope of the
content of the resource. --gt lt!ELEMENT
dccoverage (PCDATA)gt lt!ATTLIST dccoverage
xmllang CDATA IMPLIEDgt lt!-- Information about
rights held in and over the resource. --gt
lt!ELEMENT dcrights (PCDATA)gt lt!ATTLIST
dcrights xmllang CDATA IMPLIEDgt
27A More Complex SGML DTD
lt!DOCTYPE USMARC lt!-- USMARC DTD. UCB-SLIS
v.0.08 --gt lt!-- By Jerome P. McDonough, April 1,
1994 --gt lt!ELEMENT USMARC - - (Leader, Directry,
VarFlds)gt lt!ATTLIST USMARC Material
(BKAMCFMPMUVMSE) "BK" id
CDATA IMPLIEDgt lt!-- Author's Note the id
attribute for the USMARC element is
intended to hold a unique record number
for each MARC record in the
local database. That is to
say, it is intended ONLY as an aid in
maintaining the local database of MARC
records --gt lt!ELEMENT Leader - O (LRL, RecStat,
RecType, BibLevel, UCP, IndCount, SFCount,
BaseAddr, EncLevel, DscCatFm,
LinkRec, EntryMap)gt lt!ELEMENT Directry - O
(PCDATA)gt lt!ELEMENT VarFlds - O (VarCFlds,
VarDFlds)gt lt!-- Component parts of Leader
--gt lt!-- Logical Record Length --gt lt!ELEMENT LRL
- O (PCDATA)gt etc
28More Complex DTD (cont.)
lt!-- Variable Data Fields --gt lt!ELEMENT VarDFlds
- O (NumbCode, MainEnty?, Titles, EdImprnt?,
PhysDesc?, Series?,
Notes?, SubjAccs?, AddEnty?, LinkEnty?,
SAddEnty?, HoldAltG?,
Fld9XX?)gt lt!-- Component Parts of Variable Data
Fields --gt lt!-- Numbers Codes --gt lt!ELEMENT
NumbCode - O (Fld010?, Fld011?, Fld015?, Fld017,
Fld018?, Fld019, Fld020,
Fld022, Fld023, Fld024,
Fld025, Fld027, Fld028, Fld029,
Fld030, Fld032, Fld033, Fld034,
Fld035, Fld036?,
Fld037, Fld039, Fld040?, Fld041?, Fld042?,
Fld043?, Fld044?,
Fld045?, Fld046?, Fld047?, Fld048, Fld050,
Fld051, Fld052,
Fld055, Fld060, Fld061, Fld066?,
Fld069, Fld070,
Fld071, Fld072, Fld074, Fld080?,
Fld082, Fld084, Fld086, Fld088, Fld090,
Fld096)gt lt!-- Main Entries --gt lt!ELEMENT
MainEnty - O (Fld100?, Fld110?, Fld111?,
Fld130?)gt lt!-- Titles --gt lt!ELEMENT Titles - O
(Fld210?, Fld211, Fld212, Fld214, Fld222,
Fld240?, Fld242, Fld243?, Fld245,
Fld246, Fld247)gt lt!-- Edition, Imprint, etc.
--gt lt!ELEMENT EdImprnt - O (Fld250?, Fld254?,
Fld255, Fld256?, Fld257?, Fld260?,
Fld261?, Fld262?, Fld263?,
Fld265?)gt lt!-- Physical Description, etc.
--gt lt!ELEMENT PhysDesc - O (Fld300, Fld305,
Fld306?, Fld310?, Fld315?,
Fld321, Fld340, Fld350?, Fld351, Fld355,
Fld357, Fld362)gt etc
29Complex DTD (cont.)
lt!-- Title Statement --gt lt!ELEMENT Fld245 - O
(Six?, (abcfghknps))gt lt!ATTLIST Fld245
AddEnty (NoYesBlank) IMPLIED
NFChars (0123456789Blnk)
IMPLIEDgt etc lt!-- Subfield Element
Declarations --gt lt!ELEMENT a - O
(PCDATA)gt lt!ELEMENT b - O
(PCDATA)gt lt!ELEMENT c - O
(PCDATA)gt lt!ELEMENT d - O
(PCDATA)gt lt!ELEMENT e - O (PCDATA)gt
30Document Markup
- All document markup is derived from the DTD for
the particular document type - In SGML the DTD should be referenced in the
document using the DOCTYPE declaration - lt!DOCTYPE name SYSTEM "file_path"
gtorlt!DOCTYPE name SYSTEM "file_path"
doctype_declaration_subsetgtorlt!DOCTYPE name
doctype_declaration_subsetgtThe
doctype_declaration_subset can be any combination
of elements, entity, and attribute declarations
31HTML
- HTML was not originally "real" SGML, the DTD was
invented after the language - It is often more concerned with the form of the
output on the screen than with the structural
contents of the HTML docs - Relies on the application (such as Netscape) to
implement interesting actions like hypertext
linking - XHTML is now a W3C recommendation that applies
XML conventions to HTML, and provides a growing
set of capabilities within an XML framework (our
phones use XHTML)
32Lecture Overview
- Review
- XML and Document Engineering
- Metadata And Markup
- XML As A Metadata Lingua Franca
- METS
- SGML vs. XML DTD Construction
- XML Schemas
- XML For Protocols And Metadata Languages
- Readings/Discussion
33What are XML Schemas?
- An XML vocabulary for expressing your data's
structure AND content types, and even the
business rules involved in processing the data - Written in XML themselves
- Support namespaces for combining multiple schemas
in the same documents - The slides in this section are based on an XML
tutorial by Roger L. Costello
34Example
ltlocationgt ltlatitudegt32.904237lt/latitude
gt ltlongitudegt73.620290lt/longitudegt
ltuncertainty units"meters"gt2lt/uncertaintygt lt
/locationgt
Is this data valid? To be valid, it must meet
these constraints (data business rules) 1.
The location must be comprised of a latitude,
followed by a longitude, followed by an
indication of the uncertainty of the
lat/lon measurements. 2. The latitude must be
a decimal with a value between -90 to 90 3.
The longitude must be a decimal with a value
between -180 to 180 4. For both latitude and
longitude the number of digits to the right
of the decimal point must be exactly six
digits. 5. The value of uncertainty must be a
non-negative integer 6. The uncertainty units
must be either meters or feet.
We can express all these data constraints using
XML Schemas
35Validating your data
36Purpose of XML Schemas
- Specify
- the structure of instance documents
- "this element contains these elements, which
contains these other elements, etc" - the datatype of each element/attribute
- "this element shall hold an integer with the
range 0 to 12,000" (DTDs don't do too well with
specifying datatypes like this)
37Why Schemas?
Motivation for XML Schemas
- People are dissatisfied with DTDs
- It's a different syntax
- You write your XML (instance) document using one
syntax and the DTD using another syntax --gt bad,
inconsistent - Limited datatype capability
- DTDs support a very limited capability for
specifying datatypes. You can't, for example,
express "I want the ltelevationgt element to hold
an integer with a range of 0 to 12,000" - Desire a set of datatypes compatible with those
found in databases - DTD supports 10 datatypes XML Schemas supports
44 datatypes
38Highlights of XML Schemas
- XML Schemas are a tremendous advancement over
DTDs - Enhanced datatypes
- 44 versus 10
- Can create your own datatypes
- Example "This is a new type based on the string
type and elements of this type must follow this
pattern ddd-dddd, where 'd' represents a digit". - Written in the same syntax as instance documents
- less syntax to remember
- Object-oriented'ish
- Can extend or restrict a type (derive new type
definitions on the basis of old ones) - Can express sets, i.e., can define the child
elements to occur in any order
39Highlights of XML Schemas
- Can specify element content as being unique (keys
on content) and uniqueness within a region - Can define multiple elements with the same name
but different content - Can define elements with nil content
- Can define substitutable elements - e.g., the
"Book" element is substitutable for the
"Publication" element.
40BookStore.dtd
lt!ELEMENT BookStore (Book)gt lt!ELEMENT Book
(Title, Author, Date, ISBN, Publisher)gt lt!ELEMENT
Title (PCDATA)gt lt!ELEMENT Author
(PCDATA)gt lt!ELEMENT Date (PCDATA)gt lt!ELEMENT
ISBN (PCDATA)gt lt!ELEMENT Publisher (PCDATA)gt
41ELEMENT
ATTLIST
BookStore
Author
PCDATA
Book
ID
Title
CDATA
NMTOKEN
ISBN
Publisher
Date
ENTITY
This is the vocabulary that DTDs provide to
define your new vocabulary
42http//www.w3.org/2001/XMLSchema
http//www.books.org (targetNamespace)
complexType
element
BookStore
Author
sequence
Book
schema
Title
boolean
string
ISBN
Publisher
Date
integer
This is the vocabulary that XML Schemas provide
to define your new vocabulary
One difference between XML Schemas and DTDs is
that the XML Schema vocabulary is associated with
a name (namespace). Likewise, the new vocabulary
that you define must be associated with a name
(namespace). With DTDs neither set of vocabulary
is associated with a name (namespace) DTDs
pre-dated namespaces.
43lt?xml version"1.0"?gt ltxsdschema
xmlnsxsd"http//www.w3.org/2001/XMLSchema"
targetNamespace"http//www.books
.org" xmlns"http//www.book
s.org" elementFormDefault"q
ualified"gt ltxsdelement name"BookStore"gt
ltxsdcomplexTypegt ltxsdsequencegt
ltxsdelement ref"Book"
minOccurs"1" maxOccurs"unbounded"/gt
lt/xsdsequencegt lt/xsdcomplexTypegt
lt/xsdelementgt ltxsdelement name"Book"gt
ltxsdcomplexTypegt ltxsdsequencegt
ltxsdelement ref"Title"
minOccurs"1" maxOccurs"1"/gt
ltxsdelement ref"Author" minOccurs"1"
maxOccurs"1"/gt ltxsdelement
ref"Date" minOccurs"1" maxOccurs"1"/gt
ltxsdelement ref"ISBN" minOccurs"1"
maxOccurs"1"/gt ltxsdelement
ref"Publisher" minOccurs"1" maxOccurs"1"/gt
lt/xsdsequencegt
lt/xsdcomplexTypegt lt/xsdelementgt
ltxsdelement name"Title" type"xsdstring"/gt
ltxsdelement name"Author" type"xsdstring"/gt
ltxsdelement name"Date" type"xsdstring"/gt
ltxsdelement name"ISBN" type"xsdstring"/gt
ltxsdelement name"Publisher" type"xsdstring"/gt
lt/xsdschemagt
BookStore.xsd
xsd Xml-Schema Definition
44lt?xml version"1.0"?gt ltxsdschema
xmlnsxsd"http//www.w3.org/2001/XMLSchema"
targetNamespace"http//www.books
.org" xmlns"http//www.book
s.org" elementFormDefault"q
ualified"gt ltxsdelement name"BookStore"gt
ltxsdcomplexTypegt ltxsdsequencegt
ltxsdelement ref"Book"
minOccurs"1" maxOccurs"unbounded"/gt
lt/xsdsequencegt lt/xsdcomplexTypegt
lt/xsdelementgt ltxsdelement name"Book"gt
ltxsdcomplexTypegt ltxsdsequencegt
ltxsdelement ref"Title"
minOccurs"1" maxOccurs"1"/gt
ltxsdelement ref"Author" minOccurs"1"
maxOccurs"1"/gt ltxsdelement
ref"Date" minOccurs"1" maxOccurs"1"/gt
ltxsdelement ref"ISBN" minOccurs"1"
maxOccurs"1"/gt ltxsdelement
ref"Publisher" minOccurs"1" maxOccurs"1"/gt
lt/xsdsequencegt
lt/xsdcomplexTypegt lt/xsdelementgt
ltxsdelement name"Title" type"xsdstring"/gt
ltxsdelement name"Author" type"xsdstring"/gt
ltxsdelement name"Date" type"xsdstring"/gt
ltxsdelement name"ISBN" type"xsdstring"/gt
ltxsdelement name"Publisher" type"xsdstring"/gt
lt/xsdschemagt
lt!ELEMENT BookStore (Book)gt
lt!ELEMENT Book (Title, Author, Date,
ISBN, Publisher)gt
lt!ELEMENT Title (PCDATA)gt lt!ELEMENT Author
(PCDATA)gt lt!ELEMENT Date (PCDATA)gt lt!ELEMENT
ISBN (PCDATA)gt lt!ELEMENT Publisher (PCDATA)gt
45lt?xml version"1.0"?gt ltxsdschema
xmlnsxsd"http//www.w3.org/2001/XMLSchema"
targetNamespace"http//www.book
s.org" xmlns"http//www.bo
oks.org"
elementFormDefault"qualified"gt ltxsdelement
name"BookStore"gt ltxsdcomplexTypegt
ltxsdsequencegt ltxsdelement
ref"Book" minOccurs"1" maxOccurs"unbounded"/gt
lt/xsdsequencegt
lt/xsdcomplexTypegt lt/xsdelementgt
ltxsdelement name"Book"gt
ltxsdcomplexTypegt ltxsdsequencegt
ltxsdelement ref"Title" minOccurs"1"
maxOccurs"1"/gt ltxsdelement
ref"Author" minOccurs"1" maxOccurs"1"/gt
ltxsdelement ref"Date" minOccurs"1"
maxOccurs"1"/gt ltxsdelement
ref"ISBN" minOccurs"1" maxOccurs"1"/gt
ltxsdelement ref"Publisher" minOccurs"1"
maxOccurs"1"/gt lt/xsdsequencegt
lt/xsdcomplexTypegt lt/xsdelementgt
ltxsdelement name"Title" type"xsdstring"/gt
ltxsdelement name"Author" type"xsdstring"/gt
ltxsdelement name"Date" type"xsdstring"/gt
ltxsdelement name"ISBN" type"xsdstring"/gt
ltxsdelement name"Publisher" type"xsdstring"/gt
lt/xsdschemagt
All XML Schemas have "schema" as the root element.
46lt?xml version"1.0"?gt ltxsdschema
xmlnsxsd"http//www.w3.org/2001/XMLSchema"
targetNamespace"http//www.book
s.org" xmlns"http//www.bo
oks.org"
elementFormDefault"qualified"gt ltxsdelement
name"BookStore"gt ltxsdcomplexTypegt
ltxsdsequencegt ltxsdelement
ref"Book" minOccurs"1" maxOccurs"unbounded"/gt
lt/xsdsequencegt
lt/xsdcomplexTypegt lt/xsdelementgt
ltxsdelement name"Book"gt
ltxsdcomplexTypegt ltxsdsequencegt
ltxsdelement ref"Title" minOccurs"1"
maxOccurs"1"/gt ltxsdelement
ref"Author" minOccurs"1" maxOccurs"1"/gt
ltxsdelement ref"Date" minOccurs"1"
maxOccurs"1"/gt ltxsdelement
ref"ISBN" minOccurs"1" maxOccurs"1"/gt
ltxsdelement ref"Publisher" minOccurs"1"
maxOccurs"1"/gt lt/xsdsequencegt
lt/xsdcomplexTypegt lt/xsdelementgt
ltxsdelement name"Title" type"xsdstring"/gt
ltxsdelement name"Author" type"xsdstring"/gt
ltxsdelement name"Date" type"xsdstring"/gt
ltxsdelement name"ISBN" type"xsdstring"/gt
ltxsdelement name"Publisher" type"xsdstring"/gt
lt/xsdschemagt
The elements and datatypes that are used to
construct schemas - schema - element -
complexType - sequence - string come from the
http///XMLSchema namespace
47XMLSchema Namespace
http//www.w3.org/2001/XMLSchema
complexType
element
sequence
schema
boolean
string
integer
48lt?xml version"1.0"?gt ltxsdschema
xmlnsxsd"http//www.w3.org/2001/XMLSchema"
targetNamespace"http//www.books
.org" xmlns"http//www.book
s.org" elementFormDefault"q
ualified"gt ltxsdelement name"BookStore"gt
ltxsdcomplexTypegt ltxsdsequencegt
ltxsdelement ref"Book"
minOccurs"1" maxOccurs"unbounded"/gt
lt/xsdsequencegt lt/xsdcomplexTypegt
lt/xsdelementgt ltxsdelement name"Book"gt
ltxsdcomplexTypegt ltxsdsequencegt
ltxsdelement ref"Title"
minOccurs"1" maxOccurs"1"/gt
ltxsdelement ref"Author" minOccurs"1"
maxOccurs"1"/gt ltxsdelement
ref"Date" minOccurs"1" maxOccurs"1"/gt
ltxsdelement ref"ISBN" minOccurs"1"
maxOccurs"1"/gt ltxsdelement
ref"Publisher" minOccurs"1" maxOccurs"1"/gt
lt/xsdsequencegt
lt/xsdcomplexTypegt lt/xsdelementgt
ltxsdelement name"Title" type"xsdstring"/gt
ltxsdelement name"Author" type"xsdstring"/gt
ltxsdelement name"Date" type"xsdstring"/gt
ltxsdelement name"ISBN" type"xsdstring"/gt
ltxsdelement name"Publisher" type"xsdstring"/gt
lt/xsdschemagt
Says that the elements defined by this schema -
BookStore - Book - Title - Author - Date
- ISBN - Publisher are to go in
this namespace
49Book Namespace (targetNamespace)
http//www.books.org (targetNamespace)
BookStore
Author
Book
Title
ISBN
Publisher
Date
50lt?xml version"1.0"?gt ltxsdschema
xmlnsxsd"http//www.w3.org/2001/XMLSchema"
targetNamespace"http//www.books
.org" xmlns"http//www.book
s.org" elementFormDefault"q
ualified"gt ltxsdelement name"BookStore"gt
ltxsdcomplexTypegt ltxsdsequencegt
ltxsdelement ref"Book"
minOccurs"1" maxOccurs"unbounded"/gt
lt/xsdsequencegt lt/xsdcomplexTypegt
lt/xsdelementgt ltxsdelement name"Book"gt
ltxsdcomplexTypegt ltxsdsequencegt
ltxsdelement ref"Title"
minOccurs"1" maxOccurs"1"/gt
ltxsdelement ref"Author" minOccurs"1"
maxOccurs"1"/gt ltxsdelement
ref"Date" minOccurs"1" maxOccurs"1"/gt
ltxsdelement ref"ISBN" minOccurs"1"
maxOccurs"1"/gt ltxsdelement
ref"Publisher" minOccurs"1" maxOccurs"1"/gt
lt/xsdsequencegt
lt/xsdcomplexTypegt lt/xsdelementgt
ltxsdelement name"Title" type"xsdstring"/gt
ltxsdelement name"Author" type"xsdstring"/gt
ltxsdelement name"Date" type"xsdstring"/gt
ltxsdelement name"ISBN" type"xsdstring"/gt
ltxsdelement name"Publisher" type"xsdstring"/gt
lt/xsdschemagt
The default namespace Is http//www.books.org whi
ch is the targetNamespace!
This is referencing a Book element
declaration. The Book in what namespace? Since
there is no namespace qualifier it is referencing
the Book element in the default namespace, which
is the targetNamespace! Thus, this is a
reference to the Book element declaration in this
schema.
51lt?xml version"1.0"?gt ltxsdschema
xmlnsxsd"http//www.w3.org/2001/XMLSchema"
targetNamespace"http//www.books
.org" xmlns"http//www.book
s.org" elementFormDefault"q
ualified"gt ltxsdelement name"BookStore"gt
ltxsdcomplexTypegt ltxsdsequencegt
ltxsdelement ref"Book"
minOccurs"1" maxOccurs"unbounded"/gt
lt/xsdsequencegt lt/xsdcomplexTypegt
lt/xsdelementgt ltxsdelement name"Book"gt
ltxsdcomplexTypegt ltxsdsequencegt
ltxsdelement ref"Title"
minOccurs"1" maxOccurs"1"/gt
ltxsdelement ref"Author" minOccurs"1"
maxOccurs"1"/gt ltxsdelement
ref"Date" minOccurs"1" maxOccurs"1"/gt
ltxsdelement ref"ISBN" minOccurs"1"
maxOccurs"1"/gt ltxsdelement
ref"Publisher" minOccurs"1" maxOccurs"1"/gt
lt/xsdsequencegt
lt/xsdcomplexTypegt lt/xsdelementgt
ltxsdelement name"Title" type"xsdstring"/gt
ltxsdelement name"Author" type"xsdstring"/gt
ltxsdelement name"Date" type"xsdstring"/gt
ltxsdelement name"ISBN" type"xsdstring"/gt
ltxsdelement name"Publisher" type"xsdstring"/gt
lt/xsdschemagt
This is a directive to any instance documents
which conform to this schema Any elements used
by the instance document which were declared in
this schema must be namespace qualified.
52Referencing a schema in an XML instance document
lt?xml version"1.0"?gt ltBookStore xmlns
"http//www.books.org"
xmlnsxsi"http//www.w3.org/2001/XMLSchema-instan
ce" xsischemaLocation"http/
/www.books.org
BookStore.xsd"gt
ltBookgt ltTitlegtMy Life and
Timeslt/Titlegt ltAuthorgtPaul
McCartneylt/Authorgt ltDategtJuly,
1998lt/Dategt ltISBNgt94303-12021-4389
2lt/ISBNgt ltPublishergtMcMillin
Publishinglt/Publishergt lt/Bookgt
... lt/BookStoregt
1
3
2
1. First, using a default namespace declaration,
tell the schema-validator that all of the
elements used in this instance document come from
the http//www.books.org namespace. 2. Second,
with schemaLocation tell the schema-validator
that the http//www.books.org namespace is
defined by BookStore.xsd (i.e., schemaLocation
contains a pair of values). 3. Third, tell the
schema-validator that the schemaLocation
attribute we are using is the one in the
XMLSchema-instance namespace.
53XMLSchema-instance Namespace
http//www.w3.org/2001/XMLSchema-instance
schemaLocation
type
noNamespaceSchemaLocation
nil
54Referencing a schema in an XML instance document
targetNamespace"http//www.books.org"
schemaLocation"http//www.books.org
BookStore.xsd"
BookStore.xsd
BookStore.xml
- uses elements from namespace
http//www.books.org
- defines elements in namespace
http//www.books.org
A schema defines a new vocabulary. Instance
documents use that new vocabulary.
55Note multiple levels of checking
BookStore.xml
BookStore.xsd
XMLSchema.xsd (schema-for-schemas)
Validate that the xml document conforms to the
rules described in BookStore.xsd
Validate that BookStore.xsd is a valid schema
document, i.e., it conforms to the rules
described in the schema-for-schemas
56Default Value for minOccurs and maxOccurs
- The default value for minOccurs is "1"
- The default value for maxOccurs is "1"
ltxsdelement ref"Title" minOccurs"1"
maxOccurs"1"/gt
Equivalent!
ltxsdelement ref"Title"/gt
57Much More to XMLSchema!
- This was an overview of some basics
- There are many other features, such as
- The ability to import other schemas or parts of
schemas - Ability to specify many data types
- Etc.
- XMLSchema definitions are at W3C
- http//www.w3.org/TR/xmlschema-0/ is a good
place to start
58Lecture Overview
- Review
- XML and Document Engineering
- Metadata And Markup
- XML As A Metadata Lingua Franca
- METS
- SGML vs. XML DTD Construction
- XML Schemas
- XML For Protocols And Metadata Languages
- Readings/Discussion
59Other Protocols and Metadata Systems Using XML
- SOAP (Simple Object Access Protocol)
- DAV/DASL (Distributed Authoring and Versioning)
- SDLIP (Simple Digital Library Interoperability
Protocol) - RDF (Resource Description Framework)
- ADL Gazetteer Protocol
- OAI-MHP (already discussed)
- MPEG-7 (more next time)
- METS
- Also versions of MARC and other formats in XML
60SGML and XML Sources and Resources
- Books
- van Herwijnen, Eric. Practical SGML. (2nd Ed.)
Boston Kluwer Academic Publishers, 1994. - Goldfarb, Charles F. The SGML Handbook. Oxford
Clarenden Press, 1990. (and MANY XML books) -
- Web Sites
- The W3C web site (all XML standards documents)
- http//www.w3.org
- Robin Covers SGML/XML Site
- http//www.oasis-open.org/cover/sgml-xml.html
61Lecture Overview
- Review
- XML and Document Engineering
- Metadata And Markup
- XML As A Metadata Lingua Franca
- METS
- SGML vs. XML DTD Construction
- XML Schemas
- XML For Protocols And Metadata Languages
- Readings/Discussion
62Discussion Vam Makam
- Kirk covers examples of DTDs for books and
newspapers. Many individuals and corporations
have been creating numerous DTDs for themselves
and general purposes. What are some innovative
and useful ideas for areas where designing DTDs
might be useful? For ideas that may have already
been thought of, how could they be improved or
extended?
63Discussion Vam Makam
- However, recent XML DTDs have emerged, newer
ideas such as XML schemas have presented
themselves as a better option. Given the thought
process and work gone into designing existing
DTDs, at what point is it worth modifying an
existing DTD to an XML schema? - Now that you have learned how to design a dtd and
have basic knowledge about XML, what are some
existing technologies that combined with XML
become more useful?
64Discussion Annie Yeh
- Kirk addresses the advantages of using external
DTDs, the reusability of public DTDs, the ability
to focus on content rather than structure, easier
management or multiple documents, and easier data
error checking. What are some of the existing
repositories in which we can store these DTDs?
What are some of the ways with which we can
facilitate this process? What are their pros and
cons? What are some of the more ideal interfaces
with which to facilitate this?
65Discussion Annie Yeh
- What are the differences between DTDs and
Schemas, and what are the pros and cons of each?
66Next Time
- Metadata for Motion Pictures MPEG-7
- Readings/Discussion
- MPEG-7 (Part 1) (J. M. Martinez, R. Koenen, F.
Pereira) - MPEG-7 (Part 2) (J. Martinez)
67(No Transcript)