Title: XML and XSL
1XML and XSL
- A report on the workshop given by Shaoping Moss
on October 16, 2004
.with additional examples from a real-life
project
Presented by ASIST members Caryn Anderson,
Prairie Clayton Kara Schwartz At Simmons
College, November 1, 2004
2Topics discussed
- SGML, XML, and HTML
- XML and XSL Basics
- XML in Libraries and Academics
- XML in Future Web Development
- Slide content courtesy of Shaoping Moss.
3Markup Languages
- Address the structure of a document.
- Identify different components of the document.
- Convey information to software that will allow it
to - Index the data for searching.
- Render the data.
- Transform the data.
- SGML, XML, and HTML are all markup languages.
- Slide content courtesy of Shaoping Moss.
4Document, Structure, and Format
- A document is
- A record which contains information , originally
an inscribed or written record but now considered
to include any format in which information might
be held (e.g. map, manuscript, tape, video,
software). (International Encyclopedia of
Information and Library Science) - A collection of small elements, which can be
headings, subheadings, paragraphs, quotations,
etc - Structure vs Format
- Structure is about the content of the document.
- Format is about the way a document looks.
- Slide content courtesy of Shaoping Moss.
5What is SGML?
- Stands for Standard Generalized Markup Language.
- Initiated by Charles Goldfarb at IBM in the
1960s. - Adopted as a standard of the International
Organization for Standardization(ISO 8879) in
1986. - Slide content courtesy of Shaoping Moss.
6SGML and Its Subdivisions
- SGML is composed of tag-set building rules.
- SGML has given birth to other sets of
subdivisions - HTML and XML.
- CALS for defense.
- BOEING for commercial airlines.
- C-H for publishing.
- OED for Old English Dictionary.
- TEI guidelines for the Text Encoding Initiative.
- EAD for Encoded Archival Descriptions.
- Slide content courtesy of Shaoping Moss.
7HTML Development
- HTML stands for Hypertext Markup Language.
- HTML was developed by Tim Berners-Lee at a
physics lab near Geneva, Switzerland in 1992. - Its simplicity has contributed to the rapid
growth of the World Wide Web in the 1990s. - HTML version 4 came out in 1997.
- XHTML 1.0 is the latest HTML standard.
- Slide content courtesy of Shaoping Moss.
8HTML Problems
- Easy HTML coding has made it harder for browsers
to handle. - Tags are predefined in HTML.
- Format and content are mixed and content is hard
to reuse. - Slide content courtesy of Shaoping Moss.
9What is XML?
- XML is a new Web standard developed by the World
Wide Web Consortium in 1998. - XML stands for eXtensible Markup Language.
- XML was designed to describe data.
- XML tags are not predefined in XML.
- XML separates format from content and semantic
structure. - Data encoded in XML can function much like a
traditional database. - XML content can be output in many formats, such
as XHTML, text, Word documents, PDF, etc - Slide content courtesy of Shaoping Moss.
10The Display of the Document
My First XML Chapter 1 Introduction to XML
What is HTML? What is XML? Chapter 2 XML
Syntax Elements must have a closing tag
Elements must be properly nested
Slide content courtesy of Shaoping Moss.
11An HTML Document
An HTML document describes the book
lth1gtMy First XMLlt/h1gt lth2gtIntroduction to
XMLlt/h2gt ltpgtWhat is HTML?lt/pgt ltpgtWhat is
XML?lt/pgt lth2gtXML Syntaxlt/h2gt ltpgtElements must
have a closing tag.lt/pgt ltpgtElements must be
properly nested.lt/pgt
Slide content courtesy of Shaoping Moss.
12An XML Document
An XML document describes the book
ltbookgt lttitlegtMy First XMLlt/titlegt
ltchaptergtIntroduction to XML ltparagtWhat is
HTML?lt/paragt ltparagtWhat is XML?lt/paragt
lt/chaptergt ltchaptergtXML Syntax
ltparagtElements must have a closing tag.lt/paragt
ltparagtElements must be properly
nested.lt/paragt lt/chaptergt lt/bookgt
Slide content courtesy of Shaoping Moss, 2004
13HTML Elements/Tags
An HTML document describes the book
lth1gtMy First XMLlt/h1gt lth2gtIntroduction to
XMLlt/h2gt ltpgtWhat is HTML?lt/pgt ltpgtWhat is
XML?lt/pgt lth2gtXML Syntaxlt/h2gt ltpgtElements must
have a closing tag.lt/pgt ltpgtElements must be
properly nested.lt/pgt
- Are
- defined by HTML standard
- always the same
- can be used in any order
Original slide content courtesy of Shaoping Moss.
14XML Elements/Tags
An XML document describes the book
ltbookgt lttitlegt My First XMLlt/titlegt
ltchaptergt Introduction to XML ltparagt What
is HTML?lt/paragt ltparagt What is XML?
lt/paragt lt/chaptergt ltchaptergt XML
Syntax ltparagt Elements must have a closing
tag. lt/paragt ltparagt Elements must be
properly nested. lt/paragt lt/chaptergt lt/bookgt
- Are
- defined by user/groups (DTD/Schema)
- different for each DTD/Schema
- hierarchical (tree structure)
Original slide content courtesy of Shaoping Moss.
15XML is flexible and extensible
An XML document describes the book for a
different user group
ltmanuscriptgt ltnamegt My First XML lt/namegt
ltpartgt Introduction to XML ltsectiongt What
is HTML? lt/sectiongt ltsectiongt What is XML?
lt/sectiongt lt/partgt ltpartgt XML
Syntax ltsectiongt Element Rules lt/sectiongt
ltparagt Elements must have a closing tag.
lt/paragt ltparagt Elements must be properly
nested. lt/paragt lt/partgt lt/manuscriptgt
Instead of book
Extend to accommodate greater detail of part
section AND paragraph
Original slide content courtesy of Shaoping Moss.
16Differences between HTML and XML
XML is not a replacement for HTML. XML and HTML
were designed with different goals. - XML
was designed to describe data and to focus on
what data is. - HTML was designed to display
data and to focus on how data looks. HTML
structure and tags are very loose while XML
structure and tags are strict - XML documents
must be well-formed. - XML elements must be
properly nested. - All XML elements must be
closed. - Tag names must be case consistent.
Slide content courtesy of Shaoping Moss.
17Differences HTML XML
Content Format Selection Organization
- - Held in generic containers (lth1gt, ltpgt, etc.)
- In the default format of the content tag OR
- As defined by a Cascading Style Sheet (internal
or external) - All content always included (no option to easily
select or suppress content must manually change
document) - Content only displayed in the order written (to
change order you must manually change document
- Held in specific containers that describe what
the data is (ltbookgt, ltchaptergt, etc.) - -XSLT files define the formats of each section
(i.e. font, color, size, etc.) - -multiple XSLTs for same XML
- XSLT selects and determines order of display of
content - Multiple XSLTs for same XML (one to produce just
book title list, one to display full text, one
for citations, etc.)
18Differences HTML XML
Analogy What you can get
Address List in plain WORD document One
document of your list of contacts with all the
information that you have for each person in the
order you typed it.
- Address List in database or MAIL MERGE data file
- Friends Family with full addresses for Holiday
cards - E-mail list of just Professional contacts for
announcing new product - Special formatting of whole list for better
display on PDA - Etc. etc. etc. all from SAME XML document
19How to Build an XML file family
- Establish the Document Type Definition (DTD) or
Schema - Write a well-formed XML document that holds your
data in the containers established by your
DTD/Schema - Validate your XML document to make sure you
conformed to your DTD/Schema - Build as many different XSL documents as you need
to select data from your XML file, organize it
the way you want it to appear, and format it so
it looks the way you want.
Now you can link your XML file to whatever XSL
you want to get the kind of display you want at
any given time.
20The XML family unit of files and languages
http//www.mysite.org/myfile.xml WEB PAGE
5. Displays content to browser
1. Calls the .xml file
Uses HTML for formatting
XML Where the data is held
XSL Instructions for using XML data and
displaying it
2. Calls .xsl for display instructions
Uses XSLT to select data from .xml file and
format it
DTD or Schema The organizational chart for the
data
3. Looks in .xml for content
Uses XSL-PATH to access certain spots in the .xml
file
4. Returns content to .xsl
File type .xml
File type .xsl
Uses XSL-FO for specifying formatting semantics
(?)
File types .dtd .xml (schemas)
For validation during creation
Languages used in XSLT documents during creation
21The DTD or Schema
means there can be as many of this element as
you want
lt!ELEMENT booklist (book)gt lt!ELEMENT book
(booktitle,author,country,publisher,price,year)gt
lt!ELEMENT booktitle(PCDATA) lt!ELEMENT
author(PCDATA)gt lt!ELEMENT country(PCDATA)gt lt!ELE
MENT publisher(PCDATA)gt lt!ELEMENT
price(PCDATA)gt lt!ELEMENT year(PCDATA)gt
The DTD establishes the hierarchy of
elements/tags.
Original file content courtesy of Shaoping Moss.
22The XML document
lt?xml version"1.0" encoding"UTF-8" ?gt lt!DOCTYPE
list SYSTEM "dtdforbooklist.dtd"gt lt?xml-stylesheet
type"text/xsl" href"xslforbooklist.xsl"?gt ltbook
listgt ltbookgt ltbooktitlegtHTML and XHTMLthe
Definitive Guidelt/booktitlegt ltauthorgtChuck
Muscianolt/authorgt ltauthorgtBill
Kennedylt/authorgt ltcountrygtUSAlt/countrygt ltpubli
shergtO Reillylt/publishergt ltpricegt19.95lt/pricegt
ltyeargt2000lt/yeargt lt/bookgt ltbookgt ltbooktitlegt
XHTML 1.0 Language Sourcebooklt/booktitlegt ltautho
rgtIan S. Grahamlt/authorgt ltcountrygtUSAlt/countrygt
ltpublishergtJohn Wiley and Sonslt/publishergt ltpr
icegt30.00lt/pricegt ltyeargt2000lt/yeargt lt/bookgt lt/b
ooklistgt
This is what DTD is being used.
This is what XSL is being used.
Original file content courtesy of Shaoping Moss.
23The XSL document
lt?xml version"1.0" encoding"UTF-8"?gt ltxslstyles
heet xmlnsxsl"http//www.w3.org/1999/XSL/Transfo
rm" version"1.0"gt ltxsltemplate
match"/"gt lthtmlgt ltbodygt lth1gtMy Book
Collectionlt/h1gt lttable border"1"gt lttr
bgcolor"9acd32"gt ltthgtTitlelt/thgt
ltthgtAuthorlt/thgt ltthgtPublisherlt/thgt
ltthgtCountrylt/thgt ltthgtPricelt/thgt
lt/trgt ltxslfor-each select"booklist/book"gt ltxsls
ort select"publisher"/gt ltxslif
test"yeargt1995"gt lttrgt lttdgtltxslvalue-of
select"booktitle"/gtlt/tdgt lttdgtltxslvalue-of
select"author"/gtlt/tdgt lttdgtltxslvalue-of
select"publisher"/gtlt/tdgt lttdgtltxslvalue-of
select"country"/gtlt/tdgt lttdgtltxslvalue-of
select"price"/gtlt/tdgt lt/trgt lt/xslifgt lt/xslfor-
eachgt lt/tablegt lt/bodygt lt/htmlgt lt/xsltemplategt lt/x
slstylesheetgt
xsltemplate is XSLT for use the template
below
match is X-PATH for link to or start with
and / means the root element (booklist in
this case)
This is basic HTML for the template
xslfor-each with the select instruction is
XSLT for select from each of the books in the
booklist
xslsort with the select instruction is XSLT
for sort by publisher
xslif with the test instruction is XSLT for
only those books when the year is later than
1995
xslvalue-of with the select instruction is
XSLT for use the data from this element
You must close your XSLT commands
You must close the HTML tags of your template
24The Web Page
Original file content courtesy of Shaoping Moss.
25Done! not so hard
- Logical
- Flexible
- Extensible
- Interoperable!!
26XML in Libraries
- Use XML to mapping MARC to MARC XML, HTML, or
MODS formats - MARC XML Conversion Stylesheets
- Use XML to improve searching of archival finding
aids and to catalog Web sites- Five College
Archives Manuscript Collections. - http//asteria.fivecolleges.edu/index.html
- XML-based eScholarship.
- http//escholarship.cdlib.org/
- Use XML for interlibrary loan.
- XML-based database systems.
- Slide content courtesy of Shaoping Moss.
27XML in Academics
- Text Encoding Initiative(TEI)
- http//www.tei-c.org/
- Initially launched in 1987, TEI is an
internationally and interdisciplinary standard
for encoding, keeping and analyzing textual
content structure of digital texts. - This standard is designed for use with a broad
range of text types, especially in the
humanities. It is widely used in libraries,
archives, and by publishers and researchers for
online research and teaching and for the storage
and exchange of large and small text collections. - Since 1987, TEI projects have mushroomed in all
humanities disciplines, including language,
literature, history, classics, social science and
computer science. - Slide content courtesy of Shaoping Moss.
28TEI projects
- Women Writers Project.
- http//www.wwp.brown.edu
- Perseus Digital Library.
- http//www.perseus.tufts.edu/
- Early American Fiction Collection.
- http//etext.lib.virginia.edu/eaf/pubindex.html
- American Memory Project- Historical Collections
for the National Digital Library. - http//lcweb2.loc.gov/ammem/ammemhome.html
- The Newton Papers Project.
- http//www.newtonproject.ic.ac.uk
- Slide content courtesy of Shaoping Moss.
-
29XML is Going to Be Everywhere
- TEI guidelines for the Text Coding Initiative
- http//www.tei-c.org/Guidelines2/index.html
- EAD for Encoded Archival Descriptions
- http//www.loc.gov/ead/
- The Dublin Core Metadata Initiative (DCMI)
- http//dublincore.org/
- MARC XML-MARC 21 XML Schema
- http//www.loc.gov/standards/marcxml/
- MODS XML- Metadata Object Description Schema
- http//www.loc.gov/standards/mods
- Slide content courtesy of Shaoping Moss.
30XML is Going to Be Everywhere
- Resource Description Framework (RDF)
- Information and Content Exchange (ICE)
- Online Information Exchange (ONIX)
- Metadata for Images in XML (MIX)
- XML/EDI (Electronic Data Interchange)
- Bioinformatic Sequence Markup Language (BSML)
- Mathematical Markup Language (MathML)
- Slide content courtesy of Shaoping Moss.
31XML in Future Web Development
- XML is a cross-platform, software and hardware
independent tool for transmitting information. - XML will be as important to the future of the Web
as HTML has been to the foundation of the Web. - XML will become the most common tool for all data
manipulation and data transmission. - Every serious Web technology is now expected to
define its relationship to XML. - Slide content courtesy of Shaoping Moss.
32XML in Future Web Development
- Every serious Web technology is now expected to
define its relationship to XML. - - Catherine Ebenezer in Trends in Integrated
Library Systems. - Slide content courtesy of Shaoping Moss.
33Shaoping Moss
Information Technology Consultant Research and
Instructional Support Mount Holyoke
College Email smoss_at_mtholyoke.edu Phone
413.538.3034 Fax 413.538.3112
We are grateful to Shaoping Moss for being such
an excellent instructor and giving us permission
to use her slides and materials in this
presentation.
34So this XML stuff is rad and all but could I see
why Id want to learn it and not just an encoding
set like EAD?
35Well, suppose youve got a batch of metadata on
your hands. Not just any metadata, but some
weird set of information that cant really be
shoehorned into your pal MARC 21. You need some
way of organizing the metadata. It would be nice
if you could make the metadata look all pretty
and whatnot, while youre at it.
36Heres where XML comes in!
- Get your metadata together, having done all the
sexy stuff like data dictionary creation first - Define labels for everything
- Match related terms, including subordinates
- Define your rules (Y can only appear after X, and
if you have X and Y, you must have Z, but Q is
optional, etc) - Youve pretty much just made up a schema right
there - Wait, what was that about making it pretty?
37Oh, right, it should be attractive. Well, then
you just start playing with XSL.
ltLINK REL"STYLESHEET" TYPE"text/css"
HREF"./games.css" TITLE"MASTER"/gt
Specifically, you tell the XSL to go look at the
plain ol stylesheet youve adapted from a
thousand other HTML pages.
38So then youve got this.
39Hey, wait. I thought you said this was all
cross-platform and cross-browser. How come this
isnt parsing in my browser? And how do I search
individual records? You mean I have to hand
encode every record?
Well, yes. You can write your own parser, export
encoded records from a database, or create a
search engine if you like. Youll just need more
than a semesters worth of practice to do it.