Title: Introduction to XML: A Librarians Perspective
1Introduction to XMLA Librarians Perspective
- Delphine KhannaRutgers UniversityPalinet,
July/August 1999
2Overview of the workshop
- What is XML? How does it work?
- Why XML? What is it going to change?
- Overview of other XML-related standards.
- XML in libraries (standards projects).
- Practical skills
- Creating an XML document,
- Creating an XSL style sheet,
- Work with MS Internet Explorer 5.0.
3Workshop Web site
- http//scc01.rutgers.edu/ceth/intromat/xml/
- Contents
- This slide presentation,
- XML Samples used in this workshop,
- List of useful Web links and other resources.
4A First Look at XML
5Basics
- Simplified definition XML is a kind of
super-HTML where you can define your own tags.
6Term Clarification
- XML can be called a
- encoding format,
- language,
- standard.
- We will prefer standard.
7The XML Family A whole family of standards
- XML,
- XSL,
- XLINK, XPOINTER,
- Namespaces,
- RDF,
- XML Schemas,
- DOM,
- and more
8XML Who? When?
- XML family developed by W3C.
- Very recent
- XML 1.0 February 1998.
- Namespaces January 1999.
- RDF February 1999.
- XLINK, XPOINTER, XSL, XML Schemas still working
drafts.
9XML To develop the next generation of Web
applications
- People want to do more sophisticated things with
the Web. - HTML is too limited for that.
- Need for a more powerful language XML.
10XML Hype 2 myths
- XML will replace everything.
- (HTML, back-end relational databases, etc.)
- XML is completely different from Web technologies
we had before.
11Why is XML better?
12Lets look at a typical HTML document
-
- Lines Written in Early Spring
- William Wordsworth
- I heard a thousand blended notes,
- While in grove I sate reclined,
- In that sweet mood when pleasant
thoughts
- Bring sad thoughts to the mind.
-
- To her fair works did nature link
- The human soul that through me ran
- And much it griev'd me my heart to
think
- What man has made of man.
13What is the problem?
- To do more fancy things with documents
- need to make their logical structure explicit.
- Otherwise software applications
- do not know what is what,
- do not have any handle over documents.
14Why XML is better Overall
- HTML
- Encoding too vague and messy.
- Logical structure is not clearly encoded.
- XML
- Allows us to create clean structured documents,
where logical structure of document is totally
explicit.
15The same document in XML
-
-
- Lines Written in Early Spring
- William
Wordsworth -
- I heard a thousand blended
notes, - While in grove I sate
reclined, - In that sweet mood when
pleasant thoughts - Bring sad thoughts to the
mind. -
-
- To her fair works did nature
link - The human soul that through
me ran - And much it griev'd me my
heart to think - What man has made of
man. -
16Why XML is better Reason 1
- HTML One single fixed tag set
- , , , , etc.
- XML You can define your own tag set
- , , .
- , , , .
- Possible to describe the logical structure
exactly.
17Why XML is better Reason 2
- HTML Lack of syntax controlHelloPis considered OK.
- XML Documents have to be at least
well-formedHellois the only
form acceptable. - Code much cleaner.
18Why XML is better Reason 3
- HTML Logical structure and display are mixed up
- , , .
- This text is important.
- XML Clear distinction between logical structure
and display - This text is important.
-
- Code much cleaner.
19By the way, HTML is not that bad
- HTML
- Really simple Attractive to basic users.
- Works fine for basic Web pages.
- XML
- Clearly more complexWill scare off basic users.
- Probably an overkill for basic Web pages.
20What will XML change?Or, why do we need to
make the logical structure explicit?
21Different displays for different output devices
- Regular computer screens,
- Pocket computers, Palm Pilot,
- WebTV,
- Audio (visually-impaired, cars),
- Braille,
- Print.
22Term Clarification
- The Web based on a client/server architecture.
23Server-side Databases should speak to each other
- A very successful model
- relational databases on the server side.
- Next step data integration.
- Example 1 An online bookstore,
- Example 2 Medical records,
- Example 3 An index that knows which journals
are available in the library.
24XML representing structured data
- If XML can represent structured text,it can also
represent structured data. - XML is also very good at representing mixed data
seamlessly.
25XML for Interchange Example of a converted R-DB
record
-
-
- 33456
- Next day delivery
- 15
- New York City
- Pittsburgh
- 07/30/1999
- 07/31/1999
-
26Client-side The Web more than an online
fax-machine
- Web-browsers thin clients
- They just display documents.
- Clients can do more
- Client workstation has a lot of unused power,
- Less strain on the network and on the server,
- Example Viewing and sorting of a medical record.
27Client-side The Web more than an online
fax-machine (2)
- Clients can do more
- Personalized and sophisticated processing
possible. - Processing possibly provided by 3rd-party client
applications. - Example Bibliography manager.
28XML The nitty-gritty details
29Term Clarification
- Element,
- Tag (opening tag / closing tag / delimiter),
- Element content,
- Attribute (name / value).
- Example
- John Smith
30Differences in Syntax between XML and HTML
- XML Declaration
- Every opening tag must have a closing tag.
- Empty tags have a different syntax
- Tags are case sensitive different from
312-Level Syntax Control
- XML documents can be
- Well-formed,
- Valid.
32Syntax Control Well-formed documents
- All XML documents must be well-formed.
- XML parsers check the well-formedness.
- Criteria of well-formedness
- Every opening tag must have a closing tag.
Illegal Hello - No overlapping elements Illegal
Hello - One unique root element
33Tree Representation
- POEM
- TITLE AUTHOR STANZA STANZA
-
- FIRSTNAME LASTNAME LINE LINE
LINE LINE LINE LINE -
34Create your own XML document
- The cooking recipe document
- 1. Brainstorming on the structure of the
document, - 2. Creation of the document with a template.
35Editing XML Documents
- Textpad Internet Explorer 5 as a parser.
- Caution IE5 comes with limitations and
proprietary features. - Alternative
- XML editor (e.g. Softquads XMetal).
36To get started
- Create file in Textpad and load it in IE5.
- File extension xml.
- Save regularly and reload in IE5.
- Begin with
-
-
37Document Type Definitions (DTD)
38Document Type Definitions(DTD)
- Formal way of defining the tags used in a series
of documents. - A DTD
- specifies a list of tags,
- defines the relationships between these tags.
- Allows us to create consistency across a
collection of documents (e.g., 5000 poems).
39How does a DTD look like?
40Creating a DTD
- Non-trivial task.
- Higher level of expertise needed than for using a
DTD. - In-depth knowledge of XML,
- In-depth knowledge of the type of documents being
described. - Preliminary Document Analysis.
- A DTD can be dozens of pages long.
41Syntax ControlValid documents
- Higher level of control than well-formed
documents. - An XML document is valid if it conforms to its
DTD.
42XML DTD declaration
- Local file
-
-
- URL
-
- edu/ceth/intromat/xml/samples/poem/poem.dtd
43Validation with IE5
- When loading a documentThe IE5 parser does not
validate it. - Possible to validate a document through a script.
- Possible also to use a separate validating
parser. - For instance, the Scholarly Technology Groups
XML parser. (Brown U.). - Validating vs. non-validating parsers.
44Validation Strategy
- For now, best model
- When creating documents use a validating parser.
- For instance Scholarly Technology Group's
On-line XML Validating Parser (http//www.stg.brow
n.edu/service/xmlvalid/). - When users download them parser only checks if
well formed.
45Namespaces
- Need to use elements from several DTDs in the
same document. - Scheme to identify the source of each element.
- Special case Same element name used by 2 DTDs.
46Namespace Example
- xmlnsisbn'urlISBNhttp//www.isbn.o
rg/isbndtd - Cheaper by the
Dozen - 1568491379
- This is funny
book -
- Note Adapted from example in the Namespaces
recommendation.
47Namespace Example (2)Default Namespace
- xmlnsisbn'urlISBNhttp//www.isbn.org
/isbndtd - Cheaper by the Dozen
- 1568491379
- This is funny book
-
- Note Adapted from example in the Namespaces
recommendation.
48More good things about XML
49Positive side-effects of XML (1)
- XML fosters the development of community-based
standards. - Concept of 2-level standard very powerful
- XML universal,
- DTDs community-specific.
- Now developing a new standard amounts to writing
a DTD. - Much easier than starting from scratch.
- E.g., Xlit.
50Positive side-effects of XML (2)
- Wide-spread standards are stronger than those
used by a limited community(regardless of their
intrinsic value). - HL7 -- XML.
- Easier to hire programmers.
- More documentation available.
- Actively maintained by very large base of people.
51Positive side-effects of XML (3)
- A set of standards bundled together are stronger
than an isolated one. - Likely to appeal to more people (The Microsoft
Office idea). - The standards reinforce each others.
52Stylesheet Languages for XML
53Stylesheet Languages for XML
- Specify how to display logical elements.
- XML supports 2 stylesheet languages
- CSS
- Quite Limited,
- But eases transition HTML--XML.
- XSL
- Very powerful,
- Still a working draft.
54Extensible Stylesheet Language (XSL)
- 2 Parts
- Transformations
- Transform the XML document (reorder, hide, add
elements). - Formatting Objects (FO)
- Attach formatting properties to XML elements.
55XSL in IE 5.0
- Supports transformations but not the FO.
- Trick transform XML DTD-specific elements into
HTML elements. - Convenient because everybody knows HTML.
56XSL-to-HTML Stylesheets Syntax
- Style Sheet Excerpt XML Document Excerpt
-
-
Mary Brown - Easy
Cooking -
-
-
John Smith - 101
Recipes -
-
- Sue
Meyer - Italian
Cuisine -
- HTML Output
- Mary Brown Easy Cooking
- John Smith 101 Recipes
- Sue Meyer Italian Cuisine
57Beginning of an XSL-to-HTML Stylesheet
58Example of XSL-to-HTML Stylesheet
59Declaring an XSL Stylesheet in an XML document
- Just after the XML declaration (and the DTD
declaration if there is one). - Local file
- hrefpoem.xsl?
- URL
- hrefhttp//scc01.rutg ers.edu/ceth/intromat/xml/
samples/poem/poem.xsl ?
60Creating your own Stylesheet
- The XSL-to-HTML recipe stylesheet
- XSL stylesheets can be tricky.
- Always use another stylesheet as a model.
- Name the file recipe.xsl.
- Make sure to declare it in the XML document.
- hrefrecipe.xsl"?
- Always add one template at a time, and reload in
IE5 to make sure it works.
61Recipe Stylesheet Step 1
62Recipe Stylesheet Step 2
63Recipe Stylesheet Step 3
64Recipe Stylesheet Step 4
65Recipe Stylesheet Step 5
66XML Formatting ObjectsExample
-
- ,255)
- font-size16pt
-
-
-
- Note Adapted from stylesheet created by Lynn
Lobash.
67Some Other XML-related Standards
68Linking Standards
- HTML links
- Really primitive and limited.
- Linking standards for XML
- Much more powerful.
- 2 parts
- XLink (aka. XLL),
- XPointer (aka. XLP).
- Still working drafts.
69XLink
- To define links to one or several documents.
- 2 types of links
- Simple,
- Extended.
70XLink Simple link
- Example
- hrefpoem1.xmlGo to related poem
- Other attributes / Alternative values
- inline true, false (link to same document vs.
outside). - show replace, new, embed.
- actuate user, auto.
- title ( a caption).
- Similar to HTML links, but slightly more fancy.
71Xlink Extended Link
- One link, several targets.
- For instance, the link See related poems would
open as a list of links in a pop-up window.
72Xlink Extended Link (2)
- Example
- titleSee related poems
- titleBlue Mountains hrefpoem1.xml/
- titlePink Flowers hrefpoem2.xml/
- titleSea of Green hrefpoem3.xml/
73XPointer
- To define links that target points within
documents. - Special language to explain which spot is
targeted. - In HTML
- Need to manually insert a tag .
- Hence need to own the document.
- With Xpointer
- No need to add anything to the target document.
74XPointer (2)
- Example
- hrefpoem1.xmlroot().child(2)Go to related
poem - Other possibilities
- root().child(3).child(4)
- id(poem273)
- root().descendant(2, stanza)
- root().string(1, my heart)
- span(root().child(3), root().child(5))
75Unicode
- Default character encoding for XML.
- Great improvement for encoding of non-western
languages - more than 65,000 characters,
- Eventually will represent all alphabets and
writing systems, - Also includes special typographic characters (
¼ ).
76SGML, XML, HTMLWhat is the difference?
- XML SGML slightly simplified.
- HTML just an SGML DTD.
- Can be easily converted to an XML DTD.
- Relationship
- XML and SGML are meta-languages,
- HTML is a language.
77Searching XML Documents
78Models for XML RepositoriesFlat file system
- A bunch of XML documents in a folder.
- Native XML search engine
- an XML-aware Web site search engine.
- XML Query Language XQL
- Still in development
- Find word milk only when it appears in
attribute DIETINFO2 of element PRODUCT.
79Models for XML RepositoriesRegular relational
databases
- E.g., Web-based OPACs, Ovid, Amazon.
- Back-end relational DBMS
- MS Access or Oracle, for instance.
- Web interface
- uses scripts like CGI or Cold Fusion,
- Easy to change the scripts to output XML instead
of HTML, - Can even produce XML OR HTML according to the
capabilities of the requesting browser.
80Models for XML RepositoriesXML-aware relational
DBs
- Benefit from R-Databases AND XML advantages.
- Mixed record
- Nested structured text difficult to map to R-DB.
- However, many structured texts have a table-like
section (the bibliographic information). - R-Databases very mature technology (data
integrity, security, load balance, etc.).
81Models for XML RepositoriesXML-aware relational
DBs (2)
- Example of Oracle
- Enhanced full-text capabilities
- indexing,
- truncations, stemming, thesaurus, etc.,
- XML-like searching,
- can create SQL queries with embedded XML
subqueries. - Automatic mapping
- R-DB record -- XML document,
- XML document -- R-DB record,
- Virtual flat file system.
82Information Retrieval Standardfor XML
- Needed to implement cross-repository search
- To query across several XML servers seamlessly,
- Whatever the implementation on the server side
(Flat file system, R-DBMS, etc.).
83Information Retrieval Standard Z39.50
- Used in the library community.
- To query OPACs, indexes, etc.
- Possible to specify
- A Query Language,
- The format of the results,
- A session protocol.
84Information Retrieval Standard Z39.50 XML
- Currently beginning to integrate XML
- Defined as a possible output format,
- Some propositions to use XML as an alternative to
BER for overall Z39.50 syntax. - Once XQL is stabilized it could be ported to
Z39.50. - Good candidate to become the IR-Standard for XML.
- Little known outside the library community.
85XML in Libraries
86Which library projects are already using XML/SGML?
- Mostly academic institutions.
- (as well as Library of Congress and NYPL.)
- Usually in SGML.
- (Very recent ones in XML.)
- Mostly
- large and long-term digitization projects,
- involving the digitization of numerous texts.
- Converted to HTML on-the-fly.
87Text Encoding Initiative (TEI)
- Standard to encode primary sources in the
Humanities. - SGML-based. (It is an SGML DTD.)
- Currently being converted to XML.
- Maintained by TEI Consortium.
- Widely adopted in Humanities computing community.
- Has spread to libraries.
88Examples of TEI Projects
- Special collections
- Library of Congresss American Memory Project,
- Literary texts
- U. of Virginias E-text Collection,
- Browns Women Writers Project,
- Historical editions (MEP DTD)
- Abraham Lincoln Papers,
- Susan B. Anthony Papers.
89Encoding Archival Description (EAD)
- Finding Aids to Special Collections and Archives.
- SGML/XML-based standard. (It is a DTD.)
- Maintained by the Library of Congress.
- Widely adopted.
90Examples of EAD Projects
- Among many others California Digital Librarys
Online Archive of California. - Union DatabaseRLGs Archival Resources Project
- (MARC AMC records and EAD finding aids).
91Materials Used by Libraries
- Reference Materials
- Oxford English Dictionary,
- American National Biography,
- Electronic Journals
- Springer-Verlags Link.
92XML in Libraries What will it change? (1)
- EAD finding aids
- Offer precise and controlled search capabilities,
- Make the creation of union databases possible.
93XML in Libraries What will it change? (2)
- Full-text databases of primary sources
- Easy to search, display, etc.
- Next step, union databases.
- With precise and controlled search capabilities.
- Full-text databases of e-journals, monographs.
- Competition with PDF/page images, though.
- Again next step, union databases.
94XML in Libraries What will it change? (3)
- More sophisticated and customized clients
- Bibliography manager,
- Concordance program.
- New library standards based on XML
- TEI, EAD
- MARC (!)
- XML not just a fad, more than 10 years of
SGML-based TEI.
95XML in Libraries What will it change? (4)
- XML is more likely than any other formats to
resist obsolescence - Platform independent,
- Open standard
- (not proprietary),
- Written in ASCII/Unicode plain text
- (no binary encoding, the simplest text editor can
read it), - Tags are human-readable.
96Should you use XML in your project today?
- Are your data made of a repetition of similar
objects? (e.g., 3000 poems) - Is your project database-based?
- Is your project large?
- Do you plan to
- deliver to different output devices?
- integrate your project with others? (e.g. union
database) - develop advanced capabilities? (server-side or
client-side)