Title: CSE 636 Data Integration
1CSE 636Data Integration
XML Semistructured Data Document Type Definitions
2Semistructured Data
- Another data model, based on trees
- Motivation flexible representation of data
- Often, data comes from multiple sources with
differences in notation, meaning, etc. - Motivation sharing of documents among systems
and databases
3Graphs of Semistructured Data
- Nodes objects
- Labels on arcs (attributes, relationships)
- Atomic values at leaf nodes (nodes with no arcs
out) - Flexibility no restriction on
- Labels out of a node
- Number of successors with a given label
4Example Data Graph
root
beer
beer
bar
manf
manf
prize
A.B.
name
name
year
award
servedAt
Bud
Gold
1995
Mlob
name
addr
Maple
Joes
5XML
- HTML
- Uses tags for formatting the presentation (e.g.,
italic) - Hard for applications to process
- XML Extensible Markup Language
- Uses tags for semantics(e.g., this is an
address) - Similar to labels in semistructured data
- Allows you to invent your own tags
- Easy for applications to process
6HTML ? XML
lthtmlgt ltbodygt lth1gt Bibliography lt/h1gt
ltpgt ltigtFoundations of Databaseslt/igt
Abiteboul, Hull, Vianu ltbr/gt
Addison Wesley, 1995 lt/pgt ltpgt
ltigt Data on the Web lt/igt Abiteboul,
Buneman, Suciu ltbr/gt Morgan Kaufmann,
1999 lt/pgt lt/bodygt lt/htmlgt
lt?xml version 1.0 standalone yes
?gt ltbibliographygt ltbookgt
lttitlegtFoundations of Databaseslt/titlegt
ltauthorgt Abiteboul lt/authorgt ltauthorgt Hull
lt/authorgt ltauthorgt Vianu lt/authorgt
ltpublishergt Addison Wesley lt/publishergt
ltyeargt 1995 lt/yeargt lt/bookgt
lt/bibliographygt
7Why XML is of Interest to Us
- XML is just syntax for data
- Note we have no syntax for relational data
- But XML is not relational semistructured
- This is exciting because
- Can translate any data to XML
- Can ship XML over the Web (HTTP, SOAP)
- Can input XML into any application
- Thus data sharing and exchange on the Web
8XML Data Sharing and Exchange
XML DB
?
?
Applications
Applications
XML Data
Transform
Integrate
Web (HTTP, SOAP)
Warehouse
Relational DB
Web Site
Web Service
9XML Tags Elements
- Tags book, title, author,
- XML tags are case sensitive
- Tags, as in HTML, are normally matched pairs
- ltbookgt lt/bookgt
- Start tag ltbookgt, End tag lt/bookgt
- Elements everything between tags
- Example 1 lttitlegtFoundations of
Databaseslt/titlegt - Example 2 ltbookgt lttitlegtFoundations of
Databaseslt/titlegt lt/bookgt - Elements may be nested arbitrarily
- Empty element ltbookgtlt/bookgt
- Abbreviation ltbook/gt
10XML Attributes
- ltbook price 55 currency USDgt
- lttitlegt Foundations of Databases lt/titlegt
- ltauthorgt Abiteboul lt/authorgt
-
- ltyeargt 1995 lt/yeargt
- lt/bookgt
- Attributes are alternative ways to represent data
11Replacing Attributes with Elements
- ltbookgt
- lttitlegt Foundations of Databases lt/titlegt
- ltauthorgt Abiteboul lt/authorgt
-
- ltyeargt 1995 lt/yeargt
- ltpricegt 55 lt/pricegt
- ltcurrencygt USD lt/currencygt
- lt/bookgt
12Elements vs. Attributes
- Too many attributes make documents hard to read
- Attributes do not specify document structure
- Attributes are good for simple information
13More XML CDATA Section
- Syntax lt!CDATA .....any text here...gt
- Example
- ltexamplegt lt!CDATA some text here lt/notAtaggt
ltgtgt - lt/examplegt
14More XML Entity References
- Syntax entityname
- Example ltelementgt this is less than lt
lt/elementgt - Some entities
lt lt
gt gt
amp
apos
quot
38 Unicode char
15More XML Comments
- Syntax lt!-- .... Comment text... --gt
- Yes, they are part of the data model !!!
16XML Semantics a Tree !
data
ltdatagt ltperson age25 gt ltnamegt Mary
lt/namegt ltaddressgt ltstreetgt Maple
lt/streetgt ltnogt 345 lt/nogt
ltcitygt Seattle lt/citygt lt/addressgt
lt/persongt ltpersongt ltnamegt John lt/namegt
ltaddressgtThailandlt/addressgt ltphonegt
23456 lt/phonegt lt/persongt lt/datagt
person
person
age
address
name
address
name
phone
25
street
no
city
Mary
Thai
John
23456
Maple
345
Seattle
17Well-Formed XML
- Start the document with a declaration, surrounded
by lt?xml ?gt - Normal declaration is
- lt?xml version 1.0 standalone yes ?gt
- Standalone no DTD provided
- Has single root element surrounding nested
elements - Has matching tags
18XML Data
- XML is self-describing
- Schema elements become part of the data
- Relational schema person(name, phone)
- In XML ltpersongt, ltnamegt, ltphonegt are part of the
data, and are repeated many times - Consequence XML is much more flexible
- XML semistructured data
- Well-Formed XML with nested tags is exactly the
same idea as trees of semistructured data - XML also enables nontree structures, as does the
semistructured data model
19XML is Semistructured Data
- Missing attributes
- Could represent ina table with nulls
ltpersongt ltnamegt Johnlt/namegt
ltphonegt1234lt/phonegt lt/persongt ltpersongt
ltnamegtJoelt/namegt lt/persongt
? no phone !
name phone
John 1234
Joe -
20XML is Semistructured Data
- Repeated attributes
- Impossible in tables
ltpersongt ltnamegtMarylt/namegt
ltphonegt2345lt/phonegt
ltphonegt3456lt/phonegt lt/persongt
? two phones !
name phone
Mary 2345 3456
???
21XML is Semistructured Data
- Attributes with different types in different
objects - Nested collections (no 1NF)
- Heterogeneous collections
- ltdbgt contains both ltbookgts and ltpublishergts
ltpersongt ltnamegt
ltfirstgtJohnlt/firstgt
ltlastgtSmithlt/lastgt lt/namegt
ltphonegt1234lt/phonegt lt/persongt
? structured name !
22Document Type Definition (DTD)
- Part of the original XML specification
- An XML document may have a DTD
- Valid XML if it has a DTD and conforms to it
- Validation is useful in data exchange
23Very Simple DTD
lt!DOCTYPE db lt!ELEMENT db ((bookpublisher)
)gt lt!ELEMENT book (title,author,year?)gt
lt!ELEMENT title (PCDATA)gt lt!ELEMENT
author (PCDATA)gt lt!ELEMENT year (PCDATA)gt
lt!ELEMENT publisher (PCDATA)gt gt
24DTD The Content Model
contentmodel
- Content modellt!ELEMENT tag (CONTENT)gt
- Complex a regular expression over other
elements - Text-only PCDATA
- Empty EMPTY
- Any ANY
- Mixed content (PCDATA A B C)
25DTD Regular Expressions
DTD
XML
sequence
lt!ELEMENT name (firstName, lastName))
optional
lt!ELEMENT name (firstName?, lastName))
zero or more
lt!ELEMENT person (name, phone))
one or more
lt!ELEMENT person (name, phone))
alternation
lt!ELEMENT person (name, (phoneemail)))
26DTD Attributes
lt!ELEMENT person (ssn, name, office,
phone?)gt lt!ATTLIST person age CDATA REQUIRED
height CDATA IMPLIEDgt
ltperson age25 height6gt ltnamegt
...lt/namegt ... lt/persongt
27DTD Attributes
- lt!ATTLIST tag (name type kind)gt
- Types
- CDATA string
- (Mon Wed Fri) enumeration
- ID key
- IDREF foreign key
- IDREFS foreign keys separated by space
- others rarely used
- Kind
- REQUIRED
- IMPLIED optional
- value default value
- value FIXED the only value allowed
28XML IDs and References
- Attributes can be pointers from one object to
another - Compare to HTMLsNAME foo and HREF foo
- Allows the structure of an XML document to be a
general graph, rather than just a tree
29XML Creating IDs
- Give an element E an attribute A of type ID
- When using tag ltEgt in an XML document, give its
attribute A a unique value - Example
- ltE A xyzgt
30XML Creating References
- To allow objects of type F to refer to another
object with an ID attribute, give F an attribute
of type IDREF - Or, let the attribute have type IDREFS, so the F
object can refer to any number of other objects
31XML IDs and References
- ltperson ido555gt
- ltnamegtJanelt/namegt
- lt/persongt
- ltperson ido456gt
- ltnamegt Mary lt/namegt
- ltchildren idrefo123 o555/gt
- lt/persongt
- ltperson ido123 mothero456gt
- ltnamegtJohnlt/namegt
- lt/persongt
- IDs and references in XML are just syntax
32DTD ID and IDREF(S) Attributes
lt!ELEMENT person (ssn, name, office,
phone?)gt lt!ATTLIS person age CDATA REQUIRED
id ID REQUIRED manager IDREF REQUIRED
manages IDREFS REQUIRED gt
ltperson age25 idp29432
managerp48293 managesp34982
p423234gt ltnamegt ....lt/namegt
... lt/persongt
33Use of DTDs
- Set standalone no
- Either
- Include the DTD as a preamble of the XML
document, or - Follow DOCTYPE and the ltroot taggt by SYSTEM and a
path to the file where the DTD can be found, or - Mix the two... (e.g. to override the external
definition)
34Example (a)
- lt?xml version 1.0 standalone no ?gt
- lt!DOCTYPE BARS
- lt!ELEMENT BARS (BAR)gt
- lt!ELEMENT BAR (NAME, BEER)gt
- lt!ELEMENT NAME (PCDATA)gt
- lt!ELEMENT BEER (NAME, PRICE)gt
- lt!ELEMENT PRICE (PCDATA)gt
- gt
- ltBARSgt
- ltBARgtltNAMEgtJoes Barlt/NAMEgt
- ltBEERgtltNAMEgtBudlt/NAMEgt ltPRICEgt2.50lt/PRICEgtlt/BEER
gt - ltBEERgtltNAMEgtMillerlt/NAMEgt ltPRICEgt3.00lt/PRICEgtlt/B
EERgt - lt/BARgt
- ltBARgt
- lt/BARSgt
35Example (b)
- Assume the BARS DTD is in file bar.dtd
- lt?xml version 1.0 standalone no ?gt
- lt!DOCTYPE BARS SYSTEM bar.dtdgt
- ltBARSgt
- ltBARgtltNAMEgtJoes Barlt/NAMEgt
- ltBEERgtltNAMEgtBudlt/NAMEgt
- ltPRICEgt2.50lt/PRICEgtlt/BEERgt
- ltBEERgtltNAMEgtMillerlt/NAMEgt
- ltPRICEgt3.00lt/PRICEgtlt/BEERgt
- lt/BARgt
- ltBARgt
- lt/BARSgt
36DTDs as Grammars
lt!DOCTYPE db lt!ELEMENT db ((bookpublisher)
)gt lt!ELEMENT book (title,author,year?)gt
lt!ELEMENT title (PCDATA)gt lt!ELEMENT
author (PCDATA)gt lt!ELEMENT year (PCDATA)gt
lt!ELEMENT publisher (PCDATA)gt gt
37DTDs as Grammars
db (bookpublisher) book
(title,author,year?) title
string author string year
string publisher string
- Same thing as
- A DTD is a EBNF (Extended BNF) grammar
- An XML tree is precisely a derivation tree
- A valid XML document a parse tree for that
grammar
38DTDs as Grammars
lt!DOCTYPE paper lt!ELEMENT paper
(section)gt lt!ELEMENT section
((title,section) text)gt lt!ELEMENT title
(PCDATA)gt lt!ELEMENT text (PCDATA)gt gt
ltpapergt ltsectiongt lttextgt lt/textgt lt/sectiongt
ltsectiongt lttitlegt lt/titlegt
ltsectiongt lt/sectiongt
ltsectiongt lt/sectiongt
lt/sectiongt lt/papergt
- XML documents can be nested arbitrarily deep
39DTDs as Schemas
- Not so well suited
- impose unwanted constraints on order
- lt!ELEMENT person (name,phone)gt
- references cannot be constrained
- ID/IDREFS can reference any ID
- can be too vague
- lt!ELEMENT person ((namephoneemail))gt
40DTDs as Schemas
- No context-dependant typing
- Cannot distinguish between used car ads and new
car ads - Different structure in different contexts
dealer
UsedCars
NewCars
ad
ad
model
year
year
41XML APIs
- Document Object Model - DOM
- Manipulation of XML Data
- Provides a representation of an XML Document as a
tree - Reads XML Document into memory
- http//www.w3.org/DOM
- Many implementations (Sun JAXP, Apache Xerces, )
- Simple API for XML - SAX
- Event-based framework for parsing XML data
- http//www.saxproject.org/
42References
- Lecture Slides
- Jeffrey D. Ullman
- http//www-db.stanford.edu/ullman/dscb/pslides/ps
lides.html - Dan Suciu
- http//www.cs.washington.edu/homes/suciu/COURSES/5
90DS/02xmlsyntax.htm - http//www.cs.washington.edu/homes/suciu/COURSES/5
90DS/11dtd.htm - Alon Levy
- http//www.cs.washington.edu/education/courses/cse
p544/02sp/lectures/lecture5cut.ppt - BRICS XML Tutorial
- A. Moeller, M. Schwartzbach
- http//www.brics.dk/amoeller/XML/index.html
- W3C's XML homepage
- http//www.w3.org/XML
- XML School an XML tutorial
- http//www.w3schools.com/xml