Title: Normal Forms, XML, and Schemas
1Normal Forms, XML, and Schemas
- Zachary G. Ives
- University of Pennsylvania
- CIS 550 Database Information Systems
- October 13, 2005
Some slide content courtesy of Susan Davidson
Raghu Ramakrishnan
2Announcements
- Homework 3 due Wednesday _at_ noon
- We will post a solution set shortly afterwards
- (No late turn-ins!)
- Midterm Thursday 10/20
- Project groups (and optionally proposals) due
10/20 - We are aiming for a recitation section for the
RSS project around 10/24-10/26
3Two Important Normal Forms
- Boyce-Codd Normal Form (BCNF). For every
relation scheme R and for every X ? A that holds
over R, - either A ? X (it is trivial) ,or
- or X is a superkey for R
- Third Normal Form (3NF). For every relation
scheme R and for every X ? A that holds over R, - either A ? X (it is trivial), or
- X is a superkey for R, or
- A is a member of some key for R
4Normal Forms Compared
- BCNF is preferable, but sometimes in conflict
with the goal of dependency preservation - Its strictly stronger than 3NF
- Lets see algorithms to obtain
- A BCNF lossless join decomposition
(nondeterministic) - A 3NF lossless join, dependency preserving
decomposition
5BCNF Decomposition Algorithm(from Korth et al.
our book gives a recursive version)
result R compute F while there is a
relation schema Ri in result that isnt in
BCNF let A ? B be a nontrivial FD on Ri s.t.
A ? Ri is not in F and A and B are
disjoint result (result Ri) ? (Ri - B),
(A,B)
i.e., A doesnt form a key
6An Example
- Given the schema
- Stuff(sid, name, serno, classroom, cid, fid,
prof) - And FDs
- sid ? name serno ? classroom, cid, fid
- fid ? prof
- Find the Boyce-Codd Normal Form for this schema
- What if instead
- sid ? name classroom, cid ? serno fid ? prof
serno ? cid
73NF Decomposition Algorithm
Let F be a minimal cover i0 for each FD A ? B
in F if none of the schemas Rj, 1? j ? i,
contains AB increment i
Ri (A, B) if no schema Rj, 1 ? j ? i
contains a candidate key for R
increment i Ri any candidate key
for R return (R1, , Ri)
Build dep.-preserving decomp.
Ensurelosslessdecomp.
8An Example
- Given the schema
- Stuff(sid, name, serno, classroom, cid, fid,
prof) - And FDs
- sid ? name serno ? classroom, cid, fid
- fid ? prof
- Find the Third Normal Form for this schema
- What if instead
- sid ? name classroom, cid ? serno fid ? prof
serno ? cid
9Summary of Normalization
- We can always decompose into 3NF and get
- Lossless join
- Dependency preservation
- But with BCNF
- We are only guaranteed lossless joins
- The algorithm is nondeterministic, so there is
not a unique decomposition for a given schema R - BCNF is stronger than 3NF every BCNF schema is
also in 3NF
10Normalization Is Good Or Is It?
- In some cases, we might not mind redundancy, if
the data isnt directly updated - Reports (people like to see breakdowns by
semester, department, course, etc.) - Warehouses (archived copies of data for doing
complex analysis) - Data sharing (sometimes we may export data into
object-oriented or hierarchical formats)
11XML A Semi-Structured Data Model
12Why XML?
- XML is the confluence of several factors
- The Web needed a more declarative format for data
- Documents needed a mechanism for extended tags
- Database people needed a more flexible
interchange format - Lingua franca of data
- Its parsable even if we dont know what it
means! - Original expectation
- The whole web would go to XML instead of HTML
- Todays reality
- Not so But XML is used all over under the
covers
13Why DB People Like XML
- Can get data from all sorts of sources
- Allows us to touch data we dont own!
- This was actually a huge change in the DB
community - Interesting relationships with DB techniques
- Useful to do relational-style operations
- Leverages ideas from object-oriented,
semistructured data - Blends schema and data into one format
- Unlike relational model, where we need schema
first - But too little schema can be a drawback, too!
14XML Anatomy
Processing Instr.
- lt?xml version"1.0" encoding"ISO-8859-1" ?gt
- ltdblpgt
- ltmastersthesis mdate"2002-01-03"
key"ms/Brown92"gt - Â ltauthorgtKurt P. Brownlt/authorgt
- Â lttitlegtPRPL A Database Workload
Specification Languagelt/titlegt - Â ltyeargt1992lt/yeargt
- Â ltschoolgtUniv. of Wisconsin-Madisonlt/schoolgt
- Â lt/mastersthesisgt
- ltarticle mdate"2002-01-03" key"tr/dec/SRC1997-
018"gt - Â lteditorgtPaul R. McJoneslt/editorgt
- Â lttitlegtThe 1995 SQL Reunionlt/titlegt
- Â ltjournalgtDigital System Research Center
Reportlt/journalgt - Â ltvolumegtSRC1997-018lt/volumegt
- Â ltyeargt1997lt/yeargt
- Â lteegtdb/labs/dec/SRC1997-018.htmllt/eegt
- Â lteegthttp//www.mcjones.org/System_R/SQL_Reunio
n_95/lt/eegt - Â lt/articlegt
Open-tag
Element
Attribute
Close-tag
15Well-Formed XML
- A legal XML document fully parsable by an XML
parser - All open-tags have matching close-tags (unlike so
many HTML documents!), or a special - lttag/gt shortcut for empty tags (equivalent to
lttaggtlt/taggt - Attributes (which are unordered, in contrast to
elements) only appear once in an element - Theres a single root element
- XML is case-sensitive
16XML as a Data Model
- XML information set includes 7 types of nodes
- Document (root)
- Element
- Attribute
- Processing instruction
- Text (content)
- Namespace
- Comment
- XML data model includes this, plus typing info,
plus order info and a few other things
17XML Data Model Visualized(and simplified!)
attribute
root
p-i
element
Root
text
dblp
?xml
mastersthesis
article
mdate
mdate
key
key
author
title
year
school
2002
editor
title
year
journal
volume
ee
ee
2002
1992
1997
The
ms/Brown92
tr/dec/
PRPL
Digital
db/labs/dec
Univ.
Paul R.
Kurt P.
SRC
http//www.
18What Does XML Do?
- Serves as a document format (super-HTML)
- Allows custom tags (e.g., used by MS Word,
openoffice) - Supplement it with stylesheets (XSL) to define
formatting - Data exchange format (must agree on terminology)
- Marshalling and unmarshalling data in SOAP and
Web Services
19XML as a Super-HTML(MS Word)
- lth1 class"Section1"gtlta name"_top /gtCIS 550
Database and Information Systemslt/h1gt - lth2 class"Section1"gtFall 2004lt/h2gt
- ltp class"MsoNormal"gt
- ltplacegt311 Townelt/placegt, Tuesday/Thursday
- lttime Hour"13" Minute"30"gt130PM
300PMlt/timegt - lt/pgt
-
20XML Easily Encodes Relations
Student-course-grade
sid serno exp-grade
1 570103 B
23 550103 A
- ltstudent-course-gradegt
- lttuplegtltsidgt1lt/sidgtltsernogt570103lt/sernogtltexp-grad
egtBlt/exp-gradegtlt/tuplegt - lttuplegtltsidgt23lt/sidgtltsernogt550103lt/sernogtltexp-gra
degtAlt/exp-gradegtlt/tuplegt - lt/student-course-gradegt
21But XML is More FlexibleNon-First-Normal-Form
(NF2)
- ltparentsgt
- ltparent nameJean gt
- ltsongtJohnlt/songt
- ltdaughtergtJoanlt/daughtergt
- ltdaughtergtJilllt/daughtergt
- lt/parentgt
- ltparent nameFenggt
- ltdaughtergtFelicitylt/daughtergt
- lt/parentgt
Coincides with semi-structured data, invented
by DB people at Penn and Stanford
22XML and Code
- Web Services (.NET, recent Java web service
toolkits) are using XML to pass parameters and
make function calls - Why?
- Easy to be forwards-compatible
- Easy to read over and validate (?)
- Generally firewall-compatible
- Drawbacks? XML is a verbose and inefficient
encoding! - XML is used to represent
- SOAP the envelope that data is marshalled
into - XML Schema gives some typing info about
structures being passed - WSDL the IDL (interface def language)
- UDDI provides an interface for querying about
web services
23Integrating XML What If We Have Multiple
Sources with the Same Tags?
- Namespaces allow us to specify a context for
different tags - Two parts
- Binding of namespace to URI
- Qualified names
- ltroot xmlnshttp//www.first.com/aspace
xmlnsothernsgt - lttag xmlnsmynshttp//www.fictitious.com/mypath
gt - ltthistaggtis in the default namespace
(aspace)lt/thistaggt - ltmynsthistaggtis in mynslt/mynsthistaggtltotherns
thistaggtis a different tag in othernslt/othernsthi
staggt - lt/taggt
- lt/rootgt
24XML Isnt Enough on Its Own
- Its too unconstrained for many cases!
- How will we know when were getting garbage?
- How will we query?
- How will we understand what we got?
- We also need
- Some idea of the structure
- Our focus next
- Presentation, in some cases XSL(T)
- Well talk about this soon
- Some way of interpreting the tags?
- Well talk about this later in the semester
25Structural ConstraintsDocument Type Definitions
(DTDs)
- The DTD is an EBNF grammar defining XML structure
- XML document specifies an associated DTD, plus
the root element - DTD specifies children of the root (and so on)
- DTD defines special significance for attributes
- IDs special attributes that are analogous to
keys for elements - IDREFs references to IDs
- IDREFS a nasty hack that represents a list of
IDREFs
26An Example DTD
- Example DTD
- lt!ELEMENT dblp((mastersthesis article))gt
- lt!ELEMENT mastersthesis(author,title,year,school,c
ommitteemember)gt - lt!ATTLIST mastersthesis(mdate CDATA REQUIRED ke
y ID REQUIRED - advisor CDATA IMPLIEDgt
- lt!ELEMENT author(PCDATA)gt
-
- Example use of DTD in XML file
- lt?xml version"1.0" encoding"ISO-8859-1" ?gt
- lt!DOCTYPE dblp SYSTEM my.dtd"gt
- ltdblpgt
27Representing Graphs and Links in XML
- lt?xml version"1.0" encoding"ISO-8859-1" ?gt
- lt!DOCTYPE graph SYSTEM special.dtd"gt
- ltgraphgt
- ltauthor idauthor1gt
- ltnamegtJohn Smithlt/namegt
- lt/authorgt
- ltarticlegt
- ltauthor refauthor1 /gt lttitlegtPaper1lt/titlegt
- lt/articlegt
- ltarticlegt
- ltauthor refauthor1 /gt lttitlegtPaper2lt/titlegt
- lt/articlegt
28Graph Data Model
Root
graph
?xml
!DOCTYPE
article
article
author
id
title
title
author
author
name
Paper1
author1
ref
Paper2
ref
John Smith
author1
author1
29Graph Data Model
Root
graph
?xml
!DOCTYPE
article
article
author
id
title
title
author
author
name
Paper1
author1
ref
Paper2
ref
John Smith
30DTDs Arent Expressive Enough
- DTDs capture grammatical structure, but have some
drawbacks - Not themselves in XML inconvenient to build
tools for them - Dont capture database datatypes domains
- IDs arent a good implementation of keys
- Why not?
- No way of defining OO-like inheritance
31XML Schema
- Aims to address the shortcomings of DTDs
- XML syntax
- Can define keys using XPaths
- Type subclassing thats more complex than in a
programming language - Programming languages dont consider order of
member variables! - Subclassing by extension and by restriction
- And, of course, domains and built-in datatypes
32Basics of XML Schema
- Need to use the XML Schema namespace (generally
named xsd) - simpleTypes are a way of restricting domains on
scalars - Can define a simpleType based on integer, with
values within a particular range - complexTypes are a way of defining
element/attribute structures - Basically equivalent to !ELEMENT, but more
powerful - Specify sequence, choice between child elements
- Specify minOccurs and maxOccurs (default 1)
- Must associate an element/attribute with a
simpleType, or an element with a complexType
33Simple Schema Example
- ltxsdschema xmlnsxsd"http//www.w3.org/2001/XMLS
chema"gt - ltxsdelement namemastersthesis"
typeThesisType"/gt - ltxsdcomplexType nameThesisType"gt
- ltxsdattribute namemdate" type"xsddate"/gt
- ltxsdattribute namekey" type"xsdstring"/gt
- ltxsdattribute nameadvisor" type"xsdstring"/gt
- ltxsdsequencegt
- ltxsdelement nameauthor" typexsdstring"/gt
- ltxsdelement nametitle" typexsdstring"/gt
- ltxsdelement nameyear" typexsdinteger"/gt
- ltxsdelement nameschool" typexsdstring/gt
- ltxsdelement namecommitteemember"
typeCommitteeType minOccurs0"/gt - lt/xsdsequencegt
- lt/xsdcomplexTypegt
- lt/xsdschemagt
34Designing an XML Schema/DTD
- Not as formalized as relational data design
- We can still use ER diagrams to break into
entity, relationship sets - ER diagrams have extensions for aggregation
treating smaller diagrams as entities and for
composite attributes - Note that often we already have our data in
relations and need to design the XML schema to
export them! - Generally orient the XML tree around the
central objects - Big decision element vs. attribute
- Element if it has its own properties, or if you
might have more than one of them - Attribute if it is a single property or perhaps
not!