Title: Introduction to Databases: Relational and XML Models and Languages
1Introduction to DatabasesRelational and XML
Models and Languages
- Instructors
- Bertram Ludaescher
- Kai Lin
2Overview
- 0915-1020 Relational Databases (1h05)
- 1020-1030 BREAK (10)
- 1030-1150 Relational Databases (1h20)
- 1150-1315 LUNCH (1h25)
- 1315-1345 Demo Hands-on (30)
- 1345-1510 XML Basics (1h25)
- 1510-1530 BREAK (20)
- 1530-1630 XML Querying (1h)
- 1630-1700 Demo Hands-on (30)
3Scope
- Today Introduction to Databases, in particular
- Relational database model
- Relational Operations and Queries
- Constraints
- XML data model
- Querying and transforming XML
- Some demos simple hands-on exercise
- Tomorrow
- Introduction to Knowledge Representation and
Ontologies - But first déjà vu
4(No Transcript)
5What is a Database System?
- Database (system)
- Database Instance (set of tables of rows)
- Database Management System (DBMS)
- Origins in the commercial world
- to organize, query, and manipulate data more
effectively, efficiently, and independently - Scientific databases
- often special features
- spatial, temporal, spatiotemporal, GIS, units,
uncertainty, raw derived data,
6Why not just use files as databases?
- For some applications yeah why not?
- But in general
- scanning greping large files can be very
inefficient - no language support for selecting desired data,
joining them, etc. - cannot express the kinds of questions/queries
youd like to ask - grep is no substitute for a query language
- redundant and/or inconsistent storage of data
- no transaction management and concurrency control
among multiple users - no security
- no recovery
- no data independence (application? data)
- no data modeling support
-
7Features of a Database System
- A data model (relational, object-oriented, XML)
prescribes how data can be organized - as relations (tables) of tuples (rows)
- as classes of (linked) objects
- as XML trees
- A (database) schema (stored in the data
dictionary) defines the structure of a specific
database instance - Relational schema
- OO schema
- XML Schema (or XML DTD)
8Features of a Database System
- Data is treated uniformly and separately from the
application - Efficient data access
- Queries and views are expressed over the schema
- Integrity constraints (checking and enforcement)
- Transactions combine sets of operations into
logical units (all-or-nothing) - Synchronization of concurrent user transactions
- Recovery (after system crash)
- not to be confused w/ backup
- instead guarantee consistency by roll-back of
partially executed transactions (how? Hint
logging)
9DB features, e.g., Concurrency Control
- Concurrent execution of simultaneous requests
- long before web servers where around...
- transaction management guarantees consistency
despite concurrent/interleaved execution - Transaction ( sequence of read/write operations)
- Atomicity a transaction is executed completely
or not at all - Consistency a transaction creates a new
consistent DB state, i.e., in which all integrity
constraints are maintained - Isolation to the user, a transaction seems to
run in isolation - Durability the effect of a successful
(committed) transaction remains even after
system failure
10Levels of Abstraction Architecture Overview
User
Conceptual Level
View 1
View 2
View n
Export schemas
logical data independence
ER-Model (Entity-Relationship) OO Models
(Classes) part of DB design ? conceptual
design often lost in the process
Logical (conceptual) level
Tables
physical data independence
Physical level
Index structures
DB instances
11Database Design Entity-Relationship (ER) Model
Employee
Department
works-for
- Entities
- Relationships
- Attributes
- ER Model
- initial, high-level DB design (conceptual model)
- easy to map to a relational schema (database
tables) - comes with more constraints (cardinalities,
aggregation) and extensions EER (is-a gt class
hierarchies) - related UML (Unified Modeling Language) class
diagrams
12The Relational Model
Employee
- Relation/Table Name
- employee, dept
- Attributes Column Names
- Emp, Salary, DeptNo, Name, Mgr
- Relational Schema
- employee(Empstring, Salaryinteger,
DeptNointeger), ... - Tuple Row of the table
- (tom, 60000, 1)
- Relation Set of tuples
- (...), (...), ...
Emp Salary DNo tom 60k 1 tim 57k 1 sally 45k 3 car
ol 30k 1 carol 35k 2 .
FK foreign key, pointing to another key
Department
DNo Name Mgr 1 Toys carol 2 Comp. carol 3 Shoes
sam
13Ex Creating a Relational Database in SQL
CREATE TABLE employee ( ssn CHAR(11), name
VARCHAR(30), deptNo INTEGER, PRIMARY KEY
(ssn), FOREIGN KEY (deptNo) REFERENCES
department )
CREATE TABLE department ( deptNo INTEGER, na
me VARCHAR(20), manager CHAR(11), PRIMARY
KEY (deptNo), FOREIGN KEY (manager)
REFERENCES employee(ssn) )
14What is a Query?
- Intuitively
- An executable question in terms of a database
schema - Evaluating a query Q against a database instance
D yields a set of answer objects - Relational tuples or XML elements
- Example
- Who are the employees in the Toys dept.?
- Who is (are) the manager(s) of Tom?
- Show all pairs (Employee, Mgr)
- Technically
- A mapping from an input schema (the given table
schemas) to a result schema (the new columns you
are interested in) defined in some query language
15Why (Declarative) Query Languages?
,,Die Grenzen meiner Sprache bedeuten die Grenzen
meiner Welt. The limits of my language mean the
limits of my world. Ludwig Wittgenstein,
Tractatus Logico-Philosophicus
If you have a hammer, everything looks like a
nail.
- Things we talk and think about in PLs and QLs
- Assembly languages
- registers, memory locations, jumps, ...
- C and the likes
- if-then-else, for, while, memory (de-)allocation,
pointers, ... - Object-oriented languages
- C C plus objects, methods, classes, ...
- Java objects, methods, classes, references,
... - Smalltalk objects, objects, objects, ...
- OQL object-query language
16Why (Declarative) Query Languages?
- Things we talk and think about in PLs and QLs
- Functional languages (Haskell, ML)
- (higher-order) functions, fold(lr), recursion,
patterns, ... - gt Relational languages (SQL, Datalog)
- relations (tables), tuples (rows) conceptual
level ER - relational operations ?, ?, ?, ?, ...,
?,?,?,?,?,..., ?, ?, X - gt Semistructured/XML (Tree) Graph Query
Languages - trees, graphs, nodes, edges, children nodes,
siblings, - XPath, XQuery,
- Also
- Focus on what, and not how!
17Example Querying a Relational Database
input tables
join
SELECT e.Emp, d.Mgr FROM Employee e, Department
d WHERE e.DeptNo d.DeptNo
SQL query (or view def.)
we dont say how to evaluate this expression
answer (or view)
18Example Query SQL vs DATALOG
- List all employees and their managers
- In SQL
- SELECT e.name, d.manager
- FROM Employee e, Department d
- WHERE e.deptNo d.deptNo
- In DATALOG
- q(E, M) - employee(E, S, D), department(D, N,
M).
a join operation
19Important Relational Operations
- select(R, Condition)
- filter rows of a table wrt. a condition
- project(R, Attr)
- remove unwanted columns keep rest
- join(R1, A2, R2, A2, Condition)
- find matches in a related table
- e.g. match R1.foreign key R2.primary key
- cartesian product(R1, R2)
- union (OR), intersection (AND)
- set-difference (NOT IN)
20Relational Operations (in DATALOG)
(query) output (query) input
21Demo
- Relational Queries in DATALOG
22Queries, Views, Integrity Constraints
- can all be seen as special queries
- Query q() - ad-hoc queries
- View v() - exported views
- Integrity Constraints
- ic () - . MgrSal lt EmpSal
- say what shouldnt happen
- if it does alert the user (or refuse an update,
)
23Query Evaluation vs Reasoning
- Query evaluation
- Given a database instance D and a query Q, run
Q(D) - What databases do all the time
- Reasoning (aka Semantic Query Optimization)
- Given a query Q and a constraint C, optimize
QC (e.g., given C, Q might be unsatisfiable) - Given Q1 and Q2 decide whether Q1? Q2
- Given Q1,Q2, C decide whether Q1? Q2 C
- Note we are NOT given a database instance D
here just the schema and the query/IC expressions
24Summary QLs for Relational Databases
25Relational Algebra
26Relational Algebra
27Relational Algebra
28Relational Algebra
29Relational Algebra
30Hands-on Part
31DBDesigner 4
- An open source (GPL) database design tool
- Goto http//www.fabforce.net/dbdesigner4/
- Download and install (5 min)
- Open the example schema Order (File -gt Open -gt
Order), and examine the relations between the
tables forumtopic and forumpost. Find the foreign
key used in the relation postHasTopic (5 min) - Connect to a sample MySQL database
- host geon07.sdsc.edu
- port 3306
- database summer_institute
- username root
- password blank
- (5 min)
- Select all records in the table forumtopic (2
min) - Select all records in the table forumpost, and
sort the result according to their idforumpost
(3min) - Find all forum posts with the topic Cars (5 min)
- Insert a record into the table forumpost (5 min)
32Additional Material
33Non-Relational Data Models
- Relational model is flat atomic data values
- nesting is modeled via pointers (foreign keys)
and Skolem-ids - extension nested relational model (tables
within tables, cf. nested HTML tables) - values can be nested lists ..., tuples (...),
sets ... - ISO standard(s) SQL
- identity is value based
- Object-oriented data model
- complex (structured) objects with object-identity
(oid) - class and type hierarchies (sub-/superclass,
sub-/supertype) - OODB schema may be very close to world model
(no translation into tables) - () queries fit your OO schema
- (-) (new) queries that dont fit nicely
- ODMG standard, OQL (Object Query Language)
34Example Object Query Language (OQL)
SELECT DISTINCT STRUCT( E e.name, C
e.manager.name, M ( SELECT c.name
FROM c IN e.children WHERE FOR ALL d IN
e.manager.children c.age gt d.age ) ) FROM e IN
Employees
- Q what does this OQL query compute?
- Note the use of path expressions like
e.manager.children - gt Semistructured/Graph Databases
35A Graph Database
36Querying Graphs with OO-Path Expressions
?- dblp."Inf. Systems".L."Michael E.
Senko". Answer L"Volume 1, 1975 L"Volume
5, 1980".
?- dblp."Inf. Systems".L.P, substr("Volume",L),
P person.spouselives_in P.lives_in.
37Constructs for Querying Graphs
Example ?- dblp . any . (if(?vldb) if(?sigmod))
38Keys, Keys, and more Keys
- A key is a minimal set of attributes that
- uniquely identify a tuple
- determine every other attribute value in a tuple
- There may be many keys for a relation we
designate one as the primary key - The phrase candidate key is used in place of
key where the key denotes the primary key - A superkey is a superset of a key (i.e., not
necessarily minimal)
39Normalization of Relations
Example of good design
Employee(EName, SSN, BDate, Address, DNumber)
Department(DName, DNumber, DMgrSSN)
Example of bad design (why is it bad?)
Emp(EName, SSN, BDate, Address, DNum, DName,
DMgrSSN)
40Whats Wrong?
Emp(EName, SSN, BDate, Address, DNum, DName,
DMgrSSN)
- The description of the department (DName,
DMgrSSN) is repeated for every employee that
works in that department. - Redundancy!
- The department is described redundantly.
- This leads to update anomalies! ( and wastes
space) - Digression (for experts only)
- redundancy can be used to increase performance
e.g. materialized views)
41Update Anomalies (caused by redundancy)
- Insertion Anomalies
- If you insert an employee
- Need to know which department he/she works
- Need to know the description information for that
department - If you want to insert a department, you cant
until there is at least one employee - Deletion Anomalies if you delete an employee, is
that dept. gone? was this the last employee in
that dept? - Modification Anomalies if you change DName,
for example, it needs to be changed everywhere!
42Null values also cause problems
- Null values might help in special cases, but are
not a general solution to update anomalies - For example, they may
- Waste space
- Make it hard to specify and understand joins
- Make it hard to aggregate (count, sum, etc.)
- Have different meanings
- Attribute does not apply to this tuple
- Attribute value is unkown
- Value is known but absent (not yet recorded)
- Causes problems with queries (cant interpret
query answers)
43Why worry about normalization?
- To
- reduce redundancy update anomalies
- reduce the need for null values
- Last not least its a solved problem
- all algorithms proofs have been worked out
Here Normalization based on FDs (theres more)
44Functional Dependencies
- Statement that if two tuples agree on attributes
A they must agree on attributes B - A ? B
- If the value of the first attribute(s), A, is
known, - then the value of the second attribute(s), B,
is - known (read A determines B)
- We want to know if it is always true in the
application
45Functional Dependencies
- Examples of functional dependencies
- social-security-number ? employee-name
- course-number ? course-title
- Examples that are NOT functional dependencies
- course-number -?? book
- course-number -?? car-color
46What is a functional dependency?
Remember what it means to be a function x
f(x) x g(x) x h(x) 1
2 1 2 1 10 1 3
2 2 2 20 2 5
3 5 3 30 3 5
f is not a function for x1, f(x) is not unique
we are looking for functional relationships (that
must occur in a relation) among attribute values
47What are the FDs?
EMP(ENAME, SSN, BDATE, ADDRESS, DNUM, DNAME,
DMGRSSN) EMP_PROJ(SSN, PNUM, HOURS, ENAME,
PNAME, PLOCATION)
48What are the FDs?
EMP(ENAME, SSN, BDATE, ADDRESS, DNUM, DNAME,
DMGRSSN) EMP_PROJ(SSN, PNUM, HOURS,
ENAME, PNAME, PLOCATION)
5
2
1
3
6
4
9
10
7
8
49Why do we care?
EMPLOYEE (SSN, NAME, SALARY, JOB_DESC)
- If all FDs are implied by the key
- it means that the DBMS only enforces keys (not
FDs) and the DBMS is going to enforce the keys
anyway - otherwise, we have to perform expensive
operations to maintain consistency (code or check
statements)
50Decomposition
- Main refinement technique decomposition
(replacing ABCD with, say, AB and BCD, or ACD and
ABD) based on the projection operator. - Decomposition should be used judiciously
- Is there reason to decompose a relation? (via
Normal Forms) - What problems (if any) does the decomposition
cause? (lost information or dependencies?)
51How can we decompose using the Project Operator?
EMP(ENAME, SSN, BDATE, ADDRESS, DNUM, DNAME,
DMGRSSN) EMP_PROJ(SSN, PNUM, HOURS, ENAME,
PNAME, PLOCATION)
2
1
EMP1(SSN, ENAME, BDATE, ADDRESS, DNUM) X(DNUM,
DNAME, DMGRSSN)
3
4
5
6
EMP2(SSN, ENAME) X(PNUM, PNAME,
PLOCATION) Y(SSN, PNUM, HOURS)
8
9
10
7
52Correct Decompositions
- How do we know if a decomposition is correct?
- That we havent lost anything?
- We have three goals
- lossless-join decomposition
- (dont throw any information away)
- (be able to reconstruct the original relation)
- dependency preservation
- all of the FDs end up in just one relation
- (not split across two or more relations)
- Boyce-Codd Normal Form (BCNF) - no redundancy
53Lossless Decompositions
- What is a lossless decomposition?
- What is a lossy decomposition?
- When R is decomposed into R1 and R2
- Check to see if (R1 R2) R
- if it is a lossy decomposition, then R1 R2
- gives you TOO MANY tuples.
- Note we are doing a natural join
54Example a lossy decomposition
- Now join them using natural join we get extra
tuples!!! - 1 smith p2 billing
55Testing for a Lossless Decomposition
- For decomposition into two relations
- Let R1 and R2 form a decomposition of R.
- R1 and R2 are both sets of attributes from R.
- For the decomposition to be lossless ...
-
- The attributes in common must be a key for 1 of
the relations! - Note You are joining a key and a foreign key
56Example test for a lossless decomposition
- Employee(SSN, name, project, p-title)
- decomposition Employee (SSN, name) Project
(project, p-title, name) - Which attribute is in common?
- Employee.Name and Project.Name
- Is name a key for either of these two tables?
- NO! We have a problem.
57Example test for a lossless decomposition
- Employee(SSN, name, project, p-title)
- decomposition Employee (SSN, name) Project
(project, p-title) - Which attribute is in common?
- None
- Is this decomposition lossless?
- NO! We have a problem.
58Example test for a lossless decomposition
- Employee(SSN, name, project, p-title)
- decomposition Employee (SSN, name, project)
Project (project, p-title) - Which attribute is in common?
- Employee.project and Project.project
- Is project a key for either of these two tables?
- YES! We have a lossless decomposition.
59Some Preliminary Normal Form Definitions
- Prime Attribute - an attribute A is prime if it
is a member of a key, otherwise it is nonprime - Partial Dependency - given an FD X?Y, Y is
partially dependent on X if there is a proper
subset X? of X such that X??Y. -
- Transitive Dependency - A is transitively
dependent upon X if X?Y, Y-/?X, and Y?A and A?XY.
60What are the FDs?
EMP(ENAME, SSN, BDATE, ADDRESS, DNUM, DNAME,
DMGRSSN) EMP_PROJ(SSN, PNUM, HOURS,
ENAME, PNAME, PLOCATION)
5
2
1
3
6
4
9
10
7
8
61Normal Forms Based on FDs
- 1NF - all attribute values (domain values) are
atomic(part of the definition of the relational
model) - 2NF - 1NF no key partial dependencies (every
nonprime attribute fully depends on every key of
R) - R(A B C D) B ? C (not allowed)
- 3NF - 2NF no nonprime attribute is transitively
dependent upon any key - R(A B C D) AB ?C, C ? D (not allowed)
- BCNF - 3NF no attribute is transitively
dependent upon any key (all FDs determined by a
key) - R(A B C D) AB ?C, C ? B (not allowed)
62Whats the Goal?
- BCNF, Lossless, and Dependency-Preserving
- (first choice)
- 3NF, Lossless and Dependency-Preserving
- (second choice)
- because sometimes you cant preserve all
dependencies
63Finding all of the FDs
- Armstrongs Axioms
- Reflexivity If X ? Y, then X ? Y
- Trivially, all attributes A ? A,
- e.g., name ? name and gender ? gender
- Augmentation If X ? Y, then XZ ? YZ for any Z
- Transitivity If X ? Y and Y ? Z, then X ? Z
- X, Y, and Z are sets of attributes in R
- Armstrongs Axioms are a sound complete set of
inference rules - Union If X ? Y and X ? Z, then X ? YZ
- Decomposition If X ? YZ, then X ? Y and X ? Z
64Finding all FDs the closure of a set of FDs
- The closure of a set of FDs
- Let F be a set of FDs and F the closure of F.
- F is the set of all FDs implied (or derivable)
from F using Armstrongs Axioms - F is computed by applying the inference rules
until no new FDs can be found
65What is Dependency Preserving
- Suppose F is the original set of FDs and G is the
set of FDs after decomposition - If we compute F and G, and F G then the
decomposition is dependency preserving
66Other Results (textbook)
- Algorithms To
- Compute F
- Compute the Minimal Cover for F
- Find a dependency-preserving decomposition into
3NF - Find a lossless-join decomposition into BCNF
- Find a lossless-join dependency preserving
decomposition into 3NF
67Example Not dependency preserving
- addr(number, street, city, state, zip)
- number, street, city, state ? zip
- zip ? state
- If we decompose, this FD wont occur within one
relation. Thus, since the DBMS only enforces
keys (and not FDs directly), this cant be
enforced.(If we leave it alone, it is in 3NF)
68Lossless join decomposition algorithm
- Set D R (the current set of relations)
- While there is a relation in D thats not in BCNF
- Choose a relation scheme Q that is not in BCNF
- Find a FD X ? Y in Q that violates BCNF
- Replace Q in D by (Q Y) and (X?? Y)
- End While
- Identify dependences that are not preserved (X??
A). - Add XA as a table to the set D
69Tuning the Conceptual Schema
- The choice of conceptual schema should be guided
by the workload, in addition to redundancy
issues - We may settle for a 3NF schema rather than BCNF.
- E.g., the workload may influence our choice to
decompose a relation into 3NF or BCNF. - We may further decompose a BCNF schema!
- We might denormalize (i.e., undo a decomposition
step), or add fields to a relation. We must take
care to avoid the problems caused by the
redundancy!
70Decomposition of one table into several
- A relation is replaced by a collection of
relations that are projections. Most important
case. (Vertical partitioning) - Sometimes, we might want to replace a relation by
a collection of relations that are selections.
(Horizontal partitioning) - Each new relation has the same schema as the
original, but a subset of the rows. - Collectively, new relations contain all rows of
the original. Typically, the new relations are
disjoint. - Why might we do this?
71Tuning the Conceptual Schema
- Vertical decomposition using the project operator
Course c cname instructor room days
Course1 c room days
Course3 c instructor
Course2 c cname
72Tuning the Conceptual Schema
- Horizontal partition/decomposition using the
select operator
Course c cname instructor room days
Undergraduate-Course c cname instructor room day
s
Graduate-Course c cname instructor room days
73Tuning the Conceptual Schema
- Denormalizingalways introduces redundancy!
Course-Offering (offering, quarter, c,
instructor, time, days)
Instructor-Room (instructor, room)
Course (c, cname)
Course-Offering (offering, quarter, c, cname,
instructor, room, time, days)
This introduces redundancy - which must be
managed.
74Denormalization
- Denormalization interferes with dependency
preservation - Course-Offering
- (offering, quarter, c, cname, instructor, room,
time, days) - Note that having the FD instructor ? room (and
nothing else) in a single table allows the FD to
be enforced (by enforcing the candidate key of
instructor). - Instructor-Room (instructor, room)