Title: Database Theory and Methodology
1Database Theory and Methodology
2The Good and the Bad
- So far we have not developed any measure of
goodness to measure the quality of the design,
other than our intuition. - Bad database design
- Redundant Information
- Update anomalies
- Wasting storage with null values
- Generation of invalid data during joins
- Good database design
- Easy to explain its semantics
- No redundancy (or at least reduced redundancy)
- No information loss during joins
- We need formal concepts to define goodness and
badness of relational schemas.
3The Evils of Redundancy
- Redundancy is at the root of several problems
associated with relational schemas - Redundant storage
- Insertion anomalies
- Deletion anomalies
- Update anomalies
4Example Bad Design
- redundancy airline name repeated for the same
flight - inconsistency when airline name for a flight
changes, it must be changed in many places
5Bad Database Design
- insertion anomalies how do we represent that
SK912 is flown by Scandinavian without there
being a date and a plane assigned? - deletion anomalies cancelling AA411 on 10/22/00
makes us lose that it is flown by American. - update anomalies if DL242 is flown by Sabena, we
must change it everywhere.
6Bad Database Design- decomposition
7Bad Database Design- information loss
- information loss we polluted the database with
false facts we cant find the true facts.
8Good Database Design
- no redundancy of FACT (!)
- no inconsistency
- no insertion, deletion or update anomalies
- no information loss
9Goal Devise a Theory for the Following
- Decide whether a particular relation R is in
good form. - In the case that a relation R is not in good
form, decompose it into a set of relations R1,
R2, ..., Rn such that - each relation is in good form
- the decomposition is a lossless-join
decomposition - Our theory is based on
- functional dependencies
- multivalued dependencies
10Functional Dependencies (FDs)
- A functional dependency is a constraint between
two sets of attributes from the database. - Definition A functional dependency, denoted by X
? Y, holds over relation R if, for every
allowable instance r of R - t1 ? r, t2 ? r, t1.X t2.X implies t1.Y t2.Y
- i.e., given two tuples in r, if the X values
agree, then the Y values must also agree. (X and
Y are sets of attributes.) - An FD is a statement about all allowable
relations. - Must be identified based on semantics of
application. - Given some allowable instance r1 of R, we can
check if it violates some FD f, but we cannot
tell if f holds over R! - If X ? Y in R, this does not say whether or not Y
? X.
11Functional Dependencies and Keys
- A key constraint is a special case of an FD.
- Note the following
- If X is a candidate key, X ? Y for any subset of
attributes Y of R. - X ? Y does not require that the set X be minimal
the additional minimality condition must be met
for X to be a key. - If X ? Y holds where Y is the set of all
attributes in R, and there is some subset V of X
s.t. V ? Y, then X is a superkey.
12Example Constraints on Entity Set
- Consider relation obtained from Hourly_Emps
- Hourly_Emps (ssn, name, lot, rating, hrly_wages,
hrs_worked) - Notation We will denote this relation schema by
listing the attributes SNLRWH - This is really the set of attributes
S,N,L,R,W,H. - Sometimes, we will refer to all attributes of a
relation by using the relation name. (e.g.,
Hourly_Emps for SNLRWH) - Some FDs on Hourly_Emps
- ssn is the key S ? SNLRWH
- rating determines hrly_wages R ? W
13Example (Contd.)
- Problems due to R ? W
- Update anomaly Can we change W in
just the 1st tuple of SNLRWH? - Insertion anomaly What if we want to insert an
employee and dont know the hourly wage for his
rating? - Deletion anomaly If we delete all employees with
rating 5, we lose the information about the wage
for rating 5!
Will two smaller tables be better?
14Closure of a set of FDs
- Given some FDs, we can usually infer additional
FDs - ssn ? did, did ? lot implies ssn ? lot
- mgr_ssn ? mgr_phone, dept_no ? mgr_ssn implies
dept_no ? mgr_phone - An FD f is implied by a set of FDs F if f holds
whenever all FDs in F hold. - Formally, the set of all dependencies that
include F as well as all dependencies that can be
inferred from F is called the closure of F it is
denoted by F.
15Inference Rules for FDs
- Armstrongs Axioms (X, Y, Z are sets of
attributes) - Reflexivity If Y ? X then X ? Y.
- Augmentation If X ? Y, then XZ ? YZ for
any Z. - Transitivity If X ? Y and Y ? Z, then X
? Z. - These are sound and complete inference rules for
FDs! - By sound we mean that they generate only FDs in
F when applied to a set F of FDs. - By complete we mean that repeated application of
these rules will generate all FDs in the closure
of F.
16Derived inference rules
- Union if XY and XZ, then XYZ.
- Decomposition if XYZ, then XY and XZ.
- Pseudotransitivity if XY and WYZ, then WXZ.
- These additional rules are not essential their
soundness can be proved using Armstrongs Axioms. - Exercise Prove rules Union, Decomposition and
Pseudotransitivity using A.A.
17Example
- Consider the following schema
- Contracts(cid,sid,jid,did,pid,qty,value)
- C is the key C ? CSJDPQV
- Project purchases each part using single
contract JP ? C - Dept purchases at most one part from a supplier
SD ? P - JP ? C, C ? CSJDPQV imply JP?CSJDPQV
- SD?P implies SDJ ? JP
- SDJ ? JP, JP ? CSJDPQV imply SDJ ?
CSJDPQV - We can infer several additional FDs that are in
the closure by using augmentation or
decomposition. e.g. C ? CSJDPQV implies C ? C,
C? S,C ? J, C? D, C? P, C? Q, C? V. - We have also a number of trivial FDs from
reflexivity rule.
18Reasoning About FDs
- Computing the closure of a set of FDs can be
expensive. (Size of closure is exponential in
attrs!) - Typically, we just want to check if a given FD X
? Y is in the closure of a set of FDs F. An
efficient check - Compute attribute closure of X (denoted X) wrt
F - Set of all attributes A such that X ? A is in F.
- There is a linear time algorithm to compute this.
- Check if Y is in X.
- Does F A ?B, B ? C, CD ? E imply A ? E?
- i.e, is A ?E in the closure F ?
(Equivalently, is E in X)?
19Attribute Closure
- X X
- repeat
- oldX X
- for each functional dependency Y ? Z in F do
- if Y ? X then X X ? Z
- until X oldX
20Example
- Given the following set of FDs
- F Ssn ? Ename,
- Pnumber ? Pname, Plocation,
- Ssn,Pnumber ? Hours
- The attribute closures
- Ssn Ssn, Ename
- Pnumber Pnumber, Pname,Plocation
- Ssn,Pnumber Ssn, Pnumber, Ename, Pname,
Plocation, Hours
21Example
- R (A, B, C, G, H, I)
- F A ? B A ? C CG ? H CG ? I B ? H
- (AG)
- result AG.
- result ABCG (A ? C and A ? B)
- result ABCGH (CG ? H and CG ? AGBC)
- 4. result ABCGHI (CG ? I and CG ? AGBCH)
- Is AG a candidate key?
- Is AG a super key?
- Does AG ? R? Is (AG) ? R
- Is any subset of AG a superkey?
- Does A ? R? Is (A) ? R
- Does G ? R? Is (G) ? R
22Finding a Key for a Relation
- Algorithm Finding a Key K for R, given a set of
FDs - Set K R.
- For each attribute A in K
- Compute (K-A) wrt F
- If (K-A) contains all the attributes in R,
then - set K K A
-
23Example
- Consider the following schema and the set of FDs
- R (Ssn, Pnumber, Ename, Pname, Plocation,
Hours) - F Ssn ? Ename, Pnumber ? Pname, Plocation,
Ssn,Pnumber ? Hours - The Key is Ssn,Pnumber, since
- Ssn,Pnumber Ssn, Pnumber, Ename, Pname,
Plocation, Hours
24Equivalence of two sets of FDs
- Definition A set of FDs F covers another set of
FDs E if every FD in E is also in F. - Definition F and E are equivalent if F E
(i.e. F covers E and E covers F). - We can determine whether F covers E by
calculating X with respect to F for each FD XY
in E, and then checking whether this X includes
the attributes in Y. If this is the case for
every FD in E, then F covers E.
25Minimal Cover
- Sets of functional dependencies may have
redundant dependencies that can be inferred from
the others - Eg A ? C is redundant in A ? B, B ? C,
A ? C - Parts of a functional dependency may be redundant
- E.g. on RHS A ? B, B ? C, A ? CD can
be simplified to A ?
B, B ? C, A ? D - E.g. on LHS A ? B, B ? C, AC ? D can
be simplified to A ?
B, B ? C, A ? D - Intuitively, a canonical cover of F is a
minimal set of functional dependencies
equivalent to F, having no redundant dependencies
or redundant parts of dependencies
26Minimal Sets of FDs
- A set of FDs F is minimal if
- Every dependency in F has a single attribute for
its right-hand side. - We cant replace any dependency X? A in F with a
dependency Y?A where Y ? X and still have a set
of dependencies equivalent with F. - This ensures that there are no redundancies by
having redundant attributes on the left-hand side
of a dependency. - We cant remove any dependency from F and still
have a set of dependencies equivalent with F. - This ensures that there are no redundancies by
having a dependency that can be inferred from the
remaining FDs in F.
27Minimal Cover of a set of FDs
- A minimal cover of a set of FDs E is a minimal
set of dependencies F that is equivalent to E. - There can be several minimal covers for a set of
FDs. - Algorithm Finding Minimal Cover F for a set of
Fds E. - 1. Set F E.
- 2. Replace each Fd X ? A1,..., An in F by the n
Fds X?A1,..., X?An. - 3. For each Fd X?A in F
- For each attribute B?X
- If F - X?A ? (X - B) ?A ? F
- Then replace X?A with (X - B) ?A in F.
- 4. For each remaining Fd X?A in F
- If F - X?A?F, then remove X?A from F.
28Examples
- E A ? B, ABCD ? E, EF ? GH, ACDF ? EG
has the following minimal cover - F A ? B, ACD ? E, EF ? G, EF ? H
- E Ssn ? Ename, Pnumber ? Pname, Plocation,
Ssn,Pnumber ? Hours - F Ssn ? Ename, Pnumber ? Pname, Pnumber ?
Plocation, Ssn,Pnumber ? Hours
29Normal Forms
- Returning to the issue of schema refinement, the
first question to ask is whether any refinement
is needed! - If a relation is in a certain normal form (BCNF,
3NF etc.), it is known that certain kinds of
problems are avoided/minimized. This can be used
to help us decide whether decomposing the
relation will help. - Role of FDs in detecting redundancy
- Consider a relation R with 3 attributes, ABC.
- No FDs hold There is no redundancy here.
- Given A ? B Several tuples could have the
same A value, and if so, theyll all have the
same B value!
30Normalization
- Normalization of data is a process of analyzing
the given relational schemas based on their FDs
and primary keys to achieve the desirable
properties of - minimizing redundancy
- minimizing update, insert and delete anomalies.
- The process of normalization through
decomposition must also confirm two additional
properties - Lossless join property (nonadditive join
property) - Dependency preservation property
31Overview of NFs
NF2 1NF 2NF 3NF BCNF
32Normal Forms
- NF2 non-first normal form.
- 1NF R is in 1NF iff all domain values are
atomic. - 2NF R is in 2. NF iff R is in 1NF and every
non-key attribute is fully dependent on the key. - 3NF R is in 3NF iff R is 2NF and every nonkey
attribute is non-transitively dependent on the
key - BCNF R is in BCNF iff every determinant is a
candidate key - Determinant an attribute on which some other
attribute(s) is fully functionally dependent.
331NF Atomic Values
Attributes must be defined over domains with
atomic values
342NF Full Functional Dependency
- An FD X ? Y is a full functional dependency if
any attribute A ? X , X A does not
functionally determine Y. - An FD X ? Y is a partial functional dependency if
some attribute A ? X , (X A) ? Y. - E.g. ssn, proj_num ? hours is a full FD.
- ssn, proj_num ? ename is a partial FD.
- Definition A relation schema R is in 2NF if R
is in 1NF and every nonprime attribute A is fully
functionally dependent on the primary key.
35Third Normal Form (3NF)
- Relation R with FDs F is in 3NF if, for all X ?
A in F - A ? X (called a trivial FD), or
- X contains a key for R, or
- A is part of some key for R.
- Minimality of a key is crucial in third condition
above! - If R is in 3NF, obviously in 2NF.
- If R is in 3NF, some redundancy is still
possible. It is a compromise, used when BCNF not
achievable (e.g., no good decomposition, or
performance considerations). - Lossless-join, dependency-preserving
decomposition of R into a collection of 3NF
relations always possible.
36What Does 3NF Achieve?
- If 3NF is violated by X ? A, one of the following
holds - X is a subset of some key K
- We store (X, A) pairs redundantly.
- X is not a proper subset of any key.
- There is a chain of FDs K ? X ? A, which means
that we cannot associate an X value with a K
value unless we also associate an A value with an
X value. - But even if relation is in 3NF, these problems
could arise. - e.g., Reserves SBDC, S ? C, C ? S is in
3NF, but for each reservation of sailor S, same
(S, C) pair is stored.
37Boyce-Codd Normal Form (BCNF)
- Reln R with FDs F is in BCNF if, for all X ? A
in F - A X (called a trivial FD), or
- X contains a key for R.
- In other words, R is in BCNF if the only
non-trivial FDs that hold over R are key
constraints. - No dependency in R that can be predicted using
FDs alone. - If we are shown two tuples that agree upon the X
value, we cannot infer the A value in
one tuple from
the A value in the other. - If example relation is in BCNF, then 2 tuples
must be identical
(since X is a key).