Title: Database Design Theory
1Database Design Theory
- Which tables to have in a database
- Normalization
2Database Design Theory
- Given some body of data to be represented in a
database, as modelled in an E-R diagram, what is
the most suitable logical structure for that
data? - How do we decide on the appropriate tables and
the attributes of the tables?
3Requirements
- Accommodate data integrity.
- General integrity constraints
- e.g. referential integrity
- Domain specific integrity constraints
- e.g. no user can borrow more than 4 books.
- Robust in the sense that the design should be
application independent. - We try to achieve this through the elimination of
redundancy.
4The Danger of Redundancy
- Consider the example
- For students, we want to know student ID, name
and address. - For courses, we need to know course ID, title and
lecturer. - For employees, we need to know the employee ID,
name and department. - For each department, we need to know the
department ID, the name and the location. - For each enrollment, we need to know the grade.
5The Danger of Redundancy Continued
- One solution store everything in one big table
- Appl(sid,name,addr,
- cid,title,
- eid, ename,
- deptid, dname, loc,
- grade)
- Clearly, this leads to redundancy.
- For example, we need to store the students
address for every course they have been
registered for.
6The Danger of Redundancy. Conclusion
- If everything in one table, then
- Greater space requirements
- Insertion anomalies
- Cannot store information on student who has not
passed a course yet. - Deletion anomalies
- We may want to delete a course but some student
may be registered only for that course. - Update anomalies
- If a student changes their address, many tuples
need to be updated. - Danger of inconsistency in database.
7Good Database Design
- The basic idea
- A good database is one in which each table
consists of a primary key and a set of mutually
independent attributes. - Strategy for achieving a good database design
- Identify undesirable dependencies in a table and
decompose by projection.
8Functional Dependencies (FDs)
- Attribute (set) Y is functionally dependent on
attribute (set) X if, whenever two tuples have
the same value for X, they also have the same
value for Y. - Notation X ? Y
- X is called the determinant.
- An FD A? B is non-trivial if and only if B ? A
and B?A.
9Functional Dependencies in Our Example.
- If everything is in one table, then these FDs
exist - sid ? name, addr
- cid ? title
- eid ? ename
- deptid ? dname, loc
- sid, cid ? grade
- Note we also have
- sid, cid, eid, deptid ?
- all other attributes
10Keys Again
- A set of attributes X in a relation R is a
superkey if every attribute in R is functionally
dependent on X. - A candidate key is a minimal superkey.
- Alternate keys are candidate keys that have not
been selected as primary keys. - A prime attribute is a member of a candidate key.
11Armstrongs Axioms
- Let X,Y and Z be sets of attributes of a relation
R - Reflexivity (X ?Y) ? (X ? Y)
- Augmentation
- (X ? Y) ? (XZ ? YZ)
- Transitivity
- ((X ?Y) (Y? Z)) ? (X ?Z)
- Axioms are sound and complete
- Can derive all FDs that follow from a given set
of FDs. - Derive only true FDs
12Some Consequences of Armstrongs Axioms
- The following are implied by Armstrongs axioms
- Decomposition
- (X ? YZ) ?(X ? Y)
- Union
- ((X ? Y)(X ? Z)) ? (X ? YZ)
- Pseudo transitivity
- ((X ? Y)(WY ? Z)) ? (WX ? Z)
13Closure of a Set of Functional Dependencies
- If F is a set of functional dependencies, the
closure of F, F, is the set of all functional
dependencies logically implied by those in F. - Useful since it allows us to determine candidate
keys (there must be functional dependency to all
other attributes), but very expensive to compute.
14Closure Under a Set of Functional Dependencies
- Since F is too expensive to compute, we use
closure of X under a set of functional
dependencies, X. - (X ? Y) in F if and only if
- Y ? X.
- Since X is relatively easy to compute, we can
now verify whether X is a superkey.
15Computing X.
- To compute X under a set of FDs F
- INPUT X, F
- OUTPUT X
- S X
- WHILE
- there is a (Z?Y) in F
- with Z ? S and Y ? S
- DO S SY
- ENDWHILE
- X S
16Decomposition
- Recall that having identified undesirable FDs, we
now need to decompose. - Decomposition
- Let U be a relation scheme. A set of R1,..,Rn
of relation schemes is a decomposition of U if - R1 ? ? Rn U
- Every attribute of U occurs in at least one Ri.
17Desirable Properties of a Decomposition
- Decompositions should be
- Lossless
- Dependency preserving
- No redundancy
- Minimal number of tables
- Sometimes, not all properties can be achieved
simultaneously.
18Lossless Decomposition
- Let
- R1,..,Rn a decomposition of U
- u relation instance over U
- Pi ?Ri(u) for i from 1 to n
- Then
- R1,..,Rn is a lossless decomposition if
- u P1 P n
- In other words, the original relation can be
reconstructed.
19Dependency Preserving Decompositions
- In decomposing a table, ensure that any FDs are
easily enforceable. - Example
- Relation U(A,B,C)
- FDs A ? B, A ? C, B ? C
- If we decompose U into R(A,B) and S(B,C), then A
? B, B ? C can be easily enforced when changing R
or S. - Because of transitivity, A ? C is automatically
enforced.
20Non-Dependency Preserving Decomposition
- If we decompose U into R(A,B) and S(A,C), then
enforcing A ? B and A ? C is easy. - However, B ? C becomes an interrelational
constraint and can only be enforced through a
join. - This decomposition is not dependency preserving.
21Normalization
- Normal forms, as defined in relational database
theory, are guidelines for the design of the
tables in the database. - Normalization reduces redundancy.
- Important to remember why we want to avoid
redundancy - Space requirements
- Insertion, deletion and update anomalies.
22The Normal Forms
- First normal form
- Second normal form
- Third normal form
- Boyce-Codd normal form
- Fourth normal form
- Fifth normal form
- The normal forms are ordered in that everything
in 2NF is also in 1NF. - We ignore 5NF, as violations hardly occur in
practice.
23First Normal Form
- A relation is in 1NF iff the value of each
attribute in a tuple is atomic. - A relation which is not in 1NF
- SID CID GRADE
- 123 CS33Q A
- CS35A B
- 234 CS33Q C
- CS34A B
- CS36Q B
24Getting Tables into 1NF
- Normalizing a table which is not in 1NF is easy
Simply repeat the other fields. - Thus
- SID CID GRADE
- 123 CS33Q A
- 123 CS35A B
- 234 CS33Q C
- 234 CS34A B
- 234 CS36Q B
25Second and Third Normal Form
- Second and third normal form concern relationship
between non-key and prime attributes. - Recall that a prime attribute is a member of a
candidate key. - Under 2NF and 3NF, a non-key attribute value must
provide a fact about the key, the whole key and
nothing but the key. - Every non-prime attribute must be fully
functionally dependent on a candidate key.
26Second Normal Form
- 2NF is violated when a non-key attribute depends
on a proper subset of a candidate key. - The following violates 2NF
- Result(cid, sid, name, grade)
- as name is functionally dependent on sid alone.
27Dangers of Violating 2NF
- Note that name is repeated for every course that
a student has a grade for. - Problems
- Danger of inconsistency if a student changes
their name, e.g., by getting married. - If a student has not passed any courses yet, then
the students name cannot be stored.
28Getting Tables into 2NF
- Decompose the table into
- Result(cid, sid, grade)
- Student(sid, name)
- This decomposition leads to longer retrieval
times for queries which involve joins. - Normalization is necessary to avoid anomalies
which arise because of changes to attributes. - If little chance of changes, then sometimes do
not normalize.
29Third Normal Form
- Third normal form is violated when a non-prime
attribute depends on another non-prime attribute. - The following violates 3NF
- Empl(eid, dept, loc)
- loc is a fact about dept.
- Danger same as violation of 2NF.
30Getting Tables into 3NF
- Again, decompose
- Empl(eid, dept)
- Department(dept, loc)
- We can always restore 3NF through a lossless and
dependency preserving decomposition.
31Boyce-Codd Normal Form (BCNF)
- A relation scheme R is in BCNF if every
determinant of a FD over R is a candidate key. - In other words, the determinant of every FD is a
superkey. - Violation of BCNF
- R(A,B,C,D,E,F)
- A ? BC, D ? AEF
- D ABCDEF
- D is a good primary key.
- A ABC
32Another Violation of BCNF
- Assume that we give each registration for a
course a unique registration number - Reg(rid, sid, cid, sname, grade)
- FDs
- rid ? sid, cid
- sid, cid ? rid, grade
- sid ? sname
- rid all attributes
33Getting Tables into BCNF
- Decompose according to the FD whose determinant
is not a superkey. - In our example, sid ? sname
- This gives
- Reg(rid, sid, cid, grade)
- Stud(sid, sname)
- Not always possible to get tables into BCNF while
preserving all functional dependencies.
34Example where BCNF is not possible
- Consider
- R(A,B,C)
- AB ? C, C ? B
- Not in BCNF because C is not a superkey.
- However, every decomposition of R fails to be
dependency preserving as we have to split up the
attributes in AB ? C - Have to settle for 3NF.
35Multivalued Dependencies (MVDs)
- In an FD, X ? Y, knowing the value of X means
that you know the unique value for Y. - In an MVD, X ? ? Y, knowing the value of X means
that you know the set of values from which Y can
come.
36Example of MVD
- Assume we have two streams for some course,
taught by different instructors, and that for
each course, we use two textbooks. - Example
- course instructor text
- CS35A Rao Date
- Harold Korth
- CS34A Rao Jackson
- Mugisa Rich
37Example of MVD Continued
- Putting table in 1NF gives
- Course Instructor Text
- CS35A Rao Date
- CS35A Rao Korth
- CS35A Harold Date
- CS35A Harold Korth
- CS34A Rao Jackson
- CS34A Rao Rich
- CS34A Mugisa Jackson
- CS34A Mugisa Rich
- With primary key
- Course, Instructor, Text
- Since no FD, in BCNF.
38Redundancy because of MVDs
- However, still redundancy in the table because
- if ltc,p,xgt and ltc,p,xgt in table ltc,p,xgt and
ltc,p,xgt in table too. - The table contains two multivalued dependencies
- Course ? ? Instructor
- Course ? ? Text
- Danger of insertion and update anomalies
39Fourth Normal Form
- Under 4NF, a relation should not contain two or
more independent MVDs. - In other words, if there is a MVD, X ? ? Y, then
X should be a superkey.
40Getting Tables into 4NF
- Again, get a table into 4NF through decomposition
so that each MVD is captured in a separate table. - Example
- CP(Course, Instructor)
- CT(Course, Text)
41Normalization Reconsidered
- Normalization helps avoid
- Insertion anomalies
- Update anomalies
- Deletion anomalies
- Normalization increases retrieval time for some
queries.