Title: Physical%20Database%20Design
1Chapter 6
2Physical Design
- The purpose of the physical design process is to
translate the logical description of the data
into technical specifications for storing and
retrieving data - Goal create a design that will provide adequate
performance and insure database integrity,
security, and recoverability - Decisions made in this phase have a major impact
on data accessibility, response times, security,
and user friendliness.
3Required Inputs
- Normalized relations
- Data volume and use estimates
- Attribute definitions
- Descriptions of where and when data are used
- Expectations for response time, data security,
backup, recovery, retention, and integrity - Description of chosen technology
4Determining volume and usage
- Data volume statistics represent the size of the
business - calculated assuming business growth over a period
of several years - Usage is estimated from the timing of events,
transaction volumes, and reporting and query
activity. - Less precise than volume statistics
5Composite usage map (Pine Valley Furniture
Company)
6Physical Design Decisions
- Specify the data type for each attribute from the
logical data model - minimize storage space and maximize integrity
- Specify physical records by grouping attributes
from the logical data model - Specify the file organization technique to use
for physical storage of data records - Specify indexes to optimize data retrieval
- Specify query optimization strategies
7Data Format
- Data type selection goals
- minimize storage
- represent all possible values
- eliminate illegal values
- improve integrity
- support manipulation
- Note these have different relative importance
8Data format decisions (coding)
- e.g., AH(Adams Hall), B(Buchanan) , etc
- implement by creating a look-up table
- there is a trade-off in that you must create and
store a second table and you must access this
table to look up the code value - consider using when a field has a limited number
of possible values, each of which occupies a
relatively large amount of space, and the number
of records is large and/or the number of record
accesses is small
9Example code-look-up table (Pine Valley Furniture
Company)
10Data Format decisions (integrity)
- Data integrity controls
- default value
- Range control
- Null value control
- Referential integrity
- Missing data
- substitute an estimate
- report missing data
- Sensitivity testing
11For example...
- Suppose you were designing the age field in a
student record at your university. What decisions
would you make about - data type
- integrity (range, default, null)
- How might your decision vary by other
characteristics about the student such as degree
sought?
12Attribute groupings
- Physical Record A group of fields stored in
adjacent memory locations and retrieved together
as a unit. - Page The amount of data read or written in one
I/O operation. - Blocking Factor The number of physical records
per page.
13Database Access Model
The goal in structuring physical records is to
minimize performance bottlenecks resulting from
disk accesses (accessing data from disk is slow
compared to main memory)
14Attribute groupingDenormalization
- Process of transforming normalized relations into
denormalized physical record specifications - may partition a relation into more than one
physical record - may combine attributes from different relations
into one physical record
15Denormalization
- Involves a trade-off
- Reduced disk accesses and greater performance
(due, for example, to fewer table joins) - - But -
- Introduction of anomalies (and thus redundancies)
that will necessitate extra data maintenance - increase chance of errors and force reprogramming
when business rules change - may optimize certain tasks at the expense of
others (if activities change, benefits may no
longer exist)
16Denormalization opportunities
- 11 relationship
- MM associative entity with non-key attributes
- reference data
17A possible denormali-zation situation reference
data
18More denormalization options
- Horizontal Partitioning Distributing the rows of
a table into several separate files. - Vertical Partitioning Distributing the columns
of a table into several separate files. - The primary key must be repeated in each file.
- Combination of both
19Partitioning
- Advantages of Partitioning
- Records used together are grouped together
- Each partition can be optimized for performance
- Security and recovery
- Partitions stored on different disks less
contention - Parallel processing capability
- Disadvantages of Partitioning
- Slower retrievals when across partitions
- Complexity for application programmers
- Anomalies and extra storage space requirements
due to duplication of data across partitions
20How about this..
- Consider the following normalized relations
- STORE(Store_Id, Region, Manager_Id, Square_Feet)
- EMPLOYEE(Emp_Id, Store_Id, Name, Address)
- DEPARTMENT(Dept, Store_ID, Manager_Id,
Sales_Goal) - SCHEDULE(Dept, Emp_Id, Date, hours)
- What opportunities might exist for
denormalization?
21Physical Files
- Physical File A file as stored on disk
- Constructs to link two pieces of data
- Sequential storage
- Pointers
- File Organization How the files are arranged on
the disk (more on this later) - Access Method How the data can be retrieved
based on the file organization - Relative - data accessed as an offset from the
most recently referenced point in secondary
memory - Direct - data accessed as a result of a
calculation to generate the beginning address of
a record
22File Organizations
- A technique for physically arranging the records
of a file on secondary storage devices. - Goals in selecting (trade-offs exist, of
course) - Fast data retrieval
- High throughput for input and maintenance
- Efficient use of storage space
- Protection from failures or data loss
- Minimal need for reorganization
- Accommodation for growth
- Security from unauthorized use
23File Organizations
- Sequential
- Indexed
- Indexed Sequential
- Indexed Nonsequential
- Hashed (also called Direct)
- See Table 6-3 for comparison
24Sequential File Organization
- Records of the file are stored in sequence by the
primary key field values
25Comparisons of file organizations (a) Sequential
26Sequential Retrieval
- Consider a file of 10,000 records each occupying
1 page - Queries that require processing all records will
require 10,000 accesses - e.g., Find all items of type 'E'
- Many disk accesses are wasted if few records meet
the condition - However, very effective if most or all records
will be accessed (e.g., payroll)
27Indexed File Organization
- Index concept is like index in a book
- Indexed-sequential file organization The
records are stored sequentially by primary key
values and there is an index built on the primary
key field (and possibly indexes built on other
fields, also)
28(b) Indexed
29Hashed File Organization
- Hashing Algorithm Converts a primary key value
into a record address - Division-remainder method is common hashing
algorithm(more to come on this)
30(c ) Hashed
31Hashing
- A technique for reducing disk accesses for direct
access - Avoids an index
- Number of accesses per record can be close to one
- The hash field is converted to a hash address by
a hash function
32Hashing
33Shortcomings of Hashing
- Different hash fields may convert to the same
hash address - these are called Synonyms
- store the colliding record in an overflow area
- Long synonym chains degrade performance
- There can be only one hash field per record
- The file can no longer be processed sequentially
34Clustering
- In some relational DBMS, related records from
different tables that are often retrieved
together can be stored close together on disk - Because the related records are stored close to
one another on the physical disk, less time is
needed to retrieve the data - E.g., Customer data and Order data may frequently
be retrieved together - Can require substantial maintenance if the
clustered data changes frequently
35Indexing
- An index is a table file that is used to
determine the location of rows in another file
that satisfy some condition
36Querying with an Index
- Read the index into memory
- Search the index to find records meeting the
condition - Access only those records containing required
data - Disk accesses are substantially reduced when the
query involves few records
37Maintaining an Index
- Adding a record requires at least two disk
accesses - Update the file
- Update the index
- Trade-off
- Faster queries
- Slower maintenance (additions, deletions, and
updates of records) - Thus, more static databases benefit more overall
38Rules of Thumbfor Using Indexes
- 1. Indexes are most useful on larger tables
- 2. Index the primary key of each table(may be
automatic, as in Access) - 3. Indexes are useful on search fields (WHERE)
- 4. Indexes are also useful on fields used for
sorting (ORDER BY) and categorizing (GROUP BY) - 5. Most useful to index on a field when there are
many different values for that field
39Rules of Thumbfor Using Indexes
- 6. Find out the limits placed on indexing by your
DBMS (Access allows 32 indexes per table, and no
index may contain more than 10 fields) - 7. Depending on the DBMS, null values may not be
referenced from an index (thus, rows with a null
value in the field that is indexed may not be
found by a search using the index)
40Rules for Adding Derived Columns
- Use when aggregate values are regularly
retrieved. - Use when aggregate values are costly to
calculate. - Permit updating only of source data.
- Create triggers to cascade changes from source
data.
41One Other Rule of Thumbfor Increasing Performance
- Consider contriving a shorter field or selecting
another candidate key to substitute for a long,
multi-field primary key (and all associated
foreign keys)
42Query Optimizer Factors
- Type of Query
- Highly selective.
- All or most of the records of a file.
- Unique fields
- Size of files
- Indexes
- Join Method
- Nested-Loop
- Merge-Scan (Both files must be ordered or indexed
on the join columns.)
43More practice
- Draw a composite usage map for the following
- PERSON(person_ID, name, address, DOB)
- PATIENT(PA_person_ID, Contact)
- PHYSICIAN(PH_person_ID, specialty)
- PERFORMANCE(PA_person_ID, PH_person_ID,
Treatment, Treatment_date, Treatment_time) - CONSUMPTION(PA_person_ID, Item, Date, Quantity)
- ITEM(Item, Description)
- Make recommendations about denormalizing,
partitioning, file organization, and indexing
44RAID
- Redundant Arrays of Inexpensive Disks
- Exploits economies of scale of disk manufacturing
for the computer market - Can give greater security
- Increases fault tolerance of systems
- Not a replacement for regular backup
45RAID
- The operating system sees a set of physical
drives as one logical drive - Data are distributed across physical drives
- All levels, except 0, have data redundancy or
error-correction features - Parity codes or redundant data are used for data
recovery
46Mirroring
- Write
- Identical copies of file are written to each
drive in array - Read
- Alternate pages are read simultaneously from each
drive - Pages put together in memory
- Access time is reduced by approximately the
number of disks in the array - Read error
- Read required page from another drive
- Tradeoffs
- Provides data security
- Reduces access time
- Uses more disk space
47Mirroring
Complete Data Set
Complete Data Set
No parity
48Striping
- Three drive model
- Write
- Half of file to first drive
- Half of file to second drive
- Parity bit to third drive
- Read
- Portions from each drive are put together in
memory - Read error
- Lost bits are reconstructed from third drives
parity data - Tradeoffs
- Provides data security
- Uses less storage space than mirroring
- Not as fast as mirroring
49Striping
One-Half Data Set
One-Half Data Set
Parity Codes
50Database Architectures
- Hierarchical
- Network
- Relational
- Object-oriented
- Multidimensional
51Hierarchical models
- Type of logical database model in which data is
organized in a tree structure. - Records are divided into segments which are
linked using pointers. - Parent and child data are stored together.
52Sample hierarchical model
ROOT/PARENT
Employee
Job
Benefits
Compensation
FIRST CHILD
Assignments
2nd CHILD
Ratings
Salary
Pension
Insurance
Health
53Network model
- Extends the hierarchical model by allowing a
child to have zero, one, or many parents - No real theoretical base
- Never widely
- used
54Comparing the alternatives
OODBS are marketed as easy to use, but the
transition from RDBMS is difficult