Title: Chapter 6: Physical Database Design and Performance
1Chapter 6Physical Database Design and
Performance
2Objectives
- Definition of terms
- Describe the physical database design process
- Choose storage formats for attributes
- Select appropriate file organizations
- Describe three types of file organization
- Describe indexes and their appropriate use
- Translate a database model into efficient
structures - Know when and how to use denormalization
3Physical Database Design
- Purposetranslate the logical description of data
into the technical specifications for storing and
retrieving data - Goalcreate a design for storing data that will
provide adequate performance and insure database
integrity, security, and recoverability - Does not include implementing files and databases
4Physical Design Process
5Designing Fields
- Field smallest unit of named application data in
database - Corresponds to a simple attribute from logical
data model - Field design
- Choosing data type
- Coding, compression, encryption
- Controlling data integrity
- Handling missing data
6Choosing Data Types
- Data type
- A detailed coding scheme recognized by system
software, such as a DBMS, for representing
organizational data. - Objectives for selecting data types
- Minimize storage space
- Represent all possible values
- Improve data integrity
- Support all data manipulations
7Choosing Data Types
- CHARfixed-length character
- VARCHAR2variable-length character (memo)
- LONGlarge number
- NUMBERpositive/negative number
- INEGERpositive/negative whole number
- DATEactual date
- BLOBbinary large object (good for graphics,
sound clips, etc.)
8Figure 6-2 Example code look-up table (Pine
Valley Furniture Company)
Code saves space, but costs an additional lookup
to obtain actual value
9Field Data Integrity
- Default valueassumed value if no explicit value
- Range controlallowable value limitations
(constraints or validation rules) - Null value controlallowing or prohibiting empty
fields - Referential integrityrange control (and null
value allowances) for foreign-key to primary-key
match-ups
Sarbanes-Oxley Act (SOX) legislates importance of
financial data integrity
10Handling Missing Data
- Substitute an estimate of the missing value
(e.g., using a formula) - Construct a report listing missing values and
resolve unknown values - In programs, ignore missing data unless the value
is significant (sensitivity testing)
Triggers can be used to perform these operations
11Physical Records
- Physical Record A group of fields stored in
adjacent memory locations and retrieved together
as a unit - Page The amount of data read or written in one
I/O operation - Blocking Factor The number of physical records
per page
12Denormalization
- Normalized relations solve data maintenance
anomalies and minimize redundancies, but may not
yield efficient data processing - Transforming normalized relations into
unnormalized physical record specifications - Benefits
- Can improve performance (speed) by reducing
number of table lookups (i.e. reduce number of
necessary join queries) - Costs (due to data duplication)
- Wasted storage space
- Data integrity/consistency threats
13Figure 6-3 A possible denormalization situation
two entities with one-to-one relationship
14Figure 6-4 A possible denormalization situation
a many-to-many relationship with nonkey attributes
Extra table access required
Null description possible
It is advisable to combine attributes from one
entity into the record representing the
associative entity, avoiding one join operation
15Figure 6-5 A possible denormalization
situation reference data
Extra table access required
Data duplication
Consider merging the two entities when there are
few instances of the entity on the many-side for
each entity instance on the one side
16Denormalization--Partitioning
- While denormalization can combine tables to avoid
joining, it can also create more tables by
partitioning a relation into multiple tables - Horizontal Partitioning Distributing the rows of
a table into several separate files based on
common column values - Eg. A customer relation could be broken into 4
regional customer files - Useful for situations where different users need
access to different categories of rows - Horizontal partitioning is very similar to
supertype/subtype relationship - Rows from different partitions can be
reconstructed by SQL UNION - Vertical Partitioning Distributing the columns
of a table into several separate relations - eg. PART relation can be broken into
accounting-related, engineering-related,
sales-related tables - Useful for situations where different users need
access to different columns - The primary key must be repeated in each file
- Combinations of Horizontal and Vertical
Partitions often correspond with User Schemas
(user views)
17Denormalization--Partitioning (cont.)
- Advantages of Partitioning
- Efficiency Records used together are grouped
together - Local optimization Each partition can be
optimized for performance - Security, recovery
- Load balancing Partitions stored on different
disks, reduces contention - Take advantage of parallel processing capability
- Disadvantages of Partitioning
- Inconsistent access speed Slow retrievals across
partitions - Complexity Non-transparent partitioning to
programmers - Extra space or update time Duplicate data
access from multiple partitions
18Denormalization -- Data Replication
- Purposely storing the same data in multiple
locations of the database - Improves performance by allowing multiple users
to access the same data at the same time with
minimum contention - Sacrifices data integrity due to data duplication
- Best for data that is not updated often
19Designing Physical Files
- Physical File
- A named portion of secondary memory allocated for
the purpose of storing physical records - Tablespacenamed set of disk storage elements in
which physical files for database tables can be
stored - Extentcontiguous section of disk space
- Constructs to link two pieces of data
- Sequential storage
- Pointersfield of data that can be used to locate
related fields or records it contains the
address of associated data
20File Organizations
- Technique for physically arranging records of a
file on secondary storage - Factors for selecting file organization
- Fast data retrieval and throughput
- Efficient storage space utilization
- Protection from failure and data loss
- Minimizing need for reorganization
- Accommodating growth
- Security from unauthorized use
- Types of file organizations
- Sequential
- Indexed
- Hashed
21Figure 6-7a Sequential file organization
1
2
If sorted every insert or delete requires resort
Records of the file are stored in sequence by
the primary key field values
- If not sorted
- Average time to find desired record n/2
n
22Indexed File Organizations
- Records are stored either sequentially or
nonsequentially with an index that allows
software to locate individual records - Indexa separate table that used to quickly
determine the location of rows in a file that
satisfy some condition (like library card
catalog) - Primary keys are automatically indexed
- Oracle has a CREATE INDEX operation, and MS
ACCESS allows indexes to be created for most
field types - Indexing approaches
- B-tree index, Fig. 6-7b
- Bitmap index, Fig. 6-8
- Join Index, Fig 6-9
23Figure 6-7b B-tree index
Leaves of the tree are all at same level
? consistent access time
- uses a tree search
- Average time to find desired record depth of
the tree
24- Bitmap saves on space requirements
- Rows - possible values of the attribute
- Columns - table rows
- Bit indicates whether the attribute of a row has
the values
Figure 6-8 Bitmap index index organization
Ideal for attributes that have a few possible
values
25Figure 6-9 Join Indexesspeeds up join operations
Join index is an index on the columns from two or
more tables that come from the same domain of
values. It precompute the result of a relational
join operator.
26Figure 6-7c Hashed file or index organization
- Hash algorithm
- A routine that converts a primary key value into
a relative record address. Usually uses
division-remainder to determine record position.
Records with same position are grouped in lists
27(No Transcript)
28Clustering Files
- In some relational DBMSs, related records from
different tables can be stored together in the
same disk area - Useful for improving performance of join
operations - Primary key records of the main table are stored
adjacent to associated foreign key records of the
dependent table - e.g. Oracle has a CREATE CLUSTER command
29Using index
- Database manipulation require locating a row or
collection of rows that satisfy some condition - Scanning every row in a table will be
unacceptably slow when tables are large - The structure of an index-- a table by itself
with 2 columns the key and the address of the
record(s) contain that key value
30Creating index
- Unique key index
- CREATE UNIQUE INDEX CUSTINDEX ON CUSTOMER
(CUSTOMER_ID) - Composite unique key index
- CREATE UNIQUE INDEX LINEINDEX ON ORDER_LINE
(ORDER_ID, PRODUCT_ID) - Secondary (nonunique) key index
- CREATE INDEX DESCINDEX ON PRODUCT (DESCRIPTION)
- Bitmap index
- CREATE BITMAP INDEX DESCBITINDEX ON PRODUCT
(FINISH)
31Rules for Using Indexes
- Use on larger tables
- Index the primary key of each table
- Index search fields (fields frequently in WHERE
clause) - Fields in SQL ORDER BY and GROUP BY commands
- When there are gt100 values but not when there are
lt30 values
32Rules for Using Indexes (cont.)
- Avoid use of indexes for fields with long values
perhaps compress values first - DBMS may have limit on number of indexes per
table and number of bytes per indexed field(s) - Null values will not be referenced from an index
- Use indexes heavily for non-volatile databases
limit the use of indexes for volatile databases - Why? Because modifications (e.g. inserts,
deletes) require updates to occur in index files
33RAID
- Redundant Array of Inexpensive Disks
- A set of disk drives that appear to the user to
be a single large logical disk drive - Stripes cut across all disk drives, workload is
balanced. - Allows parallel access to data (improves access
speed) - Pages are arranged in stripes
- Increased likelihood of disk drive failure, and
fault tolerant technologies are developed to
store redundant data (variations of RAID)
34Figure 6-10 RAID with four disks and striping
Here, pages 1-4 can be read/written simultaneously
35Raid Types (Figure 6-10)
- Raid 3
- Error correction in one disk
- Record spans multiple data disks (more than
RAID2) - Not good for multi-user environments,
- Raid 4
- Error correction in one disk
- Multiple records per stripe
- Parallelism, but slow updates due to error
correction contention - Raid 5
- Rotating parity array
- Error correction takes place in same disks as
data storage - Parallelism, better performance than Raid4
- Raid 0
- Maximized parallelism
- No redundancy
- No error correction
- no fault-tolerance
- Raid 1
- Fully redundant, disk mirror
- Write operation must be done twice
- Most common form
- Raid 2
- No redundancy
- One record spans across data disks
- Error correction in multiple disks reconstruct
damaged data
36Database Architectures (Figure 6-11)
Legacy Systems
Current Technology
Data Warehouses
37Optimizing query performance
- Parallel query processing
- Symmetric multiprocessor technology
- Breaking apart a query into modules that can be
processed in parallel by related processors - Example Each processor run a copy of query on a
horizontal partition - Overriding automatic query optimization
- In a DBMS, query optimizer choose the best plan
to execute the query based on statistics about
each table - The optimizers plan for processing a query can
be learned by command EXPLAIN or EXPLAIN PLAN
(steps like access indexes, use parallel servers,
join tables) - If you know a better way, you can force the DBMS
to do the steps differently
38Optimizing query performance
- Picking data blocking size
- Too small size result in many physical I/O
operations - Too large size result in extra data being
transferred - Normally 2K to 32K
- Balance I/o across disk controllers
- Disks are attached to controllers the more
controllers, the better parallel access - Collect statistics on disk and controller
utilization on table accessing, and balance the
workload by moving tables between disk drives and
controllers