Title: Relational Database M 358 Block 5
1Relational Database (M 358)Block 5
2Storage Device
Disk Architecture Disk platters - One of more
disk platters rotate continually about the
central axis at constant speed, each is accessed
by a read/write head. A disk platter would
consists of two disk surfaces. A track - Data is
stored along a number of tracks around the disk A
sector - A contiguous sequence of bytes within a
single track of one surface of a disk platter. A
sector of data is the minimum amount of data
transferred between the disk and the memory A
cylinder - At any moment, heads are positioned on
the equivalent track on each of the disk
surfaces. Each cylinder consists of all these
equivalent tracks Example of calculation for
disk capacity 2000 cylinder 10 disk surfaces 8
sectors per track 4K bytes per sector Disk
capacity 4K x 8 x 2000 x 10 640000K bytes
3Storage Device
- Accessing Data Stored on a Disk
- Time to access (read or write) a sector of data
on a disk - Seek time - Time for the access mechanism to move
the read/write heads from current position to the
cylinder where the required sector resides -
increases as the distance the arms must move
increases. - Rotational delay - Time it takes waiting for the
start of the required sector to come round and
position itself at the read/ write head. - Data transfer time - is the time taken to
transfer the contents of a sector on the disk to
the memory - If two rows of data reside within the same disk
sector, a single disk access is sufficient to
transfer data to memory. - If two rows reside in different cylinders, then
two disk accesses would be required.
4Storage Model
Storage model - the software components that form
the interface between the DBMS and the operating
system. Two approaches for internal file
management 1. DBMS use the file organization and
access methods provided by the underlying
operating system for file management. 2. DBMS
itself manages the internal file organization -
adopted by most large scale DBMS. Necessary
software components for the second approach 1.
File manager - view database as a logical file
- a collection of blocks in memory, where each
block corresponds to a particular sector on
disk. - determine which block (disk sector)
contains the desired rows. - instruct buffer
manager to retrieve rows from buffer if they
exist in buffer. - instruct disk manager to
retrieve rows from disk if they only exist in
disk.
5Storage Model
6Storage Model
2. Disk manager - views database as physical
file - a collection of sectors on the
disk. - determine the physical disk address of a
required block. - retrieve rows from disk and
place them in buffer by buffer manager. 3. Buffer
manager - maintain as many blocks as possible in
buffer memory to minimize no. of disk
accesses. - manage the allocation of space in the
memory for storing the blocks , e.g., by least
recently used (LRU) scheme - replace blocks which
is least recently referenced by a user
process. - schema blocks and index blocks in the
buffer are not replaced (due to frequent
access). - support transaction uncommitted data
- are not allowed to write back to disk
(pinned). committed data - are written back to
disk (forced output). rolled back data - are
removed from buffer without writing back to
disk. - support concurrency - lock active
transaction in buffer.
7Storage Structures
Fixed length storage records in a block -
simple structure for management. - block header
includes the block number of the block. - block
footer contains the location of a particular row
within the block. - record prefix - control
information , includes the row identifier. - when
a record is being retrieved, record prefix is
stripped off before passing to DBMS. - when a
record is being stored, record prefix is added to
the data and placed into a block - when a row is
deleted, the record prefix contains a field which
is used to indicate that the storage record is
logically deleted. Variable length storage
records in a block - column values are stored
using minimum number of bytes required (e.g.
trailing blanks of a string is not
stored). - occupy less storage space. -
drawbacks not easy to reuse space formerly
occupied by a deleted record. require
additional field in the record prefix to indicate
the no. of bytes of row data.
8Storage Structures
Row Identifier (RID) - for locating a row of
stored records - contains a block number
and an offset from end of block (within the
block footer). It locates a field containing an
offset from the beginning of the block that
locates the storage record within that
block. - RIDs need not to be changed even if
records are moved up and down within the block,
only the offset values at the block footer have
to change.
9Storage Structures
Pointer Chains - Store two or more tables in a
database file. - At the record prefix, store the
RID of the next storage record, forming a pointer
chain. - The location of the first storage record
in each chain needs to be held in a known
position, typically the RIDs of the first storage
record of each table will be held in contents
block (usually the first block of the database
file).
10Storage Structures
Clustering - store records that are logically
related (or frequently used together) together
within the same or adjacent block - to minimize
the disk accesses. - clustering scheme can vary
at different time, according to the frequency of
different type of data request.
11Storage Structures
Indexing E.g. SELECT name FROM student WHERE
student_id s05 - If whole student table is
loaded into memory for searching, it will involve
many disk accesses. - To minimize the disk
access, build an index file containing a mapping
between student_ids and RIDs. - Search this
index file (suppose small and in memory) to find
out the RID of student_id s05. Then use just
a single disk access to retrieve the data of
student_id s05 according to the searched RID.
- Indexing overhead - when new rows are
inserted, a new entry will also have to be added
to the index.
12Storage Structures
- Indexing facilitates
- Direct access to individual rows of a table (just
like the previous example) - Sequential access to all, or some, rows of a
table where the sequence is that defined by the
ordering of the values in the index file. - Multi-level (B tree) index
- - for more efficient index searching
13Storage Structures
- Hashing
- By computation to determine the location of a
desired row from the column value - To retrieve a row of a given column value, the
DBMS performs the computation and instructs the
file manager to fetch the corresponding record at
the computed position. - Two hashing schemes
- 1. Suppose 13 rows of data per block. Divide the
column value by 13 to obtain the location
information. E.g student_id s38 - 38 / 13 2 remainder 12,
- then retrieve s38 data from block number 2 1
3, and offset from end of block is 12 - data may not uniformly distribute and waste
space - 2. Suppose use 3 blocks to store 30 students.
Divide the column value by 3 to obtain the block
number. Offset value is determine by file
manager. - 38/3 12 remainder 2, block number is 2 1
3 and offset value is assigned by file
manager, say the next available slot. - less space. But space indicated by computed
blocked number might be fully occupied, need
extra algorithm to handle such situation
14Comparison of Indexing and Hashing
15Distributed Data
Database divided into a number of regional
databases
user process b
user process a
Region 2
Region 1
Region 3
Some form of communication network must be
provided to join the local databases and allow
access from both local and distributed users
Region 2 students
Region 1 students
Region 3 students
Region 2 staff
Region 1 staff
Region 3 staff
Local users - users who require access to only
the data stored at a specific location. Distribute
d users - users who require access to data stored
at multiple locations.
16Distributed Data
- Three approaches for distribution of processing
and data - 1. Client multi-server requires explicit
navigation by the user process between multiple
database servers that are independent of each
other. - 2. Distributed database hide the location of the
underlying data and how it is distributed so that
it appears as a single database from users point
of view. - 3. Replication systems makes use of data
replication, distributing the copies to allow
local processing of data within each local
database system.
17Client Multi-server Systems
user process
client
server
DBMS
DBMS
database
schema
database
schema
Region 1 Student Database
Region 2 Student Database
- Need to ensure integrity of data among different
database. E.g. need to keep student_id in Region
1 and Region 2 databases above be unique -
possibly assigning a particular range of
student_id for each region. - User needs to know which database he is dealing
with -gt Connection management - Establish connection CONNECT
- Make current connection for a dormant connection
SET CONNECTION - Close down communication DISCONNECT
18Client Multi-server Systems
- Transaction Management - two phase commit
- E.g. System A performs a transaction including an
INSERT on System B and a DELETE on System C
insert
DELETE GET READY COMMIT
INSERT DELETE COMMIT first phase second
phase done
INSERT GET READY COMMIT
delete
get ready
get ready
request system B to prepare to commit/rollback
transaction
time out
commit the local transaction
yes
yes
commit
commit
When both system B and C all reply yes
final commit of transaction
OK
OK
System C
System A
System B
19Distributed Databases
network
- Local users can interface with DBMS. Distributed
users must interface with DDBMS. - Location independence Users do not have to know
anything about where data is stored nor how it is
processed to meet their requirement.
20Distributed Databases
- Global schema schema describing all the data for
a distributed user (like logical schema) - Distributed schema schema defining where data
is located, i.e. in which local database (like
storage schema). - Location of global and distribution schema
21Distributed Databases
- Data fragmentation
- Horizontal fragmentation a subset of complete
rows of a table is stored in one location. - Vertical fragmentation some columns of a table
are stored in one location and other columns in
another. - A combination of horizontal and vertical
fragmentation. - Processing queries in distributed databases
- Subsidiary query - retrieve some part of the
required data from a given location. - Combine intermediate tables resulting from
subsidiary queries from different locations to
obtain final results. Combining tables involves
movement of data among locations, performed by - UNION, for horizontally fragmented database
- JOIN, for vertically fragmented database
22Replication Databases
- Replication is a way of providing copies of data
to other database servers so that each user
process can perform all processing locally. - minimizing the use of remote connections.
- reducing the dangers of remote site failure or
connection failure. - Advantages
- shorter response time.
- higher availability.
- presence of data snapshots can facilitate for
auditing purpose or trend analysis.
23Replication Databases
network
- Replication server
- to receive or collect updates from its local
database and pass copies on to other sites that
require them - to receive or collect updates from remote sites,
via their replications servers and apply them to
the local database. - Ways for transfer replicates
- Sender pushes replicates to receiver, whenever
there is data change. - Receiver requests, from time to time, replicates
from sender (pull method).
24Replication Databases
- Ways of replication
- Consolidation
- collection of data fragments from a number of
locations, into one location to construct a
global view. - Dissemination
- one database and copies data fragments for
distribution of different locations. - A primary source provide copies of data to
different locations, the data copies are often
the subset of primary copy.
primary
primary B
primary A
copy A copy B
copy A
copy B
25Data Warehouse
- Data warehouse
- extracting data from one or more databases to
build up a large historical database, enabling
users to look for trends or patterns within the
data stored by an organization. - Separate from operational system (e.g. order
processing system) which usually holds data only
necessary for supporting day-to-day operations.
Data values held in operational system is
constantly changing and older data is archived or
even deleted. Data warehouses are simplified
structures which have not necessarily been
normalized in order to ease for making queries. - Characteristics of data warehouse
- Subject oriented - data is organized around
subjects (such as sales) rather than operation
applications (such as order processing). - Non-volatile - data, once placed in the
warehouse, is usually not subject to change. - Integrated - data is consistent. Data comes from
various systems of different technologies. Their
format and meaning should be consistent when
putting into warehouse. - Time variant - historical data is recorded.
26Data Warehouse
- Dimensions When performing data analysis in
data warehouse, the decision maker would make
analysis on different dimensions, say sales by
product, sales by member (product and member are
then the dimensions). Time is always one of the
dimensions in a data warehouse. - Dimensional analysis is to identify what is the
subject (e.g. sales) for the data warehouse, and
what are the dimensions in the data warehouse,
which finally often depicted in star schema. - Facts of the subject - they are the attributes
that can be analyzed across the dimensions, e.g.
cost, selling price (for the subject sales).
Facts must can be totaled over a given time
period.
subject
dimension 1
time
dimension 2
27Data Warehouse Model
Extraction component - copies data from the
existing data sources to allow them to put into
the warehouse database. Integration component -
ensures the consistency of data from different
data sources, which includes format integration
- logically similar data types (e.g. dates of
different formats ) are converted so that they
have the same physical data format. Presence of
a set of integration rules as a kind of map which
specifies how information has to be converted
before it is allowed into the data
warehouse. semantic integration - to make
consistent for the meaning of information. Each
kind of data should be understood by everyone -gt
requires metadata to describe about the data.
28Data Warehouse Model
Data warehouse component - stores and manages
data. - in star schema. - the fact table (the
centre of the star) usually has a large number of
rows and a small number columns. - fact columns
should be summable which facilitates the
production of summary tables for management
decision. - the dimension tables (the points of
the star) are joined to the fact table through
foreign keys. Aggregates - in order to avoid
full scan of fact tables to process users
queries every time, some aggregated fact tables
(e.g. summing data by day or by month) are
prepared in the DBMS so that users can access
those if they happen to satisfy the requirements
of the query. Aggregate navigation component -
receives user query, analyzes which level of
aggregate tables can be applied, and if
applicable, directs the query automatically to
the aggregate tables for execution. - for
improving the query execution time. Presentation
tools - tools presenting data to users. - many
client tools available for presenting forms,
charts etc for users.
29THE END