Title: Database Management Systems CSE530a
1Database Management Systems CSE530a
2Today
- Internal Level Physical Storage and Access
3Indexes
- Ordered indices values in sorted fashion
- Hash indices values distributed across
buckets by using a function - An index record consists of a value and pointers
to one or more records with that value. Can be - Dense every value group indexed
- Sparse only some values are indexed
- Include
- Compound indexes (values from more than one data
column) - Covering index (uses values in the index for the
SELECT clause) - Unique index
- Clustering indexes (stores similar data rows near
each other) - Bitmap indexes (assigns 1 if a value is true, 0
if false)
4Hash Files
- Records written in non-sequential order
- Can use
- Open addressing, Unchained overflow, Chained
overflow, Multiple hashing, Dynamic hashing - Limitations
- Useful for exact match
- Poor for ranges, patterns, additional fields
5Hashing - Introduction
- A hash function h(x), is essentially a mapping of
the entire set of possible values to a smaller
set of m values. These values are used to
determine the location of a given record in the
table. - Hash function calculates address of the page
where record is stored - based on a one or more base fields (hash field)
- If a key field, called hash key
- Hash function creates even spread of records
across file - Folding applies math to different parts of the
has field (empID 0110 could become (01)10. 11
is address of disk page - Division-remainder uses mod 0110 mod 100. 10 is
address of the disk page
6Hashing - Introduction
7Hashing
- There are two main types of hashing
- Static Hashing
- uses a fixed address space, defined when the file
is created. When the address space becomes too
full, the space is said to be saturated. When a
file becomes saturated, reorganize the hash
structure, usually by choosing a new hash
structure and mapping the old file to a new one. - Dynamic Hashing
- allows the file size to change dynamically,
growing and shrinking as the database size
changes.
8Hashing - Collisions
- Collisions are an unavoidable complication of
hash functions. - In general, the hash functions used by a DBMS
will map each record to a storage location which
holds some number of records. In this case, a
collision only occurs when a new record hashes to
a location which has already reached capacity. - Avoid collisions by the choice of the hash
function and key. - A good hash function satisfies the assumption of
simple uniform hashing For a given has function,
each key is equally likely to hash to any of m
slots, independently of any other keys hash
value. - Distribute keys uniformly over the available hash
values. A function which satisfies this
assumption will minimize collisions, dividing the
input values evenly among the available storage
locations.
9Hashing - Static Collision Handling
- Four most common collision handling methods for
statically hashed databases. - Open Addressing
- Unchained Overflow
- Chained Overflow
- Multiple Hashing
10Static Collision Handling - Open Addressing
For collisions, a linear search finds the next
available slot. Once the last slot has been
searched, the system continues searching at the
first slot. When searching, the system first
checks the location indicated by the records
hash value. If no such record is there, a linear
search of the hash locations locates the record.
Wrapping around indicates no record exists with
that key.
11Static Collision Handling
- Unchained Overflow
- A separate overflow area is mapped. When a
collision occurs, the new record is redirected to
the first available space. - When searching, the system first checks the
location indicated by the records hash value.
If no such record is there, a linear search of
the overflow area locates the record. - Can show significant improvement over open
addressing, particularly if collisions are
minimized. Fewer collisions result in a smaller
overflow space and a shorter search time for
records located in the overflow area. - Chained Overflow
- Every location identified by a hash value carries
a synonym pointer. The synonym pointer
identifies the specific page in the overflow area
where it directs records causing a collision.
Incoming records are redirected to the first open
slot in the page. - Searches are the same as with unchained overflow,
except the linear search is over a smaller area,
again resulting in a better performance when
recovering colliding records. - Multiple
- When a collision occurs, a second hashing
function is applied. Multiple hashing can also
be applied to an overflow method, using a second
hash function to place the record into the
overflow area, providing a chance at direct
access to the record even in the event of a
collision. - When searching, if the record is not found at the
first hash-value location, the subsequently
hashed specified location is searched. - It is possible to use more than one additional
hash function, however one of the previous
methods will most likely be implemented alongside
multiple hashing in the event that all hashes of
a given record result in collisions.
12Dynamic Collision Handling - Extendible Hashing
Generates values over a large range, typically
b-bit binary integers, where 32 is the most
common value of b. Buckets (slots) are created as
required. Start with a single bucket and add
records until full. Then the bucket is split,
depending on I bits of the hash value for each
record, where 0 ? i ? b. These i bits are the
offset into a directory which stores i, called
the depth, along with 2i pointers, one for each
bucket. Each bucket holds the current value of i
to determine the address. When a bucket is
emptied, the bucket and its corresponding
directory entry are deleted, reducing the
overhead.
13B Tree Assignment
- Assignment
- Write brief pseudocode and use the pseudocode to
implement a B tree for improving search queries
that always combine the Applicant and Trade Name
fields in the WHERE clause - The nodes of the tree can hold a maximum of 6 key
values - Assuming you are inserting and removing the
following values into your table and build and
rebuild your secondary index using each value
returned (in the order it appears) for the
following queries (representing the impact on
your tree) - INSERT INTO MYTABLE select distinct
applicant,tradename from tblProducts where
ingredient 'diazepam' and routeofadministration
oral - INSERT INTO MYTABLE select applicant,tradename
from tblProducts where ingredient 'lorazepam'
and routeofadministration 'oral' and applicant
not in (select distinct applicant from
tblProducts where ingredient 'diazepam' and
routeofadministration 'oral') - DELETE FROM MYTABLE WHERE APPLICANT IN (Select
distinct applicant from tblProducts where
ingredient 'lorazepam' and routeofadministration
'oral' and applicant in (select distinct
applicant from tblProducts where ingredient
'diazepam' and dosageform 'injectable')) - This grade will be included in the lab grades
- Due December 7
14B-Trees
- Well established as the most common structures
for indexes - Multi-level
- d is the order of the tree it is a measure of
the tree node capacity - Every node except the root contains m entries,
where d/2 lt m lt d - The root node contains 1 lt m lt d entries
- Non-leaf nodes with m index entries contain m1
pointers to children - Pointer Pi points to a subtree with K values such
that Ki-1 lt K lt Ki
15Motivations for a B-Trees
Root
40
Index
Pages
20
33
51
63
Primary
Leaf
46
55
10
15
20
27
33
37
40
51
97
63
Pages
16Motivations for a B-Trees
Root
40
Index
Pages
20
33
51
63
Primary
Leaf
46
55
10
15
20
27
33
37
40
51
97
63
Pages
23
Overflow
Pages
17Motivations for a B-Trees
Root
40
Index
Pages
20
33
51
63
Primary
Leaf
46
55
10
15
20
27
33
37
40
51
97
63
Pages
41
48
23
Overflow
Pages
42
18B-Trees (order 2) one algorithm
- Query find all values with a pointer value of P
- If search value is lt SearchKey value, go left
otherwise, go right
19B-Tree Search Function
- The keys in each node are in ascending order.
- At every given node N the following is true
- The subtree starting at record N.Branch0 has
keys lt N.Key0 - The subtree starting at record N.Branch1 has
keys gt N.Key0 and less than N.Key1 - The subtree starting at record N.Branch2 has
keys gt N.Key1 and less than N.Key2
20B Tree index/sequence sets
21B Trees datapage leaf
Sort order Anizy,Apach,Apensen,Ardwick,Arnham,Ath
ens
22B Tree Example
- Rules for this example
- d is the order of the tree it is a measure of
the capacity of child nodes - Every node except the root contains m entries,
where d/2 lt m lt d - The root node contains 1 lt m lt d entries
- Non-leaf nodes with m index entries contain
between (m1)/2 and m1 pointers to children - Pointer Pi points to a subtree with K values such
that Ki-1 lt K lt Ki - Search uses pointer to the right for greater than
or equal to in non-leaf nodes, greater than in
leaf nodes until equal to is found or not found
23Records in a block
- How to store records in blocks?
- number of records r
- block size B
- record size R
- blocking factor bf number of records in a
block - Bf ?B/R? (spanned, unspanned)
- number of blocks needed b
- b ?r/bf?
24B-trees performance impact
- A 4k page can many records per page
- ((4 b/pointer 4b/field)n, 4b/pointer) order
of 512 - Root 511 records
- Level 1 261,632 records
- Level 2 133,955,584 records
- Total 134,217,727 records
25B-trees performance impact
- 1,000,000 records of 300Bytes each (including
header) - Search key is a 4 byte int a pointer requires 4
bytes - 4KB blocks, no block header, random placement,
avg. retrieval time 5.6 ms - No time for memory reads
- 13.6 records/block 76924 blocks to store data
- 512 indexes/block 1954 blocks to store index
- No index
- (76924/2) 38462 block accesses (avg.)
- Time to find a record 38462 5.6 ms 215.4 s
- Indexed, binary search
- log(1954) 1 11 1 12 block accesses
(maximum)time to find a record 12 5.6 ms
67.2 ms - Indexing increased speed by 3205 times.