The Ubiquitous BTree presentation

About This Presentation

Title:

The Ubiquitous BTree

Description:

The longest path in a B-tree of n keys is log d n nodes, d being the order of B-tree. ... Next it should be checked to see that atleast d keys remain. ... –

Number of Views:165

Avg rating:3.0/5.0

Slides: 18

Provided by: shubhakris

Category:

more less

Transcript and Presenter's Notes

Title: The Ubiquitous BTree

1
Algorithms I (91.503)
The Ubiquitous B-Tree
- By Douglas Comer
Overview

This presentation focuses on
Basic B-tree and its operations such as
balancing, insertion and deletion
including the cost of these operations.
Comparison of several variations of the B-tree.
B-trees in a multi-user environment mainly the
security issues.
Finally a general purpose access method using
B-trees which includes performance enhancements,
tree-structured file directory and other VSAM
facilities.

2
Algorithms I (91.503)
The Ubiquitous B-Tree
- By Douglas Comer
Introduction
What are B-Trees ? There are many techniques for
organizing a file and its index. The B-Tree is,
de-facto, the standard organization for indexes
in a database system. What is File Operation? A
file is a set of records, each of the form ri
(ki, ?i), in which ki is called the key for the
ith record, and ?I the associated information.The
file operations include insert add a new
record, (ki, ?i), checking that ki is
unique delete remove record (ki, ?i), given
ki find retrieve ?i given ki next retrieve
?i1 given that ?i was just retrieved (i.e.,
process the file sequentially)
3
Algorithms I (91.503)
The Ubiquitous B-Tree
- By Douglas Comer
Basic B-tree
History In the late 1960s computer manufacturers
and independent research groups competitively
developed general purpose file systems and called
access methods for their machines. R.Bayer
and E. McCreight, then at Boeing Scientific
Research Labs, proposed an external index
mechanism with relatively low cost for most of
the file operations and they called it a
B-tree. Tree Search The branch taken at a node
depends on the outcome of a comparison of the
query key and the key stored at the node. If the
query is less than the stored key, the left
branch is taken, if it is greater, the right
branch is followed as shown below. Fig 1
Query key is 20. The search follows the path in
red
contd
50
10
70
4
Algorithms I (91.503)
The Ubiquitous B-Tree
- By Douglas Comer
.contd
A B-tree of order d contains in each node at most
2d keys and 2d 1 pointers as shown below The
number of keys may vary from node to node, but
each must have at least d keys and d 1 pointers.
As a result, each node is at least half full. In
the usual implementation a node forms one record
of the index file, has a fixed length capable of
accommodating 2d keys and 2d pointers, and
contains additional information telling how many
keys correctly reside in the node. Balancing The
beauty of B-trees lies in the methods for
inserting and deleting records that always leave
the tree balanced. The figure shows the figure of
a B-tree of order d indexing a file of n
records. Fig 3
.
key 1
key 2
key 2d
Fig 2 A node in a B-tree of order d with 2d keys
and 2d 1 pointers
h log d n
all leaves
5
Algorithms I (91.503)
The Ubiquitous B-Tree
- By Douglas Comer
..contd
The longest path in a B-tree of n keys is log d n
nodes, d being the order of B-tree. A find
operation may visit n nodes in an unbalanced tree
indexing a file of n records, but it never visits
more than 1 log d n nodes. Balancing plays a
very important role here because each visit
requires a secondary storage access and balancing
offers large potential savings. Also B-tree
balancing scheme restricts changes in the tree to
a single path from a leaf to the root, so it
cannot introduce runaway overhead. But at the
same time balancing introduces extra storage to
lower storage costs, since secondary storage is
inexpensive compared to retrieval time it becomes
advantageous. Insertion Insertion is a two step
procedure. First a find proceeds from the root to
locate the proper leaf for insertion. Then the
insertion is performed, and balance is restored
by a procedure which moves from the leaf back
toward the root. If the key is to be inserted in
a node that is already full, a split
occurs. contd...
6
Algorithms I (91.503)
The Ubiquitous B-Tree
- By Douglas Comer
..contd
The split follows by placing the smallest d of 2d
1in one node and the largest d in the other
node and remaining value is promoted to the
parent node where it serves as a separator. The
parent node will accommodate an additional key
and the insertion process terminates. If the
parent node is full, splitting propagates all the
way to the root and the tree increases in height
by one level and the height of a B-tree increases
only by a split at the root. Deletion Deletion
also uses find operation to locate the nodes
which may lead to two possibilities the key
resides in a leaf node, or a nonleaf node. A
nonleaf deletion requires an adjacent key which
is found by searching for the leftmost leaf in
the right subtree of the now empty slot and it is
swapped into the vacated position so that it
finds work correctly. Next it should be checked
to see that atleast d keys remain. If less than d
then an underflow is said to occur and
redistribution becomes necessary. Distribution of
keys among two neighbors is possible only if
there are atleast 2d keys to distribute. If not a
concatenation will occur. Concatenation is
inverse of splitting. Here the keys are combined
into one of the contd.
7
Algorithms I (91.503)
The Ubiquitous B-Tree
- By Douglas Comer
..contd
nodes and the other is discarded. This process of
concatenation continues to the next higher level
up to the root level thus decreasing the height
of the B-tree. Figure depicts the process.
Cost of Operation This considers the retrieval
and insertion and deletion costs. Considering
retrieval cost, each node except root has atleast
d direct descendants since there are between d
and 2d keys per node, the root has atleast 2
descendants. All leaves lie at the same depth h
so there are ?hi0 (d i d h - 1/2d - 1) nodes
with atleast d keys each. The height of a tree
with n total keys is therefore constrained so
that 2d(dh - 1)/(d-1) lt n which can be shown as
2dhltn1, or hltlogd( n1)/2 Insertion and
Deletion Costs A B-tree of order d for a file of
n records, insertion and deletion take logd n in
worst case. Having large number of keys in a node
is advantageous because as the branch factor, d
increases, the costs of find, insert and delete
operations decreases.
contd.
10
15
27
10
27
Fig 4 (a) A deletion causing concatenation
and (b) the rebalanced tree
12
14
20
12
14
15
20
(a)
(b)
8
Algorithms I (91.503)
The Ubiquitous B-Tree
- By Douglas Comer
..contd
However there are practical limits on the size of
a node. Besides these costs the constant factor
grows as the size of data transferred increases
and it is device dependent. Hence the optimum
node size depends on the characteristics of the
system and the devices on which the file is
allocated. This completes the discussion of
B-trees and the next section describes B-tree
variants and its advantages and implementation
techniques.
9
Algorithms I (91.503)
The Ubiquitous B-Tree
- By Douglas Comer
B-Tree variants
These vary from B-trees in the approach.According
to Bayer and McCreight, the underflow condition
and overflow, resulting from a deletion and
insertion respectively are handled without
concatenation or splitting by redistributing keys
from neighboring nodes thus eliminating the
associated overhead. There are several
variations concentrated on improvements in the
secondary costs, index creation for a file,
varying order at each depth etc., B -Trees In
a B -Trees each node is atleast 2/3 full(instead
1/2 full). The insertion employs a local
redistribution scheme to delay splitting until 2
sibling nodes are full. Then the 2 nodes are
divided into 3, each 2/3 full. This guarantees
that storage utilization is atleast 66, while
requiring only moderate adjustment of the
maintainence algorithms. It should be pointed out
that increasing storage utilization has the side
effect of speeding up the search since the height
of the resulting tree is smaller. B-Trees In
this case all trees reside in the leaves. The
upper levels consists only of an index, a roadmap
to enable rapid location of the index and key
parts. contd...
10
Algorithms I (91.503)
The Ubiquitous B-Tree
- By Douglas Comer
..contd
The leaf nodes are usually linked together
left-to-right as shown. Searching
proceeds from root through the index to
a leaf. Since all keys reside in the leaves, it
does not matter what values are
encountered as long as it leads to
correct leaf. Deletion has the ability to
leave non-key values in the index part as
separators simplifies processing.Insertion and
find operations are similar to B-trees. The
advantage lies in that it requires 1 access to
satisfy a next operation. Also during sequential
processing of a file, no node will be accessed
more than one, so space for only 1 node need be
available in main memory. This is well suited for
applications entail to both random and sequential
processing. Prefix B-trees This technique uses
the beginning letter as the separator value in
the index between the keys. This helps in case of
Computer and Electronic which can be C and
E, by saving space but does not help in the
case of Programmer and Programmers.

random search
index
keys
Sequential search
Fig 5 B-tree with separate index and key parts.
11
Algorithms I (91.503)
The Ubiquitous B-Tree
- By Douglas Comer
Virtual B-Tree The concept of demand paging is
used here. By careful allocation each node of
B-tree can be mapped into one page of the virtual
address space which treats as though in were in
the main memory. Access to nodes not in memory
causes page-in the nodes from secondary
storage. The most active nodes are close to the
root and tend to stay in memory. The advantages
are, the special hardware performs transfers at
high speed, the memory protection mechanism
isolates other users and frequently accessed
parts of thr tree will remain in
memory. Compression Wagner came up with
implementation techniques such as compressed keys
and compressed pointers. Pointers can be
compressed using a base/displacement form of node
address rather than an absolute address value. As
shown in the figure. To reconstruct an actual
pointer value, the basic is added to the
displacement for that pointer. This is
appropriate for virtual B-trees where pointers
take on large address values.
contd..
Fig 6 A node with compressed pointers. To obtain
the ith pointer, the base value is added to the
ith offset.
...
base
offset 0
key 1
offset 1
key 2
offset 2
offset 2d-1
key 2d
offset 2d
12
Algorithms I (91.503)
The Ubiquitous B-Tree
- By Douglas Comer
..contd
Both key and pointer compression increase the
capacity of each node and hence decrease the
retrieval costs. The tradeoff for decreased
secondary storage accesses is an increase in the
CPU time necessary to search a node after it has
been read. Binary B-trees Bayer proposed this
theory which makes B-trees suitable for a
one-level store. A Binary B-tree is a B-tree of
order 1, each node has 1 or 2 keys and 2 or 3
pointers. To avoid wasting space for nodes that
are only half full, a linked representation is
used. Analysis shows that insertion, deletion and
find takes logn steps and searching
the rightmost path requires twice as many nodes
to be accessed as the leftmost. To
maintain logarithmic cost, two right
links should never point to sibling
nodes in a row.
B-tree
(a)
Binary B-tree
27
32
27
32
(b)
27
27
Fig 7 Nodes in a B-tree and the corresponding
nodes in a binary b-tree. Each right pointer in
the binary b-tree representation can point to a
sibling or a descendent
13
Algorithms I (91.503)
The Ubiquitous B-Tree
- By Douglas Comer
2-3 Trees
2-3 tree is a B-tree of order 1. Hopcroft
developed the notion of 2-3 tree, and explored
its usefulness in a one- level store. Each node
in a 2-3 tree has 2 or 3 sons. They use the
number of comparisons and the number of node
accesses, respectively, as cost criterion.
B-Tree in a Multiuser Environment The use of
B-tree in a database system must permit several
user requests to be processed simultaneously. So
synchronization becomes very important issue. So
this issue was tackled by Bayer and Schkolnick in
which a set of locking protocols can insure the
integrity of B-Tree accesses while allowing
concurrent activity. A find locks the node once
it has been read so that other processes cannot
interfere with it. As the search progresses it
releases its lock on the ancestor allowing others
to read. Updating in a concurrent environment
presents a more complex problem, one that
requires more complex protocols. This is done by
reservation. Once an update process establishes
reservations on a path leading to some leaf, it
may convert the reservations to absolute locks,
top-down. Then update proceeds, changing only
nodes on which it holds absolute locks. Once
finished will release and the updated path is
available for other processes. A problem of
reserving the path was an issue for which a
solution was proposed that provided a
parameterized model and showed how reservations
can permit enough concurrency to utilize present
technology while wasting very little time on
restarting reservations. Another solution for
this problem eliminated the need for all but the
most simple protocols, since updates never need
to travel back up the tree at all, thus only one
pair of nodes will ever be locked at a given
time.
14
Algorithms I (91.503)
The Ubiquitous B-Tree
- By Douglas Comer
Security
Bayer and Metzger proposed encipherment schemes.
These schemes have relatively high cost unless
implemented via hardware. Also the changes to the
B-Tree maintenance algorithms are minor,
especially if the encipherment can be done on
the fly during data transmission. VSAM A
general purpose access method using B
-Trees This is IBMs general purpose B-Tree based
access method. VSAM is designed to support
sequential searching as well as logarithmic cost
insertion, deletion, and find operations. B
-Trees offer dynamic allocation and release of
storage utilization of 50 and eliminates the
need for periodic reorganization of the entire
file.
index
B-Tree
.
sequence set

Control intervals (actual data)
Fig 8 A VSAM file with actual data stored in the
leaves
15
Algorithms I (91.503)
The Ubiquitous B-Tree
- By Douglas Comer
Performance Enhancements
The file organization must incorporate special
devices if transactions are to be conducted
efficiently. The maximum size of a control
interval is limited by the largest unit of data
that the hardware can transfer in one operation.
In addition the set of all control intervals
associated with one sequence set node must fit on
one cylinder of the particular disk storage unit
used to store the file. These restrictions
improve performance and permit even further
enhancements. Replication reduces disk seek time.
VSAM attempts to improve performance in several
other ways. Pointers are compressed using the
base/displacement method described above, keys
are compressed in both the forward and backward
directions. Finally VSAM allows the index part to
be virtual B-Tree, using the virtual memory
hardware to receive it. Tree-Structured File
Directory The novel idea in VSAM is that one
data format should be used throughout the system.
All the VSAM files are kept in master catalog.
Given its name it can locate it since all files
are in the same catalog. If several processes
access the master catalog simultaneously,
contention occurs and all but one will have to
wait. To avoid lengthy delays each user can
define a local catalog with entries for his VSAM
files. contd
16
Algorithms I (91.503)
The Ubiquitous B-Tree
- By Douglas Comer
..contd
Once the user catalog has been located by
searching the master catalog, further references
to files indexed by that catalog do not entail
searching the master catalog. The resulting
multilevel, tree-structured catalog scheme has a
flavor similar to the MULTICS file system. Other
VSAM Facilities The above discussed VSAM files
are called key-sequenced files. The other form is
entry-sequenced files which allow sequential
processing when no key accompanies a record. This
requires no index and hence less expensive to
maintain. Also free space distribution of free
space within the file must be decided. If many
insertions are to be carried out, then file
should not be loaded with each node 100 full or
the initial insertions will be expensive. On the
other hand loading nodes only 50 wastes storage.
Finally VSAM supplies facilities for efficient
insertion of a large contiguous set of records,
protection of data, file backup and error
recovery, all of which are necessary in a
production environment.
17
Algorithms I (91.503)
The Ubiquitous B-Tree
- By Douglas Comer
Summary
The paper gives a good knowledge about the basic
B-tree, its variants and various implementation
techniques. B-trees guarantee 50 storage
utilization while allocating and releasing space
as the file grows or shrinks. These variants
retain the basic B-tree properties in terms of
cost of the operation along with efficient
sequential processing. The implementation
techniques provide enhanced performance,
generality, and the ability to use B-trees in a
multi-user environment. The access method, the
IBMs VSAM uses both the B-tree and B-tree and
focuses on performance enhancements and
protection of data using different available
techniques.

Write a Comment

User Comments (0)

About PowerShow.com