CS5226

About This Presentation

Title:

CS5226

Description:

Can often hold top levels in buffer pool: 3 - Index Tuning. 21. A Note ... Observe `toss' of index entry (on right), and `pull down' of index entry (below). 30 ... – PowerPoint PPT presentation

Number of Views:21

Avg rating:3.0/5.0

Slides: 91

Provided by: bco90

Category:

more less

Transcript and Presenter's Notes

Title: CS5226

1
CS5226

Indexing and Index Tuning

2
When Change is the Only Constant

CPU
Memory Speed and Size
Harddisk Speed and Size
Bandwidth

3
Moores Law being proved...
4
Improvement in Performance
CPU (60/yr)
10000
1000
100
DRAM (10/yr)
10
Disk (5/yr)
1
1980
2000
5
Indexing

Single dimensional Indexing
Multi-dimensional Indexing
High-dimensional Indexing
Indexing for advanced applications

6
Single Record and Range Searches

Single record retrievals
Find student name whose matric 921000Y13
Range queries
Find all students with cap gt 3.0
Sequentially scanning the file is costly
If data is in sorted file, do binary search to
find first such student, then scan to find
others.
cost of binary search can still be quite high.

7
Indexes

An index on a file speeds up selections on the
search key fields for the index.
Any subset of the fields of a relation can be the
search key for an index on the relation.
Search key is not the same as key (minimal set of
fields that uniquely identify a record in a
relation).
e.g., consider Student(matric, name, addr,
cap), the key is matric, but the search key can
be matric, name, addr, cap or any combination of
them.

8
Simple Index File (Data File Sorted)
Sequential File
Dense Index
9
Simple Index File (Cont)
Sequential File
Sparse Index
10
Simple Index File (Cont)
Sequential File
Sparse 2nd level
11
Secondary indexes
Sequence field

Sparse index

12
Secondary indexes
Sequence field

Dense index

13
Conventional indexes

Advantages
- Simple
- Index is sequential file
good for scans

Disadvantages - Inserts expensive,
and/or - Lose sequentiality balance
14
Example
Index(sequential)
10
20
30
continuous
40
50
60
free space
70
80
90
15
Tree-Structured Indexing

Tree-structured indexing techniques support both
range searches and equality searches
index file may still be quite large. But we can
apply the idea repeatedly!

Data pages
16
B Tree The Most Widely Used Index

Height-balanced.
Insert/delete at log F N cost (F fanout, N
leaf pages)
Grow and shrink dynamically.
Minimum 50 occupancy (except for root).
Each node contains d lt m lt 2d entries. The
parameter d is called the order of the tree.
next-leaf-pointer to chain up the leaf nodes.
Data entries at leaf are sorted.

17
Example B Tree

Each node can hold 4 entries (order 2)

Root
17
24
30
13
5
2
3
39
24
27
38
7
5
8
19
20
22
29
33
34
14
16
18
Node structure

Non-leaf nodes

index entry
P
K
P
K
P
P
K
m
0
1
2
1
m
2

Leaf nodes

P
K
P
K
P
P
K
Next leaf node
m
0
1
2
1
m
2
19
Searching in B Tree

Search begins at root, and key comparisons direct
it to a leaf (as in ISAM).
Search for 5, 15, all data entries gt 24 ...

Root
30
17
24
13
39
3
5
19
20
22
24
27
38
2
14
16
29
33
34

Based on the search for 15, we know it is not
in the tree!

20
B-Tree Scalability

Typical order 100. Typical fill-factor 67.
average fanout 133
Typical capacities (root at Level 1, and has 133
entries)
Level 5 1334 312,900,700 records
Level 4 1333 2,352,637 records
Can often hold top levels in buffer pool
Level 1 1 page 8 Kbytes
Level 2 133 pages 1 Mbyte
Level 3 17,689 pages 133 MBytes

21
A Note on Order

Order (d) concept replaced by physical space
criterion in practice (at least half-full).
Index pages can typically hold many more entries
than leaf pages.
Variable sized records and search keys mean
different nodes will contain different numbers of
entries.
Even with fixed length fields, multiple records
with the same search key value (duplicates) can
lead to variable-sized data entries

22
Inserting a Data Entry into a B Tree

Find correct leaf L.
Put data entry onto L.
If L has enough space, done!
Else, must split L (into L and a new node L2)
Redistribute entries evenly, copy up middle key.
Insert index entry pointing to L2 into parent of
L.
This can happen recursively
To split index node, redistribute entries evenly,
but push up middle key. (Contrast with leaf
splits.)
Splits grow tree root split increases height.
Tree growth gets wider or one level taller at
top.

23
Inserting 7 8 into Example B Tree
Root
30
17
24
13
39
3
5
19
20
22
24
27
38
2
7
14
16
29
33
34
(Note that 5 is copied up and
continues to appear in the leaf.)

Observe how minimum occupancy is guaranteed in
both leaf and index pg splits.

30
5
17
24
13
3
5
2
7
8
24
Insertion (Cont)
(Note that 17 is pushed up and only
appears once in the index. Contrast
this with a leaf split.)

Note difference between copy-up and push-up be
sure you understand the reasons for this.

17
5
24
30
13
30
5
17
24
13
3
5
2
7
8
25
Example B Tree After Inserting 8
Root
17
24
30
13
5
2
3
39
19
20
22
24
27
29
33
34
38
7
5
8
14
16

Notice that root was split, leading to increase
in height.

In this example, we can avoid splitting by
re-distributing entries however,
this is usually not done in practice. Why?

26
Deleting a Data Entry from a B Tree

Start at root, find leaf L where entry belongs.
Remove the entry.
If L is at least half-full, done!
If L has only d-1 entries,
Try to re-distribute, borrowing from sibling
(adjacent node with same parent as L).
If re-distribution fails, merge L and sibling.
If merge occurred, must delete entry (pointing to
L or sibling) from parent of L.
Merge could propagate to root, decreasing height.

27
Example Tree After (Inserting 8, Then) Deleting 19
Root
17
24
30
13
5
39
2
3
24
27
38
7
5
8
29
33
34
14
16
20
22

Deleting 19 is easy.

28
Example Tree After Deleting 20 ...
Root
17
27
30
13
5
2
3
39
33
34
38
7
5
8
22
24
27
29
14
16

Deleting 20 is done with re-distribution. Notice
how middle key is copied up.

29
... And Then Deleting 24

Must merge.
Observe toss of index entry (on right), and
pull down of index entry (below).

30
39
22
27
38
29
33
34
Root
13
5
30
17
3
39
2
7
22
38
5
8
14
16
27
29
33
34
30
Example of Non-leaf Re-distribution (Delete 24)
22
Root
30
27
13
5
17
20
39
38
33
34
27
29
22
24
17
18
7
5
8
3
2
20
21
14
16

In contrast to previous example, can
re-distribute entry from left child of root to
right child.

22
Root
30
13
5
17
20
31
After Re-distribution

Intuitively, entries are re-distributed by
pushing through the splitting entry in the
parent node.
It suffices to re-distribute index entry with key
20 weve re-distributed 17 as well for
illustration.

Root
17
30
22
13
5
20
39
7
5
8
2
3
38
17
18
33
34
22
27
29
20
21
14
16
32
Index Classification

Primary vs. secondary If search key contains
primary key, then called primary index.
Unique index Search key contains a candidate
key.
Clustered vs. unclustered If order of data
records is the same as, or close to, order of
data entries, then called clustered index.
A file can be clustered on at most one search
key.
Cost of retrieving data records through index
varies greatly based on whether index is
clustered or not!

33
Clustered vs. Unclustered Index

Suppose the data file is unsorted.
To build clustered index, first sort the data
file (with some free space on each page for
future inserts).
Overflow pages may be needed for inserts. (Thus,
order of data recs is close to, but not
identical to, the sort order.)

Index entries
UNCLUSTERED
CLUSTERED
direct search for
data entries
Data entries
Data entries
(Index File)
(Data file)
Data Records
Data Records
34
Index Classification (Cont.)

Dense vs. Sparse If there is at least one data
entry per search key value (in some data
record), then dense.
Every sparse index is clustered!
Sparse indexes are smaller.

Ashby, 25, 3000
22
Basu, 33, 4003
25
Bristow, 30, 2007
30
Ashby
33
Cass, 50, 5004
Cass
Smith
Daniels, 22, 6003
40
Jones, 40, 6003
44
44
Smith, 44, 3000
50
Tracy, 44, 5004
Sparse Index
Dense Index
on
on
Data File
Name
Age
35
Index Classification (Cont.)

Composite Search Keys Search on a combination of
fields.
Equality query Every field value is equal to a
constant value. E.g. wrt ltsal,agegt index
age20 sal 75
Range query Some field value is not a constant.
E.g.
age 20 or age20 sal gt 10
Data entries in index sorted by search key to
support range queries.
Lexicographic order, or
Spatial order.

Examples of composite key indexes using
lexicographic order.
11,80
11
12
12,10
name
age
sal
12,20
12
bob
10
12
13,75
13
cal
80
11
ltage, salgt
ltagegt
joe
12
20
sue
13
75
10,12
10
20
20,12
Data records sorted by name
75,13
75
80,11
80
ltsal, agegt
ltsalgt
Data entries in index sorted by ltsal,agegt
Data entries sorted by ltsalgt
36
Summary

Tree-structured indexes are ideal for
range-searches, also good for equality searches.
B tree is a dynamic structure.
Inserts/deletes leave tree height-balanced
log F N cost.
High fanout (F) means depth rarely more than 3 or
4.
Almost always better than maintaining a sorted
file.

37
Summary (Cont.)

Typically, 67 occupancy on average.
Usually preferable to ISAM, modulo locking
considerations adjusts to growth gracefully.
If data entries are data records, splits can
change rids!
Indexes can be classified as clustered vs.
unclustered, primary vs. secondary, and dense vs.
sparse, simple vs composite

38
New Database Challenges

More Complex applications. Eg. GIS, OLAP, Mobile
Hardware Advances Big and Small
Web and Internet. Eg XML

39
New Index?

A most effective mechanism to prune the search
Order of magnitude of difference between I/O and
CPU cost
Increasing data size
Increasing complexity of data and search

40
Something that Transcends TimeB-tree
41
Success Factors

Robustness
Concurrency
Performance
Scalability

Fundamentals of Building DBMS
42
B-trees forever?

Can the B-tree being a single-dimensional index
be used for emerging applications such as
Spatial databases
High-dimensional databases
Temporal databases
Main memory databases
String databases
Genomic/sequence databases
...

43
CS5226

Multi-Dimensional Indexing

44
What is a Spatial Database?

A Spatial DBMS is a DBMS
It offers spatial data types/data models/ query
language
Support spatial properties/operations
It supports spatial data types in its
implementation
Support spatial indexing, algorithms for spatial
selection and join based on spatial relationships

45
Applications

Geographical Information Systems (GIS) dealing
extensively with spatial data. Eg. Map system,
resource management systems
Computer-aided design and manufacturing
(CAD/CAM) dealing mainly with surface data. Eg.
design systems.
Multimedia databases storing and manipulating
characteristics of MM objects.

46
Spatial Data

Examples of non-spatial data
Names, zip-codes
Examples of Spatial data
Census Data
NASA satellites imagery
Weather and climate Data
Rivers, farms, ecological impact
Medical Imaging

47
Spatial Databases

Spatial Objects
Points spatial location eg. feature vectors
Lines set of points eg. roads, coastal line
Polygons set of points eg. Buildings, lakes
Data Types
Point a spatial data object with no extension
no size or volume
Regiona spatial object with a location and a
boundary that defines the extension

48
Spatial Relationships

Topological relationships
adjacent, within/contain, intersect, disjoint,
etc
Direction relationships
Above, below, north-of, south-of,etc
Metric relationships
distance lt 100 km
And operations to express the relationships

49
Spatial Queries

Range queries Find all cities within 50 km of
Madras?
Nearest neighbor queries Find the 5 cities that
are nearest to Madras?
Find the 10 images most similar to this image?
Spatial join queries Find pairs of cities
within 200 km of each other?

50
More Examples

Window Range Query Find me data points that
satisfy the conditions x1 ltA1 lt y1, x2 ltA2
lty2?
Spatial Query Find me buildings that are
adjacent to the Railway Stations?
Nearest Neighbour Query Find me the nearest
fire station to Clementi Ave. 3?

51
Spatial Representation

Raster model
Vector model

52
Representation of Spatial Objects

Testing on real objects is expensive
Minimum Bounding Box/Rectangle
How to test if 2-d rectangles intersect?

y2
y1
x2
x1
testing
representation
53
Query Operation Spatial Index

Filter Step
Select the objects whose mbr satisfies the
spatial predicate
Traverse the index and apply the spatial test on
the mbrs indexed by the index
Output set of oids (including negatives)
Refinement Step
Spatial test is done on the actual geometries of
objects whose mbr satisfied the filter step
(output)
Costly operation
Executed only on a limited number of objects

54
Why spatial index methods (SAMs)?

B-tree hash tables
Guarantee the number of I/O operations is
respectively logarithmic and constant with
respect to the collections size
Index a collection on a key
Rely on a total order on the key domain, the
order of natural numbers, or the lexicographic
order on strings
There is no such total order for geometric
objects with spatial extent
SAMs were designed to try as much as possible to
preserve spatial object proximity

55
Approaches to the Design of SAMs

Space-Based structures
Partition the embedding Space into rectangular
cells
Independent from the distribution of the objects
Objects are mapped to the cells based on some
geometric criterion
Eg Grid file, Buddy-tree, KDB-tree
Data-Based structures
Organize by partitioning the set of objects based
on spatial proximity such that each group can be
fit into a page, as opposed to the embedding
space
Adapt to the objects distribution in the
embedding space
Eg. R-tree, R tree, R tree
Mapping

56
The R-tree

A leaf entry is a pair (mbr, oid)
A non-leaf node contains an array of node entries
The number of entries is between m (fill-factor)
and M
For each entry (mbr, nodeid) in a non-leaf node
N, mbr is the directory rectangle of a child node
of N, whose page address is nodeid
All leaves are at the same level
An object appears in one, and only one of the
tree leaves

57
R-trees
A
B
Height balanced tree
A
Problem Overlap of covering rectangles.
B
58
Insertion in the R-Tree

Algorithm ChooseSubtree
CS1 Initialize Set N to be the root node
CS2 Leaf check
If N is a leaf,
return N
else
Choose subtree
Choose the entry in N whose rectangle needs
least area enlargement to include the new data.
Resolve ties by choosing the entry with the
rectangle of smallest area
end
CS3 Descend until a leaf is reached
Set N to be the childnode pointed to by the
childpointer of the chosen entry. Repeat from
CS2

59
Splitting Strategies in the R-Tree

Three versions
all are designed to minimize area covered by two
covering rectangles resulting from split
Exponential
find the area with global minimum
CPU cost is too high
Quadratic and Linear
find approximation
Quadratic performs much better than linear

60
Splitting Strategies in the R-Tree

Algorithm QuadraticSplit
Divide a set of M1 index entries into two
groups
QS1 Pick first entry for each group
Invoke PickSeeds to choose two entries, each be
first entry of each group
QS2 Check if done
Repeat
DistributeEntry
until
all entries are distributed or one of the two
groups has M?m1 entries (so that the other
group has m entries)
QS3 Select entry to assign
If entries remain, assign them to the other
group so that it has the minimum number m
required

61
Splitting in the R-Tree

Algorithm PickSeeds
Choose two entries to be the first entries of
the groups
PS1 Calculate inefficiency of grouping entries
together
For each pair of entries E1 and E2, compose a
rectangle R including E1 rectangle and E2
rectangle
Calculate d area(R) - area(E1 rectangle) -
area(E2 rectangle)
PS2 Choose the most wasteful pair
Choose the pair with the largest d
the seeds will tend to be small, if the
rectangles are of very different size (and) or
the overlap between them is high

62
Splitting in the R-Tree

Algorithm DistributeEntry
Assign the remaining entries by the criterion of
minimum area
DE1 Invoke PickNext to choose the next entry to
be assigned
DE2 Add It to the group whose covering rectangle
will have to be enlarged least to accommodate
It. Resolve ties by adding the entry to the
group with the smallest area, then to the one
with the fewer entries, then to either
Algorithm PickNext
chooses the entry with best area-goodness-value
in every situation
DE1 For each entry E not yet in a group,
calculate d1 the area increase required in the
covering rectangle of Group 1 to include E
Rectangle. Calculate d2 analogously for Group 2
DE2 Choose the entry with the maximum difference
between d1 and d2

63
Node Splitting R-trees
64
Node Splitting R-trees
65
R-trees

Range Query
Insert
Node splitting
Optimization
Coverage
Overlap
Delete
Variants R-tree
R-tree, buddy-tree

66
The R-Tree

A variant of R-Tree
Several improvements to the insertion algorithm
Aim at optimizing
Node overlapping
Area covered by a node
Perimeter of a nodes directory rectangle
Given a fixed area, the shape that minimizes the
rectangles perimeter is the square
Two variants that bring the most significant
improvement
Split Algorithm
Forced Reinsertion Strategy

67
The R Tree

The directory rectangles at a given level do not
overlap
For a point query, a single path is followed from
the root to a leaf for a region query, subtrees
whose covering mbr intersecting the query region
is traversed
The I/O complexity is bounded by the depth of the
tree
Dead space problem

68
Index Tuning

Index issues
Indexes may be better or worse than scans
Multi-table joins that run on for hours, because
the wrong indexes are defined
Concurrency control bottlenecks
Indexes that are maintained and never used

69
Information about indexes...

Application codes
VSQLAREA -- look for the one with high of
executions
INDEX_STATS meta information about indexes
HASH_AREA_SIZE
HASH_MULTIBLOCK_IO_COUNT
home work

70
Clustered / Non clustered index

Clustered index (primary index)
A clustered index on attribute X co-locates
records whose X values are near to one another.

Non-clustered index (secondary index)
A non clustered index does not constrain table
organization.
There might be several non-clustered indexes per
table.

Records
Records
71
Dense / Sparse Index

Sparse index
Pointers are associated to pages

Dense index
Pointers are associated to records
Non clustered indexes are dense

P1
Pi
P2
record
record
record
72
Index Implementations in some major DBMS

SQL Server
B-Tree data structure
Clustered indexes are sparse
Indexes maintained as updates/insertions/deletes
are performed
DB2
B-Tree data structure, spatial extender for
R-tree
Clustered indexes are dense
Explicit command for index reorganization

Oracle
B-tree, hash, bitmap, spatial extender for
R-Tree
clustered index
Index organized table (unique/clustered)
Clusters used when creating tables.
TimesTen (Main-memory DBMS)
T-tree

73
Types of Queries

Point QuerySELECT balanceFROM accountsWHERE
number 1023
Multipoint QuerySELECT balanceFROM
accountsWHERE branchnum 100

Range QuerySELECT numberFROM accountsWHERE
balance gt 10000 and balance lt 20000
Prefix Match QuerySELECT FROM employeesWHERE
name J

74
More Types of Queries

Extremal QuerySELECT FROM accountsWHERE
balance max(select balance from accounts)
Ordering QuerySELECT FROM accountsORDER BY
balance

Grouping QuerySELECT branchnum,
avg(balance)FROM accountsGROUP BY branchnum
Join QuerySELECT distinct branch.adresseFROM
accounts, branchWHERE accounts.branchnum
branch.numberand accounts.balance gt 10000

75
Index Tuning -- data

Settings
employees(ssnum, name, lat, long, hundreds1,
hundreds2)
clustered index c on employees(hundreds1) with
fillfactor 100
nonclustered index nc on employees (hundreds2)
index nc3 on employees (ssnum, name, hundreds2)
index nc4 on employees (lat, ssnum, name)
1000000 rows Cold buffer
Dual Xeon (550MHz,512Kb), 1Gb RAM, Internal RAID
controller from Adaptec (80Mb), 4x18Gb drives
(10000RPM), Windows 2000.

76
Index Tuning -- operations

Operations
Update update employees set name XXX where
ssnum ?
Insert insert into employees values
(1003505,'polo94064',97.48,84.03,4700.55,3987.2)
Multipoint query select from employees where
hundreds1 ? select from employees where
hundreds2 ?
Covered query select ssnum, name, lat from
employees
Range Query select from employees where long
between ? and ?
Point Query select from employees where ssnum
?

77
Clustered Index

Multipoint query that returns 100 records out of
1000000.
Cold buffer
Clustered index is twice as fast as non-clustered
index and orders of magnitude faster than a scan.

78
Positive Points of Clustering indexes

If the index is sparse, it has less points --less
I/Os
Good for multipoint queries
eg. Looking up names in telephone dir.
Good for equijoin. Why?
Good for range, prefix match, and ordering queries

79
Index Face Lifts

Index is created with fillfactor 100.
Insertions cause page splits and extra I/O for
each query
Maintenance consists in dropping and recreating
the index
With maintenance performance is constant while
performance degrades significantly if no
maintenance is performed.

80
Index Maintenance

In Oracle, clustered index are approximated by an
index defined on a clustered table
No automatic physical reorganization
Index defined with pctfree 0
Overflow pages cause performance degradation

81
Covering Index - defined

Select name from employee where department
marketing
Good covering index would be on (department,
name)
Index on (name, department) less useful.
Index on department alone moderately useful.

82
Covering Index - impact

Covering index performs better than clustering
index when first attributes of index are in the
where clause and last attributes in the select.
When attributes are not in order then performance
is much worse.

83
Positive/negative points of non-clustering indexes

Eliminate the need to access the underlying table
eg. Index on (A, B, C)
Select B,C From R Where A5.
Good if each query retrieves significantly fewer
records than there are pages in DB
May not be good for multipoint queries

84
Examples

Table T has 50-bytes records and attribute A has
20 different values which are uniformly
distributed. Page size4K. Is a nonclustering
index on A any good?
Now the record size is 2Kbytes.

85
Scan Can Sometimes Win

IBM DB2 v7.1 on Windows 2000
Range Query
If a query retrieves 10 of the records or more,
scanning is often better than using a
non-clustering non-covering index. Crossover gt
10 when records are large or table is fragmented
on disk scan cost increases.

86
Index on Small Tables

Small table 100 records, i.e., a few pages.
Two concurrent processes perform updates (each
process works for 10ms before it commits)
No index the table is scanned for each update.
No concurrent updates.
A clustered index allows to take advantage of row
locking.

87
Bitmap vs. Hash vs. B-Tree

Settings
employees(ssnum, name, lat, long, hundreds1,
hundreds2)
create cluster c_hundreds (hundreds2 number(8))
PCTFREE 0
create cluster c_ssnum(ssnum integer) PCTFREE 0
size 60
create cluster c_hundreds(hundreds2 number(8))
PCTFREE 0 HASHKEYS 1000 size 600
create cluster c_ssnum(ssnum integer) PCTFREE 0
HASHKEYS 1000000 SIZE 60
create bitmap index b on employees (hundreds2)
create bitmap index b2 on employees (ssnum)
1000000 rows Cold buffer
Dual Xeon (550MHz,512Kb), 1Gb RAM, Internal RAID
controller from Adaptec (80Mb), 4x18Gb drives
(10000RPM), Windows 2000.