Title: Auto administration of databases based on clustering
1Auto administration of databases based on
clustering
- Mujiba Zaman
- Jyotsna Surabattula
- Le Gruenwald
- School of Computer Science
- The University of Oklahoma
- Norman, Oklahoma, 73019, USA
- mujiba, jyotsna, ggruenwald_at_ou.edu
2Overview of the Presentation
- Motivation for auto-indexing
- Existing techniques
- Proposed Index Selection Technique
- Algorithm for Proposed Index Selection Technique
- Proposed Re-Indexing Technique
- Experiments
- Results
- Conclusions
- Future Work
3Motivation of Auto-indexing
- Index selection is an important part of physical
database design - For large databases it is difficult for DBAs to
analyze data and find an optimal set of indices - The goal of auto-indexing includes
- Analyze workload automatically to identify a good
set of indexes. - Create those indexes.
- Automatically evaluate current indexes and
re-index if necessary
4Existing Techniques
- The index selection problem (ISP) has been
approached in two different - ways to build index selection tools
- External Tools
- Some basic assumptions are made
- Cost functions are formulated based on the
assumptions - Attempt to minimize the cost function using
combinatorial optimization technique or by some
heuristic method - Optimizer Based Tools
- Utilize the query optimizer to give cost
estimates for various index configurations and
suggest a configuration with the least cost
estimation.
5Advantages and Disadvantages of existing
techniques
- External Tools
- Advantages
- Applications using the DBMS will not suffer
processing delays - Disadvantages
- Disconnected from the optimizer Index
suggested may not be used by the optimizer - Becomes obsolete if optimizer changes
6Advantages and Disadvantages of existing
techniques
- Optimizer Based Tools
- Advantages
- Suggested indexes will be used by the
optimizer - Disadvantages
- Performs expensive operation of optimizer
invocation - Longer processing time for other applications
using the DBMS when indexes are being
suggested.
7Proposed Index Selection Technique
-
- Combines both the approaches.
- Based on the intuition that the attributes
that occur more commonly and frequently in a
group of similar queries are likely to be useful
for indexing. - Use a Data Mining Clustering technique to
group queries which are similar in terms of their
use of attributes. - Extract Indexable attributes in all the
queries in each group as indexes. -
- These indexes can be single-column or
multi-column -
8Proposed Index Selection Technique
- For multi-column indexes, the order of the
columns is determined by assigning weights to
attributes based on where they are used in the
queries. - A clustered index is also chosen by assigning
weights to the attributes depending on where they
occur in the queries. - Extracted indexes are then submitted to the query
optimizer for final selection for the given
workload. - The indexes not selected by the optimizer are
eliminated. The remaining indexes are the final
indexes suggested by our tool. - Re-index if total table scan time using current
indexes is higher than that using new indexes.
9Algorithm for proposed technique
- 1. Input is a workload of queries
- 2. Parse the workload to find out all the
indexable attributes - and build two matrix
- - Query-Attribute matrix (1 presence of an
attribute 0 otherwise) - Attribute-frequency matrix (frequency of the
attribute indicated by the number) - Consider the following queries
- Q1. Select T1.A, T1.B, Sum(T1.C) from T1, T4,
where T1.AT4.K and T1.CT3.H and T1.Blt30 group
by T1.C - Q2. Select Ave(G) from T3 where F like this
and G between 10 and 20 order by G
10Algorithm for proposed technique
- A corresponding query attribute matrix with
several other queries could be as follows - Table 1 Query Attribute Matrix
Query T1.A T1.B T1.C T2.D T2.E T3.F T3.G T3.H T4.K
Q1 1 1 1 0 0 0 0 1 1
Q2 0 0 0 0 0 1 1 0 0
Q3 0 0 0 0 1 1 1 0 0
Q4 1 1 1 1 0 0 0 1 1
Q5 0 0 1 0 0 1 1 0 0
11Algorithm for proposed technique
- A corresponding attribute frequency matrix could
be as follows - Table 2 Attribute Frequency Matrix
Query T1.A T1.B T1.C T2.D T2.E T3.F T3.G T3.H T4.K
Q1 1 1 3 0 0 0 0 1 1
Q2 0 0 0 0 0 1 3 0 0
Q3 0 0 0 0 2 3 1 0 0
Q4 2 5 1 3 0 0 0 4 4
Q5 0 0 3 0 0 5 2 0 0
Freq 3 6 7 3 2 9 6 5 5
12Candidate Index Selection
Table size of T1 5000 Table size of T2
2000 Table size of T3 100 Table size of T4
20 T table size/100 Let threshold1 5,
threshold2 30
3. Choose candidate index set Freq gt
threshold1 OR Freq T gt threshold2
T table size/constant
Query T1.A T1.B T1.C T2.D T2.E T3.F T3.G T3.H T4.K
Q1 1 1 3 0 0 0 0 1 1
Q2 0 0 0 0 0 1 3 0 0
Q3 0 0 0 0 2 3 1 0 0
Q4 2 5 1 3 0 0 0 4 4
Q5 0 0 3 0 0 5 2 0 0
Freq 3 6 7 3 2 9 6 5 5
FreqT 150 300 350 60 40 9 6 5 1
Table 3. Attribute-frequency matrix
13Non-Clustered Index Selection
- 4 a. Assign weights to the candidate attribute
set as follows - Where Clause 3
- Group by/Order by 2
- Aggregate functions 1
Query T1.A T1.B T1.C T2.D T2.E T3.F T3.G
Q1 3 3 321 0 0 0 0
Q2 0 0 0 0 0 3 321
Q3 0 0 0 0 6 5 4
Q4 2 5 1 3 0 0 0
Q5 0 0 3 0 0 5 2
Total Weight 5 8 10 3 6 13 12
Table 4. Candidate Attribute Set
14Non-Clustered Index Selection
- 4 b. Order the columns in the candidate attribute
set in each table in descending order of their
weights for determining non-clustered indexes.
Query T1.C T1.B T1.A T2.E T2.D T3.F T3.G
Q1 6 3 3 0 0 0 0
Q2 0 0 0 0 0 3 6
Q3 0 0 0 6 0 5 4
Q4 1 5 2 0 3 0 0
Q5 3 0 0 0 0 5 2
Total Weight 10 8 5 6 3 13 12
Table 5. Ordered Candidate Attribute Set
15Non-Clustered Index Selection
- 4 c. Build a Query-Attribute Matrix with the
ordered columns
Query T1.C T1.B T1.A T2.E T2.D T3.F T3.G
Q1 1 1 1 0 0 0 0
Q2 0 0 0 0 0 1 1
Q3 0 0 0 1 0 1 1
Q4 1 1 1 0 1 0 0
Q5 1 0 0 0 0 1 1
Total Weight 10 8 5 6 3 13 12
Table 6. Query Attribute Matrix with ordered
candidate index set
16Non-Clustered Index Selection
- 5. Use a data mining clustering technique on the
Query-Attribute matrix to group similar queries.
A possible clustering result is - Q1 Q4 Q2 Q3 Q5 (Table 6)
- 6. Extract the index sets for each table from the
clusters obtained. - These are all the common attributes from all
the queries - clustered together.
- For the cluster Q1, Q4 Indexes are (T1.C,
T1.B, T1.A) in that order - For the cluster Q2 Q3 Q5 Indexes are (T3.F,
T3.G) in that order
17Clustered Index Selection
- 6. Choose single column clustered index as
follows - a. During parsing assign the following weight to
the attributes - Range queries 2 Join 1 Group
by/Order by 1 - b. Rank the attributes according to their weight
-
Query T1.A T1.B T1.C T2.D T2.E T3.F T3.G
Q1 1 2 1 0 0 0 0
Q2 0 0 0 0 0 0 21
Q3 0 0 0 0 1 1 1
Q4 1 3 0 2 0 0 0
Q5 0 0 1 0 0 1 1
Total Weight 2 5 2 2 1 2 5
Rank 1 2 1 2 1 1 2
Table 7. Ranking according to weight for
clustered index
18Clustered Index Selection
- c. Determine the selectivity of columns and rank
them according to the selectivity for each table.
T1.A T1.B T1.C T2.D T2.E T3.F T3.G
Selectivity 1 0.85 0.9 0.88 0.55 0.65 0.82
Rank N/A 1 2 2 1 1 2
Table 8. Ranking according to selectivity for
clustered index
19Clustered Index Selection
- d. Find the sum of rank with selectivity and
rank with weight for each - column.The indexable attribute with the
largest sum is suggested as - clustered index for the table.
- If more than one indexable attributes have
the highest sum then - select the attribute with higher rank in
weight.
Column name Rank with selectivity Rank with weight Sum
T1.A N/A 1 N/A
T1.B 1 2 3
T1.C 2 1 3
T2.D 2 2 4
T2.E 1 1 2
T3.F 1 1 2
T3.G 2 2 4
Table 9. Sum of Ranks with Weight and
Selectivity for Clustered Index
20Algorithm for proposed technique
- 7. Provide both clustered and non-
- clustered indexes to the query optimizer
- 8. Let the optimizer display the estimated
- execution plan to execute the workload
- in the database.
- 9. Select the indexes used by the
- optimizer as the final suggested index.
21Proposed Re-indexing Technique
- An auto-indexing tool should be capable of
re-indexing whenever the current indexes are no
longer good - The DBMS can periodically monitor the cost of
total table scan for a particular size of
workload - When this value exceeds a limit the DBMS triggers
the index selection tool to suggest new set of
indexes. - The limit to trigger the tool can be determined
from the relationship between increase in table
scan cost and performance gain due to
re-indexing. This can be set by the DBA.
22Proposed Re-indexing Technique
- Produce a Chart showing Re-indexing performance
gain vs. Table scan cost gt Guildelines for DBA
by doing the following - Assume the current workload is Wc, current index
set is INc, compute total current table scan cost
TSc. - Obtain a set of different workload samples over a
long history W2, W3,.. Wn. - Compute total table scan cost for each of
workload sample using the current index set
TS2c, TS3c,, TSnc.
23Proposed Re-Indexing Technique (Cont.)
- Run our index selection tool on W2, W3, , Wn to
get the corresponding recommended sets of
indexes IN2r, IN3r,, INnr. - Compute total table scan cost for each workload
using its new recommended set of indexes TS2r,
TS3r,, TSnr. - Compute the percentage of performance
improvement due to reindexing for each workload
Wj (TSc TSjr)/TSc) 100 for j 2,.., n.
24Proposed Re-indexing Technique
25Proposed Re-indexing Technique
- If the DBA has the chart in advance, the DBA can
set up the limit where he/she wants the DBMS to
trigger index selection tool. -
- The existing index set is then compared with the
new index set - Indexes which are part of new but not part of
existing set are created, those which are part of
existing set and not in new set are dropped and
those which intersect remain - The process of dropping and creating indexes in
the system follows similar methodology as
Oracles Automated Index-Rebuild System which can
be done either online or offline.
26Experiments
- Performance Metric
- average query response time
- time taken to execute the workload in minutes
divided by the total number of queries in the
workload - All experiments are conducted on the system Intel
Pentium 4-M, CPU 2.0GHz, 512 MB RAM. - Experiments conducted on TPC-R benchmark with its
22 read-only queries - Experiments conducted on Microsoft SQL Server 2000
27Experiments
- Clustering algorithms used in experiment
- 1. MACQueens k means clustering algorithm
- Used hamming distance as the distance function
-
- 2. KEROUAC (knowledge explicit, rapid, off beat
and user-centered algorithm for clustering) - K-means is a well-established data clustering
algorithm - KEROUAC is a clustering algorithm for practical
advantage it doesnt require the final clusters
number setting - Both algorithms have low computational cost
28Results With K-means
Chart 1
29Results with KEROUAC
Chart 2
30Results with Index Suggestion Time
31Conclusions
- Performance of the tool critically depends on the
choice of ? (for KEROUAC) and k (for k-means) and
the threshold value. - Increasing ? (k) will group queries with higher
similarity to each other in a cluster. - Increasing ? (k) beyond some point has no effect
(all identical queries are already in the same
cluster). - We can therefore achieve desired results by
choosing ? (k) workload size. - Increasing threshold means more attributes
eliminated from consideration - Experiments show that a threshold value about
workloadsize/4 works good. - Our tool chooses these parameters to operate on
the best performance range.
32Conclusions
- We compared our results with Microsoft SQL
Servers Index Selection Tool (An optimizer Based
Tool) and also with Frequent Itemsets Mining (An
External Tool) - Best performance improvement using k Means
clustering compared with Frequent Itemsets Mining
is 71.43 - Best performance improvement using k Means
clustering compared with Microsoft Index
Selection Tool is 16.2 - Best performance improvement using KEROUAC
clustering compared with Frequent Itemsets Mining
is 73.26 - Best performance improvement using KEROUAC
clustering compared with Microsoft Index
Selection Tool is 21.5 - The index suggestion time for Microsoft Index
Selection Tool was 4 times higher than our tool
for a workload size of 240
33Future Work
- Test the dependence of the technique on different
clustering algorithms - Test with different sizes of workload
- Test with update queries in the workload
- Test with index elimination technique
34 35Indexable Attributes
- Indexable Attributes/Columns
- Columns which belong to WHERE, GROUP BY, ORDER BY
clauses - Operators , lt, gt, lt, gt, BETWEEN, IN
- Example
- SELECT
- FROM table1,table2
- WHERE table1.column1 table2.column1
- AND (table2.column2 BETWEEN 0 AND 1000)
- the indexable attributes are
- table1.column1, table2.column1, table2.column2
36Selectivity of a column
- Selectivity
- Selectivity ratio of a column/index
- number of unique values in a column/index of
the table divided by total number of rows in that
table - SELECT COUNT (DISTINCT (column name))
- FROM table name
- If a column/index has high selectivity then it is
more useful to the optimizer and has more chances
to be picked up by the optimizer while executing
a query.
37Multicolumn Indexes
- Multicolumn indexes
- Column involved in multicolumn index should be
joined with AND clause and not with OR clause - Order of the columns in a multicolumn index is
important. Order should be based on selectivity
and also the first ordered column should be the
most used column in queries. - An index (a, b, c) is used by queries involving
a, b, c both a and b or a but not in any other
combinations - Example
- An index (major, minor) is suitable for the
following query - SELECT name
- FROM test
- WHERE major constant
- AND minor constant
38Frequent Itemset Mining Technique
- Input is a workload of queries
- Extract indexable attributes.
- Create a query attribute matrix.
-
Attributes Attributes Attributes Attributes Attributes Attributes
Queries A B C D E
Q1 1 0 1 1 0
Q2 0 1 1 0 1
Q3 1 1 1 0 1
Q4 0 1 0 0 1
Q5 1 1 1 0 1
Q6 0 1 1 0 1
Q1 SELECT FROM T1, T2 WHERE A BETWEEN 1 AND
10 AND C D Q2 SELECT FROM T1, T2 WHERE B
LIKE this AND C5 AND Elt100 Q3
SELECT FROM T1, T2 WHERE A30 AND Bgt3
GROUP BY C HAVING SUM(E)gt2
Table 1
Q4 SELECT FROM T1 WHERE Bgt2 AND E IN
(3,2,5) Q5 SELECT FROM T1, T2 WHERE A30 AND
Bgt3 GROUP BY C HAVING SUM(E)gt2 Q6 SELECT
FROM T1, T2 WHERE Bgt3 GROUP BY C HAVING
SUM(E)gt2
39Frequent Itemset Mining Technique
- A closed itemset is a maximal set of items
(attributes) that are common to a set of
transactions (queries) - The candidate indexes selected for minimal
support greater than or equal to 2/6 in the
example are as follows1 - (AC,3/6), (BE, 5/6), (ABCE, 2/6), (BCE, 4/6)
Attributes Attributes Attributes Attributes Attributes Attributes
Queries A B C D E
Q1 1 0 1 1 0
Q2 0 1 1 0 1
Q3 1 1 1 0 1
Q4 0 1 0 0 1
Q5 1 1 1 0 1
Q6 0 1 1 0 1
Table 2
40SQL-Server Index Selection Tool
Figure 1
41SQL-Server Index Selection Tool
- Candidate index selection
- For a given workload W that consists of n
queries, n workloads Wi..Wn, each consisting of
one query are generated. - For each workload Wi, the set of indexable
columns of the query Ii is the starting candidate
indexes. - Let Ci be the configuration picked by the index
selection tool for Wi - The candidate index set for W is the union of all
Ci
42SQL-Server Index Selection Tool
- Configuration enumeration
- Problem
- Pick k indexes from a set of n candidate indexes.
- Algorithm
- 1 Let Sthe best m index configuration using the
naïve enumeration algorithm. If mk then exit - 2 Pick a new index I such that Cost (S U I) lt
Cost (S U I) for any choice of I ! I - 3 If Cost (S U I )gt Cost (S) then exit
- else SS U I
- 4 If S k then exit
- 5 Go to 2
43SQL-Server Index Selection Tool
- Cost Evaluation
- Reduce the number of invocations of the optimizer
by deriving costs from already evaluated costs. - A cost of a non atomic configuration can be
derived from atomic configuration - A configuration C is atomic for a workload if
for some query in the workload there is a
possible execution of a query in the workload by
the query engine that uses all indexes in C. - Not every atomic configuration needs to be
evaluated for every single query in the workload
44SQL-Server Index Selection Tool
- Multicolumn index generation
- For given K columns K! multicolumn indexes are
possible - Iterative approach
- First iteration single column indexes are
considered - Only the selected single column indexes are input
to the two column indexes in the first iteration - This set of two-column indexes along with single
column ones are input to the third iteration and
so on.
45Clustered Index
- Clustered indexes
- A page allocated to an index is called a data
page. - For tables having clustered index the data rows
of each data page are stored in order and the
data pages are linked together by doubly-linked
list. - For table having no clustered index the data rows
are not stored in any particular order.
Figure 2
46Clustered Index
Accessing data with a clustered index
Figure 3
47Criteria to choose clustered index
- Queries that return large result sets
- Columns with a number of duplicate values that
are searched frequently - Columns other than primary key that are
frequently used in join clauses - Columns searched within a range of values
- Columns used in ORDER BY or GROUP BY queries
48MACQueens k Means Clustering algorithm
- 1) First k data units are chosen as clusters of
one member each. - 2) Remaining m k data units are assigned to
the clusters whose centroid is nearest to the
data unit under consideration. Centroid is
recomputed after every gain in the cluster - 3) Iterate through the data set assigning each
data unit to its nearest cluster taking the
existing cluster centroids as fixed seed points
until a certain criteria is reached
49MACQueens k Means Clustering algorithm
- Hamming distance is the number of positions in
which two binary words differ - For k 3 Q1, Q2, Q3 are initial clusters
- Hamming distance of
- Q1 and Q4 is 5
- Q2 and Q4 is 1
- Q3 and Q4 is 2
- Clusters at step 2 are
- cluster1 Q1
- cluster2 Q2, Q4, Q6
- cluster3 Q3,Q5
Attributes Attributes Attributes Attributes Attributes Attributes
Queries A B C D E
Q1 1 0 1 1 0
Q2 0 1 1 0 1
Q3 1 1 1 0 1
Q4 0 1 0 0 1
Q5 1 1 1 0 1
Q6 0 1 1 0 1
Table 3
50KEROUAC Clustering algorithm
KEROUAC Knowledge Explicit, Rapid, OFF beat and
User-centered Algorithm for Clustering New
Condorcet Criterion (NCC) NCC(Pz) ?
Sim(Ei,Ej) a x ? Dissim(Ei) NCC represents
the degree of dissimilarity of objects
belonging to the same cluster and the degree of
similarity between clusters. It is desired
to be minimum. a is called the granularity factor
51 KEROUAC Clustering algorithm
Attributes Attributes Attributes Attributes Attributes Attributes
Queries A B C D E
Q1 1 1 0 0 0
Q2 1 0 1 0 0
Q3 0 1 1 0 0
Q4 1 1 1 0 0
Q5 0 0 1 1 1
Q6 0 0 0 1 1
Table 4
- Calculate NCC for all the neighbors and pick the
neighbor with least NCC - Repeat until a reduction in NCC is not possible
Figure 4
52Proposed index selection algorithm
- Step 1 Input workload
- Step 2 Parse the workload to find indexable
attributes gt Query-attribute matrix - Step 3 Identify candidate indexable attributes
based on frequency of attributes in the workload
and threshold1 and threshold2 - Step 4 Compute the total weight of each
attribute using the clause weight assignment
policy and order the attributes in each table in
decreasing order of weights in order to identify
non-clustered indexes in Step 6 gt Ordered
query-attribute matrix. - Step 5 Perform a data mining clustering
technique on the ordered query-attribute matrix
to group similar queries based on their use of
attributes.
53Proposed Index Selection Algorithm (Cont.)
- Step 6 Extract index sets for each table from
the clustering results in Step5. - Step 7 Select a single attribute clustered index
for each table by computing the sum of range
querys weight and selectivity ranking for each
attribute and selecting the attribute with the
highest sum. - Step 8 Provide the query optimizer with the
virtual set of indexes chosen in - steps 5 and 7 (including both non-clustered and
clustered indexes). - Step 9 Let the optimizer display the estimated
execution plan to execute the workload. - Step 10 Select the indexes used by the optimizer
as the final suggested index.