Title: Unlocking the Mysteries Behind Update Statistics
1Unlocking the Mysteries Behind Update Statistics
2The Dice Problem
- Throw dice, how many will be 1?
3Questions about the Dice
- How many dice are you throwing?
- How many sides does each dice have?
- Are all the dice the same?
The better the information, the more accurate the
estimate.
4What does Update Statistics do?
- Collects information for the optimizer
- Statistics LOW
- Distributions MEDIUM
HIGH - Drop Distributions
- Compile stored procedures
5Statistics Collected
- systables
- systables
- syscolumns
- syscolumns
- sysindexes
- sysindexes
- Number of Rows
- Number of pages to store the data
- Second largest value for a column
- Second smallest value for a column
- of unique values for the lead key
- How highly clustered the values for the lead key
6Update Statistics LowBasic Algorithm
- Walk the leaf pages in each index
- Submit btree cleaner requests when deleted items
are found causing re-balancing of indexes - Collects the following information
- Number of unique items
- Number of leave pages
- How clustered the data is
- Second highest and lowest value
7How to Read Distributions
of rows represented in this bin
--- DISTRIBUTION --- (
-1 1 ( 868317, 70,
75) 2 ( 868317, 24,
100) 3 ( 868317, 12, 116) 4
( 868317, 30, 147) 5 (
868317, 39, 194) 6 ( 868317,
28, 222) --- OVERFLOW
--- 1 ( 779848, 43) 2
( 462364, 45)
of unique values
Highest Value in this bin
- To get the range of values look at the highest
value in the previous bin.
The value
of rows for this value
8Example - Approximating a Value
--- DISTRIBUTION --- (
-1 1 ( 868317, 70,
75) 2 ( 868317, 24,
100) 3 ( 868317, 12, 116) 4
( 868317, 30, 147) 5 (
868317, 39, 194) 6 ( 868317,
28, 222) --- OVERFLOW
--- 1 ( 779848, 43) 2
( 462364, 45)
- There are 868317 rows containing a value between
-1 and 75 - There are 70 unique values in this range
- The optimizer will deduce 868317 / 70 12,404
records for each value between -1 and 75
9Example - Dealing with Data Skew
--- DISTRIBUTION --- (
-1 1 ( 868317, 70,
75) 2 ( 868317, 24,
100) 3 ( 868317, 12, 116) 4
( 868317, 30, 147) 5 (
868317, 39, 194) 6 ( 868317,
28, 222) --- OVERFLOW
--- 1 ( 779848, 43) 2
( 462364, 45)
- Data skew
- For the value 43 how many records will the
optimizer estimate will exist? - Answer 779848 values
- Any value that exceeds 25 of the bin size will
be placed in an overflow bin
10Basic Algorithmfor Medium and High
- Develop scan plan based on available resources
- Scan table
- High All rows
- Medium Sample of rows
- Sort each column
- Build distributions
- Begin transaction
- Delete old columns distributions
- Insert new columns distributions
- Commit transaction
11Scan
- The table is scanned in its entirety for update
stats high, while it is only sampled for update
stats medium (see Sample Size) - The reading of rows is done in dirty read
isolation, regardless of what the user has set
for their transaction level.
12Scan
- This scan of the table may occur several times
depending on the amount of sort memory available
and the number of columns to collect statistics
about. - The approximate number of table scans is defined
by the (size of the data to sort) / (amount of
sort memory)
13Sort
- The rows processed by the scan phase are passed
directly to the sort package. - Each column in the row for which statistics are
being generated is passed to a unique invocation
of a sort.
14Build
- After the sort is completed we read the sorted
column data finding out the number of duplicates
and unique values creating approximately 200
range bins by default. - Any count of a duplicates value that exceeds 25
the size of a bin will be placed in an overflow
bin.
15Insert
- Now we have to delete the old distributions and
insert the new distributions. As long as the user
was not in a transaction this will be done as its
own transaction. This transaction will last for
less than 1 second and will hold NO locks on the
tables, but locks on the system catalogs while
the update occurs.
16Sample Size
- HIGH
- The entire tables is scanned and all rows are
used. - Medium
- Misconception about the number of rows sampled is
based on the number of rows in the table, this is
incorrect. - The number of samples depends on the Confidence
and Resolution. See the following chart.
17Update Statistics Medium Sample Size
18Update Statistics Medium Memory Requirements
19Update Statistics High Memory Requirements
- In memory sort
- Approximate Memory number of rows sum(column
widths 2 sizeof(pointer) )
20Memory Rules
- Estimated Update Stats memory is below 100MB
- Hard coded limit of 4MB
- Attempts to minimize the scans by fitting as many
columns into 4MB - Estimated Update Stats memory is above 100MB
- Memory is requested from MGM
- Attempt to minimize the scans by fitting as many
columns in the MGM memory
21Examples
- Customer Table
- Cust_id integer
- Fname char(50)
- Lname char(50)
- Address1 char(200)
- Address2 char(200)
- State char(2)
- zipcode integer
- Number of Rows 500,000
22ExamplesMemory for Incore Sort
23ExamplesNumber of Table Scans
24Confidence
- A factor in the number of samples used by update
statistics medium
25Resolution
- Percentage of data that is represented in a
distribution bin - Example
- 100,000 rows in the table
- Resolution of 2
- Each bin will represent 2,000 rows
26Improvements in update statistics in 7.31.UD2
- UPDATE STATISTICS CAN NOT ALLOCATE SORT MEMORY
BETWEEN 4 AND 100 MB - The default has been raised from 4MB to 15MB
- User can now configure the amount of memory
- UPDATE STATISTICS USES SLOW SCANNING TECHNOLOGY
WHEN SCANNING A TABLE -- ENABLE LIGHT SCANS - Implemented light scans
- Set oriented reads
27Improvements in update statistics in 7.31.UD2
- THE PLAN UPDATE STATISTICS MAKE WHEN SCANNING IS
NOT VIEWABLE BY THE DBA - Set explain will now print the scan path and
resource usage - UPDATE STATISTICS LOW ON FRAGMENT INDEXES RUNS
SERIALLY AND VERY SLOW - With PDQ turned on each index fragment will be
scanned in parallel - PDQ at 1 means 10 of the index fragments scanned
in parallel, while PDQ at 10 means all the index
fragments will be scanned in parallel
28Improvements in update statistics in 7.31.UD2
- ERROR 126 WHEN EXECUTING UPDATE STATISTICS AND
STORED PROCECURE (ALSO ERRORS 312/100) - Errors when trying to insert the distributions
because set lock mode to wait was not handled
properly inside update statistics - SCANNING AN INDEX WHICH IS FRAGMENT IS SLOW DUE
TO THE INEFFICIENT MERGE IN THE FRAGMENT MANAGER - Binary search used instead of the previous nest
loop merge when ordering index fragments - Most noticeable when the number of fragments in
an index is large
29Tuning with the New Statistics
- Turn on PDQ when running update statistics, but
only for tables - Avoid PDQ when updating statistics for procedures
- When running high or medium increase the memory
update statistics has to work with - Enable parallel sorting (i.e. PSORT_NPROCS)
30Considerations
- Change the RESOLUTION to 1.5
- Increasing the number of bins for the
distributions - Increasing the sample size for update statistics
medium
31Example
- Following Example
- Table size 215,000 rows
- Row size 445 bytes
- Uniprocessor
32Example of the current update statistics
- Table jmiller.t9
- Mode HIGH
- Number of Bins 267 Bin size 1082
- Sort data 101.4 MB
- Sort memory granted 4.0 MB
- Estimated number of table scans 10
- PASS 1 c9
- PASS 2 c5
- PASS 3 c7
- PASS 4 c6
- ..
- PASS 10 c4
- Completed pass 1 in 0 minutes 24 seconds
- Completed pass 2 in 0 minutes 20 seconds
- Completed pass 3 in 0 minutes 17 seconds
- Completed pass 4 in 0 minutes 17 seconds
- Completed pass 5 in 0 minutes 17 seconds
- Completed pass 6 in 0 minutes 15 seconds
- Completed pass 7 in 0 minutes 14 seconds
- Completed pass 8 in 0 minutes 15 seconds
- Completed pass 9 in 0 minutes 16 seconds
- Completed pass 10 in 0 minutes 14 seconds
Total Time 146 seconds
33The new Defaults in 7.31.UD2
Table jmiller.t9 Mode
HIGH Number of Bins 267 Bin size 1082 Sort
data 101.4 MB Sort memory granted
15.0 MB Estimated number of table scans
7 PASS 1 c9,c8,c10,c5,c7 PASS 2 c6,c1 PASS 3
c3 PASS 4 c2 PASS 5 c4
- Completed pass 1 in 0 minutes 34 seconds
- Completed pass 2 in 0 minutes 19 seconds
- Completed pass 3 in 0 minutes 16 seconds
- Completed pass 4 in 0 minutes 14 seconds
- Completed pass 5 in 0 minutes 15 seconds
Total Time 98 seconds
New Memory Default
34Enabling PDQ with update statistics
- Table jmiller.t9
- Mode HIGH
- Number of Bins 267 Bin size 1082
- Sort data 101.4 MB
- PDQ memory granted 106.5 MB
- Estimated number of table scans 1
- PASS 1 c1,c2,c3,c4,c5,c6,c7,c8,c9,c10
- Index scans disabled
- Light scans enabled
- Completed pass 1 in 0 minutes 29 seconds
PDQ Memory
Features Enabled
Total Time 29 seconds