THE STORK CORRELATION USE - PowerPoint PPT Presentation

1 / 53
About This Presentation
Title:

THE STORK CORRELATION USE

Description:

'Statistical analysis Mysterious, sometimes bizarre ... CICS tran = non-browsing, non-batch work. Use multiple decimal places to lend an air of precision ... – PowerPoint PPT presentation

Number of Views:60
Avg rating:3.0/5.0
Slides: 54
Provided by: mikema8
Category:
Tags: correlation | stork | the | use | air | tran

less

Transcript and Presenter's Notes

Title: THE STORK CORRELATION USE


1
THE STORK CORRELATIONUSE ABUSE OF STATISTICS
IN CAPACITY PLANNING
  • Denise P. Kalm
  • RD Sr. Product Specialist
  • BMC Software, Inc.

2
  • Statistical analysis Mysterious, sometimes
    bizarre manipulations performed upon the
    collected data of an experiment in order to
    obscure the fact that the results have no
    generalizable meaning for humanity. Commonly,
    computers are used, lending an aura of unreality
    to the proceedings.

3
Agenda
  • Why a stork?
  • Tools of the trade
  • Getting the terminology right
  • My favorite statistics
  • Lies, damned lies and telling your manager what
    he wants to know
  • Summary

4
The Stork Correlation
  • In a small Welsh town, there was a .95
    correlation between the arrival of storks and the
    arrival of babies.
  • Why?

5
The Stork Correlation
  • There was also a 1.0 correlation between the
    dates fishermen were home from the sea and the
    likely dates of conception.

6
Statistical Abuse
  • Correlation is the most misused statistic
  • Standard deviation and mean rank 2nd and 3rd
  • Ignorance of statistics leads to career-limiting
    recommendations
  • Statistics can be your best friend, once you
    understand them
  • But It is more art than science.

7
Why Me?
  • Background
  • Training
  • Experience

8
Why You?
  • Determining the significance of changes
  • Not confusing correlation with cause-and-effect
  • Saving time on problem resolution
  • Theory testing

9
Big Caveats
  • Only use statistics with like-minded individuals.
    Managers typically only understand average and
    percentiles.
  • When statistics dont appear to be working for
    you, check out statistics that do not require the
    assumption of normality.

10
Tools of the Trade
  • SAS/SAS Graph
  • SPSS
  • Statistical calculator
  • Excel
  • Brute force with the equations (not recommended,
    but possible)

11
Definitions
  • Sample population
  • Normality
  • Outlier
  • Mean, median, mode percentile
  • Standard deviation variance
  • Misc. terms

12
  • Statistics is a systematic method for getting
    the wrong conclusion with 95 confidence.

13
Sample Population
  • Population all the data for the period of time
    studied, I.e., every RMF/SMF data record for an
    hour
  • Sample a random selection of all data
    points/records available.

14
Normal Distribution
  • A distribution which describes many situations
    where observations are distributed symmetrically
    around the mean . 68 of all values under the
    curve lie within one standard deviation of the
    mean and 95 lie within two standard deviations.

15
Central Limit Theorem
  • As sample size increases, the distribution of the
    sample approaches a normal distribution, where
    the mean the mean of the population and the
    standard deviation equals the standard deviation
    of the population divided by the square root of
    the sample size.
  • More samples, better data.

16
Formulas
  • f(x) 1/2 )1/2 e-1/2(x-µ)/ 2- x
  • where µ     is the mean    is the standard
    deviation  e     is the base of the natural
    logarithm, sometimes called Euler's e (2.71...)
       is the constant Pi (3.14...)

17

18
Outlier
  • Outlier - A point that, because of observation
    noise, does not followthe characteristics of the
    input (or desired response) data.

19
  • There are liars, outliers, and out-and-out
    liars.

20
Mean
  • Arithmetic Mean numeric average of all the
    data.
  • X x1 x2 x3/ N(x)
  • Assumes normality
  • Affected by outliers
  • Plot data to understand

21
  • Plot to see meaning of mean

Median/Mode
Mean
22
Median and Mode
  • Median middle value, where half the values lie
    on each side of the median, when they are ordered
    by value.
  • Mode most frequently observed value.
  • If no repeats, there is no mode value.

23
Percentile Percentage Change
  • Percentile group data by putting equal number
    of data points into each group. Ex. 95
    percentile 95 of values are less than x.
  • Percentage Change
  • (after value before value) / before value
  • Risk of using percentage change

24
Standard Deviation Variance
  • Standard Deviation square root of the variance.
    For normal data, 2/3 of the data points are
    within 1 SD of the mean on either side.
  • Variance amount of spread of the data around
    the mean
  • S2 ((x1-X)2 (x2-X)2 . (xn-X)2 ) / n-1
  • Where xmean and xn is each data point, n is the
    number of samples

25
Standard Deviation of a Sample
  • If the SD is large, you need to inspect your
    sampling method. This may indicate suspect data,
    poor interval choices, etc.

26

27
My Favorite Statistics
  • Linear Regression
  • Correlation

28
  • In ancient times, they had no statistics, so
    they had to fall back on lies.
  • - Stephen B. Leacock

29
Linear Regression
  • Linear Regression describing the relationship
    between two data elements, by fitting a straight
    line to the data.
  • Ex. Xtransaction rate
  • YCPU utilization
  • Y bXC where x and y are the variables, b is
    the slope of the line and C is the point where
    the line intercepts the y-axis.

30
Linear Regression

31
Good Candidate for Regression

32
Bad Candidate for Regression

33
Gotchas
  • Make sure relating the variables makes sense.
  • Plot data when not sure of the relationship
    (scatter plot)
  • Do not throw out outliers until you are sure of
    why they occurred
  • Do not commit linear progression

34
Correlation
  • Correlation coefficient - R2 measures the
    degree of relationship(and direction) between two
    variables. R2 1.00 indicates a perfect
    correlation R2 0.0 means there is no
    relationship at all. R2 a negative number means
    that as one variable increases, the other
    decreases.

35
  • Correlation is NOT cause and effect.
  • Though there may be a causal relationship between
    two variables, you cannot infer it from a
    correlation analysis.
  • A third factor may really be causing the
    correlation.

36
  • Dont calculate it by hand use a tool.
  • Use your brain to interpret the results.

37
  • A statistician is someone who is skilled at
    drawing a precise line from an unwarranted
    assumption to a foregone conclusion.

38
How to Lie With Statistics
39
Statisticulation
  • Statistics are like a bikini. What they reveal
    is suggestive, but what they conceal is vital.
  • - Aaron Levenstein

40
Why Lie?
  • Outliers make your data look bad
  • You are trying to comply with a performance
    clause
  • You are too busy writing the great American novel
    to do your job
  • Your manager wouldnt understand anyway

41
Averaging Averages
  • Why do it?
  • Most performance data is already averaged, so it
    is easier
  • Makes response times look better in most cases
  • Smooths out all variability
  • Mostly eliminates outliers, particularly in
    plotting data

42
Using Percentage Change
  • Why do it?
  • To exaggerate the benefit of a performance
    change.
  • Ex. RT decreased 50 going from 0.2 to 01.
  • To justify a processor upgrade
  • Ex. Doubling application volume will increase
    its CPU demand 100 (even when the CPU demand was
    very small)
  • To impress or terrify

43
Small Sample Size
  • Why do it?
  • SAS jobs run faster
  • Large, randomly obtained data doesnt give the
    right results a small, selected window does
  • You dont really have any data and have to invent
    some

44
Stupid Graph Tricks
  • Why do it?
  • To make your data look better
  • How to do it
  • Log functions on one axis to diminish the
    impact of a change. Or just use different orders
    of magnitude for x and y axes
  • Select graph type (pie, line, stacked bar) which
    best misleads your audience
  • Eliminate actual metrics so you can draw the
    line to reflect your reality
  • Put time on the wrong axis
  • Eliminate all legends, data tables, etc.

45
Invalid Metrics
  • How to do it
  • Use your own definitions. Ex. Typical CICS tran
    non-browsing, non-batch work
  • Use multiple decimal places to lend an air of
    precision to the data. Good with small or
    unreliable sample or poor capture ratio.
  • Compare apples to oranges. Ex. Compare
    performance after tuning using a period of low
    demand to compare to a before of high demand
  • Add percentage changes together. Ex. If volume
    changes cause a 10 inc. in DB2, a 15 inc. in
    CICS and a 20 increase in batch, thats 45.

46
Correlation Abuse
  • How to do it
  • Select two metrics that arent usually related
    (I/O response time and file size), draw a
    correlation and justify a memory upgrade.
  • Most people dont know performance metrics well
    enough to challenge you.

47
Another Common Lie
  • Linear progression forecasting the line past
    the data points you have
  • Unless you are sure the relationship between two
    variables is linear, do not attempt this. Even
    mostly linear relationships (such as CPU vs.
    volume) may go non-linear at near-saturation.

48
What Can Go Wrong
  • What you think might happen
  • What might really be happening

49
What We Didnt Cover
  • Hypothesis testing valuable if you want to see
    how likely it is that your theory matches
    reality. Is the change in the data due to
    chance, or did you really make a difference?
  • Chi-square
  • T-test
  • When you dont have enough information about the
    data (population) or about cause-and-effect
    relationships

50
Summary
  • Turn data into information by applying statistics
    and your knowledge.
  • Practice safe performance analysis and protect
    your job.

CYA
51
  • Numbers are like people torture them enough and
    theyll tell you anything.

52
References
  • Geis How to Lie with Statistics
  • Dixon and Massey Introduction to Statistical
    Analysis
  • Gonick Smith The Cartoon Guide to Statistics
  • Sziede Statistics for the Algebraically
    Challenged
  • Munoz Sampling Issues in the Collection of
    Performance Data CMG2002

53
  • Questions?
  • Denise P. Kalm
  • Denise_Kalm_at_bmc.com
  • BMC Software, Inc.
Write a Comment
User Comments (0)
About PowerShow.com