NA3873 Lecture 3: Descriptive Statistics: Measures of Location and Variability

1 / 37
About This Presentation
Title:

NA3873 Lecture 3: Descriptive Statistics: Measures of Location and Variability

Description:

Actual calculation outside scope of class. Kurtosis. Some software packages provide kurtosis ... In practice, we rarely calculate statistics by hand. In MS ... –

Number of Views:342
Avg rating:3.0/5.0
Slides: 38
Provided by: tas88
Category:

less

Transcript and Presenter's Notes

Title: NA3873 Lecture 3: Descriptive Statistics: Measures of Location and Variability


1
NA387(3) Lecture 3 Descriptive
StatisticsMeasures of Location and Variability
  • (Devore, Ch. 1.3-1.4)

2
Topics
  • Branches of Statistics (review from last lecture)
  • Conducting a Statistical Analysis
  • Measures of Location
  • Measures of Dispersion
  • Outliers and Box Plots

3
I. Branches of Statistics-Recap
  • 1. Descriptive Statistics
  • Summarize or describe important features in a
    data set without attempting to infer conclusions.
  • Describe data samples using terms such as
  • X-bar (sample mean) and s (standard deviation).
  • These statistics are used to estimate the
    population mean (m) and population sigma s.

4
I. Branches of Statistics-Recap
  • 2. Inferential Statistics
  • Use sample of data to draw conclusions (make
    inferences).
  • Example Suppose you sample ten bottles from each
    of two different filling machines.
  • Machine A averages 12.10 oz and B averages 12.12
    oz.
  • Based on inferential statistics, you might
    conclude that the two machines are not different.

5
Location and Dispersion
  • Most common descriptive statistics are related to
    either measuring location or dispersion
    (variation).
  • Location central tendency
  • Dispersion spread of distribution

6
Example
  • Classic example to demonstrate these concepts
    Outcomes of Throwing Darts
  • On or Off Location
  • Low or High Dispersion

7
Lecture Exercise Identify On/Off Target
High/Low Dispersion for each
x
x
x
x
x
x
A. _________
B. __________
x
x
x
x
x
x
D. __________
C. __________
8
II. Measures of Location
  • Mean
  • Median
  • Trimmed Mean

9
The Mean
The average (mean) of the n numbers
Population mean
10
Mean (Average, Expected Value)
  • The Mean (also known as the average or the
    Expected Value) is a measure of the center (of
    mass, or of gravity) of a distribution.
  • Typical notation used to represent the mean of a
    sample of data is the Greek letter mu or the
    Latin letter m, or E(X), read Expected value of
    X
  • Example suppose five students take a test and
    their scores are 70, 68, 71, 69 and 98.
  •   Mean (7068716998)/5 75.2

11
Median
The sample median, is the middle value in a
set of data that is arranged in ascending order.
For an even number of data points the median is
the average of the middle two.
Population median
12
Median
  • Median (also known as the 50th percentile) is the
    middle observation in a data set.
  • Rank the data set and select the middle value.
  • If odd number of observations, the middle value
    is observation N 1 / 2.
  • If even number of observations, the middle value
    is extrapolated as midway between observation
    numbers N / 2 and N / 2 1.
  • Prior data values 68, 69, 70, 71, and 98.
  • Median is 70.
  • If another student with a score of 60 was
    included, the new median would result in 69.5 (69
    70 / 2).

13
Mean Vs. Median
  • Which is a better measure of location for the
    following set of test scores?
  • 68, 70, 69, 71, and 98
  • Mean 75.2 Median 70.0

14
Trimmed Mean
  • Trimmed Mean is a compromise between mean and
    median.
  • 10 Trimmed Mean
  • First, eliminate smallest 10 of values and
    largest 10 of values.
  • Then, re-compute the mean.
  • Trimmed means gaining popularity
  • Less sensitive than the mean to outliers, but not
    as robust as the median value.

15
Trimmed Mean (Example from Devore Textbook)
  • Variable life (hours) of incandescent lamps.
  • Sample size 20
  • How many values will be trimmed in 10 TM?
  • Mean 965.0 Median 1009.5 Trim
    Mean 971.4
  • How are these values impacted by sample size, by
    distribution?
  • What might be some useful applications?

16
III. Measures of Dispersion
  • Range
  • Standard Deviation
  • Variance

17
Range
  • The Range is the maximum value in a data set
    minus the minimum value.
  • Example Test Scores 70, 68, 71, 69 and 98.
    Range 98 - 68 30.
  •  
  • Note the range is often preferred over the
    standard deviation for small data sets (e.g., if
    of observations for a sample data set lt 10).

18
Standard Deviation
  • Sample Standard deviation (StDev), S measures the
    dispersion of the individual observations from
    the mean.
  • For a sample data set, standard deviation is also
    referred to as the sample standard deviation or
    the root-mean-square Srms
  • Units for S are the same as for the variable
    being analyzed.
  • E.g., if we measure mpg, then S will be in mpg.

19
Why divide by n-1?
  • n 1 is often referred to as the degrees of
    freedom.
  • Variety of reasons
  • Corrects underestimating bias Xis are closer
    to the sample mean (X-bar) than population mean
    (m).
  • Since we use a statistic (X-bar) in our standard
    deviation calculation, we have placed a
    restriction on one of the Xis.
  • Suppose you have 4 values. If you are told the
    mean 4, X1 3 X2 5 and X32 then X4 is
    restricted or can be calculated based on the
    mean, X1 , X2 , and X3.

20
Effects of Extreme Values
  • Test scores 70, 68, 71, 69 and 98,
  • sample standard deviation is 12.79.
  •  
  • Suppose you exclude the score of 98,
  • sample standard deviation is reduced to 1.3!
  • Standard deviation may be severely influenced by
    extreme values in sample data set (Note these
    values may not necessarily be mistakes).
  • We may reduce the effects of any individual
    observation by increasing the sample size.

21
Variance
  • Variance is the square of the standard deviation.
  • Represents the average squared deviation of each
    observation from the sample mean.
  • Prior Example where std deviation 12.79
  • Variance (12.79)2 163.72

22
Properties of s2
Let x1, x2,,xn be any sample and c be any
nonzero constant.
where is the sample variance of the xs and
is the sample variance of the ys.
23
Why Use Variance
  • Variance is often used because of its additive
    properties.
  • Suppose you are assembling two independent wood
    blocks, each has a std deviation of 2 mm.

s2AB s2A s2B sAB sA sB
Not True!
Basic Algebra!! a2 b2 ? (ab)2 Example 22
22 4 4 8 2 2 4, 42 16
24
Three Different Shapes for a Population
Distribution
symmetric
positive skew
negative skew
25
Skewness
  • Some software packages provide skewness
  • Skewness is a measure of relative (a)symmetry.
  • Zero skewness symmetric
  • Positive skewness longer right tail
  • Negative skewness longer left tail
  • Actual calculation outside scope of class

26
Kurtosis
  • Some software packages provide kurtosis
  • Kurtosis (K) is a measure of the peakedness of
    a distribution (relative to normal).
  • K 3 ? normal, bell-shaped distribution
    (mesokurtic) --(Note some software normal0)
  • K lt 3 (or negative relative to 0) ? flatter peak,
    fatter shoulders, shorter tails
  • K gt 3 (or positive relative to 0) ? more peaked
    than normal with longer tails

Actual calculation outside scope of class
27
Using Software to Calculate Descriptive Statistics
  • In practice, we rarely calculate statistics by
    hand.
  • In MS Excel, can use these functions
  • Mean ? average(array)
  • Median ? median(array)
  • Std Dev? stdev(array)
  • Variance? var(array)
  • Range ? max(array)-min(arrary)

28
Minitab Results
  • All advanced statistical software will
    automatically compute descriptive statistics.

Descriptive Statistics Score Variable
N Mean Median TrMean StDev
SE Mean Score 16 82.78
83.50 83.32 9.17 2.29 Variable
Minimum Maximum Score 63.00
95.00
29
V. Box Plots
Q3 75th Percentile Median 50th Percentile Q1
25th Percentile fs Q3 Q1 Upper Limit Q3
1.5 fs Lower Limit Q1 1.5 fs
Extreme Outlier(s)
Mild Outlier(s)

Upper Whisker Highest value within upper limit

Third quartile (Q3) or Upper fourth
Median
First quartile (Q1) or Lower fourth

Lower Whisker Lowest value within lower limit
30
Upper and Lower Fourths
After the n observations in a data set are
ordered from smallest to largest, the lower
(upper) fourth is the median of the smallest
(largest) half of the data, where the median
is included in both halves if n is odd. A
measure of the spread that is resistant to
outliers is the fourth spread fs upper
fourth lower fourth.
31
Box Plot differences in notation/calculation
  • Minitab calculates quartiles (Q1, Q3)
  • Some textbooks (including Devore) refer to lower
    and upper fourths
  • Roughly the same, but with some differences
  • Lower fourthmedian of the smallest n/2 obs, n
    even OR median of the smallest (n1)/2, n odd
  • Q1 observation at position (n1)/4 (if not an
    integer then interpolate)
  • Upper fourth median of the largest n/2 obs, n
    even OR median of the largest (n1)/2, n odd
  • Q3 observation at position 3(n1)/4 (if not an
    integer then interpolate)

32
Box Plot Information
  • Box Plots Show
  • Location line for median
  • Note some software will also include a dot for
    mean.
  • Dispersion box shows the 25th 75th percentile
    value range.
  • Departures from symmetry one box or whisker can
    be larger than the other side suggesting a lack
    of symmetry.
  • Identification of mild and extreme outliers.

33
Box Plot - MPG Example
34
Box Plots Vs. Histogram
  • Note wider box to left of median in box plot
    suggests more spread to left than right.
  • Similar pattern is shown in the histogram.

Median 20.1
Median 20.1
35
Multiple Box Plot Example
  • For MPG data, suppose you also collected data for
    tire pressures (grouped as normal or low)
  • Does this stratification variable help explain
    bi-modal distribution?

36
Outliers
Any observation farther than 1.5fs from the
closest fourth is an outlier. An outlier is
extreme if it is more than 3fs from the nearest
fourth, and it is mild otherwise.
37
Boxplots
upper fourth
lower fourth
median
extreme outliers
mild outliers
Write a Comment
User Comments (0)
About PowerShow.com