Title: NA3873 Lecture 3: Descriptive Statistics: Measures of Location and Variability
1NA387(3) Lecture 3 Descriptive
StatisticsMeasures of Location and Variability
2Topics
- Branches of Statistics (review from last lecture)
- Conducting a Statistical Analysis
- Measures of Location
- Measures of Dispersion
- Outliers and Box Plots
3I. Branches of Statistics-Recap
- 1. Descriptive Statistics
- Summarize or describe important features in a
data set without attempting to infer conclusions. - Describe data samples using terms such as
- X-bar (sample mean) and s (standard deviation).
- These statistics are used to estimate the
population mean (m) and population sigma s.
4I. Branches of Statistics-Recap
- 2. Inferential Statistics
- Use sample of data to draw conclusions (make
inferences). - Example Suppose you sample ten bottles from each
of two different filling machines. - Machine A averages 12.10 oz and B averages 12.12
oz. - Based on inferential statistics, you might
conclude that the two machines are not different.
5Location and Dispersion
- Most common descriptive statistics are related to
either measuring location or dispersion
(variation). - Location central tendency
- Dispersion spread of distribution
6Example
- Classic example to demonstrate these concepts
Outcomes of Throwing Darts - On or Off Location
- Low or High Dispersion
7Lecture Exercise Identify On/Off Target
High/Low Dispersion for each
x
x
x
x
x
x
A. _________
B. __________
x
x
x
x
x
x
D. __________
C. __________
8II. Measures of Location
9The Mean
The average (mean) of the n numbers
Population mean
10Mean (Average, Expected Value)
- The Mean (also known as the average or the
Expected Value) is a measure of the center (of
mass, or of gravity) of a distribution. - Typical notation used to represent the mean of a
sample of data is the Greek letter mu or the
Latin letter m, or E(X), read Expected value of
X - Example suppose five students take a test and
their scores are 70, 68, 71, 69 and 98. - Mean (7068716998)/5 75.2
11Median
The sample median, is the middle value in a
set of data that is arranged in ascending order.
For an even number of data points the median is
the average of the middle two.
Population median
12Median
- Median (also known as the 50th percentile) is the
middle observation in a data set. - Rank the data set and select the middle value.
- If odd number of observations, the middle value
is observation N 1 / 2. - If even number of observations, the middle value
is extrapolated as midway between observation
numbers N / 2 and N / 2 1. - Prior data values 68, 69, 70, 71, and 98.
- Median is 70.
- If another student with a score of 60 was
included, the new median would result in 69.5 (69
70 / 2).
13Mean Vs. Median
- Which is a better measure of location for the
following set of test scores? - 68, 70, 69, 71, and 98
- Mean 75.2 Median 70.0
14Trimmed Mean
- Trimmed Mean is a compromise between mean and
median. - 10 Trimmed Mean
- First, eliminate smallest 10 of values and
largest 10 of values. - Then, re-compute the mean.
- Trimmed means gaining popularity
- Less sensitive than the mean to outliers, but not
as robust as the median value.
15Trimmed Mean (Example from Devore Textbook)
- Variable life (hours) of incandescent lamps.
- Sample size 20
- How many values will be trimmed in 10 TM?
- Mean 965.0 Median 1009.5 Trim
Mean 971.4 - How are these values impacted by sample size, by
distribution? - What might be some useful applications?
16III. Measures of Dispersion
- Range
- Standard Deviation
- Variance
17Range
- The Range is the maximum value in a data set
minus the minimum value. - Example Test Scores 70, 68, 71, 69 and 98.
Range 98 - 68 30. -
- Note the range is often preferred over the
standard deviation for small data sets (e.g., if
of observations for a sample data set lt 10).
18Standard Deviation
- Sample Standard deviation (StDev), S measures the
dispersion of the individual observations from
the mean. - For a sample data set, standard deviation is also
referred to as the sample standard deviation or
the root-mean-square Srms - Units for S are the same as for the variable
being analyzed. - E.g., if we measure mpg, then S will be in mpg.
19Why divide by n-1?
- n 1 is often referred to as the degrees of
freedom. - Variety of reasons
- Corrects underestimating bias Xis are closer
to the sample mean (X-bar) than population mean
(m). - Since we use a statistic (X-bar) in our standard
deviation calculation, we have placed a
restriction on one of the Xis. - Suppose you have 4 values. If you are told the
mean 4, X1 3 X2 5 and X32 then X4 is
restricted or can be calculated based on the
mean, X1 , X2 , and X3.
20Effects of Extreme Values
- Test scores 70, 68, 71, 69 and 98,
- sample standard deviation is 12.79.
-
- Suppose you exclude the score of 98,
- sample standard deviation is reduced to 1.3!
- Standard deviation may be severely influenced by
extreme values in sample data set (Note these
values may not necessarily be mistakes). - We may reduce the effects of any individual
observation by increasing the sample size.
21Variance
- Variance is the square of the standard deviation.
- Represents the average squared deviation of each
observation from the sample mean. - Prior Example where std deviation 12.79
- Variance (12.79)2 163.72
22Properties of s2
Let x1, x2,,xn be any sample and c be any
nonzero constant.
where is the sample variance of the xs and
is the sample variance of the ys.
23Why Use Variance
- Variance is often used because of its additive
properties. - Suppose you are assembling two independent wood
blocks, each has a std deviation of 2 mm.
s2AB s2A s2B sAB sA sB
Not True!
Basic Algebra!! a2 b2 ? (ab)2 Example 22
22 4 4 8 2 2 4, 42 16
24Three Different Shapes for a Population
Distribution
symmetric
positive skew
negative skew
25Skewness
- Some software packages provide skewness
- Skewness is a measure of relative (a)symmetry.
- Zero skewness symmetric
- Positive skewness longer right tail
- Negative skewness longer left tail
- Actual calculation outside scope of class
26Kurtosis
- Some software packages provide kurtosis
- Kurtosis (K) is a measure of the peakedness of
a distribution (relative to normal). - K 3 ? normal, bell-shaped distribution
(mesokurtic) --(Note some software normal0) - K lt 3 (or negative relative to 0) ? flatter peak,
fatter shoulders, shorter tails - K gt 3 (or positive relative to 0) ? more peaked
than normal with longer tails
Actual calculation outside scope of class
27Using Software to Calculate Descriptive Statistics
- In practice, we rarely calculate statistics by
hand. - In MS Excel, can use these functions
- Mean ? average(array)
- Median ? median(array)
- Std Dev? stdev(array)
- Variance? var(array)
- Range ? max(array)-min(arrary)
28Minitab Results
- All advanced statistical software will
automatically compute descriptive statistics.
Descriptive Statistics Score Variable
N Mean Median TrMean StDev
SE Mean Score 16 82.78
83.50 83.32 9.17 2.29 Variable
Minimum Maximum Score 63.00
95.00
29V. Box Plots
Q3 75th Percentile Median 50th Percentile Q1
25th Percentile fs Q3 Q1 Upper Limit Q3
1.5 fs Lower Limit Q1 1.5 fs
Extreme Outlier(s)
Mild Outlier(s)
Upper Whisker Highest value within upper limit
Third quartile (Q3) or Upper fourth
Median
First quartile (Q1) or Lower fourth
Lower Whisker Lowest value within lower limit
30Upper and Lower Fourths
After the n observations in a data set are
ordered from smallest to largest, the lower
(upper) fourth is the median of the smallest
(largest) half of the data, where the median
is included in both halves if n is odd. A
measure of the spread that is resistant to
outliers is the fourth spread fs upper
fourth lower fourth.
31Box Plot differences in notation/calculation
- Minitab calculates quartiles (Q1, Q3)
- Some textbooks (including Devore) refer to lower
and upper fourths - Roughly the same, but with some differences
- Lower fourthmedian of the smallest n/2 obs, n
even OR median of the smallest (n1)/2, n odd - Q1 observation at position (n1)/4 (if not an
integer then interpolate) - Upper fourth median of the largest n/2 obs, n
even OR median of the largest (n1)/2, n odd - Q3 observation at position 3(n1)/4 (if not an
integer then interpolate)
32Box Plot Information
- Box Plots Show
- Location line for median
- Note some software will also include a dot for
mean. - Dispersion box shows the 25th 75th percentile
value range. - Departures from symmetry one box or whisker can
be larger than the other side suggesting a lack
of symmetry. - Identification of mild and extreme outliers.
33Box Plot - MPG Example
34Box Plots Vs. Histogram
- Note wider box to left of median in box plot
suggests more spread to left than right. - Similar pattern is shown in the histogram.
Median 20.1
Median 20.1
35Multiple Box Plot Example
- For MPG data, suppose you also collected data for
tire pressures (grouped as normal or low) - Does this stratification variable help explain
bi-modal distribution?
36Outliers
Any observation farther than 1.5fs from the
closest fourth is an outlier. An outlier is
extreme if it is more than 3fs from the nearest
fourth, and it is mild otherwise.
37Boxplots
upper fourth
lower fourth
median
extreme outliers
mild outliers