Title: Numerically Summarizing Data
1Numerically Summarizing Data
Learning Objectives
1. Understand the difference between a parameter
and a statistic 2. Describe and compute measures
of central tendency 3. Describe and compute
measures of dispersion 4. Compute measures of
location 5. Learn to read box plots and check
for outliers
2Measures of Central Tendency (Mean, Median and
Mode)
A parameter is a descriptive measure of a
population. In most real world cases, the
population parameter is not known. For example,
the average gas price in the whole nation.
A statistic is a descriptive measure of a sample.
We use statistic to estimate the corresponding
parameter. For example, Average gas price of the
nation is not known. However, we can take a
random sample of 100 stations and compute the
sample average gas price, then use the sample
average to estimate the unknown population
average.
3The population mean, is computed using all the
individuals in a population, the total of all
individuals is N. The population mean is a
parameter.
The sample mean, is computed using sample
data. The sample mean is a statistic that is an
unbiased estimator of the population mean.
NOTE In real world applications, population mean
m is usually not known, and is estimated by
using sample mean
4Median
The median of a variable is the value that lies
in the middle of the data when arranged in
ascending order. That is, half the data is below
the median and half the data is above the median.
We use m to represent the median.
5Steps in Computing the Median of a Data Set
- Arrange the data in ascending order.
- Determine the number of observations (n).
- Determine the observation in the middle of the
data set. The position is (n1)/2 - If (n1)/2 is an integer, locate the data value
at the (n1)/2 position. This is the median
(NOTE for this situation, of data values, n is
an odd number.) - If (n1)/2 is NOT an integer, the median is the
average of the two data values on either side of
the observations that lies in the (n1)/2
position. - NOTE for this situation, n is even.
6EXAMPLE Computing the Median of Data
Find the mean and median of the following pulse
rates from a sample of 8 individuals NOTE n 8
in this case
80, 76, 65, 68, 72, 73, 65, 80 Arrange them in
ascending order 65, 65, 68, 72, 73, 76, 80,
80 Find the position (n1)/2
(81)/24.5 Position is not an integer Median
(7273)/2 72.5
Adding one additional pulse rate of 100, now find
the median of the data NOTE n 9 in this case
80, 76, 65, 68, 72, 73, 65, 80,100 Ascending
order 65,65,68,72,73,76,80,80,100 Position
(91)/2 5 Median is 73 (on the 5th position)
7The mode of a variable is the most frequent
observation of the variable that occurs in the
data set. If there are two values that occur
with the most frequency, we say the data has is
bimodal.
Exercise Find the mode of the following pulse
rate data 80, 76, 65, 68, 72, 73, 65, 80,100, 80,
74, 65, 66, 70, 74, 65, 80,98 Modes are 65 and
80
8Comparing Mean and Median How does the extreme
observation affect the mean and median? similar
exam questions Example The following is the
quiz scores of 10 students in class
A 5,5,5,5,5,7,7,7,7,7 Find mean ______,
find median ________ The following is the quiz
score of 10 students in class B 5,5,5,5,5,7,7,
7,7,30 Find mean ________, find median
_________ Fact The mean is sensitive to
extreme data values. Median is robust to extreme
data values.
9How does the unusual cases affect the average,
median and the shape of the histogram? Compare
Histograms with/without the outlier case, 5000
miles
Shape is _________
Shape is _________
Descriptive Statistics Miles for 148 cases (with
the case of 5000 miles) Variable N Mean SE
Mean TrMean StDev Min Q1 Median Q3
Max Miles 148 151.5 33.6 111.8 409.4
1.0 75 120 150 5000 Descriptive
Statistics Miles for 147 cases (without the case
of 5000 miles) Variable N Mean SE Mean
TrMean StDev Min Q1 Median Q3
Maximum Miles 147 ______ 6.71 111.02
81.37 1.0 75 ______ 150 600
10Descriptive Statistics Miles for 147 cases
(without the case of 5000 miles) Variable N
Mean Min Median Maximum Miles
147 118.52 1 120
600 NOTE Median remains unchange. Why? Since it
only uses the middle one (or two data points) to
find median. But, it uses everyone data to find
average. So, a very large unusual data will make
average larger. But, not median.
When data sets have unusually large or small
values relative to the entire set of data or when
the distribution of the data is skewed, the
median is the preferred measure of central
tendency over the arithmetic mean because it is
more representative of the typical observation.
11Comparison of Mean, Median, and Mode for
different shapes of distributionsSimilar exam
question
MeanltMedain MeanMedian
MeangtMedian
12Exercise
- NOTE In real world applications, distribution of
a sample data can never be perfectly symmetric.
The shape can only be approximately symmetric. - IF MEAN IS CLOSE TO MEDIAN (NOT NECESSARY EXACTLY
MEAN MEDIAN), WE WOULD SAY THE DISTRIBUTION IS
APPROXIMATELY SYMMETRIC. - Exercise A sample of 50 gas prices are recorded
and summarized. The average price is 3.15,
median price is 3.13. Is the shape of the price
distribution more likely to be skew-to-left,
approximately symmetric, skewed-to-right? - ANS
13 Measures of Dispersion Four different measures
of dispersion Range, Variance, Standard
Deviation, Interquartile Range (IQR)
Measures of dispersion measure the degree that
the data values spread. The larger the data
values spread, the larger the variation of the
data values. Example Scores of 5 students in
class A 60,60,70,80,80 Scores of 5 students in
class B 40,60,70,80,100 Scores of 5 students in
class C 70,70,70,70,70 Q Scores in Class
____ have largest variation. Scores in Class
_____ has zero variation.
14Visualizing Variability using Histogram
A
B
Which one shows the largest variation Which one
shows the smallest variation
C
15- How to measure the variation?
- Range R Largest Data Value Smallest Data
Value
s ?s2
- The sample standard deviation is
NOTE the divider (n-1) is called the Degrees of
Freedom.
The population variance is symbolically
represented by lower case Greek sigma squared.
The population standard deviation is
16NOTE As mentioned before, for real world
problems, population mean, population variance
and population standard deviation are NOT KNOWN.
Similar to Sample Mean, sample variance and
sample standard deviation are obtained from
sample data. They are used to estimate the
unknown population variance and population
standard deviation. This is the major part of the
inferential statistics, which will be dealt with
later. In this Chapter, we are learning how to
compute and interpret these sample descriptive
summaries to understand the sample data.
17- Notation
-
- s 2 sample variance
sample standard deviation - NOTE If the original measurement unit is (ft),
- the variance s2 has measurement unit (ft)2,
since -
- If x has unit (ft), then, (x- )2 has the unit
(ft)(ft) , which is (ft)2 -
- The measurement unit of s2 is (ft)2 .
- The measurement unit of s is (ft).
- s 2 population variance. population
standard deviation
18Some important Tips
- NOTE Sample statistics such as sample mean ,
sample median, s, s2 will be different for
different samples. - Population parameters such as population mean,
m, population variance, s2, population s.d., s
are fixed constant for a given population. They
do not change for different samples.
19ExerciseComparing Variation Quiz Scores of 40
studentssimilar exam questions
1 3 2 4 9 10 5 3 1 2
0 5 10 Class C
Variation Which one has smallest s.d.? Which
has largest s.d.?
20Answer
- Class B has smallest standard deviation
- Class A has largest standard deviation
21Points to remember about variance and standard
deviation and the relationship with histogram
- The value of s and s2 is always greater than
or equal to zero. - The larger the value of
s 2 or s, the greater the variability of the
data set. - If s 2 or s is equal to zero,
all measurements must have the same. - The
standard deviation s is computed in order to have
a measure of variability having the same unit as
the observations. - The larger the s.d., the
more spread the data, the flatter the
histogram. - The smaller the s.d., the more
clustered the data around the mean, the taller
the peak of the histogram.
22Exercise (Similar Exam questions)
- 1. The gas price is a concern for people. A
random sample of 40 stations gives the following
data summary - Sample mean 2.15 Median 2.12 S .15
- Q Is the distribution of the gas prices more
likely to be - (a) Symmetric (b) skewed-to-right (c)
Skewed-to-left - And WHY?
- 2. The following two data are prices of milk from
6 stores, one was from January, and one year
after. - Store A B C D E F
- Price in January 2004 1.85 1.95 1.85 2.00 1.78 1.9
7 - Price in January 2005 2.05 2.15 2.05 2.20 1.98 2.1
7 - True or False for each of the following
statements - The average price remains the same between two
years. - (b) The price range remains the same between two
years. - (c ) The median remains the same between two
years. - (d) The standard deviation (s) remains the same
between two years.
23Descriptive Summary for the 56 distances
s , the sample standard deviation. s2 (112.2)2
Mean after excluding the lowest 5 and the
highest 5 of the data. Called Trimmed Mean
m
Descriptive Statistics distance Variable N
Mean Median TrMean StDev SE
Mean distance 56 142.0 140.0
128.3 112.2 15.0 Variable
Minimum Maximum Q1 Q3 distance
5.0 800.0 92.5 160.0
25 of the distances are lower than Q1, the first
Quartile, or 25th Percentile
75 of the distances are lower than Q3, the third
Quartile, or 75th Percentile
Smallest
Largest
24If we add the max, 6000 to the data, so that we
57 cases, what is the effect of 6000 to the
following summary statisticsIncrease? Decease?
The same?
- (a) the average distance
- (b) the median distance
- (c) the standard deviation
- (d) the range
25Answer
- Adding 6000 miles to the data, then,
- Average distance is increased.
- Median distance for this example is the same. (in
general, will be almost the same) - Standard deviation is increased.
- Range is increased.
26Empirical Rule and ApplicationsWhat is the
meaning of variation and how is it used in
solving real world problems?
For Symmetric mound-shaped data (Bell-shaped )
- Approximately
- 68 of the data is between 1 s
- 95 of the data is between 2 s
- 100 of the data is between 3 s
- of the mean
27The important Application of Empirical rule is
It is applied to identify rare (unusual, extreme
)observations. If an observation falls outside
two s.d. range, it only has 5 of chance to
occur. Therefore, it is considered to be a rare
(or unusual) case.
NOTE If you add the on each side of the center
line m, it adds to 50. A mounded-shape
distribution is symmetric about the mean.
28Applying Empirical Rule to identify Rare Events A
simple and powerful tool for identifying
outliers, extremes, or unusual, or rare events.
We will use this rule very often through out the
entire semester.
(Similar questions in the test)
Consider the 2010 ACT test, the average was 21
and a standard deviation was 4. The distribution
of the ACT scores is mounded-shaped. Q1 A
student received a score of 25. Is this an
unusually high score? Q2 If CMU will admit
students with a minimum ACT to be one standard
deviations below the mean, what is the minimum
ACT for CMU admission? Q3 A student received an
ACT of 30. Is this an unusually high score?
ANSWER Q1 25 214 (that is one s.d. above
the mean. It is inside two s.d. from the mean.
So, it is NOT unusually high score. Q2 The score
at one s.d. below the mean 21 4 17. Q3 the
score 30 gt 21 2(4) 29. 30 is outside the two
s.d. from mean. There is only 2.5 of scores
higher than 29. Hence, 30 is an unusually high
score.
29Exercise Estimating average, standard deviation
and applying Empirical Rule when distribution is
mounded-shaped
- We collect a sample of 40 weekly spending from 40
students. Suppose the spending has a
mounded-shape distribution. We only know the min
20 and max 80. As you see the weekly
spending varies. There is a variation among
spending. - (a) Give a good estimate of the average spending
and standard deviation of the weekly spending
based on the 40 students data. - (b) Approximately how many of students would
spend 35 or more per week - ANS Since the distribution is mounded-shaped,
we can use (2080) / 2 50 to estimate the
average spending. - Since this is a sample, so, we use s range/4
to estimate the s.d., which would be (80-20)/4
15.0. - ANS We can then use this estimated average
spending and s to answer question (b) - 35 is about one s.d. below the mean. Hence, the
of spending 35 or more 34 50 84.
Approximately 84 of individuals spend 35 or
more per week.
30- Five Number Summary Box plots
The Five-Number Summary MINIMUM Q1
Median Q3 MAXIMUM
IQR (Inter-quartile Range) Q3 Q1
31Steps for Drawing a Box plot
Step 1 Determine the lower and upper fence
Lower fence Q1 1.5(IQR)
Upper fence Q3 1.5(IQR)
Step 2 Draw vertical lines at Q1, M and Q3.
Enclose these vertical lines in a box.
Step 3 Label the lower and upper fences.
Step 4 Draw a line from Q1 to the smallest data
value that is larger than the lower fence. Draw
a line from Q3 to the largest data value that is
smaller than the upper fence.
Step 5 Any data value less than the lower fence
or greater than the upper fence are outliers and
are marked with an asterisk ().
32EXAMPLE Drawing a Boxplot Min Q1 M
Q3 Max IQR 28 38 48
56 73 Q3-Q1 56-3818 Draw a
boxplot for the serum HDL.
Compute the lower and upper fence and draw a
boxplot.
33Relationship between Distribution Shape and
Boxplot (Similar questions in the test)
1. If the median is near the center of the box
and each of the horizontal lines are
approximately equal length, then the distribution
is roughly symmetric. 2. If the median is left of
the center of the box and/or the right line is
substantially longer than the left line, the
distribution is right skewed. 3. If the median is
right of the center of the box and/or the left
line is substantially longer than the right line,
the distribution is left skewed
34Symmetric
35Skewed Right
36Skewed Left
37Distance data 100 distance data
38EXAMPLE Comparing Two Data Sets Using
Boxplots
The following boxplots represent the birth rate
for women 15 - 44 years of age in 1990 and 1997
for each state. What conclusion can you make?