Chapter 3: Numerically Summarizing Data

About This Presentation

Title:

Chapter 3: Numerically Summarizing Data

Description:

In other words, the median is the midpoint of the observations when they are ... Lance Armstrong won the Tour de France seven consecutive times (1999-2005) ... – PowerPoint PPT presentation

Number of Views:333

Avg rating:3.0/5.0

Slides: 70

Provided by: philip52

Learn more at: https://math.vanderbilt.edu

Category:

more less

Transcript and Presenter's Notes

Title: Chapter 3: Numerically Summarizing Data

1
Chapter 3 Numerically Summarizing Data
3.1 Measures of Central Tendency 3.2 Measures of
Dispersion 3.3 Measures of Central Tendency and
Dispersion from Grouped Data 3.4 Measures of
Position 3.5 The Five-Number Summary and Boxplots
September 25, 2008
2
The Mean of a Set
Section 3.1
3
Remark
4
The Median of a Set
In other words, the median is the midpoint of the
observations when they are ordered from smallest
to largest or vice-versa.
5
Example 1
Find the mean and median of the set of
observations 20, -3, 4, 10, 6, -1.
6
Example 2
Find the mean and median of the set of
observations -10, -6 ,0, 4, 9.
7
Mean and Dot Plot
Notice that the mean is a fulcrum for the
distribution of point masses on the lever
(x-axis).
8
Add Points (Weights)
The fulcrum has moved 1.2 units to the left.
9
Shape, Mean and Median
10
Outlier

An outlier is an observation (data point) that
falls well above or below the overall set of
data.
The mean can be highly influence by an outlier.
The median is said to be resistant to outliers
i.e., it value is not changed significantly by
the addition or removal of an outlier.

11
Example
12
Mode

The mode is the most frequent observation of the
variable.
It is most often used with categorical data.
For numerical data, it can be used when the data
is discrete.

Color Count
Black 20
White 10
Red 35
Blue 15
Green 10
Other 20
The mode of the categorical variable color is 35
(red).
13
Example
Mia Hamm, who retired at the 2004 Olympics, is
considered to be the most prolific player in
international soccer. He is a list of the number
of goals scored over her 18-year career.
MHG 0,0,0,4,10,1,10,10,19,9,18,20,13,13,2,7,8,1
3. Considering the population as the number of
goals scored by Mia Hamm, find the mean and
median and mode of this set.
14
Mean, Median and Mode and Distribution Shape
15
Measures of Dispersion
Consider the following sets of observations S1
0,0,0,0,0,0,0,0,0,0 S2 -5,-4,-3,-2,-1,1,2,3,4
,5. Both sets have the same mean and median
(namely, 0). However, the histograms or dot
plots are quite different. Yet, their dot plot
is very different.
Notice that the difference between the smallest
and largest number in each set is quite different.
Section 3.2
16
Range of a Set of Observations
Remark The range is completely determined by
only two points of the set of observations.
17
Example
Lance Armstrong won the Tour de France seven
consecutive times (1999-2005). Here is data
about his victories.
Year Winning Time (h) Distance (km) Winning Speed (km/h) Winning Margin (min)
1999 91.538 3687 40.28 7.617
2000 92.552 3662 39.46 6.033
2001 86.291 3453 40.02 6.733
2002 82.087 3278 39.93 7.283
2003 83.687 3427 40.94 1.017
2004 83.601 3391 40.56 6.317
2005 86.251 3593 41.65 4.667
The ranges for each category of winning
are Winning Time range 92.552 - 82.087
10.465 Distance range 3687 - 3278
409 Winning Speed range 41.65 - 39.46
2.19 Winning Margin range 7.283 - 1.017
6.266
18
The Spread of Quantitative Data
Consider the frequency distributions of two
different data sets.
Notice how the tails of each distribution change
from being close together to being far apart.
Section 2.4
19
The Deviation from the Mean
20
Variance and Standard Deviation
Definition The average of the square of all
deviations in a sample is called the variance of
the sample. The standard deviation of a sample
is defined as the square root of the variance.
Question Why n -1 instead of n in these
formulas?
21
Remark
There is an unfortunate duplicity on how the
words, variance and standard deviation, are used.
These quantities are computed different ways,
depending on whether the set under consideration
is a population or a sample of a population. It
turns out that if we use the formulas for
variance and standard deviation where we divide
by n instead of n-1, then the standard deviation
of the sample will consistently underestimate the
standard deviation of the population. This is
called bias. Hence, we will sometimes use the
following definitions and will distinguish
between sample standard deviation and population
standard deviation.
22
Example

For the set of observations (sample),
0,-3,10,7,5,-3,0,
Find the range of the sample.
Find the mean and median of the sample.
Find the variance of the sample.
Find the standard deviation of the sample.

23
Example

For the two set of observations, S -1,0,0,0,1
and T -1,-1,-1,-1,0,1,1,1,1,
Find the mean and median for each set.
Find the standard deviation for each set.

We see from the dot plot that the set T has more
points that vary from the mean and hence, has a
larger standard deviation.
24
Properties of the Standard Deviation

The larger the spread (variation) in the data,
the larger the standard deviation.
The standard deviation is zero only if and only
if the set from which it is computed has all of
its elements the same in which case the mean of
the set is this number.
The standard deviation is influenced by outliers.
This is true because the deviation from the mean
of the set to the outlier is a large number in
absolute value.
The standard deviation yields more information
than the range of the set. (Why?)

25
Example
The following data represents the walking time
(in minutes) from the dorm or apartment to
Professor Bischs course on operator algebras.
We treat the nine students as the population of
Prof. Bischs class.
Student Time Student Time
T.S. 39 S.Q. 45
P.C. 21 E.W. 11
A.A. 9 T.B. 12
C.S. 32 G.W. 39
N.G. 30

Find the population mean and standard deviation.
Choose a sample of 4 and compute the mean and
standard deviation of the sample.

26
(No Transcript)
27
Bell-shaped (symmetric) Distributions
Consider a set of observations that is
bell-shaped.
All three distributions have different standard
deviations.
28
Empirical Rule for almost Bell-shaped
Distributions
29
Caution
The Empirical Rule for bell-shaped distributions
is an empirical law, not a fact. The better the
distribution is being perfectly bell-shaped, then
better the accuracy of the law. It is useful in
telling us how the data is concentrated about the
mean of the distribution.
30
Example
31
Detailed Empirical Rule
32
Example

The distribution of the length of bolts produced
by the Acme Bolt Company is approximately
bell-shaped with a mean of 4 inches and a
standard deviation of 0.007 inches.
What is the range of length for 68 of the bolts
produced by this company?
What percentage of bolts will be between 3.986
inches and 4.014 inches?
If the company discards any bolts that are less
than 3.986 inches or greater than 4.014 inches,
what percentage of bolts will be discarded?
What percentage of the bolts will be between
4.007 inches and 4.021 inches?

33
Chebyshev Inequality
Example Suppose that a population has a mean of
73.5 and a standard deviation of 5.5. Find an
interval that contains at least 75 of the data
points in the population.
34
Example
In December 2004, the average price of regular
unleaded gasoline excluding taxes in the United
States as 1.37 per gallon. Researchers in the
Department of Energy estimated that the standard
deviation for this mean price was 0.05. Using
Chebyshevs Inequality,estimate the percentage of
gasoline stations that had prices within 3
standard deviations of the mean? What percentage
had prices within 2.5 standard deviations?
35
Remark

Chebyshevs Inequality does not place any
preconditions on the shape of the data set.
It is true for populations and samples.
The theorem does not say that there are exactly
100(1-1/k2) points in an interval that is one
standard deviation from the mean, but rather
there are at least this number.

36
Mean and Standard Deviation for Grouped Data
Section 3.3
37
Example
38
Example
39
Weighted Mean of a Set
Given a set of numbers, suppose that we believe
that some of the numbers are more important than
other numbers in the set. To reflect this
notation, we defined the weighted mean of a set
of numbers.
40
Example
Consider the set S -3, 1, 0, 3, -1, 1, 0 and
the weights 1.5, 0, 1, -1, 1, 2, 1. Find the
weighted mean of this set with respect to the
given weights.
41
Approximation for Standard Deviation and Variance
for Grouped Data
42
Example
43
Approximating the Median of grouped Data
44
Example
Bin Frequency Cumulative Frequency
0,10) 24 24
10,20) 14 38
20,30) 39 77
30,40) 18 95
40,50 5 100
45
Measures of Position in a Distribution

The mean and median give us information about
the center of a set of observations (the
distribution).
The range and standard deviation give us
information about the spread of the
distribution.
We now introduce a concept that is equivalent
to the position in a distribution. It will use
the concept of percentiles. The percentile will
how the distribution can be divided into parts
(sometimes equal) which in turn will give us the
notion of position within the distribution.

Section 3.4
46
z-score
47
Example
Example Consider the sample -1,0,1,5,19.
Compute the z-score for each data point.
48
Application of z-score
The average 20- to 29- year old man is 69.6
inches tall with a standard deviation of 2.7
inches. The average 20- to 29- year old woman is
64.1 inches with a standard deviation of 2.6
inches. With respect to their population, who is
relatively taller a 75-inch man or a 70-inch
woman?
49
Percentile
Definition The kth percentile in a
distribution, Pk, is a number that is the
percentage of the observations that fall below
or at this value. In other words, it subdivides
the total area enclosed by the distribution into
two sub-areas, A1 and A2, so that total area is
divided into two parts k and 100-k.
50
Algorithm for Percentiles
51
Example
Find the 20th percentile of the set S
-1,0,3,5,9,12,15,18,25. Next find the 45th
percentile.
52
Remark
53
Quartiles
When k 50, half of the observations are above
and half are below this position. One can argue
that this is equivalent to the notion of the
median of the set of observations. When k 25,
one quarter of the observations are below this
position and three quarters are above. Similarly
for k 75. These demarcation points are given
special names. When k 25, it is called the
first quartile (Q1). When k 50, it is called
the second quartile or median (Q2) and finally,
when k 75, it is called the third quartile
(Q3).
54
To Find Quartiles

To calculate Q1, we calculate P25.
To calculate Q2, we calculate P50.
To calculate Q3, we calculate P75.

55
Example
Find the quartiles for the set -1,1,5,5,0,7,2,7.
56
Example
Find the quartiles for the set -1,1,5,5,0,7,2,7,2
. Same set as previous example with the data
point 2 added.
57
Example
Find the median, Q1, and Q3 for the set of
data 68,76,60,88,69,80,75,67,71,100,63,62,71,74,
64,48,100,72,65,50,72,100,63,45,54,60,75,57,74,84,
83.
58
Interquartile Range
Definition Let Q1, Q2, and Q3 denote the
quartiles for a set of observations. The
interquartile range (IQR) of the set is defined
as IQR Q3 - Q1. Hence, it is simply the
distance between the first and third quantile.
Example Consider -1,1,5,5,0,7,2,7. Previously,
we showed that Q1 0.5 and Q3 6.0. Hence, IQR
6.0 - 0.5 5.5.
59
IQR and Outlier Criterion
Criterion Consider a set of observations. An
observation may be a possible outlier on the left
if the distance from it to Q1 is larger than
(1.5)IQR. It may be a possible outlier on the
right if the distance from it to Q3 is larger
than 1.5xIQR. We can call these demarcation
values the upper and lower fences of the
set LF Q1 - 1.5(IQR) UF Q3 1.5(IQR)
60
Example
Example Consider a set of data points
-1,0,3,5,9,10,26. Doe it have any potential
outliers?
61
Example
The following sample of the concentration of
dissolved organic carbon (mg/L) in mineral soil
8.5, 10.3, 5.5, 8.05, 3.02, 12.57, 8.37, 4.6,
7.9, 9.11, 3.91, 11.56, 4.71,10.72, 7.45, 12.89,
7.92, 8.5, 11.72, 8.79, 9.29, 7, 7.66, 21.82,
11.33, 9.81, 17.9, 4.8, 4.85, 21, 3.99, 11.72,
22.62, 7.11, 17.99, 7.31, 4.9, 11.97,10.89, 3.79,
11.8, 10.74, 9.6, 21.4, 16.92, 9.1, 7.85.
Calculate the quartiles and IQR for this sample.
Lastly, compute the upper and lower fences.
We first sort the set 3.02, 3.79, 3.91, 3.99,
4.6, 4.71, 4.8, 4.85, 4.9, 5.5, 7, 7.11, 7.31,
7.45,7.66, 7.85, 7.9, 7.92, 8.05, 8.37, 8.5, 8.5,
8.79, 9.1, 9.11, 9.29, 9.6,9.81, 10.3, 10.72,
10.74, 10.89, 11.33, 11.56, 11.72, 11.72, 11.8,
11.97, 12.57, 12.89, 16.92, 17.9, 17.99, 21,
21.4, 21.82, 22.62 and notice that there are 47
points. Hence, the median (Q2) is the middle
point of the sorted set 9.1 (the 24th point).
Therefore, Q2 9.1. To calculate Q1 and Q3, we
use the 25th and 75th percentiles Q1 7.16 and
Q3 11.72. Therefore, IQR 11.72 - 7.16
4.56. The upper and lower fences are LF Q1
- 1.5(IQR ) 7.16 - 1.5(7.56) 0.32 UF Q3
1.5(IQR ) 7.16 1.5(7.56) 18.56
62
The Five Number Summary of Position
It is often convenient to summarize the quartile
information, the smallest and largest values in
the set as a 5-tuple (smallest, Q1, median, Q3,
largest).
Example Find the five number summary for the
set -2,-1,0,1,5,6,6,8,10,11,12.
The small number is -2, the largest number is 12,
Q2 6, Q1 0 and Q3 10. Hence, the 5-tuple
is (-2,0,6,10,12).
Section 3.5
63
Box-whisker Plot of the Five Number Summary
(smallest, Q1, median, Q3, largest)
64
Some Remarks

A box-whisker is a very compact way of
summarizing the spread of the distribution.
It does not give the shape of the distribution
and hence, a histogram and a box-whisker plot
often go together.
A box-whisker plot is a convenient way to compare
two sets of data.

65
Comparing Two Sets
Which set has the larger mean? Which set has the
larger median? Which set has the largest
member? Which has the larger standard deviation?
66
Can Descriptive Summaries be Misleading?

Example Suppose a sample of Vanderbilt students
are asked to estimate how many miles that they
have driven during the month of August. After
receiving the sample for this population of
Vanderbilt students, we compute the following
statistics
number in sample 954
smallest value 0
largest value 25,000
mean 2,072.6
median 1,903
standard deviation 1,662.9
IQR 1,908
Is it reasonable to say that average Vanderbilt
student drove approximately 2,073 miles with the
median 1,903 during the month of August?

67
Actual Data Set
0,1,5,9,13,...,2997,3000,3010,3020,3030,,5000,24
000,25000 954 data points 25,000 and 24,000
are outliers Remove outliers mean 2,025.5,
median 1,899, SD 1307.5
68
Home Ownership in America