Title: Chapter 3 Numerically Summarizing Data
1Chapter 3Numerically Summarizing Data
- 3.1
- Measures of Central Tendency
2The following chart gives a summary of some
background information on 5 students at Joliet
Junior College (JJC). Name Age Gender
Number of Semesters Overall GPA
Completed At
JJC Jennifer 21 Female 1 3.5 Amy
19 Female 2 2.75 Brian 18 Male
4 3.25 Mark 18 Male 2 3.0 Jim
24 Male 3 4.0 Which of the above data
would be qualitative? AnswerGender
3 A parameter is a descriptive measure of a
population.
A statistic is a descriptive measure of a
sample.
A statistic is an unbiased estimator of a
parameter if it does not consistently over- or
underestimate the parameter.
4 The arithmetic mean of a variable is
computed by determining the sum of all the values
of the variable in the data set divided by the
number of observations.
5The population arithmetic mean, is computed using
all the individuals in a population. The
population mean is a parameter.
The population arithmetic mean is denoted by
6(No Transcript)
7The sample arithmetic mean, is computed using
sample data. The sample mean is a statistic that
is an unbiased estimator of the population mean.
The sample arithmetic mean is denoted by
8(No Transcript)
9EXAMPLE Computing a Population Mean and a Sample
Mean
The following chart gives a summary of some
background information on a Calculus class.
Name Age Gender Overall GPA
Jennifer 21 Female
3.5 Amy 19 Female 2.75 Brian 25 Male
3.95 Jane 19 Female 2.75 Mark
18 Male 3.0 Julie 19 Female 3.85 Jim
24 Male 4.0 Ted 25 Male
3.7 Michel 19 Female
3.75 Amanda 19 Female 3.65 Linda
19 Female 4.0
10 Compute the Arithmetic Mean
- Treat the students in this class as a population.
Compute the population mean of the GPA. - Then take a simple random sample of n 5
students. Compute the sample mean of the GPA.
Obtain a second simple random sample of n 5
students. Again compute the sample mean of the
GPA.
11The population(size of 11) mean
- 3.52.75 3.952.75 3.0 3.85 4.0
- 3.7 3.75 3.65 4.034.9
12The median of a variable is the value that lies
in the middle of the data when arranged in
ascending order. That is, half the data is below
the median and half the data is above the median.
We use M to represent the median.
13(No Transcript)
14EXAMPLE Computing the Median of Data
Find the population median of the total GPA from
the earlier example.
2.75 2.75 3.0 3.5 3.95 3.65 3.7 3.75
3.85 4.0 4.0 1 2 3 4
5 6 7 8 9 10 11
15The mode of a variable is the most frequent
observation of the variable that occurs in the
data set. If there is no observation that occurs
with the most frequency, we say the data has no
mode.
16EXAMPLE Finding the Mode of a Data Set
The data on the next slide represent the Vice
Presidents of the United States and their state
of birth. Find the mode.
17(No Transcript)
18(No Transcript)
19The mode is New York.
20The arithmetic mean is sensitive to extreme (very
large or small) values in the data set, while the
median is not. We say the median is resistant to
extreme values, but the arithmetic mean is not.
21When data sets have unusually large or small
values relative to the entire set of data or when
the distribution of the data is skewed, the
median is the preferred measure of central
tendency over the arithmetic mean because it is
more representative of the typical observation.
22(No Transcript)
23(No Transcript)
24(No Transcript)
25(No Transcript)
26EXAMPLE Identifying the Shape of the
Distribution Based on the Mean and Median
The following data represent the asking price of
homes for sale in Lincoln, NE.
Source http//www.homeseekers.com
27Find the mean and median. Use the mean and
median to identify the shape of the distribution.
Verify your result by drawing a histogram of the
data.
28Find the mean and median. Use the mean and
median to identify the shape of the distribution.
Verify your result by drawing a histogram of the
data.
Using MINITAB/Excel/Spss, we find that the mean
asking price is 143,509 and the median asking
price is 131,825. Therefore, we would
conjecture that the distribution is skewed right.
29(No Transcript)
30(No Transcript)
31- 3.2
- Measures of Dispersion
32To order food at a McDonalds Restaurant, one
must choose from multiple lines, while at Wendys
Restaurant, one enters a single line. The
following data represent the wait time (in
minutes) in line for a simple random sample of 30
customers at each restaurant during the lunch
hour. For each sample, answer the following (a)
What was the mean wait time? (b) Draw a histogram
of each restaurants wait time. (c ) Which
restaurants wait time appears more dispersed?
Which line would you prefer to wait in? Why?
33Wait Time at Wendys
1.50 0.79 1.01 1.66 0.94 0.67 2.53 1.20 1.46 0.89
0.95 0.90 1.88 2.94 1.40 1.33 1.20 0.84 3.99 1.90
1.00 1.54 0.99 0.35 0.90 1.23 0.92 1.09 1.72 2.00
Wait Time at McDonalds
3.50 0.00 0.38 0.43 1.82 3.04 0.00 0.26 0.14 0.60
2.33 2.54 1.97 0.71 2.22 4.54 0.80 0.50 0.00 0.28
0.44 1.38 0.92 1.17 3.08 2.75 0.36 3.10 2.19 0.23
34The mean wait time in each line is 1.39 minutes.
35(No Transcript)
36The range, R, of a variable is the difference
between the largest data value and the smallest
data values. That is Range R Largest Data
Value Smallest Data Value
37EXAMPLE Finding the Range of a Set of Data
Find the range of the student GPA collected from
Section 3.1
38The population variance of a variable is the sum
of squared deviations about the population mean
divided by the number of observations in the
population, N.
39The population variance is symbolically
represented by lower case Greek sigma squared.
Note When using the above formula, do not round
until the last computation. Use as many decimals
as allowed by your calculator in order to avoid
round off errors.
40EXAMPLE Computing a Population Variance
Compute the population variance of the population
data collected in Section 3.1.
41The sample variance is computed by determining
the sum of squared deviations about the sample
mean and then dividing this result by n 1.
42Note Whenever a statistic consistently
overestimates or underestimates a parameter, it
is called biased. To obtain an unbiased estimate
of the population variance, we divide the sum of
the squared deviations about the mean by n - 1.
43EXAMPLE Computing a Sample Variance
Compute the sample variance using the sample data
from Section 3.1
44The population standard deviation is denoted by
It is obtained by taking the square root of the
population variance, so that
45EXAMPLE Computing a Population Standard
Deviation and Sample Standard Deviation
Compute the population and sample standard
deviation for the data obtained in Section 3.1
46EXAMPLE Comparing Standard Deviations
Determine the standard deviation waiting time for
Wendys and McDonalds. Which is larger? Why?
47EXAMPLE Comparing Standard Deviations
Determine the standard deviation waiting time for
Wendys and McDonalds. Which is larger? Why?
Sample standard deviation for Wendys 0.738
minutes Sample standard deviation for McDonalds
1.265 minutes
48(No Transcript)
49(No Transcript)
50EXAMPLE Using the Empirical Rule
The following data represent the serum HDL
cholesterol of the 54 female patients of a family
doctor.
41 48 43 38 35 37 44 44 44 62 75 77 58 82 39 85 55
54 67 69 69 70 65 72 74 74 74 60 60 60 61 62 63 6
4 64 64 54 54 55 56 56 56 57 58 59 45 47 47 48 48
50 52 52 53
51(a) Compute the population mean and standard
deviation. (b) Draw a histogram to verify the
data is bell-shaped. (c) Determine the percentage
of patients that have serum HDL within 3 standard
deviations of the mean according to the Empirical
Rule. (d) Determine the percentage of patients
that have serum HDL between 34 and 80.8 according
to the Empirical Rule. (e) Determine the actual
percentage of patients that have serum HDL
between 34 and 80.8.
52(a) Using a TI83 plus graphing calculator, we find
(b)
53(c) According to the Empirical Rule,
approximately 99.7 of the patients will have
serum HDL cholesterol levels within 3 standard
deviations of the mean. That is, approximately
99.7 of the patients will have serum HDL
cholesterol levels greater than or equal to 57.4
- 3(11.7) 22.3 and less than or equal to 57.4
3(11.7) 92.5.
54(d) Because 33.8 is 2 standard deviations below
the mean (57.4 - 2(11.7) 34) and 81 is 2
standard deviations above the mean (57.4
2(11.7) 80.8), the Empirical Rule states that
approximately 95 of the data will lie between 34
and 80.8.
(e) There are no observations below 34. There
are 2 observations greater than 80.8. Therefore,
52/54 96.3 of the data lie between 34 and 80.8.
55(No Transcript)
56EXAMPLE Using Chebyshevs Theorem
Using the data from the previous example, use
Chebyshevs Theorem to (a) determine the
percentage of patients that have serum HDL within
3 standard deviations of the mean. (b) determine
the percentage of patients that have serum HDL
between 34 and 80.8.
57Answer (a) (1-1/9)10088.9 (b) 57.4-3423.4
80.8-57.423.4 23.4/11.72 two standard
deviations, so the percentage of patients that
have serum HDL between two stand deviations is at
least (1-1/4)10075