Title: Describing Distributions with Numbers
1Chapter 12
- Describing Distributions with Numbers
2Today Math Cookies
- Pick up one cookie a handout
- DO NOT EAT IT (yet). You may eat it later once we
have collected cookie data.
No clickerstoday
3Counting rule
- Anything brown counts for a chip.
- Carefully count front and back surfaces.
4Review
- Categorical variables
- pie bar graphs
- Quantitative variables
- stemplots histograms
- Good bad graphs
- Shape
- symmetric, skewed
5Quick math overview
- These expressions are algebraically equivalent
6Examples
7Turning Data Into Information
- Center of the data
- mean
- median
- mode
- Spread of the data (variability)
- variance
- standard deviation
- range
- interquartile range
8Centers of Data
- Average - a single data value that represents all
of the data - mean (arithmetic average)
- median
- mode
9Mean ( )
- Traditional measure of center
- Sum the values and divide by the number of values
10Sample mean
- Grades 68, 79, 60, 72, 77, 76, 69, 70, 60, 79.
11Median (M)
- A resistant measure of the datas center
- Median - the center of value of ordered (ranked)
data - If n is odd, the median is the middle ordered
value - If n is even, the median is the average of the
two middle ordered values - Median 1/2(n1)th position in ordered set
12Median
- Example 1 data 2 4 6
- Median (M) 4
- Example 2 data 2 4 6 8
- Median 5 (avg. of 4
and 6) - Example 3 data 6 2 4
- Median ? 2
- (order the values 2 4 6 , so Median 4)
13Sample median
- Grades 68, 79, 60, 72, 77, 76, 69, 70, 60, 79.
- rank data
- 60, 60, 68, 69, 70, 72, 76, 77, 79, 79
- Find position
- (1/2)(n1) 11/2 51/2th position
- Locate median
- M (7072)/2 71
14- Example
- minutes waiting for the PRT (n8)
- x 5, 11, 9, 15, 33, 3, 7, 12
Median RANK DATA FIRST! 3, 5, 7, 9, 11, 12, 15,
33
Median is 1/2(n1)th position (81)/2
41/2 41/2 th position is half-way between 9 and
11. (911)/2 10 Median10
15Comparing the Mean Median
- The mean and median of data from a symmetric
distribution should be close together. The
actual (true) mean and median of a symmetric
distribution are exactly the same. - In a skewed distribution, the mean is farther out
in the long tail than is the median the mean is
pulled in the direction of the possible
outlier(s).
16Mean vs. Median
- Which should we use?
- Symmetric or approx symmetric use mean
- Significantly skewed used median
- affected by outliers (extreme values)
17(No Transcript)
18Outliers?
- If it is a mistake and is documented, we can
eliminate it - If it is not a mistake, do not eliminate it
- A statistic is robust if it is not led too far
astray by a few outliers. Means (and standard
deviations) are not robust.
19Mode
- Observed value that occurs with the greatest
frequency - Note if no mode, write none not 0
- If two modes bimodal
20Sample mode
- Grades 60, 60, 68, 69, 70, 72, 76, 77, 79, 79
- There are two modes, so this data is bimodal
21Measures of Dispersion
- spread - A general term referring to how spread
out or variable a set of numbers is. - Very large spread
- 0, 100, 9999, 100000
- No spread
- 12, 12, 12, 12, 12
22Spread or Variability
- If all values are the same, then they all equal
the mean. There is no spread. - Variability exists when some values are different
from (above or below) the mean. - We will discuss the following measures of spread
range, interquartile range, variance, standard
deviation.
23Range
- One way to measure spread is to give the smallest
(minimum) and largest (maximum) values in the
data set - Range max ? min
- ( the values range from min to max )
- The range is strongly affected by outliers, and
is rarely used
24sample range
- highest data value - lowest data value
- Grades 60, 60, 68, 69, 70, 72, 76, 77, 79, 79
- Range max-min 79-6019
25Quartiles
- Three numbers that divide the ordered data into
four equal-sized groups. - Q1 has 25 of the data below it.
- Q2 has 50 of the data below it. (Median)
- Q3 has 75 of the data below it.
26QuartilesUniform Histogram
27Obtaining the Quartiles
- Order the data.
- For Q2, just find the median.
- For Q1, look at the lower half of the data
values, those to the left of the median find
the median of this lower half. - For Q3, look at the upper half of the data
values, those to the right of the median find
the median of this upper half.
28Interquartile Range (IQR)
- Used to measure dispersion (spread) with the
median - Sample IQR Q3-Q1
29- minutes waiting for the PRT (n8)
- 3, 5, 7, 9, 11, 12, 15, 33
- Recall Median is half-way between 9 and 11
- M10
- Q1 position is half-way between 5 and 7
- Q1 6
- Q3 is half-way between 12 and 15
- Q3 131/2
- IQR Q3-Q1 13.5-6 7.5
30The five-number summary boxplots
M Q1 Q3 Min
Max
- 5 summary
- Min
- Q1
- M
- Q3
- Max
31Boxplot(from Five-Number Summary)
- Central box spans Q1 and Q3.
- A line in the box marks the median M.
- Lines extend from the box out to the minimum and
maximum.
32106 13.53
33
PRT example 5 summary and boxplot
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
20 21 22 23 24 25 26 27 28 29 30 31 32 33
33OUTLIER BOX PLOTS Whats the fastest you have
ever driven a car? ____ mph.
Males (87 Students)
110 95 120 55 150
Females (102 Students)
89 80 95 30 130
- Outliers greater than 1.5(IQR) below Q1 or above
Q3
34Standard deviation?
The length of human pregnancies has a mean, or
average, of 266 days for the entire population of
women. It also has a standard deviation of 16
days. What do you think is meant by the term
standard deviation?
35Variance and Standard Deviation
- When variability exists, each data value has an
associated deviation from the mean - What is a typical deviation from the mean?
(standard deviation) - Small values of this typical deviation indicate
small spread in the data - Large values of this typical deviation indicate
large spread in the data
36Variance
- Find the mean
- Find the deviation of each value from the mean
- Square the deviations
- Sum the squared deviations
- Divide the sum by n-1
- (gives typical squared deviation from mean)
37Variance Formula
38Standard Deviation Formulatypical deviation from
the mean
standard deviation square root of the
variance
39- Let's say I have two classes, class A and class
B, and I want to know how my students are doing.
I take a random sample of 10 grades from each
class. Suppose the average in both classes turns
out to be 71. We might infer that class A and
class B are very similar, right?
40- class A 68, 79, 60, 72, 77, 76, 69, 70, 60, 79.
- class B 99, 99, 98, 96, 97, 97, 30, 35, 20, 39.
- Something is going terribly wrong in class B!
Some students are doing exceptionally well and
some are failing. Using only the sample mean, I
would think the classes are performing about the
same.
41Calculating s
- class A 68, 79, 60, 72, 77, 76, 69, 70, 60, 79.
42(No Transcript)
43(No Transcript)
44sample SD for class A
Sample std dev is approx 7
45- class B 99, 99, 98, 96, 97, 97, 30, 35, 20, 39
- s 35
- (5 times as much as class A!)
46Choosing a Summary
- Outliers affect the values of the mean and
standard deviation. - The five-number summary should be used to
describe center and spread for skewed
distributions, or when outliers are present. - Use the mean and standard deviation for
reasonably symmetric distributions that are free
of outliers.
47Distn of calories in popular candy bars
48Todays concepts
- Numerical Summaries
- Center (mean, median)
- Spread (variance, std. dev., range, IQR)
- Five-number summary Boxplots
- Choosing mean versus median
- Choosing standard deviation versus five-number
summary