Title: Describing distributions with numbers
1Chapter 2
Describing distributions with numbers
2Chapter Outline
- 1. Measuring center the mean
- 2. Measuring center the median
- 3. Comparing the mean and the median
- 4. Measuring spread the quartiles
- 5. The five-number summary and boxplots
- 6. Measuring spread the standard deviation
- 7. Choosing measures of center and spread
3Measuring center the mean
- Notation
- It is simply the ordinary arithmetic average.
- Suppose that we have n observations (data size,
number of individuals). - Observations are denoted as x1, x2, x3, xn.
4Measuring center the mean
- How to get ?
- Example 2.1 (P.33)
5Measuring center the median
- Notation M
- Median M is the midpoint of a distribution? half
the observations are smaller than M and the
other half are larger than M.
6Measuring center the median
- How to find M?
- 1. Sort all observations in increasing order
(This step is important!!!) - 2. If n is odd, observation is M. if n is
even, average of two center values is M. - Note that is the location of the median
in the ordered list, not the median value.
7Measuring center the median
- Examples
- Case 1. 11, 21, 13, 24, 15, 26, 17
- Case 2. 11, 21, 13, 24, 15, 26
- Example 2.2, 2.3 (P.35)
8Mean vs. Median
- Median is more resistant than the mean.
- The mean and median of a symmetric distribution
are close together. If the distribution is
exactly symmetric, the mean and median are
exactly the same. In a skewed distribution, the
mean is farther out in the long tail than is the
median. - Example
- 1, 2, 3, 4, 5, 6, 10000
9Inference
- Strongly skewed distributions are reported with
median than the mean.
10Measuring Spread The Quartiles
- The quartiles mark out the middle half of the
distribution.
11Calculating the Quartiles
- Step1.
- Arrange the observations in increasing order
and locate the median M in the ordered list of
observations. - Step2.
- The first quartile Q1 is the median of the
observations whose position in the ordered list
is to the left of the location of the overall
median. - Step3.
- The third quartile Q3 is the median of the
observations whose position in the ordered list
is to the right of the location of the overall
median.
12Measuring spread the quartiles
- Example 2.4 (P. 37)
- Example 2.5 (P. 38)
- Note
- (1) It is important to sort data first before
- we try to find quartiles!
- (2) Quartiles are resistant.
13The five-number summary and boxplots
- The five-number summary
- Minimum, Q1, M, Q3, Maximum.
- Boxplot is a graph of five number
- summary.
- Boxplots are most useful for side-by-side
comparison of several distributions.
14Boxplot
- 1. A boxplot is a graph of the five-number
summary - 2. A central box spans the quartiles
- 3. A line in the box marks the median
- 4. Lines extended from the box out to the minimum
and maximum - 5. Range maximum - minimum
15The five-number summary and boxplot
- Figure 2.2(P.39) side-by-side boxplots comparing
the distributions of earning for two levels of
education.
16The five-number summary and boxplots
17Inference
- Boxplot also gives an indication of the symmetry
or skewness of a distribution. -
- -- In a symmetric distribution Q1 and Q3
- are equally distant from the median,
- but in case of right skewed one the
- third quartile would be further above the
- median than the first quartile bellow it.
18Measuring spread the standard deviation
- It says how far the observations are from their
mean. - The variance s2 of a set of observations is an
average of the squares of the deviations of the
observations from their mean. - Notation s2 for variance and s for standard
deviation
19Why (n-1) ?
- As the sum of the deviations
- always equals 0, so the knowledge of (n-1) of
them determines the last one. - --- Only (n-1) of the squared deviations are
variable but not the last one, so we average by
dividing the total by (n-1). - The number (n-1) is called the degrees of
freedom of the variance or standard deviation
20Measuring spread the standard deviation
- To find the variance and the standard deviation
- 1. Find the mean of the data set
- 2. Subtract the mean from each number (we call
that deviation) - 3. Square each result
- 4. Sum all the square
- 5. Divide the sum of square by n-1, where n is
the number of all observations. Now you get
variance - 6. Standard deviation is just the positive square
root of the variance.
21Measuring spread the standard deviation
22Properties of s2 and s
- s measures spread about the mean and should be
used only when the mean is chosen as the measure
of center. - s 0 and s0 only when each of the observation
values does not differ from each other. - S is not resistant.
23Choosing measures of center and spread
- With a skewed distribution or with a distribution
with extreme outliers, five-number summary is
better. - With a symmetric distribution (without outliers),
mean and standard deviation are better.