Title: Ch5: Describing Distributions Numerically Finding the Center: The Median
1Ch5 Describing Distributions Numerically
Finding the Center The Median
- When we think of a typical value, we usually look
for the center of the distribution. - For a unimodal, symmetric distribution, its easy
to find the centerits just the center of
symmetry. - As a measure of center, the midrange (the average
of the minimum and maximum values) is very
sensitive to skewed distributions and outliers. - The median is a more reasonable choice for center
than the midrange
2Finding the Center The Median (cont.)
- The median is the value with exactly half the
data values below it and half above it. - It is the middle data
value (once the data
values have been
ordered) that divides
the
histogram into 2
two equal areas. - For an even number of
- data pts, average the 2
- middle ones
- median(2,4,5,6,6,7) 5.5
- It has the same
- units as the data.
Healthy Life Expectancy (HALE) Measure for all UN
Members, 2001
3Spread Home on the Range
- Always report a measure of spread along with a
measure of center when describing a distribution
numerically. - The range of the data is the difference between
the maximum and minimum values - Range max min
- A disadvantage of the range is that a single
extreme value can make it very large and, thus,
not representative of the data overall.
4Spread The Interquartile Range
- The interquartile range (IQR) lets us ignore
extreme data values and concentrate on the middle
of the data. - To find the IQR, we first need to find the
Quartiles, which divide the data into four equal
sections. - The lower quartile is the median of the half of
the data below the median. - The upper quartile is the median of the half of
the data above the median. - If the data has an even of points, this
division is straightforward. If it is odd, then
the text tells you to count the median in both
halves of the data. - The difference between the quartiles is the IQR,
so - IQR upper quartile lower quartile
5Spread The Interquartile Range (cont.)
- The lower and upper quartiles are the 25th and
75th percentiles of the data, so - The IQR contains the
middle 50 of
the
values of the
distribution,
as shown in
Figure 5.3
from the text - 5 number summary for HALEs
- max 73.6
- Q3 62.65
- median 57.7
- Q1 48.9
- min 26.5
Healthy Life Expectancy Measure for all UN
Members, 2001
6The Five-Number Summary
- The five-number summary of a distribution reports
its median, quartiles, and extremes (maximum and
minimum). - Example The five-number summary for the ages at
death for 66 rock concert goers who died from
being crushed is
7Making Boxplots
- A boxplot is a graphical display of the
five-number summary. - Boxplots are particularly useful when comparing
groups.
And also some additional information, such as
other outliers
8Constructing Box-plots
- Draw a single axis spanning the range of the data
- you can draw box-plots vertical or horizontal,
but this one is oriented vertically, so that is
how the instructions are described. - Draw short horizontal lines at the lower and
upper quartiles and at the median. Then connect
them with vertical lines to form a box.
9Constructing Boxplots (cont.)
- Erect fences around the main part of the data.
- The upper fence is 1.5 IQRs above the upper
quartile. - The lower fence is 1.5 IQRs below the lower
quartile. - Note the fences only help with constructing the
boxplot and should not appear in the final
display. (you can leave them in, if you want, but
only as dotted lines)
10Constructing Boxplots (cont.)
- Use the fences to grow whiskers.
- Draw lines from the ends of the box up and down
to the most extreme data values found within the
fences. - (If you look at the original data for rock
concert deaths, this would be 29 years for the
upper whisker. 13 is the youngest death (and 13
gt 9.5), so thats the lower whisker end.) -
- If a data value falls outside one of the fences,
we do not connect it with a whisker.
11Constructing Boxplots (cont.)
- Now we add any outliers by displaying any data
values beyond the fences with special symbols. - Often, a different symbol is used for far
outliers that are farther than 3 IQRs from the
quartiles. (This stylistic differentiation is
optional) - And we erase the fences (again, optional).
12Rock Concert Deaths Making Boxplots (cont.)
- Compare the histogram and boxplot for
- Worldwide Rock Concert Deaths, 1999-2000
- How does each display represent the distribution?
13Comparing Groups With Boxplots
- The following set of boxplots compares the
effectiveness of various travel coffee mugs - What does this graphical display tell you?
- Which coffee container would you recommend using?
- Did we really need to see all 4 histograms to
reach this conclusion?
Temperature change for Brands of Coffee Containers
14Summarizing Symmetric Distributions
- Medians do a good job of identifying the center
of skewed distributions. - When we have symmetric data, the mean is a good
measure of center. - We find the mean by adding up all of the data
values and dividing by n ( the number of data
values we have).
15Summarizing Symmetric Distributions (cont.)
- The distribution of pulse rates for 52 adults is
generally symmetric, with a mean of 72.7 beats
per minute (bpm) and a median of 73 bpm
Pulse Rates of 52 Adults
16Mean or Median?
Healthy Life Expectancy Measure for all UN
Members, 2001
- Regardless of the shape of the distribution, the
mean is the point at which a histogram of the
data would balance
17Mean or Median? (cont.)
- In symmetric distributions, the mean and median
are approximately the same in value, so either
measure of center may be used. - For significantly skewed data, though, its
better to report the median than the mean as a
measure of center. - Example Does the HALE data show skew? If so,
how?
18What About Spread? The Standard Deviation
- A more powerful measure of spread than the IQR is
the standard deviation, which takes into account
how far each data value is from the mean. - A deviation is the distance that a data value is
from the mean. - Since adding all deviations together would total
zero, we square each deviation and find an
average of sorts for the deviations.
19First We Find the Variance
- The variance, notation of s2, is found by summing
the squared deviations and dividing by n-1 - The variance will play a role later in our study,
but it is problematic as a measure of spreadit
is measured in squared units!
20Then We Take the Square Root
- The standard deviation, s, (or sometimes SD) is
just the square root of the variance and is
measured in the same units as the original data.
21Looking at Center and Spread, an Example
- As part of a Human Resources report, assume weve
been given annual salaries (K/yr) for 9
professors as follows. - Describe the distribution
- What would be an appropriate measure of center
and spread?
22Looking at Center and Spread, an Example
- Although the data is symmetric, we could still
determine the Median and the IQR. (First sort
the data so it is ordered)
23Looking at Center and Spread, an Example
- First Calculate the mean, then calculate the
Standard Deviation - We can use Excel to work out the calculations
step-by-step
24Thinking About Variation
- Since Statistics is about variation, spread is an
important fundamental concept of Statistics. - Measures of spread help us talk about what we
dont know. - When the data values are tightly clustered around
the center of the distribution, the IQR and
standard deviation will be small. - When the data values are scattered far from the
center, the IQR and standard deviation will be
large.
25Shape, Center, and Spread
- When describing a quantitative variable, always
report the shape of its distribution, along with
a center and a spread. - If the shape is skewed, report the median and
IQR. - If the shape is symmetric, report the mean and
standard deviation and possibly the median and
IQR as well.
26What About Outliers?
- If there are any clear outliers and you are
reporting the mean and standard deviation, report
them with the outliers present and with the
outliers removed. The differences may be quite
revealing. - Note The median and IQR are not as likely to be
affected by the outliers as the mean and SD.
27What Can Go Wrong?
- Dont forget to do a reality checkdont let
technology do your thinking for you. - First sort the values before finding the median
and quartiles. - Dont compute numerical summaries of a
categorical variable. - Watch out for multiple modesmultiple modes might
indicate multiple groups in your data. - Be aware of slightly different methodsdifferent
statistics packages and calculators may give you
different answers for the same data. - Beware of outliers.
- Make a picture, make a picture, make a picture.
28What Can Go Wrong? (cont.)
- Be careful when comparing groups that have very
different spreads. - Consider the first side-by-side boxplots of
cotinine levels for 3 different types of subjects - The 2nd boxplots show the same data for
log(cotinine) values - This example is an aside, as re-expressing data
is not going to be tested in DS212
29What have we learned?
- We can now summarize distributions of
quantitative variables numerically. - The 5-number summary displays the min, Q1,
median, Q3, and max. - Measures of center include the mean and median.
- Measures of spread include the range, IQR, and
standard deviation. - We know which measures to use for symmetric
distributions and skewed distributions. - We can also display distributions with boxplots.
- While histograms better show the shape of the
distribution, boxplots reveal the center, middle
50, and any outliers in the distribution. - Boxplots are useful for comparing groups.
30Examples
- Based on Pr5- A clerk entering salary data into
a spreadsheet accidentally put in an extra 0 on
the bosss salary, listing it as 2,000,000 /yr
instead of 200,000 /yr. Explain how this error
will affect these summary statistics for the
company payroll. (Note Although the text
doesnt say, you can assume this is a company
with at least 5 employees and also that the boss
earns the largest salary!) - Measures of center Median and Mean
- Measures of Spread Range, IQR, and Standard
Deviation
31Examples
- Pr41- The data from the CD Rom (also available
on my website) shows 8th graders average math
test scores for the participating 66 nations. - Notes copy the the CD Rom, rather than typing
all s in by hand! (Saves time and less
mistake-prone!) - Excel doesnt have Q1 and Q3 calculations, but
you can get those by sorting the data (use
Excels Sort!) adding a column for rank. - You can use excels built-in functions AVERAGE
and STDEV - For more guidance, see the Problem Solution Ive
provided. There are 2 files, one for the excel
work and also a word document
32Examples
- Pr39- In a USA Today advertisement (7/2001)
Net2Phone listed long distance rates to 24 of the
250 countries they serve. (Hint use the Excel
data set given!)
33Examples
- Pr39 (Net2Phone Example)- continued
- Display these rates
- Find the mean and median. Which is the most
appropriate measure of center? - Find the IQR and Standard deviation. Which is
the more appropriate measure of spread? - Are there any outliers? Why?
- Write a description of the rates.
- Can you conclude anything about Net2Phones rates
to all the countries they serve?
34Examples
- Pr39 (Net2Phone Example)- answers
- A boxplot is a good way to display these rates
- What are the outliers?
- Do they have a huge effect?
- Do you feel comfortable making generalizations
about Net2Phones service from 10 of the data,
when the company selected that data - (See solution set for full details!)