Ch5: Describing Distributions Numerically Finding the Center: The Median - PowerPoint PPT Presentation

1 / 34
About This Presentation
Title:

Ch5: Describing Distributions Numerically Finding the Center: The Median

Description:

When we think of a typical value, we usually look for the center of the distribution. ... symmetric, with a mean of 72.7 beats per minute (bpm) and a median of 73 bpm: ... – PowerPoint PPT presentation

Number of Views:42
Avg rating:3.0/5.0
Slides: 35
Provided by: Addison6
Category:

less

Transcript and Presenter's Notes

Title: Ch5: Describing Distributions Numerically Finding the Center: The Median


1
Ch5 Describing Distributions Numerically
Finding the Center The Median
  • When we think of a typical value, we usually look
    for the center of the distribution.
  • For a unimodal, symmetric distribution, its easy
    to find the centerits just the center of
    symmetry.
  • As a measure of center, the midrange (the average
    of the minimum and maximum values) is very
    sensitive to skewed distributions and outliers.
  • The median is a more reasonable choice for center
    than the midrange

2
Finding the Center The Median (cont.)
  • The median is the value with exactly half the
    data values below it and half above it.
  • It is the middle data
    value (once the data

    values have been
    ordered) that divides
    the
    histogram into 2
    two equal areas.
  • For an even number of
  • data pts, average the 2
  • middle ones
  • median(2,4,5,6,6,7) 5.5
  • It has the same
  • units as the data.

Healthy Life Expectancy (HALE) Measure for all UN
Members, 2001
3
Spread Home on the Range
  • Always report a measure of spread along with a
    measure of center when describing a distribution
    numerically.
  • The range of the data is the difference between
    the maximum and minimum values
  • Range max min
  • A disadvantage of the range is that a single
    extreme value can make it very large and, thus,
    not representative of the data overall.

4
Spread The Interquartile Range
  • The interquartile range (IQR) lets us ignore
    extreme data values and concentrate on the middle
    of the data.
  • To find the IQR, we first need to find the
    Quartiles, which divide the data into four equal
    sections.
  • The lower quartile is the median of the half of
    the data below the median.
  • The upper quartile is the median of the half of
    the data above the median.
  • If the data has an even of points, this
    division is straightforward. If it is odd, then
    the text tells you to count the median in both
    halves of the data.
  • The difference between the quartiles is the IQR,
    so
  • IQR upper quartile lower quartile

5
Spread The Interquartile Range (cont.)
  • The lower and upper quartiles are the 25th and
    75th percentiles of the data, so
  • The IQR contains the
    middle 50 of
    the
    values of the
    distribution,
    as shown in
    Figure 5.3
    from the text
  • 5 number summary for HALEs
  • max 73.6
  • Q3 62.65
  • median 57.7
  • Q1 48.9
  • min 26.5

Healthy Life Expectancy Measure for all UN
Members, 2001
6
The Five-Number Summary
  • The five-number summary of a distribution reports
    its median, quartiles, and extremes (maximum and
    minimum).
  • Example The five-number summary for the ages at
    death for 66 rock concert goers who died from
    being crushed is

7
Making Boxplots
  • A boxplot is a graphical display of the
    five-number summary.
  • Boxplots are particularly useful when comparing
    groups.

And also some additional information, such as
other outliers
8
Constructing Box-plots
  • Draw a single axis spanning the range of the data
  • you can draw box-plots vertical or horizontal,
    but this one is oriented vertically, so that is
    how the instructions are described.
  • Draw short horizontal lines at the lower and
    upper quartiles and at the median. Then connect
    them with vertical lines to form a box.

9
Constructing Boxplots (cont.)
  • Erect fences around the main part of the data.
  • The upper fence is 1.5 IQRs above the upper
    quartile.
  • The lower fence is 1.5 IQRs below the lower
    quartile.
  • Note the fences only help with constructing the
    boxplot and should not appear in the final
    display. (you can leave them in, if you want, but
    only as dotted lines)

10
Constructing Boxplots (cont.)
  • Use the fences to grow whiskers.
  • Draw lines from the ends of the box up and down
    to the most extreme data values found within the
    fences.
  • (If you look at the original data for rock
    concert deaths, this would be 29 years for the
    upper whisker. 13 is the youngest death (and 13
    gt 9.5), so thats the lower whisker end.)
  • If a data value falls outside one of the fences,
    we do not connect it with a whisker.

11
Constructing Boxplots (cont.)
  • Now we add any outliers by displaying any data
    values beyond the fences with special symbols.
  • Often, a different symbol is used for far
    outliers that are farther than 3 IQRs from the
    quartiles. (This stylistic differentiation is
    optional)
  • And we erase the fences (again, optional).

12
Rock Concert Deaths Making Boxplots (cont.)
  • Compare the histogram and boxplot for
  • Worldwide Rock Concert Deaths, 1999-2000
  • How does each display represent the distribution?

13
Comparing Groups With Boxplots
  • The following set of boxplots compares the
    effectiveness of various travel coffee mugs
  • What does this graphical display tell you?
  • Which coffee container would you recommend using?
  • Did we really need to see all 4 histograms to
    reach this conclusion?

Temperature change for Brands of Coffee Containers
14
Summarizing Symmetric Distributions
  • Medians do a good job of identifying the center
    of skewed distributions.
  • When we have symmetric data, the mean is a good
    measure of center.
  • We find the mean by adding up all of the data
    values and dividing by n ( the number of data
    values we have).

15
Summarizing Symmetric Distributions (cont.)
  • The distribution of pulse rates for 52 adults is
    generally symmetric, with a mean of 72.7 beats
    per minute (bpm) and a median of 73 bpm

Pulse Rates of 52 Adults
16
Mean or Median?
Healthy Life Expectancy Measure for all UN
Members, 2001
  • Regardless of the shape of the distribution, the
    mean is the point at which a histogram of the
    data would balance

17
Mean or Median? (cont.)
  • In symmetric distributions, the mean and median
    are approximately the same in value, so either
    measure of center may be used.
  • For significantly skewed data, though, its
    better to report the median than the mean as a
    measure of center.
  • Example Does the HALE data show skew? If so,
    how?

18
What About Spread? The Standard Deviation
  • A more powerful measure of spread than the IQR is
    the standard deviation, which takes into account
    how far each data value is from the mean.
  • A deviation is the distance that a data value is
    from the mean.
  • Since adding all deviations together would total
    zero, we square each deviation and find an
    average of sorts for the deviations.

19
First We Find the Variance
  • The variance, notation of s2, is found by summing
    the squared deviations and dividing by n-1
  • The variance will play a role later in our study,
    but it is problematic as a measure of spreadit
    is measured in squared units!

20
Then We Take the Square Root
  • The standard deviation, s, (or sometimes SD) is
    just the square root of the variance and is
    measured in the same units as the original data.

21
Looking at Center and Spread, an Example
  • As part of a Human Resources report, assume weve
    been given annual salaries (K/yr) for 9
    professors as follows.
  • Describe the distribution
  • What would be an appropriate measure of center
    and spread?

22
Looking at Center and Spread, an Example
  • Although the data is symmetric, we could still
    determine the Median and the IQR. (First sort
    the data so it is ordered)

23
Looking at Center and Spread, an Example
  • First Calculate the mean, then calculate the
    Standard Deviation
  • We can use Excel to work out the calculations
    step-by-step

24
Thinking About Variation
  • Since Statistics is about variation, spread is an
    important fundamental concept of Statistics.
  • Measures of spread help us talk about what we
    dont know.
  • When the data values are tightly clustered around
    the center of the distribution, the IQR and
    standard deviation will be small.
  • When the data values are scattered far from the
    center, the IQR and standard deviation will be
    large.

25
Shape, Center, and Spread
  • When describing a quantitative variable, always
    report the shape of its distribution, along with
    a center and a spread.
  • If the shape is skewed, report the median and
    IQR.
  • If the shape is symmetric, report the mean and
    standard deviation and possibly the median and
    IQR as well.

26
What About Outliers?
  • If there are any clear outliers and you are
    reporting the mean and standard deviation, report
    them with the outliers present and with the
    outliers removed. The differences may be quite
    revealing.
  • Note The median and IQR are not as likely to be
    affected by the outliers as the mean and SD.

27
What Can Go Wrong?
  • Dont forget to do a reality checkdont let
    technology do your thinking for you.
  • First sort the values before finding the median
    and quartiles.
  • Dont compute numerical summaries of a
    categorical variable.
  • Watch out for multiple modesmultiple modes might
    indicate multiple groups in your data.
  • Be aware of slightly different methodsdifferent
    statistics packages and calculators may give you
    different answers for the same data.
  • Beware of outliers.
  • Make a picture, make a picture, make a picture.

28
What Can Go Wrong? (cont.)
  • Be careful when comparing groups that have very
    different spreads.
  • Consider the first side-by-side boxplots of
    cotinine levels for 3 different types of subjects
  • The 2nd boxplots show the same data for
    log(cotinine) values
  • This example is an aside, as re-expressing data
    is not going to be tested in DS212

29
What have we learned?
  • We can now summarize distributions of
    quantitative variables numerically.
  • The 5-number summary displays the min, Q1,
    median, Q3, and max.
  • Measures of center include the mean and median.
  • Measures of spread include the range, IQR, and
    standard deviation.
  • We know which measures to use for symmetric
    distributions and skewed distributions.
  • We can also display distributions with boxplots.
  • While histograms better show the shape of the
    distribution, boxplots reveal the center, middle
    50, and any outliers in the distribution.
  • Boxplots are useful for comparing groups.

30
Examples
  • Based on Pr5- A clerk entering salary data into
    a spreadsheet accidentally put in an extra 0 on
    the bosss salary, listing it as 2,000,000 /yr
    instead of 200,000 /yr. Explain how this error
    will affect these summary statistics for the
    company payroll. (Note Although the text
    doesnt say, you can assume this is a company
    with at least 5 employees and also that the boss
    earns the largest salary!)
  • Measures of center Median and Mean
  • Measures of Spread Range, IQR, and Standard
    Deviation

31
Examples
  • Pr41- The data from the CD Rom (also available
    on my website) shows 8th graders average math
    test scores for the participating 66 nations.
  • Notes copy the the CD Rom, rather than typing
    all s in by hand! (Saves time and less
    mistake-prone!)
  • Excel doesnt have Q1 and Q3 calculations, but
    you can get those by sorting the data (use
    Excels Sort!) adding a column for rank.
  • You can use excels built-in functions AVERAGE
    and STDEV
  • For more guidance, see the Problem Solution Ive
    provided. There are 2 files, one for the excel
    work and also a word document

32
Examples
  • Pr39- In a USA Today advertisement (7/2001)
    Net2Phone listed long distance rates to 24 of the
    250 countries they serve. (Hint use the Excel
    data set given!)

33
Examples
  • Pr39 (Net2Phone Example)- continued
  • Display these rates
  • Find the mean and median. Which is the most
    appropriate measure of center?
  • Find the IQR and Standard deviation. Which is
    the more appropriate measure of spread?
  • Are there any outliers? Why?
  • Write a description of the rates.
  • Can you conclude anything about Net2Phones rates
    to all the countries they serve?

34
Examples
  • Pr39 (Net2Phone Example)- answers
  • A boxplot is a good way to display these rates
  • What are the outliers?
  • Do they have a huge effect?
  • Do you feel comfortable making generalizations
    about Net2Phones service from 10 of the data,
    when the company selected that data
  • (See solution set for full details!)
Write a Comment
User Comments (0)
About PowerShow.com