Describing Distributions with Graphs and Numbers - PowerPoint PPT Presentation

About This Presentation
Title:

Describing Distributions with Graphs and Numbers

Description:

... Example * 1792 192 36864 1666 66 4356 1362 -238 56644 1614 14 196 1460 -140 19600 1867 267 71289 1439 -161 25921 Observations Deviations Squared ... – PowerPoint PPT presentation

Number of Views:98
Avg rating:3.0/5.0
Slides: 34
Provided by: SY54
Category:

less

Transcript and Presenter's Notes

Title: Describing Distributions with Graphs and Numbers


1
Topic 2
  • Describing Distributions with Graphs and Numbers

2
Sampling/ experiment
Target population
Data
Size n
Size N
summary
Inference (estimation, testing)
visualization
3
Parameter and Statistic
  • A parameter (in statistics) is a quantity that
    defines a certain characteristic of a population.
  • Average birthweight of all new-born babies
  • Parameters are estimated based on a sample.
  • A statistic is a summary measure computed from
    sample data. Note that a parameter is a summary
    measure for an entire population.
  • A key use of a statistic is as an estimator for a
    parameter.

4
Distributions
  • When we say that 62 TAMUK students are Hispanic,
    32 are white, 3 are African-American, and 3
    are others, we mean the DISTRIBUTION of TAMUK
    students according to race is
  • Race
    Percent
  • Hispanic
    62
  • White
    32
  • African-American 3
  • Others
    3

5
  • The DISTRIBUTION of grades for a class could be
  • Grade
    Percent
  • A
    20
  • B
    45
  • C
    22
  • D
    10
  • F
    3

6
  • The DISTRIBUTION of weights of all men aged 30 in
    Texas could be
  • Weights Percent
  • Less than 130 lb. 3
  • 130 to 140 lb. 6
  • 140 to 150 lb. 15
  • 150 to 160 lb. 25
  • 160 to 170 lb. 30
  • 170 to 180 lb. 17
  • 180 or over 4

7
  • So, the DISTRIBUTION of a population describes
    how the population is made up of according to
    some characteristic.
  • If one is concerned with the characteristic of
    a population that can be described by a
    categorical variable, e.g., race, he or she may
    be interested in what percent of subjects fall in
    each race category.
  • If one is concerned with the characteristic of
    a population that can be described by a
    continuous variable, e.g., weight, he or she may
    be interested in what proportion of people fall
    in a weight interval.

8
Histograms
  • A histogram is a bar graph in which the
    horizontal scale represents classes of data
    values and the vertical scale represents
    frequencies (or relative frequencies). The
    heights of the bars correspond to the frequency
    (or the relative frequency) values, and the bars
    are drawn adjacent to each other without gaps.

9
  • Example Construct a histogram for the 20
    systolic blood pressures (SBP) of 20 men
  • 93 104 105 108 109 112 114 115
    117 119
  • 119 120 121 123 127 130 135 139
    139 158

SBP Frequency
90-99 1
100-109 4
110-119 6
120-129 4
130-139 4
140-149 0
150-159 1
10
R Codes
  • SBP c(93,104,105,108,109,112,114,115,117,119
    ,
  • 119,120,121,123,127,130,135,139,139,
    158)
  • hist(SBP, breaksc(89.5,99.5,109.5,119.5,129.5
    ,139.5,

  • 149.5,159.5,169.5),col3)
  • Copy and paste these codes to R, then you
    will see the histogram.

11
Pie Charts
Pie chart A circle having a slice of a pie
for each category. The size
of slice corresponds to the
percentage of observations
in the category.
12
(No Transcript)
13
Bar Graph for European Parliament in 2004
14
Pareto Chart Bar Graph with categories Ordered
by Their Frequency from the Tallest Bar to
Shortest
15
Measuring the Center the Mean and Median
  • The distribution of data or a population can be
    displayed graphically. In practice, we also want
    to know where the center of a distribution is.
    The mean and median are common measures of a
    distribution.
  • The mean of n observations x1, x2, , xn, denoted
    ___, is defined as ______.
  • Example The selling prices () of 5
    single-family homes are 198000, 219000, 175000,
    260000, 630000. Find the mean price.

16
The Mean is Sensitive to Outliers
  • If the 5th home were 360000, then the mean price
    would be ___. The significant difference in means
    is due primarily to the 5th price, which is
    called an outlier.
  • If we construct a histogram or a stem plot for
    the data of these 5 prices, the distribution of
    the data can be seen to be skewed to the right.
    This skewness is caused by the outlier.

17
The Median
  • Another measure of center of a distribution is
    the median.
  • Given n observations x1, x2, , xn, the median,
    denoted M, is defined as the number such that
    half the observations are smaller.
  • To find the median of n observations, we first
    sort the observations in order, then pick the
    midpoint.
  • Example Find the median of the 5 prices 198000,
    219000, 175000, 260000, and 630000.
  • What if we have 6 prices 198000, 219000, 175000,
    260000, 630000, and 230000?

18
Location of the Median
  • Given n observations, the location of the median
    in the ordered list is always (n1)/2.
  • When is the location of a median an integer? When
    decimal?
  • If the location of a median is 4.5, it means that
    the median is halfway between the 4th and 5th
    observations in the ordered list. What does it
    mean if the location is 7?
  • Find the median and its location for data 2, 5,
    1, 0, 9.
  • Find the median and its location for data 0, 3,
    1, -2, 7, 4.

19
Example Find the Mean and Median from a Stem Plot
1 69 2 455 3 334477 4 0255669 5
6 7 3
(a) What are the observations? (b) Find the
mean. (c) Find the median and its location.
20
Comparing the Mean and Median
  • For a symmetric distribution, mean median.
  • For a right-skewed distribution, mean gt median.
  • For a left-skewed distribution, mean lt median.

21
Mean, Median, and Mode
The distribution of data is Symmetric
The distribution is skew to the left
The distribution is skew to the right
22
Measuring the Spread The Quartiles
  • The spread of a distribution measures how
    divergent the distribution is.
  • The middle half of a distribution is marked out
    by two quartiles
  • The 1st quartile Q1 is the number such that 25
    of all values are smaller
  • The 3rd quartile Q3 is the number such that 75
    of all values are smaller
  • The median of a distribution is also called the
    2nd quartile which is the number such that 50 of
    all values are smaller.
  • Note also that these quartiles so defined are not
    unique.
  • To find these quartiles, we will need to sort the
    data and find the locations of these quartiles.

23
Example Find Quartiles
  • 1. Given data
  • 16 25 24 19 33 25 34 46 37 33 42 40 37 34 49 73
    46 45 45,
  • find Q1, M, and Q3.
  • 2. Given data
  • 16 25 24 19 33 25 34 46 37 33 42 40 37 34 49 73
    46 45 45 31,
  • find Q1, M, and Q3.

24
The Five-Number Summary and Boxplots
  • Q1, M, and Q3 give the information about the
    middle half of a distribution the tails of a
    distribution can be described by possible
    smallest and largest values of the distribution.
    These five values can intuitively picture a
    distribution and are called the 5-number summary.
  • The Five-Number Summary of a distribution
    describes both the center and the spread of a
    distribution.
  • The 5 numbers can be displayed in a (ordinary)
    boxplot, which consists of
  • (a) a central box spanning the quartiles Q1 and
    Q3,
  • (b) a line in the box masking the median M, and
  • (c) two lines extended from the box out to the
    smallest and largest
  • observations.
  • Compared with its competitors histograms and stem
    plots, a boxplot show less detail about the
    distribution. Boxplots are best used for
    side-by-side comparison of more than one
    distribution. The boxplot of a distribution
    should be interpreted in terms of skewness, the
    center and the spread.

25
Compare the two boxplots in terms of skewness,
spread, and center.
The side-by-side boxplot is produced with the
following R codes x c(86, 91, 72, 79, 74,
83, 73, 92, 76, 72, 67, 88, 70, 79, 93,
65, 75, 83, 90, 75, 100, 63) y c(74,
84, 86, 90, 78, 85, 75, 72, 97, 84, 87, 76, 78,
79, 82, 63, 95, 79, 82, 69, 96, 73)
zdata.frame(Gradec(x,y), Section
c(rep('Section 01', length(x)), rep('Section 02
', length(y)))) attach(z) boxplot(GradeSecti
on, col 23)
26
Spotting Suspected Outliers The 1.5xIQR Rule
  • In a boxplot, the distance between Q1 and Q3 (the
    range of the center half of the data) is a more
    resistant measure of spread. This distance is
    called the inter-quartile range, denoted IQR
    that is
  • IQR Q3 Q1.
  • The 1.5xIQR Rule for outliers An observation is
    called a suspected outlier if it falls more than
    1.5xIQR above Q3 or below Q1.
  • Example Find Q1, Q3, and IQR of the data
  • 72 83 91 84 84 78 90 85 67 91 80 85 67 65 95.
  • Identify any suspected outlier.

27
A Modified Boxplot
28
R codes
myBoxPlot function(x, col 'gray')
boxplot(x, col col) text(rep(1.3,5),
fivenum(x), labelsc('minimum', 'lower hinge',
'median',
'upper hinge', 'maximum'), col 'blue') q
quantile(x, probs c(0.25, 0.5, 0.75))
IQR q3 - q1 lowerfence q1 -
1.5IQR upperfence q3 1.5IQR
abline(h c(lowerfence, upperfence), col
'green', lty 2) text(rep(1.3,5),
c(lowerfence, upperfence), labelsc('lower
fence', 'upper fence'),

col
'blue') Outliers which((x - lowerfence)(x
- upperfence) gt 0) if (length(Outliers) !
0) text(rep(0.63, length(Outliers)),
xOutliers, labels
paste(rep('Obs.',
length(Outliers)),Outliers), col 'red')
Rainfall c(9.6, 12.9, 9.9, 8.7, 6.8, 12.5,
13.0, 10.1, 10.1, 10.1, 10.8, 7.8, 14.1, 10.6,
10.0, 11.5, 13.6, 12.1, 12.0, 9.3, 7.7, 11.0,
6.9, 9.5, 16.5, 9.3, 9.4, 8.7, 9.5, 11.6, 12.1,
8.0, 10.7, 13.9, 11.3, 11.6, 10.4)
myBoxPlot(Rainfall)
29
Measuring Spread the Standard Deviation
  • Interestingly, the mean is not among the 5-numver
    summary of a distribution. The closest partner of
    the mean is the standard deviation, which is
    another measure of the spread of a distribution.
  • The standard deviation measures how far the
    observations are from their mean.

30
Calculation of Standard Deviations
  • The variance of a set of observations is an
    average of the squares of deviation from the
    mean.
  • The standard deviation s is the square root of
    the variance

31
The standard deviation Example
  • Example (Calculating the standard deviation s)
  • Metabolic rates of 7 men who took part in a
    study of dieting. The units are calories per 24
    hours.
  • 1792 1666 1362 1614 1460 1867 1439
  • Find the mean first

32
Observations Deviations
Squared deviations
Contd
1792 192 36864
1666 66 4356
1362 -238 56644
1614 14 196
1460 -140 19600
1867 267 71289
1439 -161 25921

sum 0 sum 214870
The variance The standard deviation
33
Summary of Strategies for Exploring Data on a
Single Quantitative Variable
  • The 5-number summary is always good for
    describing the distribution of quantitative data.
  • The mean and its partner standard deviation
    should be used to describe the center and spread
    of the distribution of quantitative data only
    when the distribution is known to be symmetric,
    since both are sensitive to outliers.
  • The shape of the distribution of quantitative
    data is better described using graphical displays
    such as histograms.
Write a Comment
User Comments (0)
About PowerShow.com