Title: Describing Distributions with Graphs and Numbers
1Topic 2
- Describing Distributions with Graphs and Numbers
2Sampling/ experiment
Target population
Data
Size n
Size N
summary
Inference (estimation, testing)
visualization
3Parameter and Statistic
- A parameter (in statistics) is a quantity that
defines a certain characteristic of a population. - Average birthweight of all new-born babies
- Parameters are estimated based on a sample.
- A statistic is a summary measure computed from
sample data. Note that a parameter is a summary
measure for an entire population. - A key use of a statistic is as an estimator for a
parameter.
4Distributions
- When we say that 62 TAMUK students are Hispanic,
32 are white, 3 are African-American, and 3
are others, we mean the DISTRIBUTION of TAMUK
students according to race is - Race
Percent - Hispanic
62 - White
32 - African-American 3
- Others
3
5- The DISTRIBUTION of grades for a class could be
- Grade
Percent - A
20 - B
45 - C
22 - D
10 - F
3
6- The DISTRIBUTION of weights of all men aged 30 in
Texas could be - Weights Percent
- Less than 130 lb. 3
- 130 to 140 lb. 6
- 140 to 150 lb. 15
- 150 to 160 lb. 25
- 160 to 170 lb. 30
- 170 to 180 lb. 17
- 180 or over 4
7- So, the DISTRIBUTION of a population describes
how the population is made up of according to
some characteristic. - If one is concerned with the characteristic of
a population that can be described by a
categorical variable, e.g., race, he or she may
be interested in what percent of subjects fall in
each race category. - If one is concerned with the characteristic of
a population that can be described by a
continuous variable, e.g., weight, he or she may
be interested in what proportion of people fall
in a weight interval.
8Histograms
- A histogram is a bar graph in which the
horizontal scale represents classes of data
values and the vertical scale represents
frequencies (or relative frequencies). The
heights of the bars correspond to the frequency
(or the relative frequency) values, and the bars
are drawn adjacent to each other without gaps.
9- Example Construct a histogram for the 20
systolic blood pressures (SBP) of 20 men - 93 104 105 108 109 112 114 115
117 119 - 119 120 121 123 127 130 135 139
139 158
SBP Frequency
90-99 1
100-109 4
110-119 6
120-129 4
130-139 4
140-149 0
150-159 1
10R Codes
- SBP c(93,104,105,108,109,112,114,115,117,119
, - 119,120,121,123,127,130,135,139,139,
158) - hist(SBP, breaksc(89.5,99.5,109.5,119.5,129.5
,139.5, -
149.5,159.5,169.5),col3) - Copy and paste these codes to R, then you
will see the histogram.
11Pie Charts
Pie chart A circle having a slice of a pie
for each category. The size
of slice corresponds to the
percentage of observations
in the category.
12(No Transcript)
13Bar Graph for European Parliament in 2004
14Pareto Chart Bar Graph with categories Ordered
by Their Frequency from the Tallest Bar to
Shortest
15Measuring the Center the Mean and Median
- The distribution of data or a population can be
displayed graphically. In practice, we also want
to know where the center of a distribution is.
The mean and median are common measures of a
distribution. - The mean of n observations x1, x2, , xn, denoted
___, is defined as ______. - Example The selling prices () of 5
single-family homes are 198000, 219000, 175000,
260000, 630000. Find the mean price. -
16The Mean is Sensitive to Outliers
- If the 5th home were 360000, then the mean price
would be ___. The significant difference in means
is due primarily to the 5th price, which is
called an outlier. - If we construct a histogram or a stem plot for
the data of these 5 prices, the distribution of
the data can be seen to be skewed to the right.
This skewness is caused by the outlier.
17The Median
- Another measure of center of a distribution is
the median. - Given n observations x1, x2, , xn, the median,
denoted M, is defined as the number such that
half the observations are smaller. - To find the median of n observations, we first
sort the observations in order, then pick the
midpoint. - Example Find the median of the 5 prices 198000,
219000, 175000, 260000, and 630000. - What if we have 6 prices 198000, 219000, 175000,
260000, 630000, and 230000?
18Location of the Median
- Given n observations, the location of the median
in the ordered list is always (n1)/2. - When is the location of a median an integer? When
decimal? - If the location of a median is 4.5, it means that
the median is halfway between the 4th and 5th
observations in the ordered list. What does it
mean if the location is 7? - Find the median and its location for data 2, 5,
1, 0, 9. - Find the median and its location for data 0, 3,
1, -2, 7, 4.
19Example Find the Mean and Median from a Stem Plot
1 69 2 455 3 334477 4 0255669 5
6 7 3
(a) What are the observations? (b) Find the
mean. (c) Find the median and its location.
20Comparing the Mean and Median
- For a symmetric distribution, mean median.
- For a right-skewed distribution, mean gt median.
- For a left-skewed distribution, mean lt median.
21Mean, Median, and Mode
The distribution of data is Symmetric
The distribution is skew to the left
The distribution is skew to the right
22Measuring the Spread The Quartiles
- The spread of a distribution measures how
divergent the distribution is. - The middle half of a distribution is marked out
by two quartiles - The 1st quartile Q1 is the number such that 25
of all values are smaller - The 3rd quartile Q3 is the number such that 75
of all values are smaller - The median of a distribution is also called the
2nd quartile which is the number such that 50 of
all values are smaller. - Note also that these quartiles so defined are not
unique. - To find these quartiles, we will need to sort the
data and find the locations of these quartiles.
23Example Find Quartiles
- 1. Given data
- 16 25 24 19 33 25 34 46 37 33 42 40 37 34 49 73
46 45 45, - find Q1, M, and Q3.
- 2. Given data
- 16 25 24 19 33 25 34 46 37 33 42 40 37 34 49 73
46 45 45 31, - find Q1, M, and Q3.
24The Five-Number Summary and Boxplots
- Q1, M, and Q3 give the information about the
middle half of a distribution the tails of a
distribution can be described by possible
smallest and largest values of the distribution.
These five values can intuitively picture a
distribution and are called the 5-number summary. - The Five-Number Summary of a distribution
describes both the center and the spread of a
distribution. - The 5 numbers can be displayed in a (ordinary)
boxplot, which consists of - (a) a central box spanning the quartiles Q1 and
Q3, - (b) a line in the box masking the median M, and
- (c) two lines extended from the box out to the
smallest and largest - observations.
- Compared with its competitors histograms and stem
plots, a boxplot show less detail about the
distribution. Boxplots are best used for
side-by-side comparison of more than one
distribution. The boxplot of a distribution
should be interpreted in terms of skewness, the
center and the spread.
25Compare the two boxplots in terms of skewness,
spread, and center.
The side-by-side boxplot is produced with the
following R codes x c(86, 91, 72, 79, 74,
83, 73, 92, 76, 72, 67, 88, 70, 79, 93,
65, 75, 83, 90, 75, 100, 63) y c(74,
84, 86, 90, 78, 85, 75, 72, 97, 84, 87, 76, 78,
79, 82, 63, 95, 79, 82, 69, 96, 73)
zdata.frame(Gradec(x,y), Section
c(rep('Section 01', length(x)), rep('Section 02
', length(y)))) attach(z) boxplot(GradeSecti
on, col 23)
26Spotting Suspected Outliers The 1.5xIQR Rule
- In a boxplot, the distance between Q1 and Q3 (the
range of the center half of the data) is a more
resistant measure of spread. This distance is
called the inter-quartile range, denoted IQR
that is - IQR Q3 Q1.
- The 1.5xIQR Rule for outliers An observation is
called a suspected outlier if it falls more than
1.5xIQR above Q3 or below Q1. - Example Find Q1, Q3, and IQR of the data
- 72 83 91 84 84 78 90 85 67 91 80 85 67 65 95.
- Identify any suspected outlier.
27A Modified Boxplot
28R codes
myBoxPlot function(x, col 'gray')
boxplot(x, col col) text(rep(1.3,5),
fivenum(x), labelsc('minimum', 'lower hinge',
'median',
'upper hinge', 'maximum'), col 'blue') q
quantile(x, probs c(0.25, 0.5, 0.75))
IQR q3 - q1 lowerfence q1 -
1.5IQR upperfence q3 1.5IQR
abline(h c(lowerfence, upperfence), col
'green', lty 2) text(rep(1.3,5),
c(lowerfence, upperfence), labelsc('lower
fence', 'upper fence'),
col
'blue') Outliers which((x - lowerfence)(x
- upperfence) gt 0) if (length(Outliers) !
0) text(rep(0.63, length(Outliers)),
xOutliers, labels
paste(rep('Obs.',
length(Outliers)),Outliers), col 'red')
Rainfall c(9.6, 12.9, 9.9, 8.7, 6.8, 12.5,
13.0, 10.1, 10.1, 10.1, 10.8, 7.8, 14.1, 10.6,
10.0, 11.5, 13.6, 12.1, 12.0, 9.3, 7.7, 11.0,
6.9, 9.5, 16.5, 9.3, 9.4, 8.7, 9.5, 11.6, 12.1,
8.0, 10.7, 13.9, 11.3, 11.6, 10.4)
myBoxPlot(Rainfall)
29Measuring Spread the Standard Deviation
- Interestingly, the mean is not among the 5-numver
summary of a distribution. The closest partner of
the mean is the standard deviation, which is
another measure of the spread of a distribution. - The standard deviation measures how far the
observations are from their mean.
30Calculation of Standard Deviations
- The variance of a set of observations is an
average of the squares of deviation from the
mean. -
- The standard deviation s is the square root of
the variance
31The standard deviation Example
- Example (Calculating the standard deviation s)
- Metabolic rates of 7 men who took part in a
study of dieting. The units are calories per 24
hours. - 1792 1666 1362 1614 1460 1867 1439
-
- Find the mean first
32 Observations Deviations
Squared deviations
Contd
1792 192 36864
1666 66 4356
1362 -238 56644
1614 14 196
1460 -140 19600
1867 267 71289
1439 -161 25921
sum 0 sum 214870
The variance The standard deviation
33Summary of Strategies for Exploring Data on a
Single Quantitative Variable
- The 5-number summary is always good for
describing the distribution of quantitative data. - The mean and its partner standard deviation
should be used to describe the center and spread
of the distribution of quantitative data only
when the distribution is known to be symmetric,
since both are sensitive to outliers. - The shape of the distribution of quantitative
data is better described using graphical displays
such as histograms.