Describing Distributions with Graphs and Numbers - PowerPoint PPT Presentation

About This Presentation

Title:

Describing Distributions with Graphs and Numbers

Description:

... Example * 1792 192 36864 1666 66 4356 1362 -238 56644 1614 14 196 1460 -140 19600 1867 267 71289 1439 -161 25921 Observations Deviations Squared ... – PowerPoint PPT presentation

Number of Views:100

Avg rating:3.0/5.0

Slides: 34

Provided by: SY54

Learn more at: https://web.stcloudstate.edu

Category:

more less

Transcript and Presenter's Notes

Title: Describing Distributions with Graphs and Numbers

1
Topic 2

Describing Distributions with Graphs and Numbers

2
Sampling/ experiment
Target population
Data
Size n
Size N
summary
Inference (estimation, testing)
visualization
3
Parameter and Statistic

A parameter (in statistics) is a quantity that
defines a certain characteristic of a population.
Average birthweight of all new-born babies
Parameters are estimated based on a sample.
A statistic is a summary measure computed from
sample data. Note that a parameter is a summary
measure for an entire population.
A key use of a statistic is as an estimator for a
parameter.

4
Distributions

When we say that 62 TAMUK students are Hispanic,
32 are white, 3 are African-American, and 3
are others, we mean the DISTRIBUTION of TAMUK
students according to race is
Race
Percent
Hispanic
62
White
32
African-American 3
Others
3

The DISTRIBUTION of grades for a class could be
Grade
Percent
A
20
B
45
C
22
D
10
F
3

The DISTRIBUTION of weights of all men aged 30 in
Texas could be
Weights Percent
Less than 130 lb. 3
130 to 140 lb. 6
140 to 150 lb. 15
150 to 160 lb. 25
160 to 170 lb. 30
170 to 180 lb. 17
180 or over 4

So, the DISTRIBUTION of a population describes
how the population is made up of according to
some characteristic.
If one is concerned with the characteristic of
a population that can be described by a
categorical variable, e.g., race, he or she may
be interested in what percent of subjects fall in
each race category.
If one is concerned with the characteristic of
a population that can be described by a
continuous variable, e.g., weight, he or she may
be interested in what proportion of people fall
in a weight interval.

8
Histograms

A histogram is a bar graph in which the
horizontal scale represents classes of data
values and the vertical scale represents
frequencies (or relative frequencies). The
heights of the bars correspond to the frequency
(or the relative frequency) values, and the bars
are drawn adjacent to each other without gaps.

Example Construct a histogram for the 20
systolic blood pressures (SBP) of 20 men
93 104 105 108 109 112 114 115
117 119
119 120 121 123 127 130 135 139
139 158

SBP Frequency
90-99 1
100-109 4
110-119 6
120-129 4
130-139 4
140-149 0
150-159 1
10
R Codes

SBP c(93,104,105,108,109,112,114,115,117,119
,
119,120,121,123,127,130,135,139,139,
158)
hist(SBP, breaksc(89.5,99.5,109.5,119.5,129.5
,139.5,
149.5,159.5,169.5),col3)
Copy and paste these codes to R, then you
will see the histogram.

11
Pie Charts
Pie chart A circle having a slice of a pie
for each category. The size
of slice corresponds to the
percentage of observations
in the category.
12
(No Transcript)
13
Bar Graph for European Parliament in 2004
14
Pareto Chart Bar Graph with categories Ordered
by Their Frequency from the Tallest Bar to
Shortest
15
Measuring the Center the Mean and Median

The distribution of data or a population can be
displayed graphically. In practice, we also want
to know where the center of a distribution is.
The mean and median are common measures of a
distribution.
The mean of n observations x1, x2, , xn, denoted
___, is defined as ______.
Example The selling prices () of 5
single-family homes are 198000, 219000, 175000,
260000, 630000. Find the mean price.

16
The Mean is Sensitive to Outliers

If the 5th home were 360000, then the mean price
would be ___. The significant difference in means
is due primarily to the 5th price, which is
called an outlier.
If we construct a histogram or a stem plot for
the data of these 5 prices, the distribution of
the data can be seen to be skewed to the right.
This skewness is caused by the outlier.

17
The Median

Another measure of center of a distribution is
the median.
Given n observations x1, x2, , xn, the median,
denoted M, is defined as the number such that
half the observations are smaller.
To find the median of n observations, we first
sort the observations in order, then pick the
midpoint.
Example Find the median of the 5 prices 198000,
219000, 175000, 260000, and 630000.
What if we have 6 prices 198000, 219000, 175000,
260000, 630000, and 230000?

18
Location of the Median

Given n observations, the location of the median
in the ordered list is always (n1)/2.
When is the location of a median an integer? When
decimal?
If the location of a median is 4.5, it means that
the median is halfway between the 4th and 5th
observations in the ordered list. What does it
mean if the location is 7?
Find the median and its location for data 2, 5,
1, 0, 9.
Find the median and its location for data 0, 3,
1, -2, 7, 4.

19
Example Find the Mean and Median from a Stem Plot
1 69 2 455 3 334477 4 0255669 5
6 7 3
(a) What are the observations? (b) Find the
mean. (c) Find the median and its location.
20
Comparing the Mean and Median

For a symmetric distribution, mean median.
For a right-skewed distribution, mean gt median.
For a left-skewed distribution, mean lt median.

21
Mean, Median, and Mode
The distribution of data is Symmetric
The distribution is skew to the left
The distribution is skew to the right
22
Measuring the Spread The Quartiles

The spread of a distribution measures how
divergent the distribution is.
The middle half of a distribution is marked out
by two quartiles
The 1st quartile Q1 is the number such that 25
of all values are smaller
The 3rd quartile Q3 is the number such that 75
of all values are smaller
The median of a distribution is also called the
2nd quartile which is the number such that 50 of
all values are smaller.
Note also that these quartiles so defined are not
unique.
To find these quartiles, we will need to sort the
data and find the locations of these quartiles.

23
Example Find Quartiles

1. Given data
16 25 24 19 33 25 34 46 37 33 42 40 37 34 49 73
46 45 45,
find Q1, M, and Q3.
2. Given data
16 25 24 19 33 25 34 46 37 33 42 40 37 34 49 73
46 45 45 31,
find Q1, M, and Q3.

24
The Five-Number Summary and Boxplots

Q1, M, and Q3 give the information about the
middle half of a distribution the tails of a
distribution can be described by possible
smallest and largest values of the distribution.
These five values can intuitively picture a
distribution and are called the 5-number summary.
The Five-Number Summary of a distribution
describes both the center and the spread of a
distribution.
The 5 numbers can be displayed in a (ordinary)
boxplot, which consists of
(a) a central box spanning the quartiles Q1 and
Q3,
(b) a line in the box masking the median M, and
(c) two lines extended from the box out to the
smallest and largest
observations.
Compared with its competitors histograms and stem
plots, a boxplot show less detail about the
distribution. Boxplots are best used for
side-by-side comparison of more than one
distribution. The boxplot of a distribution
should be interpreted in terms of skewness, the
center and the spread.

25
Compare the two boxplots in terms of skewness,
spread, and center.
The side-by-side boxplot is produced with the
following R codes x c(86, 91, 72, 79, 74,
83, 73, 92, 76, 72, 67, 88, 70, 79, 93,
65, 75, 83, 90, 75, 100, 63) y c(74,
84, 86, 90, 78, 85, 75, 72, 97, 84, 87, 76, 78,
79, 82, 63, 95, 79, 82, 69, 96, 73)
zdata.frame(Gradec(x,y), Section
c(rep('Section 01', length(x)), rep('Section 02
', length(y)))) attach(z) boxplot(GradeSecti
on, col 23)
26
Spotting Suspected Outliers The 1.5xIQR Rule

In a boxplot, the distance between Q1 and Q3 (the
range of the center half of the data) is a more
resistant measure of spread. This distance is
called the inter-quartile range, denoted IQR
that is
IQR Q3 Q1.
The 1.5xIQR Rule for outliers An observation is
called a suspected outlier if it falls more than
1.5xIQR above Q3 or below Q1.
Example Find Q1, Q3, and IQR of the data
72 83 91 84 84 78 90 85 67 91 80 85 67 65 95.
Identify any suspected outlier.

27
A Modified Boxplot
28
R codes
myBoxPlot function(x, col 'gray')
boxplot(x, col col) text(rep(1.3,5),
fivenum(x), labelsc('minimum', 'lower hinge',
'median',
'upper hinge', 'maximum'), col 'blue') q
quantile(x, probs c(0.25, 0.5, 0.75))
IQR q3 - q1 lowerfence q1 -
1.5IQR upperfence q3 1.5IQR
abline(h c(lowerfence, upperfence), col
'green', lty 2) text(rep(1.3,5),
c(lowerfence, upperfence), labelsc('lower
fence', 'upper fence'),

col
'blue') Outliers which((x - lowerfence)(x
- upperfence) gt 0) if (length(Outliers) !
0) text(rep(0.63, length(Outliers)),
xOutliers, labels
paste(rep('Obs.',
length(Outliers)),Outliers), col 'red')
Rainfall c(9.6, 12.9, 9.9, 8.7, 6.8, 12.5,
13.0, 10.1, 10.1, 10.1, 10.8, 7.8, 14.1, 10.6,
10.0, 11.5, 13.6, 12.1, 12.0, 9.3, 7.7, 11.0,
6.9, 9.5, 16.5, 9.3, 9.4, 8.7, 9.5, 11.6, 12.1,
8.0, 10.7, 13.9, 11.3, 11.6, 10.4)
myBoxPlot(Rainfall)
29
Measuring Spread the Standard Deviation

Interestingly, the mean is not among the 5-numver
summary of a distribution. The closest partner of
the mean is the standard deviation, which is
another measure of the spread of a distribution.
The standard deviation measures how far the
observations are from their mean.

30
Calculation of Standard Deviations

The variance of a set of observations is an
average of the squares of deviation from the
mean.

The standard deviation s is the square root of
the variance

31
The standard deviation Example

Example (Calculating the standard deviation s)
Metabolic rates of 7 men who took part in a
study of dieting. The units are calories per 24
hours.
1792 1666 1362 1614 1460 1867 1439
Find the mean first

32
Observations Deviations
Squared deviations
Contd
1792 192 36864
1666 66 4356
1362 -238 56644
1614 14 196
1460 -140 19600
1867 267 71289
1439 -161 25921

sum 0 sum 214870
The variance The standard deviation
33
Summary of Strategies for Exploring Data on a
Single Quantitative Variable

The 5-number summary is always good for
describing the distribution of quantitative data.
The mean and its partner standard deviation
should be used to describe the center and spread
of the distribution of quantitative data only
when the distribution is known to be symmetric,
since both are sensitive to outliers.
The shape of the distribution of quantitative
data is better described using graphical displays
such as histograms.