Title: Numerical descriptions of distributions
1Numerical descriptions of distributions
- Describe the shape, center, and spread of a
distribution for shape, see slide 6 below... - Center mean and median
- Spread range, IQR, standard deviation
- We treat these as aids to understanding the
distribution of the variable at hand - The mean is often called the "average" and is in
fact the arithmetic average ("add all the values
and divide by the number of observations").
2Mathematical notation
Learn right away how to get the mean with
calculators JMP
3Your numerical summary must be meaningful!
The distribution of womens heights appears
coherent and symmetrical. The mean is a good
numerical summary.
458 60 62 64 66
68 70 72 74 76
78 80 82 84
A single numerical summary here would not make
sense.
5- The Median (M) is often called the "middle" value
and is the value at the midpoint of the
observations when they are ranked from smallest
to largest value. - arrange the data from smallest to largest
- if n is odd then the median is the single
observation in the center (at the (n1)/2
position in the ordering) - if n is even then the median is the average of
the two middle observations (at the (n1)/2
position i.e., in between) -
- In Table 1.10, calculate the
mean and median for the 2-
seater cars' city m.p.g. to see
that the mean is more - sensitive to outliers than
- the median (use TI-83)
- Also, try with JMP
6Skewness
Mode Mean Median
SYMMETRIC
Mean
Mean
Mode
Mode
Median
Median
SKEWED LEFT (negatively)
SKEWED RIGHT (positively)
7Mean and median of a distribution with outliers
Percent of people dying
8Impact of skewed data
9- Spread percentiles, quartiles (Q1 and Q3), IQR,
- 5-number summary (and boxplots), range, standard
deviation - pth percentile of a variable is a data value such
that p of the values of the variable are less
than or equal to it. - the lower (Q1) and upper (Q3) quartiles are
special percentiles dividing the data into
quarters (fourths). get them by finding the
medians of the lower and upper halfs of the data - IQR interquartile range Q3 - Q1 spread of
the middle 50 of the data. IQR is used with the
so-called 1.5IQR criterion for outliers - know
this!
10Measure of spread the quartiles
The first quartile, Q1, is the value in the
sample that has 25 of the data less than or
equal to it (? it is the median of the lower half
of the sorted data, excluding M). The third
quartile, Q3, is the value in the sample that has
75 of the data less than or equal to it (? it is
the median of the upper half of the sorted data,
excluding M).
Q1 first quartile 2.2
M median 3.4
Q3 third quartile 4.35
11Five-number summary and boxplot
Largest max 6.1
BOXPLOT
Q3 third quartile 4.35
M median 3.4
Q1 first quartile 2.2
Five-number summary min Q1 M Q3 max
Smallest min 0.6
12Boxplots for skewed data
Comparing box plots for a normal and a
right-skewed distribution
Boxplots remain true to the data and depict
clearly symmetry or skew.
13- 5-number summary min. , Q1, median, Q3, max
- when plotted, the 5-number summary is a boxplot
we can also do a modified boxplot to show
outliers (mild and extreme). Boxplots have less
detail than histograms and are often used for
comparing distributions e.g., Fig. 1.19, p.37
and below...
14Distance to Q3 7.9 - 4.35 3.55
Q3 4.35
Interquartile range Q3 Q1 4.35 - 2.2 2.15
Q1 2.2
Individual 25 has a value of 7.9 years, which is
3.55 years above the third quartile. This is more
than 3.225 years, 1.5 IQR. Thus, individual 25
is an outlier by our 1.5 IQR rule.
15(No Transcript)
16Look at Example 1.19 on page 41(section 1.2,
8/11) see Fig. 1.21 for a graph of deviations
from the mean... metabolic rates for 7 men in a
dieting study 1792, 1666, 1362, 1614, 1460,
1867, 1439. Mean1600 cals., s189.24 calories.
Be sure you know how to compute the
standard deviation with JMP and with your
calculator since its almost never done by hand
with the previous pages formula...
17- why do we square the deviations? - two technical
reasons that we'll see when we discuss the normal
distribution in the next section - why do we use the standard deviation (s) instead
of the variance (s2)? s2 has units which are the
squares of the original units of the data - why do we divide by n-1 instead of n? n-1 is
called the number of degrees of freedom since
the sum of the deviations is zero, the last
deviation can always be found if we know n-1 of
them be careful when using the TI-83 since it
calculates both division by n and n-1 - which measure of spread is best? 5-number summary
is better than the mean and s.d. for skewed data
- use mean s.d. for symmetric data
18What should you use, when, and why?
- Arithmetic mean or median?
- Middletown is considering imposing an income tax
on citizens. City hall wants a numerical summary
of its citizens income to estimate the total tax
base. - In a study of standard of living of typical
families in Middletown, a sociologist makes a
numerical summary of family income in that city.
- Mean Although income is likely to be
right-skewed, the city government wants to know
about the total tax base. - Median The sociologist is interested in a
typical family and wants to lessen the impact
of extreme incomes.
19- Finish reading section 1.2
- Be sure to go over the Summary at the end of each
section and know all the terminology - Do 1.56, 1.62-1.64, 1.67, 1.69, 1.75-1.77
(Mean/Median Applet), 1.78, 1.79 - use JMP for any problem requiring more than very
simple computations or use the TI-83 for
numerical (but not graphical) analysis...