Title: Chapter 1 Looking at Data Distributions
1Chapter 1Looking at Data Distributions
2What is statistics?
- The science of collecting, organizing, and
interpreting numerical facts (data) with the goal
of gaining understanding about a problem - Always relate calculations back to the problem at
hand as numbers alone are not meaningful - Requires thinking and judgment about data
3Variables
- A variable is a characteristic of an individual,
or object of interest (ie. Person, plant, animal) - Variables can take on different values for
different individuals - Ex. Individual Variable
- Person Age or Height
- Flower Color
- Bird Wingspan
4Distributions
- The distribution of a variable tells us what
values the variable takes on (for the group of
individuals under consideration) and how often it
takes them - Ex. Consider 10 rose bushes in a garden
- What colors are represented?
- How many of each color?
5Variables
Categorial
Quantitative
- Value falls into one of
- two or more groups, or
- categories.
- Ex. Blood type, hair color
- takes on numerical values
- Mathematical operations (such as
- averaging) make sense
- Ex. Height, age, number of credit
- cards owned
It makes sense to talk about the average height
of the students in the class, but not the average
blood type.
61.1 Displaying Distributions with Graphs
- For a categorical variable, the distribution
lists the categories and the count or percent of
individuals who fall into each one. - How can we visually display this data?
- Bar graphs
- each category is represented by a bar
- Pie charts
- The slices must represent parts of one whole
7Example Top 10 causes of death in the United
States 2001
For each individual who died in the United States
in 2001, we record what was the cause of death.
The table above is a summary of that information.
8Bar graphs Each category is represented by one
bar. The bars height shows the count (or
sometimes the percentage) for that particular
category.
Top 10 causes of deaths in the United States 2001
9Top 10 causes of deaths in the United States 2001
Bar graph sorted by rank ? Easy to analyze
Sorted alphabetically ? Much less useful
10Pie charts Each slice represents a piece of one
whole. The size of a slice depends on what
percent of the whole this category represents.
Percent of people dying from top 10 causes of
death in the United States in 2000
11Make sure your labels match the data. Make
sure all percents add up to 100.
Percent of deaths from top 10 causes
Percent of deaths from all causes
12How to Chart Quantitative Variables?
- Histograms Numerical analog of bar graph
- The range of values a variable can take on is
divided into equal size intervals (bins) - Histogram shows number of data points
(observations) that fall into each interval (bin) - Choosing the correct bin size is judgment call
13Histogram
- Ex. Test 1 scores for 10 statistics students
Student Score 1 75 2 99
3 79 4 71 5
66 6 82 7 89
8 0 9 53 10 73
10 bins
number of students
test score
14What if we change the bin size?
4 bins
number of students
test score
15Interpreting Histograms
- Look for overall pattern of data, and for any
striking departures from the pattern - Look for outliers, individual values which fall
outside the overall pattern of a distributions - Always watch out for outliers, and try to
identify and explain them - Ex. Was the statistics test really hard, or were
there unusual circumstances for student 8? Did
he not show up for class, or did he cheat on his
exam? Should he be included in the distribution?
16Stem Plots
- Separate each observation into a stem (all but
the final digit) and a leaf (final digit) - Write the stems in a vertical column with the
smallest value at the top and draw vertical line
to right of column - Write each leaf in row to right of its stem, in
increasing order - Note Some stems may have no leaves
17Creating a Stem Plot Test scores of 10 students
Student Score 1 75 2 99
3 79 4 71 5
66 6 82 7 89
8 0 9 53 10 73
18More on Stem Plots
- Back-to-back stem plots with a common stem may be
useful for comparing two related distributions - Stem plots dont work too well for large data
sets - If each stem holds a large number of leaves, you
can split each stem into two - One for leaves 0-4
- One for leaves 5-9
- If observed values have too many digits, trim
numbers before making stemplot - Ex. Trim 1234 to 123, then 12 is stem and 3 is
leaf. - Indicate leaf unit is 10.
- See example 1.8 in text
19Describing Distributions
- Can describe the overall pattern of a
distribution by its shape, center, and spread - Center For now, consider the center the
midpoint - Value with approximately half the observations
above it and half the observations below it - Spread For now, describe by indicating smallest
and largest values - Shape
- How many peaks does the distribution have?
- If one, unimodal
- If several, multimodal
- Is the distribution symmetric? Or skewed?
20Most common distribution shapes
- A distribution is symmetric if the right and left
sides of the histogram are approximately mirror
images of each other.
- A distribution is skewed to the right if the
right side of the histogram (side with larger
values) extends much farther out than the left
side. It is skewed to the left if the left side
of the histogram extends much farther out than
the right side.
Skewed distribution
21Time Plots
- A time plot of a variable plots each observation
against the time at which it was measured - Time always on horizontal axis!
- Look for patterns over time
- A trend is a rise or fall that persists over
time, despite small irregularities - A pattern that repeats itself at regular
intervals of time is called seasonal variation
22Ex. Retail price of fresh oranges over time
Time is on the horizontal, x axis. The variable
of interesthere retail price of fresh oranges
goes on the vertical, y axis.
This time plot shows a regular pattern of yearly
variations. These are seasonal variations in
fresh orange pricing most likely due to similar
seasonal variations in the production of fresh
oranges. There is also an overall upward trend
in pricing over time. It could simply be
reflecting inflation trends or a more fundamental
change in this industry.
231.2 Describing Distributions with Numbers
- Recall Distributions of variables are described
by shape, center, and spread - We now extend beyond inspecting stemplots and
histograms to more precise definitions of center
and spread - Measures of center the mean and the median
24The Mean (x-bar)
- To find the mean of a set of n observations, x1,
x2, x3, , xn, add their values and divide by
the number of observations
or
S (Sigma) means sum
25Example Test scores on 2nd exam for 10
statistics students
Exam scores 80, 73, 92, 85, 75, 98, 93, 55, 80,
90
n 10
26- Note The mean is sensitive to a few extreme
observations - NOT a resistant measure of center
- What if there were an 1lth student in the class
who didnt show up and received a 0 on the 2nd
exam? - How would this affect the mean?
27The Median (M)
- The median is the midpoint of a distribution
- Half the observations are smaller and half the
observations are larger than M - To find the median
- Arrange data from smallest to largest
- If the number of observations (n) is odd, M is
the center observation in the ordered list,
located (n1)/2 observations up from the bottom - If the number of observations (n) is even, M is
the mean of the two center observations in the
ordered list. M is still located at the (n1)/2
position
28Finding the Median
- Consider again exam scores for 10 students
Exam scores 80, 73, 92, 85, 75, 98, 93, 55, 80,
90
- Arrange data from smallest to largest
55, 73, 75, 80, 80, 85, 90, 92, 93, 98
- n 10, so n is even and M is the mean of the
- 5th and 6th observations in the ordered list.
- M is located at (101)/2, or 5.5th position in
- ordered list
- M (8085)/2 82.5
29- What happens to M if we include the 11th student
who received a 0 in the data set?
Exam scores (in order) 0, 55, 73, 75, 80, 80,
85, 90, 92, 93, 98
- There are now 11 data points, so n 11 and is
odd - M is therefore center observation in ordered
list, located in position (121)/2, or 6th
position - M 80
30Comparing the mean and the median
The mean and the median are the same only if the
distribution is symmetrical. The median is a
measure of center that is resistant to skew and
outliers. The mean is not.
Mean and median for a symmetric distribution
Mean Median
Mean and median for skewed distributions
Mean Median
Left skew
Right skew
Mean Median
31Impact of skewed data
32Measure of spread the quartiles
The first quartile, Q1, is the value in the
sample that has 25 of the data at or below it (?
it is the median of the lower half of the sorted
data, excluding M). The third quartile, Q3,
is the value in the sample that has 75 of the
data at or below it (? it is the median of the
upper half of the sorted data, excluding M).
Q1 first quartile 2.2
M median 3.4
Q3 third quartile 4.35
33Five-number summary and boxplot
Largest max 6.1
BOXPLOT
Q3 third quartile 4.35
M median 3.4
Q1 first quartile 2.2
Five-number summary min Q1 M Q3 max
Smallest min 0.6
34Boxplots for skewed data
Comparing box plots for a normal and a
right-skewed distribution
Boxplots remain true to the data and depict
clearly symmetry or skew.
35Suspected Outliers
- Outliers are troublesome data points, and it is
important to be able to identify them. - One way to raise the flag for a suspected outlier
is to compare the distance from the suspicious
data point to the nearest quartile (Q1 or Q3). We
then compare this distance to the interquartile
range (distance between Q1 and Q3). - We call an observation a suspected outlier if it
falls more than 1.5 times the size of the
interquartile range (IQR) above the first
quartile or below the third quartile. This is
called the 1.5 IQR rule for
outliers.
36Distance to Q3 7.9 - 4.35 3.55
Q3 4.35
Interquartile range Q3 Q1 4.35 - 2.2 2.15
Q1 2.2
Individual 25 has a value of 7.9 years, which is
3.55 years above the third quartile. This is more
than 3.225 years, 1.5 IQR. Thus, individual 25
is a suspected outlier.
37Measure of Spread Standard Deviation
- The most common numerical description of a
distribution is given by the mean to measure
center and the standard deviation (s) to measure
spread - Looks at how far observations are from their mean
- The variance of a set of observations (s2) is the
average of the squares of the deviations of the
observations from their mean
38- The standard deviation (s) is then given by the
square root of the variance
- The deviations xi x are large in magnitude if
observations lie far from the mean - Some deviations will be positive and some will be
negative depending on if the observations are
smaller or larger than the mean - The sum of the deviations of the observations
from the mean will always be zero - s and s2 will be large for widely spread
distributions and small if observations do not
lie far from the mean
39- Why divide by n-1?
- Since the sum of the deviations are zero, the
last observation/deviation can be calculated once
the other n-1 are known - Thus we say there are only n-1 degrees of freedom
- Why emphasize s over s2?
- s has the same unit of measurement as the
original observations - Natural measure of spread for Normal distribution
(section 1.3)
40Calculations
Womens height (inches)
Mean 63.4 Sum of squared deviations from mean
85.2 Degrees freedom (df) (n - 1)
13 s2 variance 85.2/13 6.55 inches
squared s standard deviation v6.55 2.56
inches
41Mean 63.4 inches s 2.56 inches
42Properties of the Standard Deviation
- s measures spread about the mean
- Only use when mean is measure of center
- s 0 only when there is NO spread
- Occurs when all observations have same value
- Otherwise, s gt 0
- Like the mean, s is not resistant
- A few outliers can make s very large
- Remember, the deviation is squared!
43Choosing among summary statistics
- Because the mean is not resistant to outliers or
skew, use it to describe distributions that are
fairly symmetrical and dont have outliers. ?
Plot the mean and use the standard deviation for
error bars. - Otherwise use the median in the five number
summary which can be plotted as a boxplot.
Boxplot Mean SD
44What should you use, when, and why?
- Arithmetic mean or median?
- Middletown is considering imposing an income tax
on citizens. City hall wants a numerical summary
of its citizens income to estimate the total tax
base. - In a study of standard of living of typical
families in Middletown, a sociologist makes a
numerical summary of family income in that city.
- Mean Although income is likely to be
right-skewed, the city government wants to know
about the total tax base. - Median The sociologist is interested in a
typical family and wants to lessen the impact
of extreme incomes.
45Changing the unit of measurement
- Variables can be recorded in different units of
measurement. Most often, one measurement unit is
a linear transformation of another measurement
unit xnew a bx. - Temperatures can be expressed in degrees
Fahrenheit or degrees Celsius.TemperatureFahrenhe
it 32 (9/5) TemperatureCelsius ? a bx. - Linear transformations do not change the basic
shape of a distribution (skew, symmetry,
multimodal). But they do change the measures of
center and spread - Multiplying each observation by a positive
number b multiplies both measures of center
(mean, median) and spread (IQR, s) by b. - Adding the same number a (positive or negative)
to each observation adds a to measures of center
and to quartiles but it does not change measures
of spread (IQR, s).
461.3 Density Curves and Normal Distributions
- A density curve is a mathematical idealization of
a distribution of data, picturing the overall
pattern of the data and ignoring minor
irregularities as well as any outliers - A smooth approximation to the irregular bars of a
histogram - A density curve is always on or above the
horizontal axis, and has area exactly 1 beneath it
47- Recall, in a histogram, the areas of bars
represent either counts or proportions of
observations (differ in scale on y-axis) - If proportion, then total area of all bars is 1,
and area of shaded bars gives proportion of test
scores 6.0 or lower - Similarly, the total area under a density curve
is 1, and the area under the density curve for a
range of values is the proportion of all
observations for that range.
Histogram of a sample with the smoothed, density
curve describing theoretically the population.
48- Density curves come in any imaginable shape.
- Some are well known mathematically and others
arent.
49Median and mean of a density curve
The median of a density curve is the equal-areas
point the point that divides the area under the
curve in half. The mean of a density curve is
the balance point, at which the curve would
balance if it were made of solid material.
The median and mean are the same for a symmetric
density curve. The mean of a skewed curve is
pulled in the direction of the long tail.
50Notation
- We use x and s to denote the mean and standard
deviation, respectively, as computed from a set
of actual observations - To distinguish an idealized distribution from a
sampled distribution, we denote the mean of a
density curve by m (the Greek letter mu) and the
standard deviation of a density curve by s (the
Greek letter sigma)
51Normal (Gaussian) Distributions
- Normal density curves are all symmetric,
unimodal, and bell-shaped - An exact density curve for a normal distribution
is completely determined by the mean and standard
deviation according to the following mathematical
equation - Function gives height of density curve
52Normal Distributions
- Mean at center of symmetric distribution
- Standard deviation natural measure of spread
- Points of inflection of density curve are located
distance s on either side of m (m-s, ms) - Density curve notation N(m,s)
Smaller s, less spread out
Larger s, more spread out
53Why is the Normal distribution so important?
- Good description of data sets such as test
scores, characteristics of biological
populations, and repeated measurements of the
same quantity - Good approximation to results of chance outcomes
such as tossing a coin many times - Basis for many statistical inference procedures
54A family of density curves
Here, means are the same (m 15) while standard
deviations are different (s 2, 4, and 6).
Here, means are different (m 10, 15, and 20)
while standard deviations are the same (s 3)
55The 68-95-99.7 Rule for Normal Distributions
- About 68 of all observations are within 1
standard deviation (s) of the mean (m) (for ALL
Normal distributions!). - About 95 of all observations are within 2 s of
the mean m. - Almost all (99.7) observations are within 3 s
of the mean.
Inflection point
mean µ 64.5 standard deviation s 2.5
N(µ, s) N(64.5, 2.5)
Reminder µ (mu) is the mean of the idealized
curve, while x is the mean of a sample. s
(sigma) is the standard deviation of the
idealized curve, while s is the s.d. of a sample.
56The standard Normal distribution
Because all Normal distributions share the same
properties, we can standardize our data to
transform any Normal curve N(m,s) into the
standard Normal curve N(0,1).
X
Z
If a variable X has any Normal distribution
N(m,s) then the standardized variable Z (X
m)/s has the standard normal distribution N(0,1).
For each x we calculate a new value, z (called a
z-score).
57Standardizing calculating z-scores
A z-score measures the number of standard
deviations that a data value x is from the mean m.
When x is 1 standard deviation larger than the
mean, then z 1.
When x is 2 standard deviations smaller than the
mean, then z -2.
When x is larger than the mean, z is
positive. When x is smaller than the mean, z is
negative.
58Ex. Women heights
N(µ, s) N(64.5, 2.5)
Womens heights follow the N(64.5,2.5)
distribution. What percent of women are shorter
than 67 inches tall (thats 57)?
Area ???
Area ???
mean µ 64.5" standard deviation s 2.5" x
(height) 67"
m 64.5 x 67 z 0 z 1
We calculate z, the standardized value of x
Because of the 68-95-99.7 rule, we can conclude
that the percent of women shorter than 67 should
be, approximately, 0.68 half of (1 - 0.68)
0.84 or 84.
59Using the standard Normal table
Table A gives the area under the standard Normal
curve to the left of any z value.
.0082 is the area under N(0,1) left of z -2.40
0.0069 is the area under N(0,1) left of z -2.46
.0080 is the area under N(0,1) left of z -2.41
()
60Percent of women shorter than 67
For z 1.00, the area under the standard Normal
curve to the left of z is 0.8413.
N(µ, s) N(64.5, 2.5)
Area 0.84
Conclusion 84.13 of women are shorter than
67. By subtraction, 1 - 0.8413, or 15.87 of
women are taller than 67".
Area 0.16
m 64.5 x 67 z 1
61 What percent of women are shorter than 65?
Height distributed according to N(µ, s)
N(64.5, 2.5)
62Tips on using Table A
- Because the Normal distribution is symmetrical,
there are 2 ways that you can calculate the area
under the standard Normal curve to the right of a
z value.
63More Tips on using Table A
To calculate the area between 2 z-values, first
get the area under N(0,1) to the left for each
z-value from Table A.
Then subtract the smaller area from the larger
area.
A common mistake made by students is to subtract
both z values. The area between z1 and z2 is NOT
the same as the area to the left of z2 z1 0.8
area between z1 and z2 area left of z1 area
left of z2
Note The area under N(0,1) for a single value of
z is zero.
64Example 1.27. The National Collegiate Athletic
Association (NCAA) requires Division I athletes
to score at least 820 on the combined math and
verbal SAT exam to compete in their first college
year. The SAT scores of 2003 were approximately
normal with mean 1026 and standard deviation 209.
What proportion of all students would be NCAA
qualifiers (SAT 820)?
area right of 820 total area - area
left of 820 1 - 0.1611 84
Note The actual data may contain students who
scored exactly 820 on the SAT. However, the
proportion of scores exactly equal to 820 is 0
for a normal distribution. This is a consequence
of the idealized smoothing of density curves. So
proportion of students with SAT gt 820 same as
above.
65Ex. 1.28. The NCAA defines a partial qualifier
eligible to practice and receive an athletic
scholarship, but not to compete, with a combined
SAT score of at least 720. What proportion of
all students who take the SAT would be partial
qualifiers? That is, what proportion have scores
between 720 and 820?
area between area left of 820 - area
left of 720 720 and 820 0.1611 -
0.0721 9
About 9 of all students who take the SAT have
scores between 720 and 820.
66Inverse normal calculations
- We may also want to find the observed range of
values that correspond to a given proportion/
area under the curve. - For that, we use Table A backward
- we first find the desired area/ proportion in
the body of the table, - we then read the corresponding z-value from the
left column and top row.
67Inverse Normal Calculations
Scores on the SAT verbal test in recent years
follow the N(505,110) distribution. How high
must a student score to place in the top 5
of all students taking the SAT?
1. To be in the top 5, must find z value for
standard normal distribution with 95 of area to
the left of z Use Table A z value closest to
0.95 is between 1.64 and 1.65. Use z 1.645
2. Unstandardize. Transform from z back to
original x scale. 3. Interpret This is the x
that lies 1.645 standard deviations above the
mean on the N(505,110) curve. Scores above 685.95
are in the upper 5 of scores.
68Normal quantile plots
- One way to assess if a distribution is indeed
approximately normal is to plot the data on a
normal quantile plot. - The data points are ranked and the percentile
ranks are converted to z-scores with Table A. The
z-scores are then used for the x axis against
which the data are plotted on the y axis of the
normal quantile plot. - If the distribution is indeed normal the plot
will show a straight line, indicating a good
match between the data and a normal distribution.
- Systematic deviations from a straight line
indicate a non-normal distribution. Outliers
appear as points that are far away from the
overall pattern of the plot.
69Good fit to a straight line the distribution of
rainwater pH values is close to normal.
Curved pattern the data are not normally
distributed. Instead, it shows a right skew a
few individuals have particularly long survival
times.
Normal quantile plots are complex to do by hand,
but they are standard features in most
statistical software.