Title: Lecture 2: Exploratory Data Analysis part 2
1Lecture 2 Exploratory Data Analysis part 2
- Descriptive statistics (continued)
- The Normal approximation (Section1.3)
2Describing distributions with numbers
- The distribution can be described through the
measures of its center and of its spread. - The center
- Mean or average if the distribution is symmetric
- Median (The median M is the midpoint of a
distribution, the number such that half the
observations are smaller and the other half are
larger.) if the distribution is skewed. - Its value is only slightly affected by the
presence of extreme observations, no matter how
large these observations are.
3Describing distributions with numbers
- The spread or variation
- Standard deviation (used in association with the
average) if the distribution is symmetric - First and Third Quartile (used in association
with the Median) if the distribution is skewed
4Example Shopping in a supermarket
A marketing consultant observed 50 consecutive
shoppers at a supermarket. The histogram below
shows how much each shopper spent in the store.
Summary statistics Mean 34.70 Median
27.855 Q1 19.27 Q3 45.40 IQR
45.40-19.27 26.13
About 50 of the shoppers spent less than 28
dollars, 25 spent less than 20 dollars and 25
of the customers of the store spent more that 45
dollars. Moreover, 50 of the customers spent
between 20 and 45 dollars! Extreme values for
purchases gt Q3 1.5xIQR84.59
5Box plots
Max
Mean
Q3
Median
Q1
Min
A boxplot is a graph of the 5-number summary
min, Q1, Median, Q3, Max
6Percentiles (also called Quantiles) In general
the nth percentile is a value such that n of the
observations fall at or below or it
n
nth percentile
For the shoppers data above 5th percentile
9.260 95th percentile 85.760 10th
percentile 13.235 90th percentile 65.770
Hence about 80 of the customers spent between
13 and 66 dollars.
7Example Air flow rate
- The production of integrated circuits (IC) must
take place under conditions where the air is free
of particulates, because the presence of such
particulates in the circuits is a major cause of
their failure. -
- Hewlett-Packard constructed a clean room using
ultra-low particulate filters to maintain uniform
air flow over areas in which production occurred.
The data were collected in a study comparing the
variability in air velocity through clean-room
air filters to the variability stated in the
purchasers specifications. -
- The data report the average of three flow rates
by two technicians at 8 sites on 11 filters. We
will compare only two of such sites. - Description with Box Plots
8Side-by-side Box Plots
Side-by-side box plots display the summary
statistics for the two or more samples, it is
very convenient for comparing sets of
observations
Max Q3 Median Q1 Min
Max Q3 Median Q1 Min
Sample A
Sample B
The plot indicates that on average the
observations in sample A are larger than the
observations in sample B (even though there is a
large common range).
9Measuring the spread for symmetric distributions
Example SAT Math score of 224 Computer Science
students In a large university, data were
collected to study the academic achievements of
computer science majors. Well consider the SAT
math scores of 224 first year CS students. The
average SATM score is 595.28 with s.d. s 86.40
Histogram of the SATM Scores
Are the average and s.d. good descriptions of
the SATM scores distribution? Roughly 68 of the
students have scores between 510 and 680 Roughly
95 of the students have scores between 422 and
768
10Measuring the spread for symmetric distributions
- If a distribution is symmetric
- Use the average to measure the center and
- the Standard Deviation to measure the spread.
- Example A person metabolic rate rate at which
body consumes energy. Rates of 7 men in a study
on dieting 1792, 1666, 1614, 1460, 1867, 1439,
1362. - The mean is
Deviation1867 1600267
Deviation1600 1439161
? ? ? ?
? ? ?
1300 1400 1500
1600 1700
1800 1900
Metabolic rate
11Variance
- Variance Average of the squared deviations
- In symbols, the standard deviation s of n
observations is
12Standard Deviation
- Standard deviation measures how far the
observations are from the average. - In symbols, the standard deviation s of n
observations is
- Note that two different formulas exist for
calculating V and s - In many texts, the variance is a direct
average. It is computed as the sum of the squares
of the deviations divided by the number of
observations. This is sometimes called the
population variance. - However, our text divides by n - 1. This is
sometimes referred to as the estimated variance
based on a sample.
13For Metabolic rate exampleCompute the standard
deviation
- Consider the data on metabolic rate 1792, 1666,
1614, 1460, 1867, 1439, 1362. - Calculate the mean
- Calculate the deviations from the mean entry
mean - 1792 1600192 1666 160066 1614 160014
1867 1600267 - 1460 1600 140 1439 1600 161
1362-1600 238 - Square the deviations, add them up, divide the
sum by n-1 and take the square root to get the
standard deviation
14Properties of the s.d.
- It measures the spread about the mean.
- The s.d is the square root of the variance.
- Only used in association with the mean. Good
descriptive measure for symmetric distributions. - If s 0, all the observations have the same
value. - It is a POSITIVE value, the larger s is, the more
spread out the observations are around the mean. - It is NOT a resistant measure, a few extreme
observations may affect its value. -
15Interpreting the s.d.
- For many observations especially if their
histogram is bell-shaped - Roughly 68 of the observations in the list lie
within - 1 standard deviation from the average
- And 95 of the observations lie within 2 standard
deviations from the average
Average
Ave2s.d.
Aves.d.
Ave-s.d.
Ave-2s.d.
68
95
16Example SAT Math score of 224 Computer Science
Students
In a large university, data were collected to
study the academic achievements of computer
science majors. Well consider the SAT math
scores of 224 first year CS students. The
average SATM score is 595.28 with s.d. s 86.40
Histogram of the SATM Scores
Are the average and s.d. good descriptions of
the SATM scores distribution? Roughly 68 of the
students have scores between 510 and 680 Roughly
95 of the students have scores between 422 and
768
17CS students example Descriptive statistics
Mean 595.28 Std Deviation 86.40
Max 800 Min 300 Q1 540 Median
600.00 Q3 650 IQR110 1.5xIQR165
5th percentile 460 95th percentile 750
Histogram of the SATM Scores
768
422
95 of scores
18Analysis of the scores for male and female
students
Box plot
SATM scores for men
SATM scores for women
19In general, when analyzing data
- Always plot your data
- Look for overall patterns striking deviations
such as outliers - Calculate a numerical summary to describe the
center and the spread - Symmetric distributions Mean and standard
deviation - Asymmetric distributions 5 number summary Min,
Q1, Median, Q3, Max - NEXT STEP sometimes the overall pattern is so
regular that we can describe it through a smooth
curve, called a density curve.
20Normal distribution
Normal curves provide a simple compact way to
describe symmetric, bell-shaped distributions.
Normal curve
SAT math scores for CS students
21Money spent in a supermarket
Is the normal curve a good approximation?
22SAT math scores for CS students
The area under the histogram, i.e. the
percentages of the observations, can be
approximated by the corresponding area under the
normal curve. If the histogram is symmetric, we
say that the data are approximately normal (or
normally distributed). We need to know only the
average and the standard deviation of the
observations!!
23SAT math scores for CS students
The variable SAT math scores is normally
distributed with Mean m 595.28 and Std
Deviation s 86.40. The approximate normal
distribution has function
24Two normal curves with the same mean but
different standard deviation.
25- The normal approximation is commonly used in
statistics. - There is a special normal curve that is known
as - The standard normal curve
- The standard normal distribution has mean 0 and
- standard deviation 1
- The curve is perfectly symmetric around 0
- Mathematical formula for the Standard Normal
26Benchmarks under the standard normal curve
50
27Normal distribution function F(z)
- It is defined as the area under the standard
normal to the left of z, that is F(z)P(Zltz).
The values of F(z) are tabulated, see Table A in
your textbook.
28Standard normal probabilities F(z)P(Zltz)
29Application of the normal distribution to the
dataStandardization
- The normal distribution can be used to
approximate the distribution of the data, when
the data have a symmetric histogram! - Result
- If the data X are normally distributed according
to the distribution N(m,s) with mean m and
standard deviation s, then standardized value of
X given by Z(X-m)/s is a standard normal
variable with distribution N(0,1) with mean 0 and
standard deviation equal to 1 -
- Thus, we can compute the relative frequencies for
any normal distribution, by standardizing and
using the probability Table A.
30Example
Mean 595.28 Std Dev. s 86.40
The distribution of the SATM scores for the CS
students is approximately normal with mean 595.28
and s.d. 86.40 N(595.28 , 86.40)
Problem What is the percentage of CS students
that had SAT math scores less than 700? Answer
Use the normal approximation - Say X is
N(595.28, 86.40). It is the area under the normal
density curve for Xlt 700 Standardize
subtract the average divide by the standard
deviation Xlt 700 equivalent to
Z(X-595.28)/86.40lt(700-595.28)/86.401.212
31- Answer The answer is the area under the normal
density curve for Xlt 700 - Standardize subtract the mean, then divide by
the standard deviation - Xlt 700 equivalent to Z(X-595.28)/86.40lt(700-59
5.28)/86.401.212 - Look at Table A
- We need to find the area to the left of Z1.212
- Results 88.59 of the CS students has SATM equal
to 700 or lower
F(z).8859
Z1.212
32How do we compute it?
- We use the values of the standard Normal
distribution function F(z)P(Zltz). - Problem What is the percentage of CS students
that had SAT math scores between 600 and 750?
- Approximate answer
- 1) Standardize
_
595.28
595.28
595.28
600
750
600
750
33Summary Normal distribution calculations
- Follow the following steps
- State the problem. Calculate the sample average
and the s.d. and define the interval you are
interested in - Standardize
- Compute the area under the standard normal
density curve using the Table A. - Online calculators and more info
- area under Normal
- another one (a bit glitchy)
- sampling distributions
- more on the Normal distribution
34Inverse Problem What is the lowest SAT math
score that a student must have to be in the top
25 of all CS students in the sample?
Mean 595.28 Std Dev. s 86.40
25
Sample Q3650
?
Find the value x, such that 25 of observations
fall at or above it.
35Example Data on the time between machine
failures were collected during a study on machine
performance that involved 39 similar machines.
From the data we compute, the sample mean
23.35hours and thesample standard deviation
1.67h.
- What is the percentage of machines that failed
after 24 hours? - What is the percentage of machines with failure
time between 20 and 22 hours? - How short should the failure time be for a
machine to be in the bottom 10 ?
36Answers
- The observations are on the variable Time of
failure X that is approximately normal N(23.35,
1.67). - What is the percentage of machines that failed
after 24 hours? - Compute the percentage for Xgt24, that is equal
to the area under the normal distribution to the
right of 24. - Standardize Xgt24 as
- Or equivalently Zgt 0.39
-
- Now use the standard normal probability tables
- The area under the standard normal to the right
of 0.39 is equal to - Area to the right of 0.39
1- (Area to the left of 0.39) - 1-0.6517
- 0.3483
- The answer is 0.3483. About 35 of the machines
failed after 24 hours.
37- What is the percentage of machines with failure
time between 20 and 22 hours? - We need to compute the area under the normal
distribution for - 20 ltXlt 22. This is computed by subtracting
- (Area for Xlt22) - (Area for Xlt20).
- Standardize
- X lt 22 is, in standard units
-
- Xlt20 is, in standard units
-
-
- Use the standard normal probability tables
- The area under the standard normal distribution
for Zlt-0.81is 0.2090 - The area under the standard normal distribution
for Zlt-2.00 is 0.0228 -
- The answer is 0.2090-0.2280.1862
-
- Conclusion
38- How short should the failure time be for a
machine to be in the bottom 10 ? - We need to compute the value x for XN(23.35,
1.67), such that the area under the normal
distribution on the left of x is equal to 0.1. -
0.1
X 23.35
From the normal probability tables, the standard
value z that corresponds to an area P(Zltz)0.1
is z-1.28 Thus, transforming the z-value back
to the x-units, we have x -1.28 X st.dev.
mean -1.28 X 1.67 23.35 21.21 So the bottom
10 of the cars have failure time equal to 21.21
hours or shorter.
39Normal approximations
Is the normal approximation appropriate for these
data?
Overestimate this area
Underestimate this area
Use it when the histogram of the observations is
bell-shaped!
40Normal quantile plots
- A useful tool for assessing if the data come from
a normal distribution is a graph called normal
quantile plot. - If the points on a normal quantile plot lie close
to a straight line, the plot indicates that the
data are normal. Deviations from a straight line
indicates that the data are not normal.
41SAS for Exploratory Data Analysis
- PROC MEANS
- PROC UNIVARIATE
- PROC CHART (GCHART)
- PROC UNIVARIATE
To compute descriptive statistics
To plot histograms
To plot histograms, normal probability plots,
boxplots.