Lecture 2: Exploratory Data Analysis part 2 - PowerPoint PPT Presentation

1 / 41
About This Presentation
Title:

Lecture 2: Exploratory Data Analysis part 2

Description:

SATM scores for women. 19. In general, when analyzing data: Always plot your data ... It is the area under the normal density curve for X 700 ... – PowerPoint PPT presentation

Number of Views:63
Avg rating:3.0/5.0
Slides: 42
Provided by: allp
Category:

less

Transcript and Presenter's Notes

Title: Lecture 2: Exploratory Data Analysis part 2


1
Lecture 2 Exploratory Data Analysis part 2
  • Descriptive statistics (continued)
  • The Normal approximation (Section1.3)

2
Describing distributions with numbers
  • The distribution can be described through the
    measures of its center and of its spread.
  • The center
  • Mean or average if the distribution is symmetric
  • Median (The median M is the midpoint of a
    distribution, the number such that half the
    observations are smaller and the other half are
    larger.) if the distribution is skewed.
  • Its value is only slightly affected by the
    presence of extreme observations, no matter how
    large these observations are.

3
Describing distributions with numbers
  • The spread or variation
  • Standard deviation (used in association with the
    average) if the distribution is symmetric
  • First and Third Quartile (used in association
    with the Median) if the distribution is skewed

4
Example Shopping in a supermarket
A marketing consultant observed 50 consecutive
shoppers at a supermarket. The histogram below
shows how much each shopper spent in the store.
Summary statistics Mean 34.70 Median
27.855 Q1 19.27 Q3 45.40 IQR
45.40-19.27 26.13
About 50 of the shoppers spent less than 28
dollars, 25 spent less than 20 dollars and 25
of the customers of the store spent more that 45
dollars. Moreover, 50 of the customers spent
between 20 and 45 dollars! Extreme values for
purchases gt Q3 1.5xIQR84.59
5
Box plots
Max
Mean
Q3
Median
Q1
Min
A boxplot is a graph of the 5-number summary
min, Q1, Median, Q3, Max
6
Percentiles (also called Quantiles) In general
the nth percentile is a value such that n of the
observations fall at or below or it
n
nth percentile
For the shoppers data above 5th percentile
9.260 95th percentile 85.760 10th
percentile 13.235 90th percentile 65.770
Hence about 80 of the customers spent between
13 and 66 dollars.
7
Example Air flow rate
  • The production of integrated circuits (IC) must
    take place under conditions where the air is free
    of particulates, because the presence of such
    particulates in the circuits is a major cause of
    their failure.
  • Hewlett-Packard constructed a clean room using
    ultra-low particulate filters to maintain uniform
    air flow over areas in which production occurred.
    The data were collected in a study comparing the
    variability in air velocity through clean-room
    air filters to the variability stated in the
    purchasers specifications.
  • The data report the average of three flow rates
    by two technicians at 8 sites on 11 filters. We
    will compare only two of such sites.
  • Description with Box Plots

8
Side-by-side Box Plots
Side-by-side box plots display the summary
statistics for the two or more samples, it is
very convenient for comparing sets of
observations
Max Q3 Median Q1 Min
Max Q3 Median Q1 Min

Sample A
Sample B
The plot indicates that on average the
observations in sample A are larger than the
observations in sample B (even though there is a
large common range).
9
Measuring the spread for symmetric distributions
Example SAT Math score of 224 Computer Science
students In a large university, data were
collected to study the academic achievements of
computer science majors. Well consider the SAT
math scores of 224 first year CS students. The
average SATM score is 595.28 with s.d. s 86.40
Histogram of the SATM Scores
Are the average and s.d. good descriptions of
the SATM scores distribution? Roughly 68 of the
students have scores between 510 and 680 Roughly
95 of the students have scores between 422 and
768
10
Measuring the spread for symmetric distributions
  • If a distribution is symmetric
  • Use the average to measure the center and
  • the Standard Deviation to measure the spread.
  • Example A person metabolic rate rate at which
    body consumes energy. Rates of 7 men in a study
    on dieting 1792, 1666, 1614, 1460, 1867, 1439,
    1362.
  • The mean is

Deviation1867 1600267
Deviation1600 1439161
? ? ? ?
? ? ?
1300 1400 1500
1600 1700
1800 1900
Metabolic rate
11
Variance
  • Variance Average of the squared deviations
  • In symbols, the standard deviation s of n
    observations is

12
Standard Deviation
  • Standard deviation measures how far the
    observations are from the average.
  • In symbols, the standard deviation s of n
    observations is
  • Note that two different formulas exist for
    calculating V and s
  • In many texts, the variance is a direct
    average. It is computed as the sum of the squares
    of the deviations divided by the number of
    observations. This is sometimes called the
    population variance.
  • However, our text divides by n - 1. This is
    sometimes referred to as the estimated variance
    based on a sample.

13
For Metabolic rate exampleCompute the standard
deviation
  • Consider the data on metabolic rate 1792, 1666,
    1614, 1460, 1867, 1439, 1362.
  • Calculate the mean
  • Calculate the deviations from the mean entry
    mean
  • 1792 1600192 1666 160066 1614 160014
    1867 1600267
  • 1460 1600 140 1439 1600 161
    1362-1600 238
  • Square the deviations, add them up, divide the
    sum by n-1 and take the square root to get the
    standard deviation

14
Properties of the s.d.
  • It measures the spread about the mean.
  • The s.d is the square root of the variance.
  • Only used in association with the mean. Good
    descriptive measure for symmetric distributions.
  • If s 0, all the observations have the same
    value.
  • It is a POSITIVE value, the larger s is, the more
    spread out the observations are around the mean.
  • It is NOT a resistant measure, a few extreme
    observations may affect its value.

15
Interpreting the s.d.
  • For many observations especially if their
    histogram is bell-shaped
  • Roughly 68 of the observations in the list lie
    within
  • 1 standard deviation from the average
  • And 95 of the observations lie within 2 standard
    deviations from the average

Average
Ave2s.d.
Aves.d.
Ave-s.d.
Ave-2s.d.
68
95
16
Example SAT Math score of 224 Computer Science
Students
In a large university, data were collected to
study the academic achievements of computer
science majors. Well consider the SAT math
scores of 224 first year CS students. The
average SATM score is 595.28 with s.d. s 86.40
Histogram of the SATM Scores
Are the average and s.d. good descriptions of
the SATM scores distribution? Roughly 68 of the
students have scores between 510 and 680 Roughly
95 of the students have scores between 422 and
768
17
CS students example Descriptive statistics
Mean 595.28 Std Deviation 86.40
Max 800 Min 300 Q1 540 Median
600.00 Q3 650 IQR110 1.5xIQR165
5th percentile 460 95th percentile 750

Histogram of the SATM Scores
768
422
95 of scores
18
Analysis of the scores for male and female
students
Box plot
SATM scores for men
SATM scores for women
19
In general, when analyzing data
  • Always plot your data
  • Look for overall patterns striking deviations
    such as outliers
  • Calculate a numerical summary to describe the
    center and the spread
  • Symmetric distributions Mean and standard
    deviation
  • Asymmetric distributions 5 number summary Min,
    Q1, Median, Q3, Max
  • NEXT STEP sometimes the overall pattern is so
    regular that we can describe it through a smooth
    curve, called a density curve.

20
Normal distribution
Normal curves provide a simple compact way to
describe symmetric, bell-shaped distributions.
Normal curve
SAT math scores for CS students
21
Money spent in a supermarket
Is the normal curve a good approximation?
22
SAT math scores for CS students
The area under the histogram, i.e. the
percentages of the observations, can be
approximated by the corresponding area under the
normal curve. If the histogram is symmetric, we
say that the data are approximately normal (or
normally distributed). We need to know only the
average and the standard deviation of the
observations!!
23
SAT math scores for CS students
The variable SAT math scores is normally
distributed with Mean m 595.28 and Std
Deviation s 86.40. The approximate normal
distribution has function
24
Two normal curves with the same mean but
different standard deviation.
25
  • The normal approximation is commonly used in
    statistics.
  • There is a special normal curve that is known
    as
  • The standard normal curve
  • The standard normal distribution has mean 0 and
  • standard deviation 1
  • The curve is perfectly symmetric around 0
  • Mathematical formula for the Standard Normal

26
Benchmarks under the standard normal curve
50
27
Normal distribution function F(z)
  • It is defined as the area under the standard
    normal to the left of z, that is F(z)P(Zltz).
    The values of F(z) are tabulated, see Table A in
    your textbook.

28
Standard normal probabilities F(z)P(Zltz)
29
Application of the normal distribution to the
dataStandardization
  • The normal distribution can be used to
    approximate the distribution of the data, when
    the data have a symmetric histogram!
  • Result
  • If the data X are normally distributed according
    to the distribution N(m,s) with mean m and
    standard deviation s, then standardized value of
    X given by Z(X-m)/s is a standard normal
    variable with distribution N(0,1) with mean 0 and
    standard deviation equal to 1
  • Thus, we can compute the relative frequencies for
    any normal distribution, by standardizing and
    using the probability Table A.

30
Example
Mean 595.28 Std Dev. s 86.40
The distribution of the SATM scores for the CS
students is approximately normal with mean 595.28
and s.d. 86.40 N(595.28 , 86.40)
Problem What is the percentage of CS students
that had SAT math scores less than 700? Answer
Use the normal approximation - Say X is
N(595.28, 86.40). It is the area under the normal
density curve for Xlt 700 Standardize
subtract the average divide by the standard
deviation Xlt 700 equivalent to
Z(X-595.28)/86.40lt(700-595.28)/86.401.212
31
  • Answer The answer is the area under the normal
    density curve for Xlt 700
  • Standardize subtract the mean, then divide by
    the standard deviation
  • Xlt 700 equivalent to Z(X-595.28)/86.40lt(700-59
    5.28)/86.401.212
  • Look at Table A
  • We need to find the area to the left of Z1.212
  • Results 88.59 of the CS students has SATM equal
    to 700 or lower

F(z).8859
Z1.212
32
How do we compute it?
  • We use the values of the standard Normal
    distribution function F(z)P(Zltz).
  • Problem What is the percentage of CS students
    that had SAT math scores between 600 and 750?
  • Approximate answer
  • 1) Standardize

_

595.28
595.28
595.28
600
750
600
750
33
Summary Normal distribution calculations
  • Follow the following steps
  • State the problem. Calculate the sample average
    and the s.d. and define the interval you are
    interested in
  • Standardize
  • Compute the area under the standard normal
    density curve using the Table A.
  • Online calculators and more info
  • area under Normal
  • another one (a bit glitchy)
  • sampling distributions
  • more on the Normal distribution

34
Inverse Problem What is the lowest SAT math
score that a student must have to be in the top
25 of all CS students in the sample?
Mean 595.28 Std Dev. s 86.40
25
Sample Q3650
?
Find the value x, such that 25 of observations
fall at or above it.
35
Example Data on the time between machine
failures were collected during a study on machine
performance that involved 39 similar machines.
From the data we compute, the sample mean
23.35hours and thesample standard deviation
1.67h.
  • What is the percentage of machines that failed
    after 24 hours?
  • What is the percentage of machines with failure
    time between 20 and 22 hours?
  • How short should the failure time be for a
    machine to be in the bottom 10 ?

36
Answers
  • The observations are on the variable Time of
    failure X that is approximately normal N(23.35,
    1.67).
  • What is the percentage of machines that failed
    after 24 hours?
  • Compute the percentage for Xgt24, that is equal
    to the area under the normal distribution to the
    right of 24.
  • Standardize Xgt24 as
  • Or equivalently Zgt 0.39
  • Now use the standard normal probability tables
  • The area under the standard normal to the right
    of 0.39 is equal to
  • Area to the right of 0.39
    1- (Area to the left of 0.39)
  • 1-0.6517
  • 0.3483
  • The answer is 0.3483. About 35 of the machines
    failed after 24 hours.

37
  • What is the percentage of machines with failure
    time between 20 and 22 hours?
  • We need to compute the area under the normal
    distribution for
  • 20 ltXlt 22. This is computed by subtracting
  • (Area for Xlt22) - (Area for Xlt20).
  • Standardize
  • X lt 22 is, in standard units
  • Xlt20 is, in standard units
  • Use the standard normal probability tables
  • The area under the standard normal distribution
    for Zlt-0.81is 0.2090
  • The area under the standard normal distribution
    for Zlt-2.00 is 0.0228
  • The answer is 0.2090-0.2280.1862
  • Conclusion

38
  • How short should the failure time be for a
    machine to be in the bottom 10 ?
  • We need to compute the value x for XN(23.35,
    1.67), such that the area under the normal
    distribution on the left of x is equal to 0.1.

0.1
X 23.35
From the normal probability tables, the standard
value z that corresponds to an area P(Zltz)0.1
is z-1.28 Thus, transforming the z-value back
to the x-units, we have x -1.28 X st.dev.
mean -1.28 X 1.67 23.35 21.21 So the bottom
10 of the cars have failure time equal to 21.21
hours or shorter.
39
Normal approximations
Is the normal approximation appropriate for these
data?
Overestimate this area
Underestimate this area
Use it when the histogram of the observations is
bell-shaped!
40
Normal quantile plots
  • A useful tool for assessing if the data come from
    a normal distribution is a graph called normal
    quantile plot.
  • If the points on a normal quantile plot lie close
    to a straight line, the plot indicates that the
    data are normal. Deviations from a straight line
    indicates that the data are not normal.

41
SAS for Exploratory Data Analysis
  • PROC MEANS
  • PROC UNIVARIATE
  • PROC CHART (GCHART)
  • PROC UNIVARIATE

To compute descriptive statistics
To plot histograms
To plot histograms, normal probability plots,
boxplots.
Write a Comment
User Comments (0)
About PowerShow.com