Title: Review of Midterm
1Review of Midterm
2Data
- Ws
- Subjects
- Variables
- Categorical versus quantitative
3One categorical variable
- Graphs
- Bar chart
- Pie chart
- Numerical summary
- Frequency table
- Relative frequency table
4Two categorical variables
- Conditional and marginal distribution
- Graphs
- Segmented bar charts
- Side-by-side bar charts
- Side-by-side pie charts
- Numerical summary
- Contingency table
- table percentage, row percentage, column
percentage
5Problems 28 (page 148)
- Birth order related to major?
- What percent of these students are oldest or only
children? (113/223) - What percent of Humanities majors are oldest
children? (15/43) - What percent of oldest children are Humanities
students? (15/113) - What percent of the students are oldest children
majoring in the Humanities? (15/223)
Major Birth Order Birth Order Birth Order Birth Order
1 2 3 4 Total
Math/Science 34 14 6 3 57
Agriculture 52 27 5 9 93
Humanities 15 17 8 3 43
Other 12 11 1 6 30
total 113 69 20 21 223
6Problems 30 (page 148)
- What is the marginal distribution of majors?
- What is the conditional distribution of majors
for the oldest children?
Math/Science Agriculture Humanities Other Total
57 (25.6) 93 (41.7) 43 (19.3) 30 (13.5) 223
Math/Science Agriculture Humanities Other Total
34 (30.1) 52 (46.0) 15 (13.3) 12 (10.6) 113
Major Birth Order Birth Order Birth Order Birth Order
1 2 3 4 Total
Math/Science 34 14 6 3 57
Agriculture 52 27 5 9 93
Humanities 15 17 8 3 43
Other 12 11 1 6 30
total 113 69 20 21 223
7Simpsons Paradox
- Problem 3.38 Two delivery services
Delivery Service Type of Service Number of deliveries Number of late packages Overall percentage of late deliveries
Pack Rats Regular 400 12 (3) 5.60
Pack Rats Overnight 100 16(16) 5.60
Boxes R Us Regular 100 2(2) 6
Boxes R Us Overnight 400 28 (7) 6
8One quantitative variable
- Graphs
- Histogram
- Boxplot
- Qualitative summary
- of modes
- Symmetric? Transformation?
- Outliers?
- Numerical summary
- Five-number summary
- Center mean versus median
- Spread sd versus IQR
9Problem 32 Pay
- The 1999 National Occupational Employment and
Wage Estimates for management Occupations - For chief executives
- Mean 48.67/hour
- Median 52.08/hour
- For General and Operations Managers
- Mean 31.69/hour
- Median 27.23/hour
- Are these wage distributions likely to be
symmetric, skewed to the left or skewed to the
right?
10Shifting and rescaling
Location shift rescale
min x x
Q1 x x
median x x
Q3 x x
max x x
mean x x
spread
variance x
Standard deviation x
IQR x
range x
11Problem 4.42 Job Growth
- 20 cities job growth rates predicted by Standard
Poors DRI in 1996
- Are the mean and median very different?
- Which one is more appropriate?
- Mean (2.37) or median (2.235)?
- SD (0.425) or IQR (0.515)?
- If we subtract from these growth rates the
predicted U.S. average growth rate of 1.20, how
would this change the above summary statistics? - If we omit Las Vegas (growth rate3.72) from the
data, how would you expect the above summary
statistics to change? - How to summarize the distribution of the data?
12One quantitative variable and one categorical
variable
- Comparing groups
- with histogram, boxplot, stem-and-leaf plot
- Transformation when spread is too different
across groups
13Normal model
- Z-score and standard normal
- Nearly normal condition
- Normal probability plot
- Four types of problems
- Given parameters and data values (or z-score),
ask for probabilities - Given parameters and probabilities, ask for data
values (or z-score) - Given probabilities and data values (or z-score),
ask for parameters - Given probabilities, data values (or z-score) and
one parameter, ask for the other parameter
14Problem 22 Winter Olympic 2002 speed skating
- Top 25 mens and 25 womens 500-m speed skating
times - Mean 73.46
- Sd 3.33
- If the Normal model is appropriate, what percent
of the times should be within 1.67 seconds of
73.46? - Solution 1 1.670.5sd, Normcdf(-0.5,0.5,0,1)
- Solution 2 Normcdf(72.19, 75.13, 73.46, 3.33)
- In the data, only 6 are within that range. Why
are the percentages so different?
15(No Transcript)
16Problem 39 assembly time
- Only 25 of the companys customers succeeded in
building the desk under an hour - 5 said it took them over 2 hours
- Assume that consumer assembly time follows a
Normal model - Mean ? , SD ?
- Z-score corresponding to 25
- (1- mean)/ SD invNorm(0.25,0,1) 0.6744897495
- Z-score corresponding to 95
- (2- mean)/ SD invNorm(0.95,0,1) 1.644853626
- Solve the two equations, we have
- mean 1.29
- SD 0.43
17Problem 39 assembly time (cont.)
- Mean 1.29, sd0.43
- What assembly time should the company quote in
order that 60 of customers succeed in finishing
the desk by then? - invNorm(0.6,1.29,0.43)
18Problem 39 assembly time (cont.)
- Mean 1.29, sd0.43
- The company wishes to improve the one-hour
success rate to 60. If the sd stays the same,
what new lower mean time does the company need to
achieve? - Z-score invnorm (0.6,0,1)
- Z-score (1-mean)/sd
- Mean 0.89
19Correlation
- Sign of r means?
- The range of r?
- X and Y are called uncorrelated if and only if
r0 - r(x,y)r(y,x)
- No units
- Effected by shifting or rescaling X, Y or both?
- Uncorrelated does NOT imply no association
- Sensitive to outliers (remove a point close to
the line fitted through the scatterplot increase
or decrease r?)
20Correlation Review II 13, Page 264
- What factor most explains differences in Fuel
Efficiency among cars? Heres a correlation
matrix exploring that relationship for the cars
Weight, Horsepower, engine size (Displacement),
and number of Cylinders.
MPG Weight Horse-Power Displace ment Cylinders
MPG 1.000
Weight -0.903 1.000
Horse-Power -0.871 0.917 1.000
Displacement -0.786 0.951 0.872 1.000
Cylinders -0.806 0.917 0.864 0.940 1.000
- Which factor seems most strongly associated with
Fuel Efficiency ? - What does the negative correlation indicate?
- Explain the meaning of R2 for that relationship.
21Matching r and scatterplots
- Here are several scatterplots. The calculated
correlations are 0.85, -0.87, 0.04 and 0.53.
which is which?
22Linear regression (least squares)
- How to calculate the slope?
- Given the slope, and standard deviations, how to
calculate the correlation? - The line always goes through
- Residual
- Overestimation
- Underestimation
- Causal relationship ?
- How to interpret ?
23Diagnostics of a Linear Model
- Visual inspection scatter plot satisfies the
Straight Enough Condition? Looks okay, - Regression calculate the regression equation, r
and R2. (R2rr gives the percentage of
variation of the data explained by the model).
R2 is tiny, saylt0.2, a linear model may not be
a good choice. - Residuals check the residual plot even when R2
is large. Bad sign if we see some pattern. The
spread of the residuals are supposed to about the
same across the X-axis if the linear model is
appropriate. (you can either put predicted value
or x-variable on x-axis). - Re-expression consider re-expressing the data.
If a linear model is not appropriate for the
data, And remember to repeat the diagnostics
every time after fitting a new linear model on
the transformed data.
24Randomness Simulation
- Simulation Component ?
- Response variable?
- Trial?
- Example 11.20
- Suppose the chance of passing the drivers test
is 34 the first time and 72 for the subsequent
retests. Estimate the percentage of those tested
who still do not have a drivers license after
two attemps.
25Check list
- Graphs and plots bar chart, pie chart,
histogram, boxplot (mod boxplot on ti-83), normal
probability plot, scatterplot, residual plotHow
to make ? How to interpret ? - Statistics mean, medium, min, max, range,
quartiles, standard deviation, IQR, correlation
coefficient How to calculate ? How to interpret? - Model normal distribution, linear
regression.How to get the parameters ?