Title: AP Statistics
1AP Statistics
2Hypothesis Tests Slopes
- Given Observed slope relating Education to Job
Prestige 2.47 - Question Can we generalize this to the
population of all Americans? - How likely is it that this observed slope was
actually drawn from a population with slope 0? - Solution Conduct a hypothesis test
- Notation slope b, population slope b
- H0 Population slope b 0
- H1 Population slope b ? 0 (two-tailed test)
3Review Slope Hypothesis Tests
- What information lets us to do a hypothesis test?
- Answer Estimates of a slope (b) have a sampling
distribution, like any other statistic - It is the distribution of every value of the
slope, based on all possible samples (of size N) - If certain assumptions are met, the sampling
distribution approximates the t-distribution - Thus, we can assess the probability that a given
value of b would be observed, if b 0 - If probability is low below alpha we reject H0
4Review Slope Hypothesis Tests
- Visually If the population slope (b) is zero,
then the sampling distribution would center at
zero - Since the sampling distribution is a probability
distribution, we can identify the likely values
of b if the population slope is zero
5Bivariate Regression Assumptions
- Assumptions for bivariate regression hypothesis
tests - 1. Random sample
- Ideally N gt 20
- But different rules of thumb exist. (10, 30,
etc.) - 2. Variables are linearly related
- i.e., the mean of Y increases linearly with X
- Check scatter plot for general linear trend
- Watch out for non-linear relationships (e.g.,
U-shaped)
6Bivariate Regression Assumptions
- 3. Y is normally distributed for every outcome
of X in the population - Conditional normality
- Ex Years of Education X, Job Prestige (Y)
- Suppose we look only at a sub-sample X 12
years of education - Is a histogram of Job Prestige approximately
normal? - What about for people with X 4? X 16
- If all are roughly normal, the assumption is met
7Bivariate Regression Assumptions
8Bivariate Regression Assumptions
- 4. The variances of prediction errors are
identical at different values of X - Recall Error is the deviation from the
regression line - Is dispersion of error consistent across values
of X? - Definition homoskedasticity error dispersion
is consistent across values of X - Opposite heteroskedasticity, errors vary with
X - Test Compare errors for X12 years of education
with errors for X2, X8, etc. - Are the errors around line similar? Or different?
9Bivariate Regression Assumptions
- Homoskedasticity Equal Error Variance
Here, things look pretty good.
10Bivariate Regression Assumptions
- Heteroskedasticity Unequal Error Variance
This looks pretty bad.
11Bivariate Regression Assumptions
- Notes/Comments
- 1. Overall, regression is robust to violations
of assumptions - It often gives fairly reasonable results, even
when assumptions arent perfectly met - 2. Variations of regression can handle
situations where assumptions arent met - 3. But, there are also further diagnostics to
help ensure that results are meaningful
12Regression Hypothesis Tests
- If assumptions are met, the sampling distribution
of the slope (b) approximates a T-distribution - Standard deviation of the sampling distribution
is called the standard error of the slope (sb) - Population formula of standard error
- Where se2 is the variance of the regression error
13Regression Hypothesis Tests
- Estimating se2 lets us estimate the standard
error
- Now we can estimate the S.E. of the slope
14Regression Hypothesis Tests
- Finally A t-value can be calculated
- It is the slope divided by the standard error
- Where sb is the sample point estimate of the
standard error - The t-value is based on N-2 degrees of freedom
15Regression Confidence Intervals
- You can also use the standard error of the slope
to estimate confidence intervals
- Where tN-2 is the t-value for a two-tailed test
given a desired a-level - Example Observed slope 2.5, S.E. .10
- 95 t-value for 102 d.f. is approximately 2
- 95 C.I. 2.5 /- 2(.10)
- Confidence Interval 2.3 to 2.7
16Regression Hypothesis Tests
- You can also use a T-test to determine if the
constant (a) is significantly different from zero - But, this is typically less useful to do
- Hypotheses (a population parameter of a)
- H0 a 0, H1 a ? 0
- But, most research focuses on slopes
17Regression Outliers
- Note Even if regression assumptions are met,
slope estimates can have problems - Example Outliers -- cases with extreme values
that differ greatly from the rest of your sample - Outliers can result from
- Errors in coding or data entry
- Highly unusual cases
- Or, sometimes they reflect important real
variation - Even a few outliers can dramatically change
estimates of the slope (b)
18Regression Outliers
19Regression Outliers
- Strategy for dealing with outliers
- 1. Identify them
- Look at scatterplots for extreme values
- Or, have computer software compute outlier
diagnostic statistics - There are several statistics to identify cases
that are affecting the regression slope a lot - Examples Leverage, Cooks D, DFBETA
- Computer software can even identify problematic
cases for you but it is preferable to do it
yourself.
20Regression Outliers
- 2. Depending on the circumstances, either
- A) Drop cases from sample and re-do regression
- Especially for coding errors, very extreme
outliers - Or if there is a theoretical reason to drop cases
- Example In analysis of economic activity,
communist countries differ a lot - B) Or, sometimes it is reasonable to leave
outliers in the analysis - e.g., if there are several that represent an
important minority group in your data - When writing papers, identify if outliers were
excluded (and the effect that had on the
analysis).