Title: Regression
1Regression
2Retrospective
- Week One
- Descriptive statistics
- Exploratory Data Analysis
- Week Two
- Probability
- Binomial Distribution
- Week Three
- Normal Distribution
- Interval Estimation, Hypothesis Testing, Decision
Theory
3Last Thursday and Previous Tuesday
- Bivariate Relationships
- Correlation and Analysis of Variance
4Outline
- A cognitive device to help understand the
formulas for estimating the slope and the
intercept, as well as the analysis of variance - Table of Analysis of Variance (ANOVA) for
regression - F distribution for testing the significance of
the regression, i.e. does the independent
variable, x, significantly explain the dependent
variable, y?
5Outline (Cont.)
- The Coefficient of Determination, R2, and the
Coefficient of Correlation, r. - Estimate of the error variance, s2.
- Hypothesis tests on the slope, b.
6Part I A Cognitive Device
7A Cognitive Device The Conceptual Model
- (1) yi a bxi ei
- Take expectations , E
- (2) E yi a bE xi E ei, where
- assume (3) E ei 0
- Subtract (2) from (1) to obtain model in
deviations - (4) yi - E yi bxi - E xi ei
- Multiply (3) by xi - E xi and take
expectations
8A Cognitive Device (Cont.)
- (5) Eyi - E yi xi - E xi bExi - E xi
2 Eei xi - E xi , where assume - Eei xi - E xi 0, i.e. e and x are
independent - By definition, (6) cov yx b var x, i.e.
- (7) b cov yx/ var x
- The corresponding empirical estimate, by the
method of moments
9A Cognitive Device (Cont.)
- The empirical counter part to (2)
- Square both sides of (4), and take expectations,
- (10) E yi - E yi 2 b2Exi - E xi 2
2Eeixi - E xi Eei2 - Where (11) Eeixi - E xi 0 , i.e. the
explanatory variable x and the error e are
assumed to be independent, cov ex 0
10A Cognitive Device (Cont.)
- From (10) by definition
- (11) var y b2 var x var e, this is the
partition of the total variance in y into the
variance explained by x, b2 var x , and the
unexplained or error variance, var e. - the empirical counterpart to (11) is the total
sum of squares equals the explained sum of
squares plus the unexplained sum of squares
11A Cognitive Device (Cont.)
- From Eq. 7, substitute for b in Eq. 11
- Var y covyx2/Var x Var e
- Divide by Var y 1 covyx2/varyvarx var
e/var y - or 1 r2 var e/var y where r is the
correlation coefficient
12Population Model and Sample Model Side by Side
13Conceptual Vs. Fitted Model
- Conceptual
- (1) yi a bxi ei
- Take expectations, E
- (2) Ey a bEx Eei
- (3) Where Eei 0
- Subtract (2) from (1)
- (4)yi - Ey bxi -Ex ei
14Conceptual Vs. Fitted (Cont.)
- Fitted
- First order condition
- compare (3) (vi)
- From (v) the fitted line goes through the sample
means
- Conceptual
- Multiply (4) by xi - Ex and take expectations,
E - E yi - Ey xi -Ex bE xi -Ex2 Eei xi
-Ex, - (5) where Eei xi -Ex 0
- (6) covyx bvarx
- (7) b covyx/varx
15Conceptual vs. Fitted (Cont.)
16Part II ANOVA in Regression
17ANOVA
- Testing the significance of the regression, i.e.
does x significantly explain y? - F1, n -2 EMS/UMS
- Distributed with the F distribution with 1 degree
of freedom in the numerator and n-2 degrees of
freedom in the denominator
18Table of Analysis of Variance (ANOVA)
F1,n -2 Explained Mean Square / Error Mean
Square
19Example from Lab Four
- Linear Trend Model for UC Budget
20(No Transcript)
21Time index, t 0 for 1968-69, t1 for 1969-70
etc UCBUD(t) a bt e(t)
22Example from Lab Four
- Exponential trend model for UC Budget
- UCBud(t) expabte(t)
- taking the logarithms of both sides
- ln UCBud(t) a bt e(t)
23(No Transcript)
24Ln UCBUD(t) a bt e(t)
Exp(-0.929) 0.3949
25Ln ucbud(t) a bt e(t)
26Part III The F Distribution
27The F Distribution
- The density function of the F distributionn
1 and n2 are the numerator and denominator
degrees of freedom.
28The F Distribution
- This density function generates a rich family of
distributions, depending on the values of n1 and
n2
n1 5, n2 10 n1 50, n2 10
n1 5, n2 10 n1 5, n2 1
29Determining Values of F
- The values of the F variable can be found in the
F table, Table 6(a) in Appendix B for a type I
error of 5, or Excel . - The entries in the table are the values of the F
variable of the right hand tail probability (A),
for which P(Fn1,n2gtFA) A.
30Time index, t 0 for 1968-69, t1 for 1969-70
etc UCBUD(t) a bt e(t)
F1, 36 (n-2)R2/(1 R2) 36(0.934/0.066)
509
311 dof
36 dof F1,36 4.13
32Part IV The Pearson Coefficient of Correlation,
r
- The Pearson coefficient of correlation, r, is
(13) r cov yx/var x1/2 var y1/2 - Estimated counterpart
- Comparing (13) to (7) note that
(15) rvar y1/2 /var x1/2 b
33A Cognitive Device (Cont.)
- (5) Eyi - E yi xi - E xi bExi - E xi
2 Eei xi - E xi , where assume - Eei xi - E xi 0, i.e. e and x are
independent - By definition, (6) cov yx b var x, i.e.
- (7) b cov yx/ var x
- The corresponding empirical estimate
34Part IV (Cont.) The coefficient of Determination,
R2
- For a bivariate regression of y on a single
explanatory variable, x, R2 r2, i.e. the
coefficient of determination equals the square of
the Pearson coefficient of correlation - Using (14) to square the estimate of r
35Part IV (Cont.)
- Using (8), (16) can be expressed as
- And so
- In general, including multivariate regression,
the estimate of the coefficient of determination,
, can be calculated from (21) 1
-USS/TSS .
36Part IV (Cont.)
- For the bivariate regression, the F-test can be
calculated from
F1, n-2 (n-2)/1ESS/TSS/USS/TSS
F1, n-2 (n-2)/1ESS/USS(n-2) - For a multivariate regression with k explanatory
variables, the F-test can be calculated as
Fk, n-2
(n-k-1)/kESS/USS Fk,
n-2 (n-k-1)/k
37Time index, t 0 for 1968-69, t1 for 1969-70
etc UCBUD(t) a bt e(t)
R2 1 USS/TSS 1 2.0794/29.6019 0.93
38Part VEstimate of the Error Variance
- Var ei s2
- Estimate is unexplained mean square, UMS
- Standard error of the regression is
39Time index, t 0 for 1968-69, t1 for 1969-70
etc UCBUD(t) a bt e(t)
40Part VI Hypothesis Tests on the Slope
- Hypotheses, H0 b0 HA bgt0
- Test statistic
- Set probability for the type I error, say 5
- Note for bivariate regression, the square of the
t-statistic for the null that the slope is zero
is the F-statistic
41Time index, t 0 for 1968-69, t1 for 1969-70
etc UCBUD(t) a bt e(t)
F1, 36 t2 511 22.622.6
42Part VII Students t-Distribution
43The Student t Distribution
- The Student t density function
- n is the parameter of the student t
distribution - E(t) 0 V(t) n/(n 2)
(for n gt 2)
44The Student t Distribution
n 3
n 10
45Determining Student t Values
- The student t distribution is used extensively in
statistical inference. - Thus, it is important to determine values of tA
associated with a given number of degrees of
freedom. - We can do this using
- t tables , Table 4 Appendix B
- Excel
46Using the t Table
t
t
t
t
- The table provides the t values (tA) for which
P(tn gt tA) A
The t distribution is symmetrical around 0
tA
-1.812
1.812
t.100
t.05
t.025
t.01
t.005
47(No Transcript)
48Problem 6.32 in TextTable of Joint Probabilities
49Problem 6.32
- The method of instruction in college and
university applied statistics courses is
changing. Historically, most courses were taught
with an emphasis on manual calculation. The
alternative is to employ a computer and a
software package to perform the calculations. An
analysis of applied statistics courses
investigated whether the instructors
educational background is primarily mathematics
(or statistics) or some other field.
50Problem 6.32
- A. What is the probability that a randomly
selected applied statistics course instructor
whose education was in statistics emphasizes
manual calculations? - What proportion of applied statistics courses
employ a computer and software? - Are the educational background of the instructor
and the way his or her course are taught
independent?
51Midterm 2000
- .(15 points) The following table shows the
results of regressing the natural logarithm of
California General Fund expenditures, in billions
of nominal dollars, against year beginning in
1968 and ending in 2000. A plot of actual,
estimated and residual values follows. - .How much of the variance in the dependent
variable is explained by trend? - .What is the meaning of the F statistic in the
table? Is it significant? - .Interpret the estimated slope.
- .If General Fund expenditures was 68.819 billion
in California for fiscal year 2000-2001, provide
a point estimate for state expenditures for
2001-2002. -
52- Cont.
- A state senator believes that state expenditures
in nominal dollars have grown over time at 7 a
year. Is the senator in the ballpark, or is his
impression significantly below the estimated
rate, using a 5 level of significance? - If you were an aide to the Senator, how might you
criticize this regression?
53(No Transcript)