Title: Lecturer 10: Regression with one X variable
1Lecturer 10 Regression with one X variable
- Straight line (linear model) for predicting one
variable from another - Predicted Y constant slope X
- y c bx
- Uses method of least squares to choose best line
- R squared measures accuracy of prediction
- Slope tells you
2Method of least squares
- Take any values of c and b (constant and slope)
- Work out prediction for each x in data
- Work out error of prediction
- Work out mean square error (MSE)
- Choose c and b (constant and slope) to make MSE
as low as possible .
3How to apply the method of least squares
- Use Excel Solver as in spreadsheet pred1var.xls
- Advantage is it shows you whats going on (and
its more flexible) - Use formulae derived from calculus traditional
method. - The formulae are not helpful for understanding
whats going on so I will not be covering them - Easiest to use software with formulae built in -
eg Excel - Tools Data Analysis Regression
- If this isnt on the menu use Tools Add-Ins
4An example think of a story
- With six people / organisations / etc
- And two numerical variables which may be related
in an interesting way - Get the data, or make it up ..
- Do a regression analysis to predict one variable
from the other using pred1var.xls, and then the
Regression Tool in Excel - Use the model to make a prediction
- Now try doing it the other way round
- Make sure you understand
5Regression terminology
- See table in the Word handout
6Slope / regression coefficient / x coefficient
- Interpretation obvious and important
- The slope tells you
- A negative slope means
7R squared easy version
- R squared is the square of the correlation
coefficient - Often used as a measure of how good the model is
- R squared 1 if correl 1 or -1 model very
good - R squared 0 if correl 0 model very bad
- R squared 0.5 means the model half way between
good and bad - R squared 0.9 means its good but not perfect
- Etc
8R squared more detail
- Model based on a variable with zero correlation
with the dependent variable would be completely
useless - Best prediction here is the mean.
- MSE variance square of sd. (See Pred1var.xls)
- Model based on straight line relationship is the
best possible - correlation 1 or -1
- MSE 0
- A reasonable measure of the model is the
reduction in MSE from the worst model (with MSE
variance) - Ie the proportional reduction in MSE
- This turns out to be the same as R squared
9But dont forget the sample
- Even a model with R squared 0.9 or 1 may not be
as good as it seems if the sample size is small - This is a separate issue which is not assessed by
R squared - See work on hypothesis (significance) tests and
confidence intervals
10Edited output from Excel Tool Regression for job
satisfaction data
11Note that
- This output is edited either read a book on
mathematical statistics, or ignore the rest of
the output - Eg we have ignored the t stat. This is just used
to calculate ps and confidence intervals - (I have left the term standard error, although
I wont be explaining it in detail. Its a term
used for the standard deviation of something when
you are using it as a measure of error.) - In practice you would always want a larger
sample! But this illustrates the principle.
12Multiple regression
- Prediction model (linear) using several variables
- Pred Y const slope1X1 slopenXn
- y c b1x1 bnxn
- Uses method of least squares to choose best line
- R squared (coefficient of determination) measures
goodness of fit of model to data - Slopes tell you impact of each variable on
dependent variable
13Mostly same as with single variable regression
- Least squares
- Predmvar.xls or Excel Tool (need independent X
variables in a block so you can select them all) - R squared
- Slope for each variable
- predicted increase in dependent variable if
variable is increased by one without changing
other variables - Category variables represented by 1/0 eg sexn
14Problems with regression
- Model may not be reasonable (eg infant mortality
and GNP) - Sample too small coefficients unreliable (check
confidence intervals) - Have you got the right variables?
- Highly correlated variables can give misleading
results - Too many variables
- See reading for more detail
15Uses of regression
- Very widely used in research (over-used?)
- Examples
16Predicting returns from shares
- Dissanaike (1999) produced a regression model to
predict the return which investors would receive
from investing in a particular security for a
period of four years, from the return they would
have received if they had invested in the same
security in the previous four years. The data on
which the model was based were the returns for a
sample of large companies over consecutive
periods of four years. - The regression coefficient cited was -0.112, and
the value of R squared was 0.0413. - Suppose you were considering investing in two
shares A or B. A has produced a return over the
last four years of -5, and B has produced 5.
Use the regression model to predict which share
is likely to produce the better returns over the
next four years, and by how much. How sure would
you be?
17(No Transcript)
18(No Transcript)
19Further statistics .
20Mathematical notation
- You may need to be familiar with some
mathematical notation for more advanced work
(this will not be required in the exam) - Sigma (summation) notation
- Pi (product) notation
- Use of a bar above a symbol for mean (average)
- Subscripts RJK etc
- Standard symbols n for sample size, t for time,
etc
21Covariances and variances
- The attached handout explains what these are and
the relationships between them - You may need this to follow some mathematical
work in finance - It will not be directly assessed in the exam
(although it may improve your answers)
22Formulae, computers and understanding (1)
- You can usually get the answer (eg sd, correl,
regression coefficient) - with a computer
- Using the formula / method
- Computer is
- quicker and more accurate, but
- You may not understand what the answer means or
how to use it . This can be serious!
23Formulae, computers and understanding (2)
- Sometimes the formula / method will help you
understand what the answer means - Eg percentiles, Kendall correlation coefficients
- Then its a good idea to do simple examples with
formula/method to help you understand, then use a
computer
24Formulae, computers and understanding (3)
- Sometimes the formula / method will not help you
understand what the answer means - Eg formulae for a regression coefficient, and
normal distribution - Here you need much more mathematical background
to understand properly (especially the normal
distribution) - Then its a good idea to
- try to find an alternative approach which is
easier to follow (regression), or - Concentrate on understanding the formula/method
in intuitive terms (normal distribution)
25What do I need to understand?
- What the answer means, how it relates to the
inputs, assumptions made, and how it can be used - How to work it out with a computer (although in
an exam you will not have a computer and will not
be expected to remember details of computer
menus, etc) - In some cases, how to estimate a rough answer
- For easy methods only, how to work it out without
a computer