Title: Quantitative Analysis
1Quantitative Analysis
2Themes
- These lectures will deal with regression
analysis. - This deals with estimating a relationship
between - A dependent variable and one or more
explanatory or independent variables. - E.g. Between consumption and price.
- E.g. Between sales and price and income.
- We can ask whether the relationship is
statistically significant. - We can estimate the strength of the relationship
- We can estimate the impact of each explanatory
variable on the dependent variable. - We can use the relationship to make forecasts.
3Reference
- Refer to Hildebrand and Ott
- Linear Regression and Correlation Methods.
- Multiple Regression Methods.
- Constructing a Multiple Regression Model.
- Almost any intermediate business statistics text
will have equivalent chapters or sections that
would be useful.
4A starting point
- Data
- Consider real GDP and employment.
- Is there some sort of connection.
- Probably, but which is the dependent variable?
- Is real GDP a measure of economic activity that
determines the demand for labour? - Or is it that the level of employment that
determines output and, therefore, real GDP. - So its a good question for Macroeconomic
Principles! - Look at the data.
- Because Real GDP is measured in b and
employment is measured in millions, it is better
to use index numbers for both data sets.
5It is fairly clear that the two data sets tend to
move in the same ways over a 20 year period.
6Scattergrams
- Plot
- One variable on the horizontal axis.
- E.g. RGDP.
- One variable on the vertical axis.
- E.g. Employment.
- Plot.
- For each observation plot a point on the graph.
- If the points form something close to a straight
line, we have a strong linear relationship
between the variables.
7Scattergram
8Scattergram
9Simple linear regression
- Model.
- Y ?0 ?1 X ?.
- Y dependent variable.
- I.e. The variable we are trying to explain or
predict. - E.g. Sales.
- X independent variable.
- I.e. The variable we are using to explain or
predict Y. - E.g. Advertising expenditure.
- ? error (random with mean 0).
- I.e. The net effect of all variables other than X
that influence Y. - E.g. Weather, prices, incomes and many others.
- Later we will see how we can bring the more
important of these into the model. - ?0 constant or intercept.
- ?1 coefficient or slope.
- It is like a formula for calculating Y.
- Actually for estimating the average value of Y.
- Since ? is random, we cannot use it in the
formula. - We dont know what it will be in any instance.
10Y
Y ?0 ?1 X
X
11Estimation
- Finding ?0 and ?1.
- We need data for X and Y.
- We then require the values of ?0 and ?1 that make
the equation the line of best fit. - I.e. The line that is close to as much of the
data on the scattergram as possible. - The method most often used is called ordinary
least squares (OLS). - Squares?
- Whenever we use the equation to predict Y, there
will always be an error (e) because we do not
know ?. - Error e actual Y predicted Y.
- Some will be positive.
- Some some will be negative.
- We square these errors so that they will all be
positive. - We add them to get a sum of squared errors or
ESS. - We choose ?0 and ?1 to minimise this sum.
12Y
Regression line Y ?0 ?1 X
X
13Y
Observation
Regression line Y ?0 ?1 X
X
14Y
Observation
Regression line Y ?0 ?1 X
X
15Y
Some squared errors
X
16Estimation (cont.)
- Approach
- Line of best fit
- Changing the constant (?0) and coefficient (?1)
changes the squares. - It changes the total of the squares.
- We seek the constant and coefficients that
minimise the total area. - The following slide shows how the squared errors
might change.
17Y
Squared errors
Moving the regression line makes some
squares bigger and others smaller.
X
18Formulas
- We have to minimise the ESS.
- To understand this we need to use differential
calculus. - If you know how to use it, its easy.
- ESS ?(Y - ?0 - ?1 X)2.
- Differentiate ESS with respect to ?0 and set the
derivative equal to 0. - Differentiate ESS with respect to ?1 and set the
derivative equal to 0. - Solve the simultaneous equations for ?0 and ?1.
- Sounds complicated, but if differential calculus
is a mystery to you you do not have to learn it! - The results?
19Exercise
- A problem from Selvanathan
- Problem 10.
- Twelve secretaries at the University of
Queensland were asked to take a three-day
intensive course to improve their keyboard
skills. At the beginning and the end of the
course, they were were given a particular
two-page letter and asked to type it flawlessly. - The next slide shows the data and the one after
that the relevant SAS output.
20Exercise (cont.)
- Data
- Typist Experience Improvement
- years wpm
- A 2 9
- B 6 11
- C 3 8
- D 8 12
- E 10 14
- F 5 9
- G 10 14
- H 11 13
- I 12 14
- J 9 10
- K 8 9
- L 10 10
21Scattergram
Typists with more experience seem to have larger
improvements.
22 Parameter Standard
Variable DF Estimate Error
t Value Pr gt t Intercept 1
6.86269 1.19323 5.75 0.0002
exper 1 0.53881 0.14194
3.80 0.0035
Regression equation IMP 6.863 0.539 EXP
23Exercise (cont.)
- Interpretation
- What does it mean?
- IMP 6.863 0.539 EXP
- IMP is the predicted value of the dependent
variable (improvement) for different values of
the independent variable (experience). - ?0 6.863.
- The average improvement after the course of a
keyboard operator with no experience (EXP 0) is
6.863 wpm. - This may be a little dangerous because we have no
data on typists with little or no experience. - ?1 0.539.
- The average improvement after the course of a
keyboard operator per additional year of
experience is 0.539 wpm. - I.e. We would expect the improvement of the
average typist with 11 years experience to be
0.539 wpm more than the average typist with 10
years of experience.
24Scattergram
Regression line.
25Errors
- Large or small.
- Ideally, we want the errors to be small.
- Looking at the scattergram we can see
- Large and small errors.
- Positive and negative errors.
- The average error?
- Not a good idea because the sum of the errors is
always 0. - Standard error of estimate.
- We average the squared errors.
26Scattergram
Positive errors (under-estimates)
Negative errors (over-estimates)
27Root MSE 1.49995
R-Square 0.5903 Dependent Mean 11.08333
Adj R-Sq 0.5493 Coeff Var
13.53339
28Significance
- Does Y really depend on X?
- Remember that we have a small sample and are
trying to estimate a relationship between
variables in a target population. - Consider Y ?0 ?1 X ?.
- If X changes, Y must change.
- If X increase by 1 unit, Y increases by ?1 units.
- Unless, of course, ?1 0.
- Then Y doesnt depend on X.
- This is how our test works.
- HO ?1 0.
- Y does not depend on X.
- HA ?1 ? 0.
- Y does depend on X.
29Significance (cont.)
- Test statistic.
- We use Students t distribution (again).
- Degrees of freedom.
- DF number of observations number of variables.
Trust the mathematicians on this.
30Significance (cont.)
- t tests.
- These tests work in exactly the same way as in
tests of hypotheses concerning mean values. - Large t scores lead us to reject the null
hypothesis. - The same critical values apply.
- The modern approach considers the sig or p
values. - Reject the null hypothesis if sig or p lt 0.05 (or
some other reasonable level of significance). - p Pr?1 0 given the sample data.
- Or p PrX does not explain Y given the sample
data. - We can perform one-sided tests.
- HO ?1 0.
- HA ?1 ? 0 (positive relationship).
- HA ?1 ? 0 (negative relationship).
- Divide the p value by 2.
31 Parameter Standard
Variable DF Estimate Error
t Value Pr gt t Intercept 1
6.86269 1.19323 5.75 0.0002
exper 1 0.53881 0.14194
3.80 0.0035
32Coefficient of determination
- How much of variation in Y is explained by X?
- The coefficient is called R2.
- If the regression line is a perfect fit, R2 1.
- If the regression bears no relationship to the
data, R2 0. - The regression line would be horizontal.
- I.e. As X changes, Y doesnt.
33Scattergram
34Scattergram
35R2 (cont.)
- Definition.
- R2 ratio of explained variation to total
variation. - Recall that some variation will be positive and
some negative. - We have to square and then add.
36R2 (cont.)
- Adjusted R2.
- If we have small data sets, it is likely that R2
will be quite large. - If there are only two observations, R2 1 no
matter how unlikely the relationship. - E.g. X maximum daily temperature in Melbourne
and Y daily sales of snake skin boots in New
York. - With 2 observations, R2 1!
- We have an adjusted R2 that takes sample size
into account. - SAS calculates it.
- This is the one to use if we want to compare
models.
37Root MSE 1.49995
R-Square 0.5903 Dependent Mean 11.08333
Adj R-Sq 0.5493 Coeff Var
13.53339
38Forecasting
- Using the equation.
- We have
- IMP 6.863 0.539 EXP.
- We can substitute values to EXP to forecast IMP.
- E.g. EXP 5 ? IMP 6.863 0.539 5 9.59.
- This is a point estimate (not a confidence
interval). - It can be thought about in two ways.
- The improvement of a particular typist who has 5
years experience? - The average improvement of all typists who have 5
years experience?
39Forecasting (cont).
- Reasonable approximations.
- The exact formulas are shockers!
- Provided we have reasonably large data sets we
can make approximations. - t ? 2.
- Only the first term in the square root matters
much. - The others are relatively small.
- Formulas.
- For particular values.
- Y-predicted ? 2 s?.
- For mean values.
- Y-predicted ? 2 s?/?n.
40Causality
- Be careful.
- Finding a significant and strong regression
equation between Y and X does not establish
causality. - It establishes an association.
- The variables move in related ways.
- E.g. We could expect to see a significant and
positive regression between - The number of murders per annum in the UK and
- Membership of the Church of England.
- Causality, seems doubtful.
- Causal factor?
- Almost certainly, population growth.
41Multiple regression
- More general models.
- Few interesting problems contain only two
variables. - We cannot produce scattergrams.
- We cannot draw regression lines.
- It hard in 3 dimensions.
- It is impossible in more than 3 dimensions.
- Fortunately the math still works.
- Solutions by hand are just about impossible.
- SAS can do it at nearly the speed of light!
42Multiple (cont)
- Model.
- Y ?0 ?1 X1 ?2 X2 . ?k Xk ?.
- Y dependent variable.
- E.g. Sales turnover.
- k independent variables.
- E.g. X1 Size of local market.
- E.g. X2 Average household income in local
market. - E.g. X3 Number of competitors in local market.
- ? error (random with mean 0).
- I.e. The net effect of all variables other than
X1, X2 and X3 that influence Y. - ?0 constant or intercept.
- ?j coefficient or slope for variable Xj.
- I.e. The average increase in Y when Xj increases
by 1 unit, ceteris paribus (meaning the other
variables dont change).
43Multiple (cont)
- Example.
- Aspinwall (1970) in the Southern Economic Journal
wrote an article entitled Market Structure and
Commercial Mortgage Interest Rates. - A market was defined as a standard metropolitan
statistical area (SMSA). - Aspinwall tested the hypothesis that average
mortgage interest rates in SMSAs depend on the
amount of a loan relative to the value of the
property and monopolization within each SMSA. - a priori we would expect
- High interest rates to be associated with higher
borrowing ratios. - High interest rates to be associated with greater
monopolization.
44Multiple (cont)
- Example.
- Variables
- INTEREST Average mortgage rate in an SMSA.
- COVERAGE Average loan/price (of home) in an
SMSA. - CONCENT Concentration ratio in an SMSA.
- LENDERS Number of lending institutions in an
SMSA. - The concentration ratio is the proportion of the
market in the hands of the largest 10 businesses. - Results.
- The data is limited (31 observations).
- The findings were not quite what was expected.
- Textbook results do not always occur in real
life. - The output here is generated by SAS.
45The MEANS Procedure Variable N Mean Std
Dev Minimum Maximum interest 31
5.61 0.20 5.22
6.16 coverage 31 65.33
2.85 60.20 70.60 concent
31 37.56 14.84
12.30 67.10 lenders 31
100.41 119.25 7.00
550.00
46 Root MSE 0.1609
R-Square 0.4505 Dependent Mean
5.6158 Adj R-Sq 0.3895
Coeff Var 2.8658
47We test the null hypothesis that none of the
explanatory variables (LENDERS, COVERAGE or
CONCENT) is significant in explaining the
dependent variable (INTEREST).
Analysis of
Variance
Sum of Mean Source DF
Squares Square F Value Pr gt
F Model 3 0.5734
0.1911 7.38 0.0009 Error
27 0.6993
0.0259 Corrected Total 30 1.2727
p lt 0.05 so we reject the null and conclude that
at least one of LENDERS, COVERAGE or CONCENT is
significant in explaining INTEREST.
48Theory suggests that this is unlikely to be true
and that we should expect a positive
relationship.
Parameter
Standard Variable DF Estimate
Error t Value Pr gt t Intercept 1
5.71919 0.70962 8.06
lt.0001 coverage 1 -0.00438
0.01077 -0.41 0.6874 concent
1 0.00627 0.00257
2.44 0.0215 lenders 1
-0.00052602 0.00032240 -1.63 0.1144
49Deleting variables
- When?
- Variables that have large p values.
- Deleting variables that have p values that are
marginally more than 0.05 seems a little too
extreme. - SAS provides sig values or p values for two-sided
tests, and in regression we often want to perform
one-sided tests. - These values could be double what we need to deal
with. - Deleting variables whose coefficients have the
wrong sign. - If the model is telling us that quantity sold
increases when the price increases, ceteris
paribus, something is certainly wrong. - In our example we can delete COVERAGE for both
reasons. -
50This accords with theory greater monopolization
associated with higher interest rates.
Parameter
Standard Variable DF Estimate
Error t Value Pr gt
t Intercept 1 5.43485
0.12046 45.12
lt.0001 concent 1 0.00617
0.00252 2.45
0.0208 lenders 1 -0.00050474
0.00031335 -1.61 0.1184
51This accords with theory greater competition
associated with lower interest rates.
Parameter
Standard Variable DF Estimate
Error t Value Pr gt
t Intercept 1 5.43485
0.12046 45.12
lt.0001 concent 1 0.00617
0.00252 2.45
0.0208 lenders 1 -0.00050474
0.00031335 -1.61 0.1184
52 Parameter
Standard Variable DF Estimate
Error t Value Pr gt
t Intercept 1 5.43485
0.12046 45.12
lt.0001 concent 1 0.00617
0.00252 2.45
0.0208 lenders 1 -0.00050474
0.00031335 -1.61 0.1184
Equation INTEREST 5.435 0.006166 CONCENT
0.000505 LENDERS
53Other tests
- Modern regression procedures.
- Obtaining a plausible model with good p values
and high R2 might not be enough. - Any of the following could lead to regression
equations being misleading. - Multicollinearity.
- Two or more of the independent variables being
highly correlated. - Autocorrelation.
- Successive pairs of residuals being highly
correlated in models that use time-series data. - Non-normality.
- The errors being not normally distributed.
- Heteroskedasticity.
- The variance (standard deviation squared) of the
errors being not constant. - These tests are outside the scope of this
subject. - When problems of these sorts are identified,
there is often a means of correcting them.
54About logarithms
- Base 10.
- Log 10 1.
- Log 100 2.
- Log 1000 3.
- The logarithm in each case is the power to which
we have to raise 10 to, to get the number. - Log 1000000 6 means that 106 1000000.
- Numbers that are not obvious powers of 10?
- Log 200 2.3010.
- The logarithm of any positive number can be
calculated by a power series formula. - Natural logarithms.
- For fairly obscure mathematical reasons, we often
prefer to use natural logarithms. - These have base e (instead of 10) where e is
Eulers number (2.718).
55Logarithmic laws
- These apply to all positive numbers and any base.
- Law 1
- Log(a b) Log(a) Log(b).
- Law 2.
- Log(an) n Log(a).
56Elasticity
- Concept
- In economic modelling, we are often interested in
the impact of the change in one variable on
another - In percentage terms.
- And hold other variables constant (or ceteris
paribus). - Example
- Suppose a price elasticity is ? -1.3.
- This means that a 10 price increase leads to a
13 decrease in consumption, ceteris paribus. - Calculation
57Constant elasticity models
The ? values are elasticities (as you could
demonstrate using calculus).
Now use the log laws.
The variables are logarithms.
The coefficients are elasticities.
58Dependent Variable LREVENUE Root MSE
0.1007 R-Square 0.7049 Dependent Mean 11.7259 Adj
R-Sq 0.6744 Coeff Var 0.8593
Parameter Variable Estimate t Value
Pr gt t Intercept 6.6578
8.62 lt.0001 LCOMP -0.3779 -5.82 lt.0001 LPOP
0.3520 6.29 lt.0001 LINCOME 0.1590
1.88 0.0700
Competitors ? 10 ? revenue ? 3.8.
Population ? 10 ? revenue ? 3.5.
Income ? 10 ? revenue ? 1.6.