Fitting Equations to Data - PowerPoint PPT Presentation

1 / 67

About This Presentation

Title:

Fitting Equations to Data

Description:

Title: Statistics Courses beyond Stats 242.3 Author: laverty Last modified by: laverty Created Date: 3/16/2005 5:16:15 PM Document presentation format – PowerPoint PPT presentation

Number of Views:37

Avg rating:3.0/5.0

Slides: 68

Provided by: laverty

Category:

more less

Transcript and Presenter's Notes

Title: Fitting Equations to Data

1
Fitting Equations to Data
2

Suppose that we have a
single dependent variable Y (continuous
numerical)
and
one or several independent variables, X1, X2, X3,
... (also continuous numerical, although there
are techniques that allow you to handle
categorical independent variables).
The objective will be to fit an equation to
the data collected on these measurements that
explains the dependence of Y on X1, X2, X3, ...

3
Example

Data collected on n 110 countries
Some of the variables
Y infant mortality
X1 popn size
X2 popn density
X3 urban
X4 GDP
Etc
Our intersest is in determining how Y is related
to X1, X2, X3, X4 ,etc

4
What is the value of these equations?
5
Equations give very precise and concise
descriptions (models) of data and how dependent
variables are related to independent variables.
6
Examples

Linear models Y Blood Pressure, X age
Y a X b e

Exponential growth or decay models
Y Average of 5 best times for the 100m during
an Olympic year, X the Olympic year.

Logistic Growth models

Gompertz Growth models

Note the presence of the random error term
(random noise).
This is a important term in any statistical
model.
Without this term the model is deterministic and
doesnt require the statistical analysis

11
What is the value of these equations?

Equations give very precise and concise
descriptions (models) of data and how dependent
variables are related to independent variables.
The parameters of the equations usually have very
useful interpretations relative to the phenomena
that is being studied.
The equations can be used to calculate and
estimate very useful quantities related to
phenomena. Relative extrema, future or
out-of-range values of the phenomena
Equations can provide the framework for
comparison.

12
The Multiple Linear Regression Model
13

Again we assume that we have a single dependent
variable Y and p (say) independent variables X1,
X2, X3, ... , Xp.
The equation (model) that generally describes the
relationship between Y and the Independent
variables is of the form
Y f(X1, X2,... ,Xp q1, q2, ... , qq) e
where q1, q2, ... , qq are unknown parameters of
the function f and e is a random disturbance
(usually assumed to have a normal distribution
with mean 0 and standard deviation s).

In Multiple Linear Regression we assume the
following model
Y b0 b1 X1 b2 X2 ... bp Xp e
This model is called the Multiple Linear
Regression Model.
Again are unknown parameters of the model and
where b0, b1, b2, ... , bp are unknown
parameters and e is a random disturbance assumed
to have a normal distribution with mean 0 and
standard deviation s.

15
The importance of the Linear model

1. It is the simplest form of a model in
which each dependent variable has some effect on
the independent variable Y. When fitting models
to data one tries to find the simplest form of a
model that still adequately describes the
relationship between the dependent variable and
the independent variables. The linear model is
sometimes the first model to be fitted and only
abandoned if it turns out to be inadequate.

In many instance a linear model is the most
appropriate model to describe the dependence
relationship between the dependent variable and
the independent variables. This will be true if
the dependent variable increases at a constant
rate as any or the independent variables is
increased while holding the other independent
variables constant.

3. Many non-Linear models can be put into the
form of a Linear model by appropriately
transformation the dependent variables and/or any
or all of the independent variables. This
important fact ensures the wide utility of the
Linear model. (i.e. the fact the many non-linear
models are linearizable.)

18
An Example

The following data comes from an experiment that
was interested in investigating the source from
which corn plants in various soils obtain their
phosphorous. The concentration of inorganic
phosphorous (X1) and the concentration of organic
phosphorous (X2) was measured in the soil of n
18 test plots. In addition the phosphorous
content (Y) of corn grown in the soil was also
measured. The data is displayed below

19

20
Equation Y 56.2510241 1.78977412 X1
0.08664925 X2
21
(No Transcript)
22
Least Squares for Multiple Regression
23

Assume we have taken n observations on Y
y1, y2, , yn
For n sets of values of X1, X2, , Xp
(x11, x12, , x1p)
(x21, x22, , x2p)
(xn1, xn2, , xnp)
For any choice of the parameters b0, b1, b2, ,
bp
the residual sum of squares is defined to be

The Least Squares estimators of b0, b1, b2, ,
bp
are chosen to minimize the residual sum of
squares

To achieve this we solve the following system of
equations
25

or
26
Also
or
27
The system of equations for
(n 1) linear equations in (n 1)
unknowns These equations are called the Normal
equations. The solutions are
called the least squares estimates
28
The Example

The following data comes from an experiment that
was interested in investigating the source from
which corn plants in various soils obtain their
phosphorous. The concentration of inorganic
phosphorous (X1) and the concentration of organic
phosphorous (X2) was measured in the soil of n
18 test plots. In addition the phosphorous
content (Y) of corn grown in the soil was also
measured. The data is displayed below

29

30
The Normal equations.
where
31
The Normal equations.
have solution
32
Equation Y 56.2510241 1.78977412 X1
0.08664925 X2
33
(No Transcript)
34
Summary of the Statistics used in Multiple
Regression
35

The Least Squares Estimates

- the values that minimize
36

The Analysis of Variance Table Entries
a) Adjusted Total Sum of Squares (SSTotal)
b) Residual Sum of Squares (SSError)
c) Regression Sum of Squares (SSReg)
Note
i.e. SSTotal SSReg SSError

37
The Analysis of Variance Table

Source Sum of Squares d.f. Mean Square F
Regression SSReg p SSReg/p MSReg MSReg/s2
Error SSError n-p-1 SSError/(n-p-1) MSError
s2
Total SSTotal n-1

38
Uses

1. To estimate s2 (the error variance).
- Use s2 MSError to estimate s2.
To test the Hypothesis
H0 b1 b2 ... bp 0.
Use the test statistic

- Reject H0 if F gt Fa(p,n-p-1).
39

3. To compute other statistics that are useful in
describing the relationship between Y (the
dependent variable) and X1, X2, ... ,Xp (the
independent variables).
a) R2 the coefficient of determination
SSReg/SSTotal
the proportion of variance in Y explained by
X1, X2, ... ,Xp
1 - R2 the proportion of variance in Y
that is left unexplained by X1, X2, ... , Xp
SSError/SSTotal.

b) Ra2 "R2 adjusted" for degrees of freedom.
1 -the proportion of variance in Y that is
left
unexplained by X1, X2,... , Xp adjusted
for d.f.

c) R ÖR2 the Multiple correlation
coefficient of Y with X1, X2, ... ,Xp
the maximum correlation between Y and a
linear combination of X1, X2, ... ,Xp
Comment The statistics F, R2, Ra2 and R are
equivalent statistics.

42
Using Statistical Packages

To perform Multiple Regression

43
Using SPSS
Note The use of another statistical package such
as Minitab is similar to using SPSS
44
After starting the SSPS program the following
dialogue box appears
45
If you select Opening an existing file and press
OK the following dialogue box appears
46
The following dialogue box appears
47
If the variable names are in the file ask it to
read the names. If you do not specify the Range
the program will identify the Range
Once you click OK, two windows will appear
48
One that will contain the output
49
The other containing the data
50
To perform any statistical Analysis select the
Analyze menu
51
Then select Regression and Linear.
52
The following Regression dialogue box appears
53
Select the Dependent variable Y.
54
Select the Independent variables X1, X2, etc.
55
If you select the Method - Enter.
56

All variables will be put into the equation.

There are also several other methods that can be
used

Forward selection
Backward Elimination
Stepwise Regression

57
(No Transcript)
58

Forward selection

This method starts with no variables in the
equation
Carries out statistical tests on variables not in
the equation to see which have a significant
effect on the dependent variable.
Adds the most significant.
Continues until all variables not in the equation
have no significant effect on the dependent
variable.

Backward Elimination

This method starts with all variables in the
equation
Carries out statistical tests on variables in the
equation to see which have no significant effect
on the dependent variable.
Deletes the least significant.
Continues until all variables in the equation
have a significant effect on the dependent
variable.

Stepwise Regression (uses both forward and
backward techniques)

This method starts with no variables in the
equation
Carries out statistical tests on variables not in
the equation to see which have a significant
effect on the dependent variable.
It then adds the most significant.
After a variable is added it checks to see if any
variables added earlier can now be deleted.
Continues until all variables not in the equation
have no significant effect on the dependent
variable.

All of these methods are procedures for
attempting to find the best equation

The best equation is the equation that is the
simplest (not containing variables that are not
important) yet adequate (containing variables
that are important)
62
Once the dependent variable, the independent
variables and the Method have been selected if
you press OK, the Analysis will be performed.
63
The output will contain the following table
R2 and R2 adjusted measures the proportion of
variance in Y that is explained by X1, X2, X3,
etc (67.6 and 67.3)
R is the Multiple correlation coefficient (the
maximum correlation between Y and a linear
combination of X1, X2, X3, etc)
64
The next table is the Analysis of Variance Table
The F test is testing if the regression
coefficients of the predictor variables are all
zero. Namely none of the independent variables
X1, X2, X3, etc have any effect on Y
65
The final table in the output
Gives the estimates of the regression
coefficients, there standard error and the t test
for testing if they are zeroNote Engine size
has no significant effect on Mileage
66
The estimated equation from the table below
Is
67
Note the equation is
Mileage decreases with

With increases in Engine Size (not significant, p
0.432)With increases in Horsepower
(significant, p 0.000)With increases in Weight
(significant, p 0.000)

Write a Comment

User Comments (0)