Fitting Equations to Data - PowerPoint PPT Presentation

1 / 67
About This Presentation
Title:

Fitting Equations to Data

Description:

Title: Statistics Courses beyond Stats 242.3 Author: laverty Last modified by: laverty Created Date: 3/16/2005 5:16:15 PM Document presentation format – PowerPoint PPT presentation

Number of Views:37
Avg rating:3.0/5.0
Slides: 68
Provided by: laverty
Category:

less

Transcript and Presenter's Notes

Title: Fitting Equations to Data


1
Fitting Equations to Data
2
  • Suppose that we have a
  • single dependent variable Y (continuous
    numerical)
  • and
  • one or several independent variables, X1, X2, X3,
    ... (also continuous numerical, although there
    are techniques that allow you to handle
    categorical independent variables).
  • The objective will be to fit an equation to
    the data collected on these measurements that
    explains the dependence of Y on X1, X2, X3, ...

3
Example
  • Data collected on n 110 countries
  • Some of the variables
  • Y infant mortality
  • X1 popn size
  • X2 popn density
  • X3 urban
  • X4 GDP
  • Etc
  • Our intersest is in determining how Y is related
    to X1, X2, X3, X4 ,etc

4
What is the value of these equations?
5
Equations give very precise and concise
descriptions (models) of data and how dependent
variables are related to independent variables.
6
Examples
  • Linear models Y Blood Pressure, X age
  • Y a X b e

7
  • Exponential growth or decay models
  • Y Average of 5 best times for the 100m during
    an Olympic year, X the Olympic year.

8
  • Logistic Growth models

9
  • Gompertz Growth models

10
  • Note the presence of the random error term
    (random noise).
  • This is a important term in any statistical
    model.
  • Without this term the model is deterministic and
    doesnt require the statistical analysis

11
What is the value of these equations?
  • Equations give very precise and concise
    descriptions (models) of data and how dependent
    variables are related to independent variables.
  • The parameters of the equations usually have very
    useful interpretations relative to the phenomena
    that is being studied.
  • The equations can be used to calculate and
    estimate very useful quantities related to
    phenomena. Relative extrema, future or
    out-of-range values of the phenomena
  • Equations can provide the framework for
    comparison.

12
The Multiple Linear Regression Model
13
  • Again we assume that we have a single dependent
    variable Y and p (say) independent variables X1,
    X2, X3, ... , Xp.
  •  
  • The equation (model) that generally describes the
    relationship between Y and the Independent
    variables is of the form
  •  
  • Y f(X1, X2,... ,Xp q1, q2, ... , qq) e
  • where q1, q2, ... , qq are unknown parameters of
    the function f and e is a random disturbance
    (usually assumed to have a normal distribution
    with mean 0 and standard deviation s).

14
  • In Multiple Linear Regression we assume the
    following model
  •  
  • Y b0 b1 X1 b2 X2 ... bp Xp e
  •  
  • This model is called the Multiple Linear
    Regression Model.
  • Again are unknown parameters of the model and
    where b0, b1, b2, ... , bp are unknown
    parameters and e is a random disturbance assumed
    to have a normal distribution with mean 0 and
    standard deviation s.

15
The importance of the Linear model
  • 1.     It is the simplest form of a model in
    which each dependent variable has some effect on
    the independent variable Y. When fitting models
    to data one tries to find the simplest form of a
    model that still adequately describes the
    relationship between the dependent variable and
    the independent variables. The linear model is
    sometimes the first model to be fitted and only
    abandoned if it turns out to be inadequate.

16
  1. In many instance a linear model is the most
    appropriate model to describe the dependence
    relationship between the dependent variable and
    the independent variables. This will be true if
    the dependent variable increases at a constant
    rate as any or the independent variables is
    increased while holding the other independent
    variables constant.

17
  • 3.     Many non-Linear models can be put into the
    form of a Linear model by appropriately
    transformation the dependent variables and/or any
    or all of the independent variables. This
    important fact ensures the wide utility of the
    Linear model. (i.e. the fact the many non-linear
    models are linearizable.)

18
An Example
  • The following data comes from an experiment that
    was interested in investigating the source from
    which corn plants in various soils obtain their
    phosphorous. The concentration of inorganic
    phosphorous (X1) and the concentration of organic
    phosphorous (X2) was measured in the soil of n
    18 test plots. In addition the phosphorous
    content (Y) of corn grown in the soil was also
    measured. The data is displayed below

19


20
Equation Y 56.2510241 1.78977412 X1
0.08664925 X2
21
(No Transcript)
22
Least Squares for Multiple Regression
23
  • Assume we have taken n observations on Y
  • y1, y2, , yn
  • For n sets of values of X1, X2, , Xp
  • (x11, x12, , x1p)
  • (x21, x22, , x2p)
  • (xn1, xn2, , xnp)
  • For any choice of the parameters b0, b1, b2, ,
    bp
  • the residual sum of squares is defined to be

24
  • The Least Squares estimators of b0, b1, b2, ,
    bp
  • are chosen to minimize the residual sum of
    squares

To achieve this we solve the following system of
equations
25
  • Now

or
26
Also
or
27
The system of equations for
(n 1) linear equations in (n 1)
unknowns These equations are called the Normal
equations. The solutions are
called the least squares estimates
28
The Example
  • The following data comes from an experiment that
    was interested in investigating the source from
    which corn plants in various soils obtain their
    phosphorous. The concentration of inorganic
    phosphorous (X1) and the concentration of organic
    phosphorous (X2) was measured in the soil of n
    18 test plots. In addition the phosphorous
    content (Y) of corn grown in the soil was also
    measured. The data is displayed below

29


30
The Normal equations.
where
31
The Normal equations.
have solution
32
Equation Y 56.2510241 1.78977412 X1
0.08664925 X2
33
(No Transcript)
34
Summary of the Statistics used in Multiple
Regression
35
  • The Least Squares Estimates

- the values that minimize
36
  • The Analysis of Variance Table Entries
  • a) Adjusted Total Sum of Squares (SSTotal)
  • b) Residual Sum of Squares (SSError)
  • c) Regression Sum of Squares (SSReg)
  • Note
  • i.e. SSTotal SSReg SSError
  •  

37
The Analysis of Variance Table
  • Source Sum of Squares d.f. Mean Square F
  • Regression SSReg p SSReg/p MSReg MSReg/s2
  • Error SSError n-p-1 SSError/(n-p-1) MSError
    s2
  • Total SSTotal n-1

38
Uses
  • 1. To estimate s2 (the error variance).
  • - Use s2 MSError to estimate s2.
  • To test the Hypothesis
  • H0 b1 b2 ... bp 0.
  • Use the test statistic

- Reject H0 if F gt Fa(p,n-p-1).
39
  • 3. To compute other statistics that are useful in
    describing the relationship between Y (the
    dependent variable) and X1, X2, ... ,Xp (the
    independent variables).
  • a) R2 the coefficient of determination
  • SSReg/SSTotal
  • the proportion of variance in Y explained by
  • X1, X2, ... ,Xp
  • 1 - R2 the proportion of variance in Y
  • that is left unexplained by X1, X2, ... , Xp
  • SSError/SSTotal.

40
  • b) Ra2 "R2 adjusted" for degrees of freedom.
  • 1 -the proportion of variance in Y that is
    left
  • unexplained by X1, X2,... , Xp adjusted
    for d.f.

41
  • c) R ÖR2 the Multiple correlation
    coefficient of Y with X1, X2, ... ,Xp
  • the maximum correlation between Y and a
    linear combination of X1, X2, ... ,Xp
  • Comment The statistics F, R2, Ra2 and R are
    equivalent statistics.

42
Using Statistical Packages
  • To perform Multiple Regression

43
Using SPSS
Note The use of another statistical package such
as Minitab is similar to using SPSS
44
After starting the SSPS program the following
dialogue box appears
45
If you select Opening an existing file and press
OK the following dialogue box appears
46
The following dialogue box appears
47
If the variable names are in the file ask it to
read the names. If you do not specify the Range
the program will identify the Range
Once you click OK, two windows will appear
48
One that will contain the output
49
The other containing the data
50
To perform any statistical Analysis select the
Analyze menu
51
Then select Regression and Linear.
52
The following Regression dialogue box appears
53
Select the Dependent variable Y.
54
Select the Independent variables X1, X2, etc.
55
If you select the Method - Enter.
56
  • All variables will be put into the equation.

There are also several other methods that can be
used
  1. Forward selection
  2. Backward Elimination
  3. Stepwise Regression

57
(No Transcript)
58
  • Forward selection
  1. This method starts with no variables in the
    equation
  2. Carries out statistical tests on variables not in
    the equation to see which have a significant
    effect on the dependent variable.
  3. Adds the most significant.
  4. Continues until all variables not in the equation
    have no significant effect on the dependent
    variable.

59
  • Backward Elimination
  1. This method starts with all variables in the
    equation
  2. Carries out statistical tests on variables in the
    equation to see which have no significant effect
    on the dependent variable.
  3. Deletes the least significant.
  4. Continues until all variables in the equation
    have a significant effect on the dependent
    variable.

60
  • Stepwise Regression (uses both forward and
    backward techniques)
  1. This method starts with no variables in the
    equation
  2. Carries out statistical tests on variables not in
    the equation to see which have a significant
    effect on the dependent variable.
  3. It then adds the most significant.
  4. After a variable is added it checks to see if any
    variables added earlier can now be deleted.
  5. Continues until all variables not in the equation
    have no significant effect on the dependent
    variable.

61
  • All of these methods are procedures for
    attempting to find the best equation

The best equation is the equation that is the
simplest (not containing variables that are not
important) yet adequate (containing variables
that are important)
62
Once the dependent variable, the independent
variables and the Method have been selected if
you press OK, the Analysis will be performed.
63
The output will contain the following table
R2 and R2 adjusted measures the proportion of
variance in Y that is explained by X1, X2, X3,
etc (67.6 and 67.3)
R is the Multiple correlation coefficient (the
maximum correlation between Y and a linear
combination of X1, X2, X3, etc)
64
The next table is the Analysis of Variance Table
The F test is testing if the regression
coefficients of the predictor variables are all
zero. Namely none of the independent variables
X1, X2, X3, etc have any effect on Y
65
The final table in the output
Gives the estimates of the regression
coefficients, there standard error and the t test
for testing if they are zeroNote Engine size
has no significant effect on Mileage
66
The estimated equation from the table below
Is
67
Note the equation is
Mileage decreases with
  1. With increases in Engine Size (not significant, p
    0.432)With increases in Horsepower
    (significant, p 0.000)With increases in Weight
    (significant, p 0.000)
Write a Comment
User Comments (0)
About PowerShow.com