Title: Getting Started with Regression
1Getting Started with Regression
- Presented By Larry Zirbel
- Software Techniques, Inc.
-
- Tim Wilmath, MAI
- Hillsborough County Property Appraisers Office
- Prepared For 69th Annual IAAO Conference
Nashville, TN September 17, 2003
2History of Regression
James Galton created Regression Analysis in 1885
when he was attempting to predict a persons
height based on the height of his or her parent.
3History of Regression
Galton found that children born to tall parents
would be shorter than their parents - and
children born to short parents would be taller
than their parents. Both groups of children
regressed toward the mean height of all
children.
4Uses of Regression
Predicting the Weather
5Uses of Regression
Predicting Election Results
6Uses of Regression
Predicting Sales Prices
7What is Regression?
When Regression Analysis is used to predict sales
prices or establish assessments it becomes an
Automated Sales Comparison Approach
8Steps in Regression
1. Data Exploration and cleanup
2. Specifying the model
3. Calibrating the model
4. Interpreting the results
9Data Exploration Cleanup
Is there a pattern suggesting a relationship
between variables?
Note the outliers. These will adversely affect
our final values if we dont deal with them now
Because of the potential for extreme values to
influence the mean, modelers often remove or
trim extreme values.
10Model Specification
Specifying the model means picking the
appropriate equation and which variables that
will be used.
Models can be
- Additive - Most common for residential
properties - Multiplicative- Often used for land valuation
- Hybrid - Most advanced
We are going to use an Additive Model in this
presentation
11Regression Components
- Dependent Variable
- Sales Price
- Independent Variables
- Size
- Age
- Location
- Condition
- Lot size
- Construction
- Quality
- Amenities
12Simple Regression
Simple Regression includes one Dependent Variable
(sales price) and only one Independent Variable
- such as Square Footage.
Using this model, a 1,000 sf home would be valued
at 75,000
13Simple Regression
Simple Regression using only size as the
independent variable will predict sales prices,
however, it will treat all homes with the same
size equally.
1,000 square feet - 75,000?
1,000 square feet - 75,000
14Multiple Regression
We know square footage is an important
variable but what other variables should we
include and how do we decide?
Roof Type
Swimming Pool
Exterior Wall Type
Heated Area
Effective Age
Quality
Heat/Ac Type
Lot Size
Actual Age
Screen Porch
Garage
Location
View
15Correlation Analysis
Pearsons Correlation tells you the degree of
relationships between variables.
Notice the high correlation between sales price
and size
Very little correlation between sales price and
dock
Correlation Analysis also helps identify
Collinearity, which is a correlation between 2
independent variables. For example, the living
area of a home is highly correlated to the number
of bedrooms. It would only be necessary to have
one of these variables in the model.
16Regression Equations
Ymxb
Y b0 b1 X1 b2 X2 . . . bK XK
17Running Regression
Statistical Software makes using Regression much
easier, performing the necessary calculations
quickly and accurately.
Lets Run This!
18Regression Results
Model 1
The Output tells us how good our model is working
The closer the Adj. R-Square is to 1 the
better
And - it gives us the coefficients (or
adjustments)
6,838 Bldsize x 75.07 Property Value
The adjusted R2 statistic measures the amount of
total variation explained by the Regression
Model. It ranges from 0.00 to 1.00 with 1.00
being the desired value. A high number, say
0.910 means that approximately 91 of the value
can be explained by the model.
19Regression Results
The output includes the coefficient and the
Constant
The Constant represents the un-explained value
that is not included in the model.
20Running Regression
Lets add another variable to the model - Say
Land Size
Lets run this model and see if results improve.
21Regression Results
Model 2
Our Adj. R2 went up from .731 to .801!
We also have new coefficients (or adjustments)
6,119 Bldsize x 72.66 Landsf x 0.382
Property Value
22Running Regression
Lets add Age to the model
If Age is significant to value, the model should
improve. Lets run it.
23Regression Results
Model 3
Our Adj. R2 went up from .801 to .832!
Notice the age coefficient is negative
22,855 Bldsize x 67.28 Landsf x 0.44
Age x (630.76) Property Value
24Running Regression
Lets add Building Quality to the model
We may have a problem. Lets run it and see.
25Regression Results
Model 4
Our Adj. R2 went up from .832 to .854 after
adding quality, but
Notice the constant is now negative - thats not
good!
What do we do with this quality adjustment?
26Regression Results
This doesnt make sense because the codes 1,2,3,
etc. were not meant to be a rank
Quality 1 - Fair 2 - Average 3 - Good 4 -
Excellent 5 - Superior
Resulting Adjustment
1 x 26,110 26,110
2 x 26,110 52,220
3 x 26,110 78,330
4 x 26,110 104,440
5 x 26,110 130,550
27A Note about Data Types
There are 3 primary types of property
Characteristics
- Continuous Based on a size or measurement.
- Examples Square Footage or Lot Size
- Discrete Specific pre-defined value.
- Examples Roof Material, Building Quality
- Binary Either the item is present or not
- Examples corner location, Lakefront Location
28Transformations
To solve the problem we need to convert the
discrete variable Quality into individual
binary variables which allows Regression to
distinguish each type
Fair - Yes/No Average -
Yes/No Good - Yes/No
Excellent - Yes/No Superior -
Yes/No
Quality
BECOMES
29Running Regression
Now that we have transformed the variable
Quality we can put it back in the model
Notice we left Average out
30Regression Results
Our Adj. R2 went up from .832 to .869.
Model 5
These Quality adjustments are all relative
to Average
31Running Regression
Lets transform Neighborhood into a binary and
add it to the model
Notice we left out theBase Neighborhood (the
most typical)
32Regression Results
Model 6
Our Adj. R2 went up from .869 to .874.
These Neighborhood adjustments are all relative
to our Base Neighborhood
33Running Regression
Multiplicative Transformations combine two
variables into one Square Footage x Quality
SQFT1 Reflects the fact that quality may
contribute greater value in larger homes and less
value in smaller homes. In other words, without
combining these variables, all Good Quality homes
get the same adjustment regardless of their size.
Lets add this new combined variable to the model.
Since we combined SF and Quality, we remove them
as stand-alone variables
34Regression Results
Our Adj. R2 went up from .874 to .879.
Model 7
Notice the adjustments went from fixed
dollar amounts to per square foot
35Advanced Transformations
Exponential transformations - Raise variable to a
power Land Size x .75 LAND75 Reflects the
principle of diminishing returns. The unit price
of land tends to decrease as size increases.
Without this transformation land would get the
same adjustment, regardless of size. Raising
land size to the power of .75 reflects the curve
shown below.
36Running Regression
Lets add our new transformed land variable to
the model
37Regression Results
Our Adj. R2 went up from .879 to .881.
Model 8
38Running Regression
Lets add garages, pools, and baths just to round
out our model.
39Regression Results
Our Adj. R2 went up from .881 to .895.
Model 9
40Regression Results
The Beta value in column 4 indicates the
partial correlation of the variable. It is used
in stepwise regression in deciding which
variable to add next.
41Regression Results
The significance of each variable to the model
can be determined by looking at the t values.
Rule of Thumb t scores should be 2.0 or greater
NB211002 NB211003 NB211006 are insignificant
42Regression Results
The t-statistic is calculated by dividing the
coefficient of a variable by its standard error.
For example for the variable BLDSIZE, the
t-statistic is calculated as follows 58.537 /
1.045 56.0
43Regression Results
The Standard Error of the Estimate in the
regression model tells us how much a sale
estimate will vary from its actual value. This
number alone is meaningless unless related to the
average sales price in the sale sample. Dividing
the Standard Error by the Average SalesPrice
produces the Coefficient of Variation
(COV) 15,854 / 134,043 11.82 COV
44Regression Options
Enter is the default regression method in most
statistical software programs. This method
includes all variables entered by the modeler.
Stepwise multiple regression automatically
eliminates redundant or insignificant variables.
Notice that Stepwise Regression kicked out the
neighborhoods that had low t-scores"
45Creating New Assessments
Once you have calibrated your model, the
Regression software allows you to predict the
new values (or assessments) using the
coefficients (or adjustments) you created.
46Reviewing Ratio Statistics
Once the new assessments are created using our
final model, we can review the accuracy of our
new values using traditional ratio statistics.
47Valuing the Population
Valuing the population requires transforming the
same variables you used in the model, then
applying the coefficients to those variables.
This can be done internally within some CAMA
systems, using Microsoft Excel or other
spreadsheet software, or within the regression
software. Valuing the population is one of the
most difficult aspects of regression modeling
because changes in the physical attributes of
any one parcel often requires re-running the
entire model and re-calculating values.
48Conclusion
Predicting assessments using Regression requires
the appraiser to
- Explore data to determine relationships and
cleanup outliers
- Specify which model and variables will be used
- transform variables and run regression
- Review Results, modify or add variables
- Create predicted assessments and review ratio
statistics
- Value Population using final coefficients
49The End