Dummy Variables

About This Presentation

Title:

Dummy Variables

Description:

Then we can run a regression analysis with Salary as the response variable, ... We won't be able to provide a thorough analysis of this issue but we can add one ... – PowerPoint PPT presentation

Number of Views:735

Avg rating:3.0/5.0

Slides: 73

Provided by: mois1

Learn more at: http://www.owlnet.rice.edu

Category:

more less

Transcript and Presenter's Notes

Title: Dummy Variables

1
Dummy Variables

Some potential explanatory variables are
categorical and cannot be measured on a
quantitative scale.
However, we often need to use these variables
because they are related to the response
variable.
The trick is to create dummy variables, also
called indicator or 0-1 variables.
These are variables that indicate the category a
given observation is in.

2
Dummy Variables -- continued

To create dummy variables we can use an IF
statement or we can use StatPros Dummy variable
procedure.
The Dummy variable procedure is usually easier
particularly when there are multiple categories.
Once the dummy variables are created, we can
combine the variables if we like by simply adding
the columns to get the dummy for the new
category.

3
Regression Analysis

In this example we create dummy variables for
Gender, and EducLev.
Then we can run a regression analysis with Salary
as the response variable, using any combination
of numerical and dummy explanatory variables.
We must follow two rules
We shouldnt use any of the original categorical
variables that the dummies are based on.
We should use one less dummy than the number of
categories for any categorical variable.

4
Regression Analysis -- continued

This second rule is a technical one. If we
violate it the software will give us an error
message.
For example, Ed_1-Ed_6, any five of these
variables can be used. The omitted dummy then
corresponds to the reference category.
As we will see the interpretation of the dummy
variable coefficients are all relevant to this
reference category.
To get used to dummy variables in regression
analysis we will proceed in several stages.

5
Regression Analysis -- continued

We first estimate a regression equation with only
one variable. The output is shown in this table.
The resulting equation is Predicated Salary
45.505 - 8.26Female

6
Regression Analysis -- continued

To interpret this equation recall that Female has
only two possible values, 0 and 1. If we
substitute 1 then the predicted salary equals
37.209 and if we substitute 0 the predicated
salary is 45.505.
These are the average salaries of females and
males. Therefore the interpretation of the -8.926
coefficient of the Female dummy variable is
straightforward.

7
Regression Analysis -- continued

The above equation only tells part of the story,
it ignores all information except for gender.
We expand this equation by adding the experience
variables. The output is shown in this table.

8
Regression Analysis -- continued

The corresponding equation is Predicted Salary
35.492 0.998YrsExper 0.131YrsPrior -
8.080Female
It is useful to write two separate equations, one
for females and one for males Predicted Salary
27.412 0.988YrsExper 0.131YrsPrior
Predicted Salary 35.492 0.988YrsExper
0.131YrsPrior
We interpret the coefficient -8.080 of the Female
dummy variable as the average salary disadvantage
for females relative to males after controlling
for job experience. But there is still more story
to tell.

9
Regression Analysis -- continued

We next add job grade to the equation by
including five of the six job grade dummies.
Although any five can be use we use Job_2-Job_6.
The resulting output is shown in this table.

10
Regression Analysis -- continued

The estimated regression equations is
nowPredicated Salary30.230 0.408YrsExper
0.149YrsPrior - 1.962Female 2.57Job_2
6.295Job_3 10.475Job_4 16.011Job_5
27.647Job_6
There are no two categorical variables involved,
gender and job grade.
However, we can still write a separate equation
for any combination of categories by setting the
dummies to the appropriate values.

11
Regression Analysis -- continued

For example, the equation for females at the
fifth job grade is found by setting Female1 and
Job_51 and setting the other job dummies equal
to 0. The equation formed isPredictedSalary
44.279 0.408YrsExper 0.150YrsPrior
We interpret this equation as follows
For either gender and any job grade, the expected
increase is salary for one extra year of
experience with Fifth National is 408 the
expected salary increase for one year experience
with another bank is 149.

12
Regression Analysis -- continued

The coefficients of the job dummies indicate the
average increase in salary an employee can expect
relative to the reference (lowest) job grade.
The key coefficient, the negative 1962 for
females indicates the average salary disadvantage
for females relative to males, given that they
have the same experience levels and are in the
same job grade
Although the penalty is still substantial, it
is less than a fourth of the penalty we saw
before.
It appears that females might be getting paid
less on average partly because they are in the
lower job categories.

13
Regression Analysis -- continued

We can check whether females are
disproportionately in the lower job categories by
using a pivot table with JobGrade in the row
area, Gender in the column area and the count
(expressed as a percentage) of any variable in
the data area.

14
Regression Analysis -- continued

Clearly, females tend to be concentrated at the
lower job grades.
This certainly helps to explain why females get
lower salaries on average, but it doesnt explain
why females are at the lower job grades in the
first place.
We wont be able to provide a thorough analysis
of this issue but we can add one more piece to
the puzzle now by adding education level, age,
and PCJob to the equation.

15
Regression Analysis -- continued

We dont provide the whole equation but the
resulting output is shown here.

16
Regression Analysis -- continued

The coefficients can be seen in the output.
It doesnt appear to add much to the previous
equation. The penalty does, however, go up to
2555, which is slightly greater than the 1962.
At face value we can interpret the coefficients
of the education dummies as a benefit (or loss if
negative) of extra education relative to a high
school diploma, the reference category.

17
Regression Analysis -- continued

The coefficient of PCJob implies that an employee
with a computer-related job can expect an extra
4923 in salary relative to an employee without a
computer-related job, provided the other
variables are the same for each employee.
The age coefficient is quite small and has little
effect on salary.

18
Conclusion

The main conclusion we can draw from the output
is that there is still a plausible case to be
made for discrimination against females, even
after including information on all the variables
in the database in the regression equation.

Modeling Possibilities

20
BANK.XLS

The Fifth National Bank of Springfield is facing
a gender-discrimination suit. The charge is that
its female employees receive substantially
smaller salaries than its male employees.
The banks employee database is listed in this
file. Here is a partial list of the data.

21
Question

Earlier we estimated an equation for Salary suing
the numerical explanatory variables YrsExper and
YrsPrior and the dummy variable Female.
If we drop the YrsPrior variable from the
equation (for simplicity) and rerun the
regression, we obtain the equationPredicted
Salary 35.824 0.981YrsExper - 8.012Female
The R2 value for this equation is 49.1. If we
decide to include an interaction variable between
YrsExper and Female in this equation, what is the
effect?

22
Interaction Terms

An interaction variable algebraically is the
product of two variables. Its effect is to allow
the effect of one of the variables on Y to depend
on the value of the other variable.
The interaction term allows the slope of the
regression line to differ between the two
categories.

23
Solution

We first need to form an interaction variable
that is the product of YrsExper and Female.
This can be done two ways in Excel.
we can do it manually by introducing a new
variable that contains the product of the two
variables involved, or
we can use the StatPro/Data Utilities/Create
Interaction Variable menu item.
Using the latter way we must select Female and
YrsExper as the variables, and we do not check
either of the boxes in the dialog box -- neither
should be a categorical variable.

24
Solution -- continued

Once the interaction variable has been created,
we include it in the regression equation in
addition to the other variables. The multiple
regression output is shown here.

25
Solution -- continued

The estimated regression equation isPredicated
Salary 30.430 1.528YrsExper 4.908Female
- 1.248YrsExper_Female
As we discussed before it is useful to write this
equation as two separate equations, one for
females and one for males. The female equation
is Predicated Salary 34.528
0.280YrsExperand the male equation
is Predicated Salary 30.430 1.528YrsExper
Next we can show these equations graphically.

26
Nonparallel Female and Male Salary Lines
27
Solution -- continued

The Y-intercept for the female line is slightly
higher - females with no experience at Fifth
National Bank tend to start out slightly higher
than males - but the slope of the female line is
much lower. That is, males tend to move up the
salary ladder much more quickly than females.
Again, this provides another argument, although a
somewhat different one, for gender discrimination
against females.
The R2 value increased from 49.1 to 63.9. The
interaction variable has definitely added to the
explanatory power of the equation.

Modeling Possibilities

29
BANK.XLS

The Fifth National Bank of Springfield is facing
a gender-discrimination suit. The charge is that
its female employees receive substantially
smaller salaries than its male employees.
The banks employee database is listed in this
file. Here is a partial list of the data.

30
Question

A glance at the distribution of salaries of the
208 employees shows some skewness to the right -
a few employees make substantially more than the
majority of employees.
Therefore, it might make sense to use the natural
logarithm of Salary instead of Salary as the
response variable.
If we do this, how do we interpret the results?

31
Solution

All of the analyses we did earlier with this data
set could be repeated except with Log_Salary as
the response variable.
For the sake of discussion we will look only at
the regression equation with Female and YrsExper
as explanatory variables.
After we create the Log_Salary variable and run
the regression, we obtain the output shown here.

32
Regression Output with Log_Salary as Response
Variable
33
Solution

The estimated regression equation is Predicted
Log_Salary 3.5829 0.0188YrsExper - 0.1616
Female
The R2 and se values are 42.4 and 0.1794. For
comparison with Salary these were 49.1 and
8.070.
We first interpret that neither of these values
are directly comparable to the Salary values.
The two R2 values are percentages explained of
different response variables, Log_Salary and
Salary. The fact that one is smaller does not
mean a worse fit. They simply arent comparable.

34
Solution -- continued

The situation for se is even worse. Each se is a
measure of a typical residual, but the residuals
in the Log_Salary equation are in log dollars,
whereas the residuals in the Salary equation are
in dollars.
Therefore it is no surprise that the Log_Salary
is much smaller than the se for the Salary
equation.
If we want comparable standard error measures for
the two equations, we should take antilogs of the
fitted values from the Log_Salary equation to
convert them back to dollars, subtract these from
the original Salary values, and take the standard
deviation of these residuals.

35
Solution -- continued

The resulting standard deviation is 7.74. This
is somewhat smaller than the se from the Salary
equation, an indication of a slightly better fit.
Finally we interpret the equation itself.
When the response variable is Log_Y and a term on
the right hand side of the equation is of the
form bX, then whenever X increases by one unit
Y-hat changes by a constant percentage, and this
percentage is approximately equal to b (written
as a percentage).

36
Solution -- continued

This means that for each year of experience with
Fifth National, an employees salary can be
expected to increase 1.88.
The Female expected percentage decrease in salary
is 16.16.
In other words this equation implies that females
can expect to make about 16 less than men for
comparable years of experience.

Modeling Possibilities

38
POWER.XLS

The Public Service Electric Company produces
different quantities of electricity each month,
depending on the demand.
This file lists the number of units of
electricity produced (Units) and the total cost
of producing these (Cost) for a 36-month period.
The data set appears on the next slide.
How can regression be used to analyze the
relationship between Cost and Units?

39
Data for Electric Power
40
Solution

A good place to start is with a scatterplot of
Cost versus Units.

41
Solution -- continued

The scatterplot indicates a definite positive
relationship and one that is nearly linear.
However, there is also some evidence of curvature
in the plot. The points increase slightly less
rapidly as Units increase from left to right.
In economic terms, there may be economics of
scale, where marginal cost of the electricity
decreases as more units of electricity are
produced.
Nevertheless, we use regression to estimate a
linear relationship between Cost and Units.

42
Solution -- continued

The resulting regression equation is
Predicted Cost 23,651 30.53 Units
The corresponding R2 and se are 73.6 and 2734.
We also requested a scatterplot of the residuals
versus the fitted values. The scatterplot is on
the next slide. Obtaining this scatterplot is
always a good idea if nonlinearity is suspected.
The sign of nonlinearity in this plot is that the
residuals to the far left and the far right are
all negative, whereas the majority of the
residuals in the middle are positive.

43
Residuals from a Straight-Line Fit
44
Solution -- continued

Admittedly the pattern is far from perfect -
there are a few negatives in the middle - but the
plot does hint at nonlinear behavior.
The negative-positive-negative behavior of the
residuals suggests a parabola that is, a
quadratic equation with the square of Units
included in the equation.
We first create a new variable Sqr_Units in the
data set. This can be done manually or using
StatPros Transform Variables menu item.

45
Solution -- continued

Then we use multiple regression to estimate the
equation for Cost with both explanatory
variables, Units and Sqr_Units, included.
The resulting equation from the output on the
next slide is Predicated Cost 5793
98.3Units - 0.0600Sqr_Units
Note that R2 has increase to 82.2 and se has
decreased to 2281.

46
Regression Output with Squared Term Included
47
Solution -- continued

One way to see how this regression equation fits
the scatterplot of Costs versus Units is to use
Excels trendline option.
To do so activate the scatterplot, click on any
point and use the Chart/Add Trendline menu item,
click the Type tab and select the Polynormal type
or order 2, that is a quadratic.
A graph of the equation is superimposed on the
scatterplot on the following slide. It shows
reasonably good fit, plus an obvious curvature.

48
Quadratic Fit Scatterplot
49
Solution -- continued

The main downside to a quadratic regression
equation is that there is no easy interpretation
of the coefficients of Units and Sqr_Units.
All we can say is that the terms in the equation
combine to explain the nonlinear relationship
between units produced and total cost.
A final note about the equation concerns the
coefficient of Sqr_Units.
First, the fact that it is a negative make the
parabola bend downward. This produces the
decreasing marginal cost behavior, where every
extra unit of electricity incurs a smaller cost.

50
Solution -- continued

Second, we shouldnt be fooled by the small
magnitude of this coefficient. Remember that it
is the coefficient of Units squared, which is a
large quantity. Therefore, the effect of the
product -0.0600Sqr_Units is sizable.
One other possibility we might examine is a
logarithmic fit.
In this case we create a new variable Log_Units,
the natural logarithm of Units, and then regress
Cost against the single variable Log_Units.

51
Solution -- continued

To create the new variable we can again use
StatPros Transform Variable menu item and then
we can superimpose a logarithmic curve on the
scatterplot of Cost versus Units by using the
trendline feature.
This curve appears in the scatterplot on the next
slide.
To the naked eye, it appears to be similar, and
about as good a fit as the quadratic curve.

52
Logarithmic Fit Scatterplot
53
Solution -- continued

The resulting regression equation is Predicted
Cost -63,993 16,654Log_Units
The values of R2 and se are 79.8 and 2393.
These latter values indicate that the logarithmic
fit is not quite as good as the quadratic fit.
However, the advantage of the logarithmic
equation is that it is easier to interpret.

54
Solution -- continued

In this case, where the log of the explanatory
variable is used, we can interpret its
coefficient as follows.
Suppose Units increases by 1, for example from
600 to 606. Then the equation implies that the
expected Cost will increase approximately
166.54.
In words, every 1 increase in Units is
accompanied by an expected 166.54 increase in
Cost.
Note that for larger values of Units, a 1
increase represents a larger absolute increase.
But each such 1 increase entails the same
increase in Cost. This is another way of
describing the decreasing marginal cost property.

Modeling Possibilities

56
CARDEMAND.XLS

This file contains annual data (1970-1987) on
domestic auto sales in the United States. The
data set is shown here on the next slide.
The variables are defined as
Quantity annual domestic auto sales (in number
of units)
Price real price index of new cars
Income real disposable income
Interest prime rate of interest
Estimate and interpret a multiplicative (constant
elasticity) relationship between Quantity and
Price, Income and Interest.

57
Car Demand Data
58
Constant Elasticity Relationships

A particular type of nonlinear relationship that
has firm grounding in economic theory is called a
constant elasticity relationship. It is also
called a multiplicative relationship.
One property of this type of relationship is that
the effect of a change on any explanatory
variable Xi on Y depends on the levels of the
other Xs in the equation.

59
Solution

We first take the natural logs of all four
variables.
This can be done in one step using the Transform
Variables menu item or we can use Excels LN
function.
We then use multiple regression, with
Log_Quantity as the response variable and
Log_Price, Log_Income, and Log_Interest as the
explanatory variables.
The resulting output is shown on the next slide
and the corresponding equation Predicted
Log_Quantity 4.675 - 1.185Log_Price
2.183Log_Income - 0.19Log_Interest

60
Regression Output for Multiplicative Relationship
61
Solution -- continued

If we like we can convert this back to the
original variables, that is back to
multiplicative form, by taking antilogs. The
result isPredicted Quantity 107.198Price-1.185I
ncome2.183Interest-0.191
In either form the equation implies that the
elasticities are approximately equal to -1.185,
2.183 and -0.191.
When Price increases by 1, Quantity tends to
decrease by about 1.185 when Income increases
by 1, Quantity tends to increase by about
2.183 and when Interest increases by 1,
Quantity tends to decrease by about 0.191.

62
Conclusions

Does this multiplicative equation provide a
better fit to the automobile data than does an
additive relationship?
Without doing considerable more work it is
difficult to answer this questions with
certainty.
As we discussed previously, it is not sufficient
to compare R2 and se values for the two fits.
We will simply state that the multiplicative
relationship provides a reasonably good fit, and
it makes sense economically.

Modeling Possibilities

64
LEARNING.XLS

The Presario Company produces a variety of small
industrial products.
It has just finished producing 22 batches of a
new product (new to Presario) for a customer.
This file contains the times (in hours) to
produce each batch. These data are in the table
on the next slide.
Clearly, the times have tended to decrease as
Presario has gained more experience in making the
product.

65
Data for Learning Curve

Does the multiplicative learning model apply to
these data, and what does it imply about the
learning rate?

66
Learning Curve Model

A final example of a multiplicative relationship
is the learning curve model.
A learning curve relates the unit production time
(or cost) to the cumulative volume of output
since that production process first began.
Empirical studies indicate that production times
tend to decrease by a relatively constant
percentage every time cumulative output doubles.
The constant percentage is called the learning
rate.

67
Solution

One way to check whether the multiplicative
learning model is reasonable is to create the log
variables Log_time and Log_batch in the usual way
and then see whether a scatterplot of Log_Time
versus Log_Batch is approximately linear.
The multiplicative model implies that it should
be.
Such a scatterplot is shown on the next slide,
along with a superimposed linear trend line. The
fit appears to be quite good.

68
Scatterplot of Log Variables with Linear Trend
Superimposed
69
Solution -- continued

To estimate the relationship, we regress Log_Time
on Log_Batch. The resulting equation
is Predicated Log_Time 4.834 - 0.155Log_Batch
There are a couple of ways of interpreting this
equation.
First, because it is based on a multiplicative
relationship, we can interpret the coefficient
-0.155 as an elasticity. That is when Batch
increases by 1, Time tends to decrease by
approximately 0.155. Although this is correct it
is not as useful as the doubling
interpretation.

70
Solution -- continued

We know that the estimated learning rate
satisfies -0.155 ln(learning
rate/ln(2)Solving for the learning rate
(multiply through by ln(2)) and then take
antilogs, we find that it is 0.898, or
approximately 90. In other words, whenever
cumulative production doubles, the time to
produce a batch decreases by about 10.

71
Predicting Future Production Times

Presario could use this regression equation to
predict future production times.
For example, suppose the customer places an order
for 15 more batches of the same product. We can
use the equation to predict the log of production
time for each batch, then take their antilogs and
sum them to obtain the total production time.
The calculations are shown in rows 26-42 of the
following table. The total predicted time to
finish is about 1115 hours.

72
Using the Learning Curve Model for Predications

Write a Comment

User Comments (0)