Title: Dummy Variables
1Dummy Variables
- Some potential explanatory variables are
categorical and cannot be measured on a
quantitative scale. - However, we often need to use these variables
because they are related to the response
variable. - The trick is to create dummy variables, also
called indicator or 0-1 variables. - These are variables that indicate the category a
given observation is in.
2Dummy Variables -- continued
- To create dummy variables we can use an IF
statement or we can use StatPros Dummy variable
procedure. - The Dummy variable procedure is usually easier
particularly when there are multiple categories. - Once the dummy variables are created, we can
combine the variables if we like by simply adding
the columns to get the dummy for the new
category.
3Regression Analysis
- In this example we create dummy variables for
Gender, and EducLev. - Then we can run a regression analysis with Salary
as the response variable, using any combination
of numerical and dummy explanatory variables. - We must follow two rules
- We shouldnt use any of the original categorical
variables that the dummies are based on. - We should use one less dummy than the number of
categories for any categorical variable.
4Regression Analysis -- continued
- This second rule is a technical one. If we
violate it the software will give us an error
message. - For example, Ed_1-Ed_6, any five of these
variables can be used. The omitted dummy then
corresponds to the reference category. - As we will see the interpretation of the dummy
variable coefficients are all relevant to this
reference category. - To get used to dummy variables in regression
analysis we will proceed in several stages.
5Regression Analysis -- continued
- We first estimate a regression equation with only
one variable. The output is shown in this table.
The resulting equation is Predicated Salary
45.505 - 8.26Female
6Regression Analysis -- continued
- To interpret this equation recall that Female has
only two possible values, 0 and 1. If we
substitute 1 then the predicted salary equals
37.209 and if we substitute 0 the predicated
salary is 45.505. - These are the average salaries of females and
males. Therefore the interpretation of the -8.926
coefficient of the Female dummy variable is
straightforward.
7Regression Analysis -- continued
- The above equation only tells part of the story,
it ignores all information except for gender. - We expand this equation by adding the experience
variables. The output is shown in this table.
8Regression Analysis -- continued
- The corresponding equation is Predicted Salary
35.492 0.998YrsExper 0.131YrsPrior -
8.080Female - It is useful to write two separate equations, one
for females and one for males Predicted Salary
27.412 0.988YrsExper 0.131YrsPrior
Predicted Salary 35.492 0.988YrsExper
0.131YrsPrior - We interpret the coefficient -8.080 of the Female
dummy variable as the average salary disadvantage
for females relative to males after controlling
for job experience. But there is still more story
to tell.
9Regression Analysis -- continued
- We next add job grade to the equation by
including five of the six job grade dummies.
Although any five can be use we use Job_2-Job_6.
The resulting output is shown in this table.
10Regression Analysis -- continued
- The estimated regression equations is
nowPredicated Salary30.230 0.408YrsExper
0.149YrsPrior - 1.962Female 2.57Job_2
6.295Job_3 10.475Job_4 16.011Job_5
27.647Job_6 - There are no two categorical variables involved,
gender and job grade. - However, we can still write a separate equation
for any combination of categories by setting the
dummies to the appropriate values.
11Regression Analysis -- continued
- For example, the equation for females at the
fifth job grade is found by setting Female1 and
Job_51 and setting the other job dummies equal
to 0. The equation formed isPredictedSalary
44.279 0.408YrsExper 0.150YrsPrior - We interpret this equation as follows
- For either gender and any job grade, the expected
increase is salary for one extra year of
experience with Fifth National is 408 the
expected salary increase for one year experience
with another bank is 149.
12Regression Analysis -- continued
- The coefficients of the job dummies indicate the
average increase in salary an employee can expect
relative to the reference (lowest) job grade. - The key coefficient, the negative 1962 for
females indicates the average salary disadvantage
for females relative to males, given that they
have the same experience levels and are in the
same job grade - Although the penalty is still substantial, it
is less than a fourth of the penalty we saw
before. - It appears that females might be getting paid
less on average partly because they are in the
lower job categories.
13Regression Analysis -- continued
- We can check whether females are
disproportionately in the lower job categories by
using a pivot table with JobGrade in the row
area, Gender in the column area and the count
(expressed as a percentage) of any variable in
the data area.
14Regression Analysis -- continued
- Clearly, females tend to be concentrated at the
lower job grades. - This certainly helps to explain why females get
lower salaries on average, but it doesnt explain
why females are at the lower job grades in the
first place. - We wont be able to provide a thorough analysis
of this issue but we can add one more piece to
the puzzle now by adding education level, age,
and PCJob to the equation.
15Regression Analysis -- continued
- We dont provide the whole equation but the
resulting output is shown here.
16Regression Analysis -- continued
- The coefficients can be seen in the output.
- It doesnt appear to add much to the previous
equation. The penalty does, however, go up to
2555, which is slightly greater than the 1962. - At face value we can interpret the coefficients
of the education dummies as a benefit (or loss if
negative) of extra education relative to a high
school diploma, the reference category.
17Regression Analysis -- continued
- The coefficient of PCJob implies that an employee
with a computer-related job can expect an extra
4923 in salary relative to an employee without a
computer-related job, provided the other
variables are the same for each employee. - The age coefficient is quite small and has little
effect on salary.
18Conclusion
- The main conclusion we can draw from the output
is that there is still a plausible case to be
made for discrimination against females, even
after including information on all the variables
in the database in the regression equation.
19 20BANK.XLS
- The Fifth National Bank of Springfield is facing
a gender-discrimination suit. The charge is that
its female employees receive substantially
smaller salaries than its male employees. - The banks employee database is listed in this
file. Here is a partial list of the data.
21Question
- Earlier we estimated an equation for Salary suing
the numerical explanatory variables YrsExper and
YrsPrior and the dummy variable Female. - If we drop the YrsPrior variable from the
equation (for simplicity) and rerun the
regression, we obtain the equationPredicted
Salary 35.824 0.981YrsExper - 8.012Female - The R2 value for this equation is 49.1. If we
decide to include an interaction variable between
YrsExper and Female in this equation, what is the
effect?
22Interaction Terms
- An interaction variable algebraically is the
product of two variables. Its effect is to allow
the effect of one of the variables on Y to depend
on the value of the other variable. - The interaction term allows the slope of the
regression line to differ between the two
categories.
23Solution
- We first need to form an interaction variable
that is the product of YrsExper and Female. - This can be done two ways in Excel.
- we can do it manually by introducing a new
variable that contains the product of the two
variables involved, or - we can use the StatPro/Data Utilities/Create
Interaction Variable menu item. - Using the latter way we must select Female and
YrsExper as the variables, and we do not check
either of the boxes in the dialog box -- neither
should be a categorical variable.
24Solution -- continued
- Once the interaction variable has been created,
we include it in the regression equation in
addition to the other variables. The multiple
regression output is shown here.
25Solution -- continued
- The estimated regression equation isPredicated
Salary 30.430 1.528YrsExper 4.908Female
- 1.248YrsExper_Female - As we discussed before it is useful to write this
equation as two separate equations, one for
females and one for males. The female equation
is Predicated Salary 34.528
0.280YrsExperand the male equation
is Predicated Salary 30.430 1.528YrsExper - Next we can show these equations graphically.
26Nonparallel Female and Male Salary Lines
27Solution -- continued
- The Y-intercept for the female line is slightly
higher - females with no experience at Fifth
National Bank tend to start out slightly higher
than males - but the slope of the female line is
much lower. That is, males tend to move up the
salary ladder much more quickly than females. - Again, this provides another argument, although a
somewhat different one, for gender discrimination
against females. - The R2 value increased from 49.1 to 63.9. The
interaction variable has definitely added to the
explanatory power of the equation.
28 29BANK.XLS
- The Fifth National Bank of Springfield is facing
a gender-discrimination suit. The charge is that
its female employees receive substantially
smaller salaries than its male employees. - The banks employee database is listed in this
file. Here is a partial list of the data.
30Question
- A glance at the distribution of salaries of the
208 employees shows some skewness to the right -
a few employees make substantially more than the
majority of employees. - Therefore, it might make sense to use the natural
logarithm of Salary instead of Salary as the
response variable. - If we do this, how do we interpret the results?
31Solution
- All of the analyses we did earlier with this data
set could be repeated except with Log_Salary as
the response variable. - For the sake of discussion we will look only at
the regression equation with Female and YrsExper
as explanatory variables. - After we create the Log_Salary variable and run
the regression, we obtain the output shown here.
32Regression Output with Log_Salary as Response
Variable
33Solution
- The estimated regression equation is Predicted
Log_Salary 3.5829 0.0188YrsExper - 0.1616
Female - The R2 and se values are 42.4 and 0.1794. For
comparison with Salary these were 49.1 and
8.070. - We first interpret that neither of these values
are directly comparable to the Salary values. - The two R2 values are percentages explained of
different response variables, Log_Salary and
Salary. The fact that one is smaller does not
mean a worse fit. They simply arent comparable.
34Solution -- continued
- The situation for se is even worse. Each se is a
measure of a typical residual, but the residuals
in the Log_Salary equation are in log dollars,
whereas the residuals in the Salary equation are
in dollars. - Therefore it is no surprise that the Log_Salary
is much smaller than the se for the Salary
equation. - If we want comparable standard error measures for
the two equations, we should take antilogs of the
fitted values from the Log_Salary equation to
convert them back to dollars, subtract these from
the original Salary values, and take the standard
deviation of these residuals.
35Solution -- continued
- The resulting standard deviation is 7.74. This
is somewhat smaller than the se from the Salary
equation, an indication of a slightly better fit. - Finally we interpret the equation itself.
- When the response variable is Log_Y and a term on
the right hand side of the equation is of the
form bX, then whenever X increases by one unit
Y-hat changes by a constant percentage, and this
percentage is approximately equal to b (written
as a percentage).
36Solution -- continued
- This means that for each year of experience with
Fifth National, an employees salary can be
expected to increase 1.88. - The Female expected percentage decrease in salary
is 16.16. - In other words this equation implies that females
can expect to make about 16 less than men for
comparable years of experience.
37 38POWER.XLS
- The Public Service Electric Company produces
different quantities of electricity each month,
depending on the demand. - This file lists the number of units of
electricity produced (Units) and the total cost
of producing these (Cost) for a 36-month period. - The data set appears on the next slide.
- How can regression be used to analyze the
relationship between Cost and Units?
39Data for Electric Power
40Solution
- A good place to start is with a scatterplot of
Cost versus Units.
41Solution -- continued
- The scatterplot indicates a definite positive
relationship and one that is nearly linear. - However, there is also some evidence of curvature
in the plot. The points increase slightly less
rapidly as Units increase from left to right. - In economic terms, there may be economics of
scale, where marginal cost of the electricity
decreases as more units of electricity are
produced. - Nevertheless, we use regression to estimate a
linear relationship between Cost and Units.
42Solution -- continued
- The resulting regression equation is
Predicted Cost 23,651 30.53 Units - The corresponding R2 and se are 73.6 and 2734.
We also requested a scatterplot of the residuals
versus the fitted values. The scatterplot is on
the next slide. Obtaining this scatterplot is
always a good idea if nonlinearity is suspected. - The sign of nonlinearity in this plot is that the
residuals to the far left and the far right are
all negative, whereas the majority of the
residuals in the middle are positive.
43Residuals from a Straight-Line Fit
44Solution -- continued
- Admittedly the pattern is far from perfect -
there are a few negatives in the middle - but the
plot does hint at nonlinear behavior. - The negative-positive-negative behavior of the
residuals suggests a parabola that is, a
quadratic equation with the square of Units
included in the equation. - We first create a new variable Sqr_Units in the
data set. This can be done manually or using
StatPros Transform Variables menu item.
45Solution -- continued
- Then we use multiple regression to estimate the
equation for Cost with both explanatory
variables, Units and Sqr_Units, included. - The resulting equation from the output on the
next slide is Predicated Cost 5793
98.3Units - 0.0600Sqr_Units - Note that R2 has increase to 82.2 and se has
decreased to 2281.
46Regression Output with Squared Term Included
47Solution -- continued
- One way to see how this regression equation fits
the scatterplot of Costs versus Units is to use
Excels trendline option. - To do so activate the scatterplot, click on any
point and use the Chart/Add Trendline menu item,
click the Type tab and select the Polynormal type
or order 2, that is a quadratic. - A graph of the equation is superimposed on the
scatterplot on the following slide. It shows
reasonably good fit, plus an obvious curvature.
48Quadratic Fit Scatterplot
49Solution -- continued
- The main downside to a quadratic regression
equation is that there is no easy interpretation
of the coefficients of Units and Sqr_Units. - All we can say is that the terms in the equation
combine to explain the nonlinear relationship
between units produced and total cost. - A final note about the equation concerns the
coefficient of Sqr_Units. - First, the fact that it is a negative make the
parabola bend downward. This produces the
decreasing marginal cost behavior, where every
extra unit of electricity incurs a smaller cost.
50Solution -- continued
- Second, we shouldnt be fooled by the small
magnitude of this coefficient. Remember that it
is the coefficient of Units squared, which is a
large quantity. Therefore, the effect of the
product -0.0600Sqr_Units is sizable. - One other possibility we might examine is a
logarithmic fit. - In this case we create a new variable Log_Units,
the natural logarithm of Units, and then regress
Cost against the single variable Log_Units.
51Solution -- continued
- To create the new variable we can again use
StatPros Transform Variable menu item and then
we can superimpose a logarithmic curve on the
scatterplot of Cost versus Units by using the
trendline feature. - This curve appears in the scatterplot on the next
slide. - To the naked eye, it appears to be similar, and
about as good a fit as the quadratic curve.
52Logarithmic Fit Scatterplot
53Solution -- continued
- The resulting regression equation is Predicted
Cost -63,993 16,654Log_Units - The values of R2 and se are 79.8 and 2393.
- These latter values indicate that the logarithmic
fit is not quite as good as the quadratic fit. - However, the advantage of the logarithmic
equation is that it is easier to interpret.
54Solution -- continued
- In this case, where the log of the explanatory
variable is used, we can interpret its
coefficient as follows. - Suppose Units increases by 1, for example from
600 to 606. Then the equation implies that the
expected Cost will increase approximately
166.54. - In words, every 1 increase in Units is
accompanied by an expected 166.54 increase in
Cost. - Note that for larger values of Units, a 1
increase represents a larger absolute increase.
But each such 1 increase entails the same
increase in Cost. This is another way of
describing the decreasing marginal cost property.
55 56CARDEMAND.XLS
- This file contains annual data (1970-1987) on
domestic auto sales in the United States. The
data set is shown here on the next slide. - The variables are defined as
- Quantity annual domestic auto sales (in number
of units) - Price real price index of new cars
- Income real disposable income
- Interest prime rate of interest
- Estimate and interpret a multiplicative (constant
elasticity) relationship between Quantity and
Price, Income and Interest.
57Car Demand Data
58Constant Elasticity Relationships
- A particular type of nonlinear relationship that
has firm grounding in economic theory is called a
constant elasticity relationship. It is also
called a multiplicative relationship. - One property of this type of relationship is that
the effect of a change on any explanatory
variable Xi on Y depends on the levels of the
other Xs in the equation.
59Solution
- We first take the natural logs of all four
variables. - This can be done in one step using the Transform
Variables menu item or we can use Excels LN
function. - We then use multiple regression, with
Log_Quantity as the response variable and
Log_Price, Log_Income, and Log_Interest as the
explanatory variables. - The resulting output is shown on the next slide
and the corresponding equation Predicted
Log_Quantity 4.675 - 1.185Log_Price
2.183Log_Income - 0.19Log_Interest
60Regression Output for Multiplicative Relationship
61Solution -- continued
- If we like we can convert this back to the
original variables, that is back to
multiplicative form, by taking antilogs. The
result isPredicted Quantity 107.198Price-1.185I
ncome2.183Interest-0.191 - In either form the equation implies that the
elasticities are approximately equal to -1.185,
2.183 and -0.191. - When Price increases by 1, Quantity tends to
decrease by about 1.185 when Income increases
by 1, Quantity tends to increase by about
2.183 and when Interest increases by 1,
Quantity tends to decrease by about 0.191.
62Conclusions
- Does this multiplicative equation provide a
better fit to the automobile data than does an
additive relationship? - Without doing considerable more work it is
difficult to answer this questions with
certainty. - As we discussed previously, it is not sufficient
to compare R2 and se values for the two fits. - We will simply state that the multiplicative
relationship provides a reasonably good fit, and
it makes sense economically.
63 64LEARNING.XLS
- The Presario Company produces a variety of small
industrial products. - It has just finished producing 22 batches of a
new product (new to Presario) for a customer. - This file contains the times (in hours) to
produce each batch. These data are in the table
on the next slide. - Clearly, the times have tended to decrease as
Presario has gained more experience in making the
product.
65Data for Learning Curve
- Does the multiplicative learning model apply to
these data, and what does it imply about the
learning rate?
66Learning Curve Model
- A final example of a multiplicative relationship
is the learning curve model. - A learning curve relates the unit production time
(or cost) to the cumulative volume of output
since that production process first began. - Empirical studies indicate that production times
tend to decrease by a relatively constant
percentage every time cumulative output doubles. - The constant percentage is called the learning
rate.
67Solution
- One way to check whether the multiplicative
learning model is reasonable is to create the log
variables Log_time and Log_batch in the usual way
and then see whether a scatterplot of Log_Time
versus Log_Batch is approximately linear. - The multiplicative model implies that it should
be. - Such a scatterplot is shown on the next slide,
along with a superimposed linear trend line. The
fit appears to be quite good.
68Scatterplot of Log Variables with Linear Trend
Superimposed
69Solution -- continued
- To estimate the relationship, we regress Log_Time
on Log_Batch. The resulting equation
is Predicated Log_Time 4.834 - 0.155Log_Batch - There are a couple of ways of interpreting this
equation. - First, because it is based on a multiplicative
relationship, we can interpret the coefficient
-0.155 as an elasticity. That is when Batch
increases by 1, Time tends to decrease by
approximately 0.155. Although this is correct it
is not as useful as the doubling
interpretation.
70Solution -- continued
- We know that the estimated learning rate
satisfies -0.155 ln(learning
rate/ln(2)Solving for the learning rate
(multiply through by ln(2)) and then take
antilogs, we find that it is 0.898, or
approximately 90. In other words, whenever
cumulative production doubles, the time to
produce a batch decreases by about 10.
71Predicting Future Production Times
- Presario could use this regression equation to
predict future production times. - For example, suppose the customer places an order
for 15 more batches of the same product. We can
use the equation to predict the log of production
time for each batch, then take their antilogs and
sum them to obtain the total production time. - The calculations are shown in rows 26-42 of the
following table. The total predicted time to
finish is about 1115 hours.
72Using the Learning Curve Model for Predications