Title: Categorical dependent variables
1Categorical dependent variables
2Categorical dependent variables
- Firm joins Energy Star or not
- Parcel of land developed as urban, agriculture,
or open space - Species goes extinct or not
- Opinion is Strongly Opposed, Opposed, Neutral,
Favorable, or Strongly Favorable
- Residuals will clearly not be normal!
- Relevant probability distributions are binomial
and multinomial - Instead of thinking about means, we think about
probabilities of a given result - Can still do hypothesis testing and fit linear
models
3Hypothesis testing
- Types of hypotheses
- Single sample compare to a predicted
probability - Are ESM students a sample from a population that
is 50 female? - Are UCSB students a random sample of the
California population with regards to race? - Two or more samples are they samples from the
same population? - Do two states have the same proportion of invaded
species? - Do land parcels in Ventura and Santa Barbara
Counties have the same distribution among
developed, agricultural, and open space?
- Types of tests
- Exact tests
- Gives precise probability of seeing data, given
null hypothesis - Computer intensive
- Chi-squared test
- Oldest, easiest to compute by hand
- Most commonly used
- Called Pearson in JMP
- G-test
- Called Likelihood Ratio in JMP
- Both chi-squared and G-test are approximations to
exact test - Make different assumptions
- Require expected number of at least 5 in each
cell for reliable results - Exact tests only implemented for certain cases in
JMP
4Single sample tests
- In a group of 35 students, 24 are female. Is
this a random sample from a population with a
5050 sex ratio? - Null hypothesis p 0.5
- Expected frequencies 17.5 female, 17.5 male
5Single sample tests
- In a group of 52 students, 30 are white, 5 are
Latino, 7 are black, and 10 are Asian. Is this a
random sample from a population that is 45
white, 30 Latino, 15 black, and 10 Asian? - Null hypothesis probabilities of each race are
as given - Expected frequencies 23.4 white, 15.6 Latino,
7.8 black, 5.2 Asian
6Two sample tests
- Alabama has 30 native tree species and 40
exotics Kansas has 25 natives and 40 exotics.
Are these drawn from the same statistical
population? - Null hypothesis p 55/135 0.41
- Expected frequencies in Alabama 28.5 native,
41.5 exotic - Expected frequencies in Kansas 26.5 native, 38.5
exotic
7Multiple sample tests
- A random sample of 100 land parcels in each of
Ventura, SB, and San Luis Obispo counties. In
Ventura, 30 developed, 40 agricultural, and 30
open space in SB, 45, 30, and 25 in SLO, 10,
25, and 65. Do the counties have the same land
use pattern? - Null hypothesis 85/300 0.28 developed 95/300
0.32 agriculture 12/300 0.4 open space - Expected frequencies in each county 28
developed, 32 agriculture, 40 open space
8Tumors and ETU
- Treated foods contain ETU (ethylenethiourea)
may be harmful to health. - Big question How does exposure affect chance of
contracting disease? - Some rats exposed to ETU contracted tumors.
- How does probability of tumor depend on dose?
- What dose associated with 10 tumor rate (To
advise on regulation)?
- 6 dose groups (0,5,25,125,250,500)
- 70 rats per group.
9How about OLS regression?
Call lm(formula Tumor Dose) Residuals
Min 1Q Median 3Q Max -0.78572
-0.15976 0.04055 0.04889 1.04889
Coefficients Estimate Std. Error
t value Pr(gtt) (Intercept) -4.889e-02
1.657e-02 -2.951 0.00335 Dose
1.669e-03 7.181e-05 23.244 lt 2e-16
--- Signif. codes 0 ' 0.001 ' 0.01
' 0.05 .' 0.1 ' 1 Residual standard error
0.2653 on 430 degrees of freedom Multiple
R-Squared 0.5568, Adjusted R-squared 0.5558
F-statistic 540.3 on 1 and 430 DF, p-value lt
2.2e-16
10(No Transcript)
11(No Transcript)
12Problems with the OLS regression
- Statistical
- Residuals not normally distributed
- Maybe some nonlinearity hard to tell
- Logical
- Predicted value represents probability of getting
a tumor - How do we interpret values less than zero or
greater than one?
- Solution GLM with binomially distributed
residuals - Logistic regression
- Uses the appropriate error model
- Fits a model that is constrained to be between
zero and one
13Logistic Regression
14(No Transcript)
15Dose _at_ 10 Tumor Chance
- What dose gives a 10 chance of contracting a
tumor? - Solve
- After a bunch of math, D170.24
16More complex logistic regression and other GLM
models
- Can add more variables, interactions, etc.
- Within the logistic function, model needs to be
linear in parameters - With multiple logistic regression, use Effect
Tests to test significance of individual terms - When comparing models, look at AIC (smaller is
better) - AIC -2 LogLikelihood 2df
- Can use other probability models for residuals
- gaussian, Gamma, inverse.gaussian, poisson
- Can also specify the link the function that
transforms the dependent variable to get
linearity - identity, log, sqrt, logit, probit, inverse,
1/mu2 (inverse squared) - JMP doesnt do any of these