Title: Linear Discriminant Analysis (LDA)
1Linear Discriminant Analysis (LDA)
2 Goal
- To classify observations into 2 or more groups
based on k discriminant functions - (Dependent variable Y is categorical with k
classes.) - Assumptions
- Multivariate Normal Distribution
- variables are distributed normally within the
classes/groups. - Similar Group Covariances
- Correlations between and the variances within
each group should be similar.
3Dependent Variable
- Must be categorical with 2 or more classes
(groups). - If there are only 2 classes, the discriminant
analysis procedure will give the same result as
the multiple regression procedure.
4Independent Variables
- Continuous or categorical independent variables
- If categorical, they are converted into binary
(dummy) variables as in multiple linear
regression
5Output
- Example
- Assume 3 classes (y1,2,3) of the dependent.
Y x11 x12 x13 x14 f1 f2 f3 Pred. Y
1 20 25 10 12 85 78 58 1
1 18 16 14 12 80 68 65 1
.. .. .. ..
2 15 15 16 17 75 84 70 2
2 14 16 17 18 70 88 67 2
.. .. .. ..
3 8 9 9 11 95 86 105 3
3 10 8 8 10 96 84 100 3
.. .. .. ..
6Binary Dependent - Regression
- If only 2 classes of dependent, can do multiple
regression - Sample data shown below
Status Age (18-30) Age (50) Income
Y X1 X2 X3
0 1 0 30
0 1 0 32
.. .. .. ..
0 0 0 50
0 0 0 28
0 0 0 75
.. .. .. ..
1 0 1 100
1 0 1 90
1 0 1 95
7Regression Output
SUMMARY OUTPUT
Regression Statistics Regression Statistics
Multiple R 0.833615561
R Square 0.694914903
Adjusted R Square 0.649152139
Standard Error 0.301479577
Observations 24
ANOVA
df SS MS F Significance F
Regression 3 4.140534632 1.380178211 15.18516005 2.19698E-05
Residual 20 1.817798702 0.090889935
Total 23 5.958333333
Coefficients Standard Error t Stat P-value Lower 95 Upper 95
Intercept -0.337942024 0.22002876 -1.535899327 0.14023269 -0.796913973 0.121029925
X1 -0.160950017 0.155728156 -1.033531901 0.313691534 -0.485793257 0.163893223
X2 0.426373823 0.153140052 2.784208421 0.011449703 0.106929273 0.745818373
Income 0.013571735 0.003078379 4.408727859 0.00027065 0.007150349 0.019993121
8Classification
Status Age (18-30) Age (50) Income
Y X1 X2 X3 Predicted Y Class
0 1 0 30 -0.0917 0
0 1 0 32 -0.0646 0
0 1 0 40 0.0440 0
0 1 0 38 0.0168 0
0 1 0 55 0.2476 0
0 1 0 56 0.2611 0
0 0 0 45 0.2728 0
0 0 0 40 0.2049 0
0 0 0 65 0.5442 1
0 0 0 50 0.3406 0
0 0 0 28 0.0421 0
1 0 0 75 0.6799 1
1 0 0 50 0.3406 0
1 1 0 80 0.5868 1
1 0 0 100 1.0192 1
1 0 0 90 0.8835 1
1 0 0 95 0.9514 1
1 0 1 75 1.1063 1
1 0 1 50 0.7670 1
1 0 1 85 1.2420 1
1 0 1 40 0.6313 1
1 0 1 88 1.2827 1
1 0 0 78 0.7207 1
1 0 1 65 0.9706 1
Classification Rule in this case If Pred. Y gt
0.5 then Class 1 else Class 0. This model
yielded 2 misclassifications out of 24. How good
is R-square?
9 Crosstab of Pred. Y and Y
- For large datasets, one can format the Predicted
Y variable and create a crosstab with Y to see
how accurately the model classifies the data
(fictitious results shown here). - The Good and Bad columns represent the number
of actual Y values.
Predicted Y 1000 Predicted Y 1000 Predicted Y 1000 Good Bad
900 to 1000 410 50
850 to 900 390 70
800 to 850 370 90
750 to 800 350 110
700 to 750 330 130
650 to 700 310 150
600 to 650 290 170
550 to 600 270 190
500 to 550 250 210
450 to 500 230 230
400 to 450 210 250
350 to 400 190 270
300 to 350 170 290
250 to 300 150 310
200 to 250 130 330
150 to 200 110 350
100 to 150 90 370
50 to 100 70 390
0 to 50 50 410
4370 4370
10Kolmogorov-Smirnov Test
- Use the crosstabs shown in last slide to conduct
the KS Test to determine - Cutoff score,
- Classification accuracy, and
- Forecasts of model performance.