Multivariate Coarse Classing of Nominal Variables - PowerPoint PPT Presentation

About This Presentation

Title:

Multivariate Coarse Classing of Nominal Variables

Description:

Transformed data for distance calculation. Nominal-to-numeric. mapping. Classing tree ... For each record i, calculate scorei=average(Sj qij ) ... – PowerPoint PPT presentation

Number of Views:131

Avg rating:3.0/5.0

Slides: 53

Provided by: davi119

Learn more at: https://davis.wpi.edu

Category:

more less

Transcript and Presenter's Notes

Title: Multivariate Coarse Classing of Nominal Variables

1
Multivariate Coarse Classing of Nominal Variables

Geraldine E. Rosario
Talk given at Fair Isaac on July 14, 2003
Based on paper Mapping Nominal Values to Numbers
for Effective Visualization, InfoVis 2003.

2
Outline

Motivation
Overview of Distance-Quantification-Classing
approach
Algorithmic Details
Experimental Evaluation
Wrap-Up

3
Those pesky nominal variables

Nominal variable variables whose values do not
have a natural ordering or distance
High cardinality nominal variable has large
number of distinct values
Examples?
Examples of business applications using nominal
variables?
Why do you usually pre-process/transform them
before doing data analysis?

4
Visualizing Nominal Variables

Most data visualization
tools are designed for
numeric variables.

What if variable is
nominal?

Most tools which are
designed for nominal
variables cannot handle
large of values.

5
Quantified Nominal Variables
Are the order and spacing of values within each
variable believable?
6
Coarse Classing Nominal Variables

Possible ways of classing nominal variables with
high cardinality
Domain expertise
Univariate using information about the variable
itself. e.g. based on frequency of occurrence of
the attributes
Bivariate using information from one other
variable. e.g. relationship with predictor
variable
Multivariate based on the profile across several
other variables. e.g. using cluster analysis
Is multivariate coarse classing better?

7
The approach
8
Proposed Approach

Pre-process nominal variables using a
Distance-Quantification-Classing (DQC) approach
Steps
Distance transform the data so that the
distance between 2 nominal values can be
calculated (based on the variables relationship
with other variables)
Quantification assign order and spacing to the
nominal values
Classing or intra-dimension clustering
determine which values are similar to each other
and can be grouped together
Each step can be done by more than one technique.

9
Distance-Quantification-Classing Approach
10
Example Input to Output
Task Pre-process color based on its patterns
across quality and size.
Data Quality (3) good,ok,bad Color (6)
blue,green,orange,
purple,red,white Size (10) a to j
11
Other Potential Uses of DQC as Pre-Processor

For techniques that require numeric inputs
linear regression, some clustering algorithms
(can speed up calculations but with some loss of
accuracy)
For techniques that require low cardinality
nominal variables scorecards, neural networks,
association rules
FICO-specific
Multivariate coarse classing
ClusterBots nominal variables could be
quantified and distance calculations would be
simpler. Could be applied to mixed variables?
Product groups, merchant groups
Can you think of other uses?

12
Details Details
13
Distance Step Correspondence Analysis

Used for analyzing n-way tables containing some
measure of association between rows and columns
Simple Correspondence Analysis (SCA) for 2
variables
Multiple Correspondence Analysis (MCA) for gt 2
variables. Uses SCA.
Focused Correspondence Analysis (FCA) proposed
alternative to MCA when memory is limited. Uses
SCA.
Reinvented as Dual Scaling, Reciprocal Averaging,
Homogeneity Analysis, etc.
Similar to PCA but for nominal variables

14
Simple Correspondence Analysis The Basic Idea
Calculate c2 statistic (measures the strength of
association between COLOR and QUALITY based
on assumption of independence). Any deviation
from independence will increase the c2 value.
Can we find similar COLORs based on its
association with QUALITY?
Similar profiles
15
Simple Correspondence Analysis Steps
Row percentage matrix
Column percentage matrix
Normalize counts table
Similar row profiles (blue,purple),
Similar column profiles (ok,bad),
Eigenvalues
Identify a few independent dimensions which can
reconstruct the c2 value. (SVD, EigenAnalysis).
Coordinates for Independent
Dimensions Dim1 Dim2 Blue
- 0.02 - 0.28 Green - 0.54
0.14 Orange 0.55 0.10 Purple
0 - 0.25 Red - 0.50
0.20 White 0.57 0.19
Scale the new dimensions such that c2 distances
between row points is maximized.
16
Simple Correspondence Analysis The Output

Coordinates Matrix
Set of independent dimensions
Dimensions ordered by diminishing importance
Total of independent dimensions min(r,c)-1
Similar to principal components from PCA
Eigenvalues
Indicates the importance of each independent
dimension

17
Distance Step Alternative Multiple
Correspondence Analysis

Steps
BurtTable(rawdataMatrix) ? burtMatrix
SCA(burtMatrix) ? coordMatrix, evaluesVector
ReduceNDim(coordMatrix, evaluesVector) ?
coordMatrixSubset
Input to SCA - Burt Table crosses all variables
by all variables

X3
X2
X1
X1 by X1 counts table
X1 by X2 counts table
X1
X2
X3

18
Multiple Correspondence Analysis

Features
For a given variable, determines which values are
similar to each other by comparing value profiles
across all other variables
multivariate
maximizes usage of information
memory-intensive
Simultaneously analyzes of all variables
efficient calculations

19
Reduce Number of Dimensions to Keep

Reduce the number of independent dimensions to
keep for subsequent analysis (due to large of
analysis variables and high cardinality)

eigenvalue
1 2 3 4 5
dimension
20
Distance Step AlternativeFocused Correspondence
Analysis

Proposed alternative to MCA when memory space is
limited
Core idea instead of comparing value profiles
across all other nominal variables, just compare
value profiles across the nominal variables which
are most correlated with the target variable
Input to Simple CA

X9
X1
X3
target variable Xi
Xi by X3 counts table
Xi by X1 counts table
21
Focused Correspondence Analysis

Steps
PairwiseAssociate(rawdataMatrix) ? assocMatrix
Set k ( analysis variables to use)
FCATable(rawdataMatrix, k, assocMatrix) ?
fcaInputMatrix
SCA(fcaInputMatrix) ? coordMatrix, evaluesVector
ReduceNDim(coordMatrix, evaluesVector) ?
coordMatrixSubset

22
FCA Calculate Pairwise Association

Used Uncertainty Coefficient U(RC) to measure
strength of nominal association
Bounded 0,1
U(RC)1 ? value of row variable R can be known
precisely given value of column variable C
Example U(RC) association matrix

23
FCA Determine top k associated variables for
each nominal variable

Set k gt 2 to ensure use of at least one analysis
variable per target variable
Cannot use a threshold on the association measure

24
Focused Correspondence Analysis

Features
One-at-a-time analysis
Less/controllable memory usage
Sub-optimal quantification compared to MCA
Requires pre-processing step to determine top
correlated variables per target variable
longer run time

25
Quantification Step Modified Optimal Scaling
Nominal-to-numeric mapping
Coordinates for Independent
Dimensions Dim1 Dim2 Blue
- 0.02 - 0.28 Green - 0.54
0.14 Orange 0.55 0.10 Purple
0 - 0.25 Red - 0.50
0.20 White 0.57 0.19
Optimal Scaling
Optimal Scaling goal maximize the variance of
the scores of the records, where score
average(qij)
26
Quantification Step Modified Optimal Scaling

Problem with Optimal Scaling perfect
associations between variables are not recreated
in the quantified versions
Modified Optimal Scaling
Let p of eigenvalues 1.0
If p gt 1 then set
Else set

27
Classing Step Hierarchical Cluster Analysis
Cluster Analysis weighted by counts
from FCA
28
Loss of Information due to Classing

Determine variable V with highest association
with target X.
Create X by V counts table.
Calculate total table measure of association (eg,
U(XV)).
Starting from bottom of tree, for every pair of
nodes merged,
calculate cumulative information loss

29
Distance-Quantification-Classing Approach
30
Does this approach work?
31
Experimental Evaluation

Wrong quantification and classing will introduce
artificial patterns and cause errors in
interpretation
Evaluation measures
Believability
Quality of Visual Display
Quality of classing
Quality of quantification
Space FCA less space
Run time MCA faster

perception
statistical
computational
32
Test Data Sets
33
Believability and Quality of Visual Display

Given two displays resulting from different
nominal-to-numeric mappings
Which mapping gives a more believable ordering
and spacing?
Based on your domain knowledge, are the values
that are positioned close together similar to
each other?
Are the values that are positioned far from the
rest of the values really outliers?
Which display has less clutter?

34
Automobile Data Alphabetical
35
Automobile Data MCA
Are these patterns believable?
36
Automobile Data FCA
Are these patterns believable?
37
PERF Data Alphabetical
Region-Country 1-many Country-Product many-many
Are these associations preserved and revealed?
38
PERF Data FCA
Region-Country 1-many Country-Product many-many
Are these associations preserved and revealed?
39
Quality of Classing

Classing A is better than classing B if, given a
classing tree, the rate of information loss with
each merging is slower

Information loss due to classing for one variable
? The lower the line, the slower the info
loss, the better the classing.
Calculate difference between the lines. ?
40
Which classing is better depends on dataset
Distribution of difference between the lines.
41
Quality of Quantification

A quantification is good if
If data points that are close together in nominal
space are also close together in numeric space
If two variables are highly associated with each
other, then their quantified versions should also
have high correlation.

42
MCA gives better quantification
? Average Squared Correlation higher value
better quantification
? Correlation between MCA and FCA scales how
close are FCA scales to MCA scales
43
Had enough yet?
44
Going back to Multivariate Coarse Classing

Other issues
Missing values
Mixed or numeric variables as analysis variables
Nominal values with small counts
Robustness of quantification and classing

45
Can you think of other uses of DQC at FICO?

For techniques that require numeric inputs
linear regression, some clustering algorithms
(can speed up calculations but with some loss of
accuracy)
For techniques that require low cardinality
nominal variables scorecards, neural networks,
association rules
FICO-specific
Multivariate coarse classing
ClusterBots nominal variables could be
quantified and distance calculations would be
simpler. Could be applied to mixed variables?
Product groups, merchant groups
???????

46
Implementation

SAS version exists
PROC CORRESP, PROC CLUSTER, PROC FREQ
C version in development

47
Summary

DQC is a general-purpose approach for
pre-processing nominal variables for data
analysis techniques requiring numeric variables
or low cardinality nominal variables
DQC multivariate, data-driven, scalable,
distance-preserving, association-preserving
FCA is a viable alternative to MCA when memory
space is limited
Quality of classing and quantification
depends on strength of associations within the
data set.
is in the eye of the user

48
Yippee, its over!

Original InfoVis2003 paper Mapping Nominal
Values to Numbers for Effective Visualization.
http//davis.wpi.edu/xmdv/documents.html
XmdvTool Homepage
http//davis.wpi.edu/xmdv
xmdv_at_cs.wpi.edu
Code is free for research and education.

49
References

Gre93 GREENACRE, M.J., 1993, Correspondence
Analysis in Practice, London Academic Press
Gre84 Greenacre, M. (1984), Theory and
Applications of Correspondence Analysis, London
Academic Press
Sta StatSoft Inc. Correspondence Analysis.
http//www.statsoftinc.com/textbook/stcoran.html
Fri99 Friendly, Michael. 1999. "visualizing
Categorical Cata." In Sirken, Monroe G. et. al.
(eds). Cognition and Survey Research. New York
John Wiley Sons.
Kei97 Keim D. A. Visual Techniques for
Exploring Databases, Invited Tutorial, Int.
Conference on Knowledge Discovery in Databases
(KDD'97), Newport Beach, CA, 1997.
Hua97b Zhexue Huang. A Fast Clustering
Algorithm to Cluster Very Large Categorical Data
Sets in Data Mining (1997)
SAS Manuals (PROC CORRESP, PROC CLUSTER, PROC
FREQ)

50
What input tables can SCA accept?

In general, SCA can use as input any table that
has the properties
The table must use the same physical units or
measurements, and
The values in the table must be non-negative.
The FCA input table satisfies these properties.

51
Uncertainty Coefficient U(RC)
Source SAS Proc Freq
52
Average Squared Correlation

Given the raw data matrix Rrij, where the
columns represent the variables. Create new
matrix Qqij where qij.quantified version of
rij.. Let Qjjth column of Q.
For each record i, calculate scoreiaverage(Sj
qij )
For each variable j, calculate corrjcorrelation(Q
i,score)
Calculate average of the squared correlation.
Source Gre93