Title: Multivariate Coarse Classing of Nominal Variables
1Multivariate Coarse Classing of Nominal Variables
- Geraldine E. Rosario
- Talk given at Fair Isaac on July 14, 2003
- Based on paper Mapping Nominal Values to Numbers
- for Effective Visualization, InfoVis 2003.
2Outline
- Motivation
- Overview of Distance-Quantification-Classing
approach - Algorithmic Details
- Experimental Evaluation
- Wrap-Up
3Those pesky nominal variables
- Nominal variable variables whose values do not
have a natural ordering or distance - High cardinality nominal variable has large
number of distinct values - Examples?
- Examples of business applications using nominal
variables? - Why do you usually pre-process/transform them
before doing data analysis?
4Visualizing Nominal Variables
- Most data visualization
- tools are designed for
- numeric variables.
- What if variable is
- nominal?
- Most tools which are
- designed for nominal
- variables cannot handle
- large of values.
5Quantified Nominal Variables
Are the order and spacing of values within each
variable believable?
6Coarse Classing Nominal Variables
- Possible ways of classing nominal variables with
high cardinality - Domain expertise
- Univariate using information about the variable
itself. e.g. based on frequency of occurrence of
the attributes - Bivariate using information from one other
variable. e.g. relationship with predictor
variable - Multivariate based on the profile across several
other variables. e.g. using cluster analysis - Is multivariate coarse classing better?
7The approach
8Proposed Approach
- Pre-process nominal variables using a
Distance-Quantification-Classing (DQC) approach - Steps
- Distance transform the data so that the
distance between 2 nominal values can be
calculated (based on the variables relationship
with other variables) - Quantification assign order and spacing to the
nominal values - Classing or intra-dimension clustering
determine which values are similar to each other
and can be grouped together - Each step can be done by more than one technique.
9Distance-Quantification-Classing Approach
10Example Input to Output
Task Pre-process color based on its patterns
across quality and size.
Data Quality (3) good,ok,bad Color (6)
blue,green,orange,
purple,red,white Size (10) a to j
11Other Potential Uses of DQC as Pre-Processor
- For techniques that require numeric inputs
linear regression, some clustering algorithms
(can speed up calculations but with some loss of
accuracy) - For techniques that require low cardinality
nominal variables scorecards, neural networks,
association rules - FICO-specific
- Multivariate coarse classing
- ClusterBots nominal variables could be
quantified and distance calculations would be
simpler. Could be applied to mixed variables? - Product groups, merchant groups
- Can you think of other uses?
12Details Details
13Distance Step Correspondence Analysis
- Used for analyzing n-way tables containing some
measure of association between rows and columns - Simple Correspondence Analysis (SCA) for 2
variables - Multiple Correspondence Analysis (MCA) for gt 2
variables. Uses SCA. - Focused Correspondence Analysis (FCA) proposed
alternative to MCA when memory is limited. Uses
SCA. - Reinvented as Dual Scaling, Reciprocal Averaging,
Homogeneity Analysis, etc. - Similar to PCA but for nominal variables
14Simple Correspondence Analysis The Basic Idea
Calculate c2 statistic (measures the strength of
association between COLOR and QUALITY based
on assumption of independence). Any deviation
from independence will increase the c2 value.
Can we find similar COLORs based on its
association with QUALITY?
Similar profiles
15Simple Correspondence Analysis Steps
Row percentage matrix
Column percentage matrix
Normalize counts table
Similar row profiles (blue,purple),
Similar column profiles (ok,bad),
Eigenvalues
Identify a few independent dimensions which can
reconstruct the c2 value. (SVD, EigenAnalysis).
Coordinates for Independent
Dimensions Dim1 Dim2 Blue
- 0.02 - 0.28 Green - 0.54
0.14 Orange 0.55 0.10 Purple
0 - 0.25 Red - 0.50
0.20 White 0.57 0.19
Scale the new dimensions such that c2 distances
between row points is maximized.
16Simple Correspondence Analysis The Output
- Coordinates Matrix
- Set of independent dimensions
- Dimensions ordered by diminishing importance
- Total of independent dimensions min(r,c)-1
- Similar to principal components from PCA
- Eigenvalues
- Indicates the importance of each independent
dimension
17Distance Step Alternative Multiple
Correspondence Analysis
- Steps
- BurtTable(rawdataMatrix) ? burtMatrix
- SCA(burtMatrix) ? coordMatrix, evaluesVector
- ReduceNDim(coordMatrix, evaluesVector) ?
coordMatrixSubset - Input to SCA - Burt Table crosses all variables
by all variables
X3
X2
X1
X1 by X1 counts table
X1 by X2 counts table
X1
X2
X3
18Multiple Correspondence Analysis
- Features
- For a given variable, determines which values are
similar to each other by comparing value profiles
across all other variables - multivariate
- maximizes usage of information
- memory-intensive
- Simultaneously analyzes of all variables
- efficient calculations
19Reduce Number of Dimensions to Keep
- Reduce the number of independent dimensions to
keep for subsequent analysis (due to large of
analysis variables and high cardinality)
eigenvalue
1 2 3 4 5
dimension
20Distance Step AlternativeFocused Correspondence
Analysis
- Proposed alternative to MCA when memory space is
limited - Core idea instead of comparing value profiles
across all other nominal variables, just compare
value profiles across the nominal variables which
are most correlated with the target variable - Input to Simple CA
X9
X1
X3
target variable Xi
Xi by X3 counts table
Xi by X1 counts table
21Focused Correspondence Analysis
- Steps
- PairwiseAssociate(rawdataMatrix) ? assocMatrix
- Set k ( analysis variables to use)
- FCATable(rawdataMatrix, k, assocMatrix) ?
fcaInputMatrix - SCA(fcaInputMatrix) ? coordMatrix, evaluesVector
- ReduceNDim(coordMatrix, evaluesVector) ?
coordMatrixSubset
22FCA Calculate Pairwise Association
- Used Uncertainty Coefficient U(RC) to measure
strength of nominal association - Bounded 0,1
- U(RC)1 ? value of row variable R can be known
precisely given value of column variable C - Example U(RC) association matrix
23FCA Determine top k associated variables for
each nominal variable
- Set k gt 2 to ensure use of at least one analysis
variable per target variable - Cannot use a threshold on the association measure
24Focused Correspondence Analysis
- Features
- One-at-a-time analysis
- Less/controllable memory usage
- Sub-optimal quantification compared to MCA
- Requires pre-processing step to determine top
correlated variables per target variable - longer run time
25Quantification Step Modified Optimal Scaling
Nominal-to-numeric mapping
Coordinates for Independent
Dimensions Dim1 Dim2 Blue
- 0.02 - 0.28 Green - 0.54
0.14 Orange 0.55 0.10 Purple
0 - 0.25 Red - 0.50
0.20 White 0.57 0.19
Optimal Scaling
Optimal Scaling goal maximize the variance of
the scores of the records, where score
average(qij)
26Quantification Step Modified Optimal Scaling
- Problem with Optimal Scaling perfect
associations between variables are not recreated
in the quantified versions - Modified Optimal Scaling
- Let p of eigenvalues 1.0
- If p gt 1 then set
- Else set
27Classing Step Hierarchical Cluster Analysis
Cluster Analysis weighted by counts
from FCA
28Loss of Information due to Classing
- Determine variable V with highest association
with target X. - Create X by V counts table.
- Calculate total table measure of association (eg,
U(XV)). - Starting from bottom of tree, for every pair of
nodes merged, - calculate cumulative information loss
29Distance-Quantification-Classing Approach
30Does this approach work?
31Experimental Evaluation
- Wrong quantification and classing will introduce
artificial patterns and cause errors in
interpretation - Evaluation measures
- Believability
- Quality of Visual Display
- Quality of classing
- Quality of quantification
- Space FCA less space
- Run time MCA faster
perception
statistical
computational
32Test Data Sets
33Believability and Quality of Visual Display
- Given two displays resulting from different
nominal-to-numeric mappings - Which mapping gives a more believable ordering
and spacing? - Based on your domain knowledge, are the values
that are positioned close together similar to
each other? - Are the values that are positioned far from the
rest of the values really outliers? - Which display has less clutter?
34Automobile Data Alphabetical
35Automobile Data MCA
Are these patterns believable?
36Automobile Data FCA
Are these patterns believable?
37PERF Data Alphabetical
Region-Country 1-many Country-Product many-many
Are these associations preserved and revealed?
38PERF Data FCA
Region-Country 1-many Country-Product many-many
Are these associations preserved and revealed?
39Quality of Classing
- Classing A is better than classing B if, given a
classing tree, the rate of information loss with
each merging is slower
Information loss due to classing for one variable
? The lower the line, the slower the info
loss, the better the classing.
Calculate difference between the lines. ?
40Which classing is better depends on dataset
Distribution of difference between the lines.
41Quality of Quantification
- A quantification is good if
- If data points that are close together in nominal
space are also close together in numeric space - If two variables are highly associated with each
other, then their quantified versions should also
have high correlation.
42MCA gives better quantification
? Average Squared Correlation higher value
better quantification
? Correlation between MCA and FCA scales how
close are FCA scales to MCA scales
43Had enough yet?
44Going back to Multivariate Coarse Classing
- Other issues
- Missing values
- Mixed or numeric variables as analysis variables
- Nominal values with small counts
- Robustness of quantification and classing
45Can you think of other uses of DQC at FICO?
- For techniques that require numeric inputs
linear regression, some clustering algorithms
(can speed up calculations but with some loss of
accuracy) - For techniques that require low cardinality
nominal variables scorecards, neural networks,
association rules - FICO-specific
- Multivariate coarse classing
- ClusterBots nominal variables could be
quantified and distance calculations would be
simpler. Could be applied to mixed variables? - Product groups, merchant groups
- ???????
46Implementation
- SAS version exists
- PROC CORRESP, PROC CLUSTER, PROC FREQ
- C version in development
47Summary
- DQC is a general-purpose approach for
pre-processing nominal variables for data
analysis techniques requiring numeric variables
or low cardinality nominal variables - DQC multivariate, data-driven, scalable,
distance-preserving, association-preserving - FCA is a viable alternative to MCA when memory
space is limited - Quality of classing and quantification
- depends on strength of associations within the
data set. - is in the eye of the user
48Yippee, its over!
- Original InfoVis2003 paper Mapping Nominal
Values to Numbers for Effective Visualization. - http//davis.wpi.edu/xmdv/documents.html
- XmdvTool Homepage
- http//davis.wpi.edu/xmdv
- xmdv_at_cs.wpi.edu
- Code is free for research and education.
49References
- Gre93 GREENACRE, M.J., 1993, Correspondence
Analysis in Practice, London Academic Press - Gre84 Greenacre, M. (1984), Theory and
Applications of Correspondence Analysis, London
Academic Press - Sta StatSoft Inc. Correspondence Analysis.
http//www.statsoftinc.com/textbook/stcoran.html - Fri99 Friendly, Michael. 1999. "visualizing
Categorical Cata." In Sirken, Monroe G. et. al.
(eds). Cognition and Survey Research. New York
John Wiley Sons. - Kei97 Keim D. A. Visual Techniques for
Exploring Databases, Invited Tutorial, Int.
Conference on Knowledge Discovery in Databases
(KDD'97), Newport Beach, CA, 1997. - Hua97b Zhexue Huang. A Fast Clustering
Algorithm to Cluster Very Large Categorical Data
Sets in Data Mining (1997) - SAS Manuals (PROC CORRESP, PROC CLUSTER, PROC
FREQ)
50What input tables can SCA accept?
- In general, SCA can use as input any table that
has the properties - The table must use the same physical units or
measurements, and - The values in the table must be non-negative.
- The FCA input table satisfies these properties.
51Uncertainty Coefficient U(RC)
Source SAS Proc Freq
52Average Squared Correlation
- Given the raw data matrix Rrij, where the
columns represent the variables. Create new
matrix Qqij where qij.quantified version of
rij.. Let Qjjth column of Q. - For each record i, calculate scoreiaverage(Sj
qij ) - For each variable j, calculate corrjcorrelation(Q
i,score) - Calculate average of the squared correlation.
- Source Gre93