Title: A1258565574URiuF
1Multivariate Data Analysis
Overview of Methods
2Motivation
- Sophisticated multivariate statistical methods
are becoming standard practice in the physical,
natural and social sciences, as well as in
business - Variations of existing methods are being
developed, existing techniques are being applied
to new applications, and new methods continue to
be designed
3Motivation
- The accelerated use of advanced multivariate
techniques is being driven by - Growing complexity in the topics being addressed
- Ever-larger data sets
- Ability to apply computationally intensive
methods through powerful computer tools - Academic training
4Overview of Multivariate Data Analysis Methods
5Vocabulary Data types
6Vocabulary Variable Types
- Response vs explanatory
- Response or dependent variable
- Variable to be modeled or predicted
- Explanatory or independent variable
- Variables used to predict or model dependent
variable - Importance of identifying data and variable types
- Critical in determining analysis objectives and
appropriate analysis method - Avoid inappropriate variable operations
7Classification of Methods
- Dependence techniques
- One or a set of variables are regarded as
dependent variables - Objective is to predict or explain the value of
the dependent variable(s) based on the values of
a set of independent variables - Examples
- What is the probability that a loan applicant
will default? - What factors best differentiate people whose
primary news source is the Internet?
8Classification of Methods
- Dependence techniques
- Multiple regression
- Logistic regression
- Discriminant analysis
- Canonical correlation
- Structural equation modeling
- Analysis of variance
- Decision trees
9Classification of Methods
- Interdependence techniques
- No single group of variables defined as
dependent or independent - Objective is to identify and characterize
underlying structure between the variables - Examples
- What are the underlying factors that define a
customers perception of a brand? - Which signal returns arise from the same object
and how many objects are present?
10Classification of Methods
- Interdependence techniques
- Factor analysis
- Multidimensional scaling
- Correspondence analysis
- Cluster analysis
11Classification of Methods
- Interdependence techniques are valuable data
reduction methods - Data reduction attempts to manage and interpret
the large amounts of data gathered - One goal is combine groups of cases measured over
multiple variables into a relatively small number
of understandable segments - Or to group variables together into categories of
latent traits and then characterize cases with
respect to this smaller number of traits - The reduced data variables are then often used as
variables in dependence techniques
12Multiple Regression
- Multiple regression is a dependence technique
used to model the relationship between the value
of a single metric dependent variable and a set
of metric independent variables - Categorical variables can be included as dummy
variables - Model can be applied to predict changes in the
dependent variables response to changes in the
independent variables - Regression also indicates the relative importance
of independent variables on the response of the
dependent variable
13Multiple Regression
- For example, a client may be interested in
understanding the effect of price and promotional
activity on a products market share among both
loyal and not loyal customers - Technical result is a linear model of the form
- Y a0 a1X1 a2X2 anXn
- Best visualizations of the results control all
but one (or two) of the independent variables and
examine how the value of dependent variable
changes with respect to the free independent
variables
14Multiple Regression
Market share for loyal customers
Market share for not-loyal customers
15Multiple Regression
- Properties
- Single interval scale dependent variable
- Multiple independent variables, preferably on
interval scale - Familiar and useful technique
- Issues
- Assumes linear relationship between dependent and
independent variables - Overused and often assumptions not fully checked
- Often misapplied to classification problems
16Logistic Regression
- Logistic Regression is a dependence techniques
used to model the relationship between a single
categorical dependent variable and a set of
metric independent variables - Typically dependent variable takes one of two
values success/failure, buy/do not buy - Multinomial formulations
- A logistic model gives the probability that the
dependent variable takes a target value given the
values of the independent variable
17Logistic Regression
- For example, which credit and demographic factors
best predict whether a customer will keep a loan
current - Dependent variable taken as 60 days past due or
worse - Independent variables are credit and employment
history, and demographic descriptors
18Logistic Regression
- Properties
- Powerful technique for predicting group
membership and identifying important independent
variables - Becoming more widely used
- Procedures and results similar to linear
regression - Issues
- Adequate data
- Model validation
- Communicating probabilistic concepts
19Decision Trees
- Decision trees are a dependence technique used to
develop a model to classify the value of a
single dependent variable based on a set of
independent variables - Dependent and independent variables can be any
data type - The typical product of CART is a straightforward,
easily interpretable set of segmentation rules - For example, classify existing customers as high
or low likelihood buyers of a new product based
on demographics and historical purchasing
behavior. Classification could be used to focus
advertising campaign -
20Decision Trees
- Decision trees can be also used to examine
profiles of different market segments with
respect to underlying demographic and
psychographic variables - For example, what are the most significant
demographic variables determining whether the
Internet is a persons most important information
source?
21Decision Trees
22Decision Trees
- Properties
- Single dependent variable of any scale
- Multiple independent variables of any scale
- Free of model assumptions typical in other
dependence techniques - Powerful statistical learning algorithm able to
identify complex variable interactions - Issues
- Not as familiar
- Standard inferential statistics not applicable
- Often leads to asymmetric relationships
23Factor Analysis
- Factor analysis is an interdependence technique
used to identify a set of underlying latent
traits (factors) that explain the correlations
between a large number of variables - Data summarizing
- Derive a set of underlying concepts that
summarize a larger set of variables - Data reduction
- Develop a set of factor variables that serves as
a more parsimonious description of the data
24Factor Analysis
- Interested in defining underlying dimensions
influencing the perception of online destinations - Survey respondents are asked to rate a set of
destinations (including clients) with respect to
a number of traits - Factor analysis can be applied to develop a
succinct set of perception dimensions - This manageable set of dimensions can be used to
characterize a clients site and to develop a
focused plan to reposition it
25Factor Analysis
- Factors can then be used to provide visual
summary of data
Competence Sophistication Trustworthy Exciting
26Factor Analysis
- Properties
- Very useful in identifying structure and
relationships in data - Provides tractable set of concepts for both
managerial and analytical uses - Provides opportunities for visualizations
- Issues
- Questionnaire design
- Variable selection
- Factor interpretation and validity
27Cluster Analysis
- Cluster analysis is an interdependence technique
used to segment cases into homogeneous groups
based on a specified set of variables - Data reduction
- Develop a more parsimonious description of cases
which can then be used in analytical
classification methods - Identify similarities between cases with respect
to clustering variables - Characterize clusters with respect to other sets
of variables
28Cluster Analysis
- Want to identify and then characterize similar
groups of TV pilot shows based on survey
responses rating shows on various traits - For one or two traits it may be possible to do
this subjectively. Cluster analysis provides an
objective method for multiple
traits - Clusters can be characterized with respect to
variables not used in the analysis, such as show
success, and cluster membership can be used as a
dependent variable in classification method
29Cluster Analysis
Cluster 1 Low likelihood of success Cluster 2
Moderate likelihood of success Cluster 3 High
likelihood of success
30Cluster Analysis
- Properties
- Many cluster techniques are available for data of
all scales - Can identify structure in large data sets that
may be difficult to discover in any other way - Provides objective segmentation method
- Issues
- Selecting appropriate clustering method
- Determining appropriate number of clusters
- Validating clusters