Title: Visualization of Multivariate Data
1Visualization of Multivariate Data Christine
Steinhoff Max Planck Institute for Molecular
Genetics Berlin, Germany
2Outline
Motivation DATA INTEGRATION
Data types EXPRESSION aCGH Patients
Information Problems DISTRIBUTION, SCALE
Procedure DISCRETIZATION FILTERING
INDICATOR MATRIX MCA TOWARDS DISTANCE
DEFINITION
Results
3Data Sources
4... Basic output Microarrays
Affymetrix
x(i,j) real value
Idea x(gene i,slide j) should be correlated in
some way with the number of mRNA molecules of
sequence gene i in the probe of slide j
5... What does arrayCGH aim at?
x(i,j) real value
Idea x(gene i,slide j) should be correlated in
some way with the number of DNA copies of
sequence gene i in the probe of slide j
6ArrayCGH
Intuitive Idea just take the same chip but put
DNA extract instead of mRNA onto it! -gt many
more sophisticated methods have been
developed Even though the concept is very
similar there are profound differences
7DATA INTEGRATION
Patients Covariates Information on Patients
under study
8PROBLEMS
Discrete categories
After appropriate normalization Approx
lognormal symmetric
Not symmetric skew
Scale and Distribution differ!
9Motivation
10Gene x3 Loss
The AIM
Gene x1 Overexpressed
Gene x2 Amplified
Gene x4 Overexpressed
11Approaches
Generalized Singular Value Decomposition
Samples
Berger et al Huang et al Jefferey et al
m
m
aCGH
Expr
n
p
Genes
Preprocessing Scale and Distribution
Transformation
12Approaches
EV EV
Berger et al Huang et al Jefferey et al
i
The columns of X are the generalized singular
vectors of R
13Approaches
- Problems
- Scaling, Distribution transformation
- Only two datasets
- Does not allow for categorical variables
Berger et al Huang et al Jefferey et al
14Data INPUT
Procedure
Discretization
Filtering
Indicator coding
Multiple Correspondence Analysis
15Step 1 Discretization
Patients covariates
arrayCGH
Expression
Categorical e.g. Staging Grading Smoking Mutatio
n ....
16Step 1 Discretization
arrayCGH
Expression
For example CBS Package DNAcopy Segmentation
and discretization of arrayCGH data
For example Fold Change Criterion
17Step 1 Discretization
Patients covariates
arrayCGH
Expression
Typically n23,000 -gt reduce number
18Step 2 Filtering (optional)
- Suggestion
- Neglect all genes with no change in any patient
- Choose genes with highest Variance across
patients - Select for high Correlation between arrayCGH and
expression
19Step 3 Indicator Matrix - Binary Coding
Indicator matrix With binary coding
Original matrix With categories
20Step 3 Indicator Matrix - Binary Coding
Indicator matrix With binary coding
Original matrix With categories
21Step 4 Appending Matrices
A
E
P
Experimental
SupplementalCovariates
22Multiple Correspondence Analysis with
supplementary Information
23Multiple Correspondence Analysis
Gene 251 state 1
G1 (-1) G1 (0) G1 (1) G2 (-1) ...
G1 (-1) G1 (0) G1 (1) G2 (-1) ...
t(E)E
t(E)A
t(A)E
t(A)A
24Patients Information
25EXAMPLE PUBLISHED DATA
26Covariate States Display
27Gene States Display
28Towards Distance Definition
- Determine
- Angle
- Vector length
- - Select genes according to a predefined angle
- Or
- - Select genes according to angle and length
a
29How to select candidate genes?
X1 angle X2 1/vector covariatestate X3
1/vector genestate -gt Minimize!
L2_w(x1,x2,x3) sqrt(w1x12 w2x22 w3x32)
30- How does the analysis compare with
- Just acgh
- Just expr
- Joint analysis?
31- How does the analysis compare with
- Just acgh
- Just expr
- Joint analysis?
32Explore ERBB2 and MYC that have been found in
Berger et al.
ERBB2 Amplified in ACGH
ERBB2 normal in ACGH
ERBB2 overexpression
33ERBB2 underexpr
ERBB2 loss in ACGH
34MYC Overexpression
MYC amplification
35MYC Normal acgh
MYC underexpression
36Enrichment of GO Categories
37SUMMARY
Pipeline for joint visualization of (a)
experimental continuous data e.g. arrayCGH and
expression data (b) Patients covariates
Application Data set parallel investigation of
arrayCGH and expression in breast cancer
patients covariate data available Determinati
on of candidate gene sets enrichment of specific
cancer related GO Categories
38FURTHER DIRECTIONS AND OPEN QUESTIONS
- Integration of variable datasources
- Appropriate discretization methods
- Avoid filtering by choosing algorithm for
decomposition of sparse matrices - Evaluation scheme (problem of simulation and
noise adding) - Investigation of Robustness
- ...
39ACKNOWLEDGEMENT
Sensor Lab, CNR-INFM
Max Planck Institute for Molecular Genetics
Martin Vingron
Matteo Pardo
40Gene Expression Arrays Technology
- Schena M, Schalon D,
- David RW, Brown PO
- Quantitative monitoring of gene
- expression patterns with a
- complementary DNA microarray. Science 1995
- Lennon GG Lehrach HH. Hybridization
analyses of arrayed cDNA libraries. Trends
Genet. 1991
Commercial 1998
Affymetrix
... In meanwhile several more!!!!
41Gene Expression Arrays Technology
Affymetrix
42Welche Technologieplattformen gibt es?
Hybridisierung
Affymetrix
Rot Grün
...AATGGGTCAGAAGGACTCCTATGTGGGTG...
TTACCCAGTCTTCCTGAGGATACACCCAC
TTACCCAGTCTTGCTGAGGATACACCCAC
43... Some differences
Affymetrix
Rot Grün
- - Nylon Filter
- - eine Probe
- radioaktives Signal
- - viele Spots möglich
- - große Fläche / lokale Effekte
- - Überstrahlen
- - nur eine Probe pro Hybri-
- disierungsvorgang
- - Glas Träger
- - rote und grüne Probe
- Floureszenz Signal
- bis 20000 Spots möglich
- - gleichzeitiges Hybridisieren
- von Probe und Kontrolle
- (rot/grün)
- - Chip
- - eine Probe bestehend aus
- 16-20 Wdh. und zugehörigen
- Mismatches
- kommerzieller Chip
- gute reproduzierbare Daten
- nur eine Probe pro Hybridisierungs-vorgang