Title: Discrimination Models and Variance Stabilizing Transformations of Metabolomic NMR Data
1Discrimination Models and Variance Stabilizing
Transformations of Metabolomic NMR Data
- Institute on Research and Statistics, Sacramento
- 04/08/04
- Parul Vora Purohit
2Biodata and omics
- Genome Project
- Genomics - Study of Genes
- Proteomics - Study of proteins
- Metabolomics - Study of metabolites
- cellomics, CHOmics, chromonoics, etc.
- Analytical techniques
- Microarray Spectroscopy
- Mass Spectroscopy
- NMR Spectroscopy
-
-
3NMR Spectroscopy
Intense homogenous and magnetic field High
Powered RF transmittor capable of delivering
short pulses 500 MHz stimulate 1H nuclear spin
transitions Probe which enables the coils used to
excite and detect the signal Plot of signal vs
shift in frequency from original pulse Measured
in ppm (ratio from the original signal)
- Curtsey Joseph Medendorp / Public Information /
University of Kentucky
4NMR Data
- Allows detection of compounds with H content
- Shift characterizes the chemicals (metabolites)
- Examples
- 2.14 ppm glutamine ? CH2 group
- 2.27 ppm - valine ß CH group
- 6.91 ppm tyrosine C3, 5H ring
- 65,000 points (variables) per sample
5Questions
- Classification Can we distinguish sick
organisms from the healthy ones? - Identification Which metabolites play a role in
the disease (biomarker)? - DIFFERENCES IN THE DETAILS!
6Abalone Data
- A set of 18 abalone
- 8 healthy, 5 stunted, 5 sick
- Tissue from muscle
- Questions
- Can we classify the abalone accurately ?
- Can we detect any metabolites that are markers?
7Problems / Solutions
- Multivariate Techniques
- Matrix of 65,000 (variables) x 18 (samples)
- Too many variables as compared to the number of
samples - Dimension Reduction by Binning
- Classification and metabolite marker
identification using PCA and Cluster Analysis - Methods assume that the data is normally
distributed with a constant variance - Generalized Log Transformation improves results!
8NMR Data Pre-Processing
- Background Subtraction
- TMSP Peak (standard at 0 ppm removed)
- Water Peak Removal
- 4.72-4.96 ppm removed)
- Normalization
- Integrated Intensity normalized to 1.0 to remove
the effects of systematic intensity changes
between abalone - Binning / Size
9Binned Spectrum
Bin Size .04 ppm 239 Bins
- Bin Size Range 0.00125 ppm 0.7 ppm
- Intensity of Bin Integrated Intensity of all
points in Bin - Restricted Region of interest to 0.2 ppm 10.0
ppm
10Principal Component Analysis (PCA)
- Technique that allows for the explanation of the
variance-covariance of the variables in terms of
a linear combination of them - X t1pT1 t2pT2 tkpTk E pi -
eigenvectors - Projections of the original data matrix on these
components give the relations between the samples
Scores Plot - A plot of the eigenvectors of the covariance
matrix gives a relationship between the variables
Loadings Plot - Reduces the dimension of the problem a few
components suffice to explain the variance
- Courtesy Wise, B. M. and Gallagher, N. B.,
PLS_Toolbox 2.1
11PCA Results
Scores Plot
Loadings Plot
12Cluster Analysis - Hierarchical
Transformed Data Groups Clearly Identified
Untransformed Data
13Generalized Log Transformation
- Shown that a transformation of the form
- f(y) ln( y (y2 c) )
- can lead to a variance stabilizing effect on the
data - The parameter c can be obtained by Maximum
- Likelihood or ANOVA methods and is of the
value - c s2 / S2
- where s2 is the variance of the noise and S2 the
variance of the high peaks - Durbin, B., Hardin, J., Rocke, D. M.,
Bioinformatics, 2002, 18, s105-s110 - Sue Geller, Jeff Gregg, Paul Hagerman, David
Rocke, Transformation and Normalization of
Oligonucleotide Microarray Data, 2003
14Maximum Likelihood
- Need replicates to determine accurate the SSE (c)
- Find c for the minimum SSE
- Find c steps using Newtons method or educated
intervals - Box, G. and Cox. D.R. (1964) An Analysis of
transformations. J. roy. Stat. Soc.. Series B
(Methodological), 26, 211.
Error Sum of Squares
c
15Transformed Spectrum
Calculate c using the replicate data by maximum
likelihood methods Use transformation of the form
using replicates, Transform data to stabilize
the variance f(y) ln( y (y2 c) )
Bin Size .04 ppm 239 Bins, c 2.7e-7
16Stabilized Variance
Bin Size .04ppm
Bin Size .04ppm C 2.7E-7
17Scores Plot Transformation Effects
Untransformed Data
Transformed Data
18Loadings Plot Transformation Effects
Untransformed Data
Transformed Data
19Cluster Analysis - Hierarchical
Transformed Data Groups Clearly Identified
Untransformed Data
20Raw Spectra Significant Bins
Healthy Stunt. Sick
Healthy Stunt. Sick
Glycogen, Sucrose, Fructose ?
Bin 124 5.38 ppm Bin 76 3.22
ppm Bin 125 5.42 ppm Bin 77 3.26
ppm Bin 126 5.46 ppm Bin 78 3.3
ppm
21Conclusions
- Demonstrated the use of data reduction
techniques, multi-variate techniques for studying
NMR and Mass Spectrometer data - Demonstrated the use of these techniques to
identify metabolite and protein bio-markers - Showed the usefulness of transformations in
rendering the data more useful
22Acknowledgements
- David M. Rocke, CIPIC
- David L. Woodruff, CIPIC
- Mark R. Viant, U. of Birmingham, U. K.