Title: Metabolite fingerprinting: detecting biological features by independent component analysis
1Metabolite fingerprinting detecting
biologicalfeatures by independent component
analysis
PCA
ICA
2Outline
- Motivation
- Introduction
- Metabolite fingerprint
- PCA
- ICA
- Materials and Methods
- 96 samples
- QTOF
- Results
- Results
- Kurtosis
- Conclusion
- Our strategy --- For ICAT (Heavy or Light
labeling)
3Motivation
- Metabolite fingerprinting is a technology for
providing information from spectra of total
compositions of metabolites. - The question is how metabolite fingerprinting
reflects the biological background. - In many applications the classical principal
component analysis (PCA) is used for detecting
relevant information. - Due to its independence condition, the
independent component analysis (ICA) is more
suitable for our questions than PCA. - However, ICA has not been developed for a small
number of high-dimensional samples, therefore a
strategy is needed to overcome this limitation.
4Outline
- Motivation
- Introduction
- Metabolite fingerprint
- PCA
- ICA
- Materials and Methods
- 96 samples
- QTOF
- Results
- Results
- Kurtosis
- Conclusion
- Our strategy --- For ICAT (Heavy or Light
labeling)
5Introduction - Metabolite fingerprinting
- All of these analytical approaches cannot be done
by full-composition metabolomic tests, but
instead call for a cheaper and faster first-round
screening method. - To group data according to inherent biological
characteristics and distinguishes these from
inherent, unrelated background noise. - Without individually determining metabolite
identities, have been termed metabolite
fingerprinting (Fiehn, 2001) and were
successfully applied to discriminate strains of
bacteria using time-of-flight mass spectrometry
(Vaidyanathan et al., 2001) or other techniques
such as infrared spectroscopy (Thomas et al.,
2000). - In biomedical fields, the same strategy is used
by applying nuclear magnetic resonance and termed
metabonomics. - One of the main questions in metabolite
fingerprinting is what the major pieces of
information provided by the spectra are, and
whether the information relates to the
experimental conditions or to some interfering
signals.
6PCA v.s. ICA
- Principal component analysis seeks directions in
feature space that best represent the data in
least squares sense. - Independent component analysis seeks directions
in the data that are most independent from one
another.
7Introduction - PCA
- One well-established technique for dimensionality
reduction and visualization is the classical
principal component analysis (PCA), where the
extracted information is represented by a set of
new variables, termed components or features.
Diamantaras and Kung (1996) give a good overview
of different PCA-algorithms. - In the field of metabolomics, PCA became a
popular tool for visualizing datasets and for
extracting relevant information (Ward et al.,
2003 Urbanczyk-Wochniak et al., 2003). - However, PCA is only powerful if the biological
question is related to the highest variance in
the dataset. If this is not the case, other
techniques of statistics or related fields may be
more helpful, depending on the biological
question, as shown by Goodacre et al. (2003) and
Johnson et al. (2003) for supervised techniques
in combination with validation and
pre-processing.
8Introduction - ICA
- In ICA, an independence condition is optimized,
which often gives more meaningful components than
optimization of only the variance, as is done by
PCA. - Because of this the components of ICA are termed
independent components (ICs), meaning that
different ICs represent different non-overlapping
information. - For applying ICA we assume that the observed data
have been determined by some unknown fundamental
factors, which are independent of each other. - By searching for components as statistically
independent as possible these required factors
can be detected. These fundamental factors are
often termed sources and the application field is
called blind source separation, BSS.
9The simple Cocktail Party Problem
Mixing matrix A
x1
s1
Observations
Sources
x2
s2
x As
n sources, mn observations
10The simple Cocktail Party Problem
Mixing matrix A
x1
s1
Observations
Sources
x2
s2
x As
n sources, mn observations
11Independent Component Analysis (ICA)
Without knowing position of microphones or what
any person is saying, can you isolate each of the
voices?
12Independent Component Analysis (ICA)
Assumption each sound from speaker unrelated to
others (independent)
13ICA Example
- BSS of recorded speech and music signals.
http//www.cnl.salk.edu/tewon/ica_cnl.html
14ICA Separation
ICA
Two Independent Sources
Mixture at two Mics
Get the Independent Signals out of the Mixture
15Independent Component Analysis
- Possible applications for ICA include
- Neurobiological modeling
- Radio and telephone communication
- Preprocessing
- EEG and MEG processing
16Outline
- Motivation
- Introduction
- Metabolite fingerprint
- PCA
- ICA
- Materials and Methods
- 96 samples
- QTOF
- Results
- Results
- Kurtosis
- Conclusion
- Our strategy --- For ICAT (Heavy or Light
labeling)
17Materials and Methods
Total samples 96
Arabidopsis thaliana
MS
Electrospray/QTOF mass spectra
Weighted density function
763 variables, 92 samples
Hybrid vigour or Heterosis display interesting
features such as higher growth, better fitness
and improved resistance against biotic and
abiotic stress factors.
Therefore, we expected to find the largest
distance between the F1 groups and the parents,
the second largest difference between the two
parents and just a small difference or none at
all between the two F1 genotypes.
18Materials and Methods
763 variables, 92 samples
By applying PCA for visualization we have to
assume that the most interesting information is
directly related to the highest variance in the
data.
PCA
PCs
ICA
minimize the dependence
ICs
define a criterion for sorting these components
to our interest.
Sorted by Kurotosis
Meaningful ICs
19Fig. 1. Mass spectra comparison of different
Arabidopsis lines and their crosses. The
intensities are plotted against the mass
(mass-to-charge ratios, m/z). From each group one
sample is arbitrarily taken. The global structure
of the spectra is very similar. However there are
differences between masses of smaller
intensities. To select the relevant information
is the challenge for our analysis.
20Fig. 2. Combined spectral data. Above, the
intensities are plotted against the mass (m/z)
for all mass-intensity pairs (given by the
highest peak in the spectra) over all samples.
Only the mass range of 115119 amu of the total
range of 501500 amu is shown. For assigning the
mass values to a set of variables a density
function is used, shown below. The peaks of the
density function (marked by a plus ) point to
high concentrations of mass values. The masses
around one peak (marked by a circle ?) are
assigned to one variable, the residual
mass-intensity pairs are removed.
21Fig. 3. PCA on normalized data. In each plot the
first two components of PCA are plotted against
each other. PCA is applied to different
normalized datasets.Without any normalization
there is no clear separation between the
different groups. By scaling the metabolites to
unit variance, the parent generation can be
separated from the F1 generation. By scaling the
samples to unit vector norm, even the
parent-lines can be separated.
22Fig. 4. PCA on vector normalized data. The first
three principal components (PCs) are plotted
pairwise against each other. Note that the first
PC (of highest variance) is not related to our
problem of separating the sample groups. Better
results are given by components of smaller
variance, PC 2 and PC 3.
23Fig. 5. ICA compared to PCA. The best
PCA-visualization given by PC 2 and PC 3 is
plotted on the left. The different groups are
only partially separable. Compared to this the
ICA result, given by the two ICs of most negative
kurtosis, IC 1 and IC 2, is shown on the right.
ICA gives a projection of the data with a greater
separation between the different groups.
24Fig. 6. The third component of ICA (IC 3) has no
information about the experimental groups (left).
However, there is a relation to the time, when
the samples are measured, shown on the right.
This technical factor could not be detected by
PCA.
25Outline
- Motivation
- Introduction
- Metabolite fingerprint
- PCA
- ICA
- Materials and Methods
- 96 samples
- QTOF
- Results
- Results
- Kurtosis
- Conclusion
- Our strategy --- For ICAT (Heavy or Light
labeling)
26Kurtosis
- Kurtosis is a classical measure of
non-Gaussianicity, and is computationally and
theoretically, relatively simple. It indicates
whether the data are peaked or flat relative to a
Gaussian (normal) distribution. A Gaussian
distribution has a kurtosis of zero. Positive
kurtosis indicates a peaked distribution
(super- Gaussian) and negative kurtosis indicates
a flat distribution (sub-Gaussian).
High vs. Low Variance
These graphs illustrate the notion of variance.
The one on the left is more dispersed than the
one on the right. It has a higher variance.
27Kurtosis
- Examples of super-Gaussian distributions (highly
positive kurtosis) are speech signals, because
these are predominantly close to zero. - Negative kurtosis can indicate
- Cluster structure -- The former can resolve
between two experimental conditions (high and low
concentrations of metabolites) - Uniformly distributed factor -- The latter can
represent a continuously changed experimental
factor such as the temperature or the light
intensity. - Thus the components with most negative kurtosis
could give us the most relevant information.
28Fig. 7. Left different numbers of PCs are used
for dimensionality reduction. ICA is applied for
each of these reduced datasets. Plotted are the
number of extracted ICs with negative kurtosis.
By using the first 6 components of PCA, ICA can
extract the highest number of interesting ICs,
whereas the kurtosis of IC 4 is close to zero.
Right For this 6 dimensional reduced dataset,
the kurtosis of all extracted ICs are plotted.
29The 10 masses of highest influence are shown for
different components. On the left the masses
given by the classical PCA are shown for PC 2 and
PC 3. These are the PCs which are closest to the
first two ICs of ICA, shown on the right. The
masses given by ICA are different to these of PCA
and are rather assignable to only one IC. These
higher mass separations are shown in Figure 8.
30Fig. 8. Mass influences. For each mass from Table
1 the absolute influence on each component is
plotted. The masses in PCA have a greater
influence on both components than the masses in
ICA, which are assigned more to one or to the
other component.
31Fig. 9. Outlier detection by ICA. The last two
components with the most positive kurtosis are
plotted against each other. The IC 6
clearly indicates an outlier, marked by an arrow.
32Outline
- Motivation
- Introduction
- Metabolite fingerprint
- PCA
- ICA
- Materials and Methods
- 96 samples
- QTOF
- Results
- Results
- Kurtosis
- Conclusion
- Our strategy --- For ICAT (Heavy or Light
labeling)
33Our strategy --- For ICAT (Heavy or Light
labeling)
ICA Separation
- ICA (??????PPT)
- Separation
- Recognition
- IASL ICAT PROCEDURE
Heavy
Light
ICA
Two Independent Sources
Mixture at two Experiments
Get the Independent Signals out of the Mixture
34??? ??
- De noising
- Missing data recover
- Speech enhancement and Recognition
- Voice Signal Noise 11000