Title: Multivariate Resolution in Chemistry
1Multivariate Resolution in Chemistry
Roma Tauler IIQAB-CSIC, Spain e-mail
rtaqam_at_iiqab.csic.es
2Lecture 1
- Introduction to data structures and
soft-modelling methods. - Factor Analysis of two-way data Bilinear models.
- Rotation and intensity ambiguities.
- Pseudo-rank, local rank and rank deficiency.
- Evolving Factor Analysis.
3Chemical sensors and analytical data
structures one variable x1 e.g. pH two
variables x1,x2 e.g pH i T three variables x1,
x2 and x3 e.g. pH, T i P n variables ?????
pH
P
4Data Structures Zero order Zero-way data
x one sample gives one scalar (tensor 0th
order) Examples - selective electrodes, pH -
absorption at one wavelength ? - height/area
chromatographic peak Assumptions - total
selectivity - known lineal response Tools -
univariate algebra and statistics
Advantages - simple and easy to
understand Disadvantages - only one compound
information - total selectivity - one
sensor for every analyte - low information
content
Time
x
x
x
x
x
x
hi
x
x
x
ci
5Data Structures First order One-way data
x1, x2, ....., xn one sample gives one vector
(tensors of order 1) Examples - matrix of
sensors - absorption at many ? (spectra) -
chromatograms at a single ? - current
intensities at many E - readings with time
(kinetics) -.................. Assumptions
- known lineal responses - different and
independent responses Tools - linear algebra
- multivariate statistics - spectral analysis
- chemometrics (PCA,MLR, PCR, PLS...) Advantages
- Calibration in presence of interferences
is possible - Multicomponent analysis is
possible Disadvantages Interferences should be
present in calibration samples
Chromatogram
Time
6Data Structures Second order / Two-way data
xij each sample gives a data table/matrix
tensor of order 2 X ? xkykT Examples -
LC-DAD LC-FTIR GC-MS LC-MS FIA-DAD
CE-MS,.. (hyphenated techniques) - esp.
excitation/emission (fluorescence) - MS/MS, NMR
2D, GCxGC-MS ... - spectroscopic/voltammetric
monitoring of chemical
reactions/processes with pH, time, T,
etc. Assumptions - linear responses -
sufficient rank (of the data matrices) Tools -
linear algebra - chemometrics Advantages -
calibration for the analyte in the presence of
interferences not modelled in calibration
samples is possible - full characterization of
the analyte and interferents may be possible -
few calibration samples are needed (only one
sample calibration)
7Data Structures Third order Three-way data
xijk each sample gives a data cube tensor of
order 3 X ? xkykzk Examples - Several
spectroscopic matrices - Several hyphenated
chromatographic - Hyphenated multidimensional
chromatography (GC x GC / MS) -
excitation/emission/time .............. Assumption
s - bilinear/trilinear responses - sufficient
rank (of the data matrices) Tools - multilinear
and tensor algebra - chemometrics Advantages
- unique solutions (no ambiguities) -
calibration for the analyte in the presence of
interferences not modelled in calibration
samples is possible - full characterization of
the analyte and interferents is possible - few
calibration samples are needed (only one sample
calibration)
Multi-way data analysis (PARAFAC, GRAM)
Extended multivariate resolution
80th order data ISE, pH,.. 1th order data
spectra 2nd order data LC/DAD
GC/MS fluorescence 3rf order data
time/ /excitation/ /emission
9Examples Chemical reaction systems monitored
using spectroscopic measurements (even at
femtosecond scale) to follow the evolution of a
reaction with time, pH, temperature, etc., and
the detection of the formation and disappearance
of intermediate and transient species Monitoring
chemical reactions.
P
e
r
i
s
t
a
l
t
i
c
p
u
m
p
D (NR,NC)
S
p
e
c
t
r
o
p
h
o
t
o
m
e
t
e
r
pH
0
.
0
5
0
m
l
wavelength
o
T
3
7
C
T
h
e
r
m
o
s
t
a
t
i
c
b
a
t
h
10Examples Quality control and optimisation of
industrial batch reactions and processes, where
on-line measurements are applied to monitor the
process. Process analysis
probe
wavelength
D (NR,NC)
time
11Examples Analytical characterisation of complex
environmental, industrial and food mixtures using
hyphenated (chromatography, continuous flow
methods with spectroscopic detection) Chromatograp
hic Hyphenated techniques LC-DAD, GC-MS, LC-MS,
LC-MS/MS....
D (NR,NC)
time
wavelength
12Examples FIA-DAD-UV with pH gradient for the
analysis of a mixture of drugs.
D (NR,NC)
pH
wavelength
13Examples Analytical characterisation of complex
sea-water samples by means of Excitation-Emission
spectra for an unknown with tripheniltin (in the
reaction with flavonol) Excitation emission
(fluorescence) EEM techniques
D (NR,NC)
excitation
emission
14Examples Protein folding and dynamic
protein-nucleic acid interaction processes. In
the post-genomic era, understanding these
biochemical complex evolving processes is one of
the main challenges of the current proteomics
research. Conformation changes
Primary structure
Secondary structure
Tertiary structure
Quaternary structure
Val Leu Ser Ala Asp Ala Trp Gly Val His
?-helix
?-sheet
turn
Random coil
Amino acids
Globule formation
Assembled subunits
Helix, sheet formation
D (NR,NC)
Temperature
wavelength
15Examples Image analysis of spatially distributed
chemicals on 2D surfaces measured using coupled
microscopy-spectroscopy techniques in geological
samples, biological tissues or food
samples. Spectroscopic Image analysis
16Data Structures in Chemistry Experimental Data
two orders/ways/modes of measurement
D(NR,NC)
row-order (way,mode) i.e. usually change in
chemical composition (concentration order)
column order (way,mode) i.e usually change in
system properties like in spectroscopy,
voltammetry,... (spectral order)
17Chemical data tables (two-way data)
J variables (wavelengths)
Instrumental measurements (spectra, voltammograms
,...)
Data table or matrix
concentration changes measurements (time,
tempera-ture, pH, ....
I spectra (times)
D
Plot of spectra (rows)
Plot of elution profiles (columns)
18Chemical data modelling
- Chemical data modelling methods may be divided
in - Hard- modelling methods (deterministic)
- Soft-modelling methods (data driven)
- Hybrid hard-soft modelling methods
Soft modelling
Hard modelling
Physical Hard Model
Analytical Information
Data
Data
?
Data driven soft model
Physical Model
Analytical Information
19Hard-modelling
- Hard-modelling approaches for chemical
(stationary, dynamic, evolving) systems are
based on an accurate physical description of the
system and on the solution of complex systems of
(differential) equations fitting the experimental
measurements describing the evolution and
dynamics of these systems. They are deterministic
models. - Hard-modelling methods usually use non-linear
least squares regression (Marquardt algorithm)
and optimisation methods to find out the best
values for the parameters of the model. - Hard-modelling usually deal with univariate data.
It has been often used in the past until the
advent of modern instrumentation and computers
giving large amounts of data outputs. - Hard-modelling is often successful for laboratory
experiments, where all the variables are under
control and the physicochemical nature of the
dynamic model is known and can be fully described
using a known mathematical model
20Hard-modelling
- However, and even at a laboratory level, there
are examples where hard-modelling requirements
and constraints are not totally fulfilled or no
physicochemical model is known to describe the
process (e.g. in chromatographic separations or
in protein folding experiments). - Data sets obtained from the study of natural and
industrial evolving processes are too complex and
difficult to analyse using hard-modelling
methods. In these cases, there is no known
physical model available or it is too complex to
be set in a general way. - Advanced hard-modelling in industrial
applications has been attempted to model
experimental difficulties, such as changes in
temperature, pH, ionic strength and activity
coefficients. This is a very difficult task!
Data Fitting in the Chemical Sciences P. Gans,
John Wiley and Sons, New York 1992
21Hard modelling
- Output C, S and model parameters.
- The model should describe all the variation in
the experimental measurements.
22Soft-modelling
- Soft-modelling instead, attempts the description
of these systems without the need of an a priori
physical or (bio)chemical model postulation. The
goal of the latter methods is the explanation of
the variations observed in the systems using the
minimal and softer assumptions about data. They
are data driven models. - Soft models usually give an improved analytical
description of the analysed process. - Soft modelling needs more data than
hard-modelling. Soft modelling methods deal with
multivariate data. Its use has augmented in the
recent years because of the advent of modern
analytical instrumentation and computers
providing large amounts of data outputs. - The disadvantage of soft models is their poorer
extrapolating capabilities (compared with
hard-modelling).
23Soft-modelling
- A soft model is hardly able to predict the
behaviour of the system under very different
conditions from which it was derived. - Complex multivariate soft-modelling data analysis
methods have been introduced for the study of
chemical processes/systems like Factor Analysis
derived methods. - Factor Analysis is a multivariate technique for
reducing matrices of data to their lowest
dimensionality by the use of orthogonal factor
space and transformations that yield predictions
and/or recognizable factors.
Factor Analysis in Chemistry 3rd Edition,
E.R.Malinowski, Wiley, New York 2002
24Soft modelling
ST
C
D
,
Constrained ALS optimisation LS (D,C) ? S LS
(D,S) ? C min (D CS)
- Output C and S.
- All absorbing contributions in and out of
the process are modelled.
25Lecture 1
- Introduction to data structures and
soft-modelling methods. - Factor Analysis of two-way data Bilinear models.
- Rotation and intensity ambiguities.
- Pseudo-rank, local rank and rank deficiency.
- Evolving Factor Analysis.
26Soft-modelling
Factor Analysis (Bilinear Model)
experimental data is modelled as a linear sum of
weighted (scores) factors (loadings)
In matrix form
data scores loadings
27Soft-modelling
BILINEARITY
Assumption Bilinearity (the contributions of the
components in the two orders of measurement are
additive)
28Soft-modelling
GOALS OF BILINEAR MODEL
0.35
0.35
0.3
0.3
0.25
0.25
0.2
0.2
0.15
0.15
0.1
0.1
0.05
0.05
0
0
0
50
100
0
20
40
60
Recovery of the responses of every component
(chemical species) in the different modes of
measurement
29Soft-modelling Factor Analysis
Real Factor Models
Predictions
30Soft-modelling Factor Analysis (traditional
approach)
matrix multiplication
Covariance matrix
Data matrix
decomposition
combination
abstract reproduction
Abstract Factors
Real Factors
target transformation
abstract rotation
New Abstract Factors
31Soft-modelling methods (I)
- Factor Analysis methods based on the use of
latent variables or eigenvalue/singular value
data matrix decompositions. Examples - PCA, SVD, rotation FA methods
- Evolving Factor Analysis methods
- Rank Annihilation methods
- Window Factor Analysis methods
- Heuristic Evolving Latent Projections methods
- Subwindow Factor Analysis methods
- ..
32Soft-modelling methods (II)
- Multivariate Resolution methods do a data matrix
decomposition into their pure components
without using explicitly latent variables
analysis techniques. Examples - SIMPLISMA
- Orthogonal Projection Approach (OPA),
- Positive Matrix Factorization methods (and
Multilinear Engine extensions) - Multivariate Curve Resolution-Alternating Least
Squares (MCR-ALS) - Gentle
- .....
33Soft-modelling methods (III)
- Three-way and Multiway methods
- which decompose three-way or multiway data
structures. Examples - Multiway and multiset extensions of PCA
- Genralized rank Annihilation, GRAM Direct
Trilear Decomposition (DTD, TLD) - Multiway and multiset extensions of MCR-ALS
methods       - PARAFAC-ALS
- Tucker3-ALS
- .......
34Soft-modelling
Factor Analysis in Chemistry, 3rd Ed.,
E.R.Malinowski, John Wiley Sons, New York,
2002 Principal Component Analysis, I.T. Jollife,
2nd Ed., Springer, Berlin, 2002 Multiway
Analysis, Applications in the Chemical Sciences,
A.Smilde, R.Bro and P.Geladi, John Wiley Sons,
New York, 2004 Multivariate Image Analysis,
P.Geladi, John Wiley and Sons, 1996 Soft modeling
of Analytical Data. A.de Juan, E.Casassas and
R.Tauler, Encyclopedia of Analytical Chemistry
Instrumentation and Applications, Edited by
R.A.Meyers, John Wiley Sons, 2000, Vol 11,
9800-9837
35Soft-modelling
Data structures Type of Models One way data
(vectors) ? Linear and non-linear models
di b0 b ci di
fnon-linear(ci) Two way data (matrices) ?
Bilinear and non-bilinear models
Non-bilinear data can still
be linear in one of the two
modes Three-way data (cubes) ? Trilinear and
non-trilinear models
Non-trilinear data can still be bilinear in two
modes
di
I samples
J variables
dij
I samples
D
36Soft-modelling
Bilinear models for two way data
J
dij
I
D
dij is the data measurement (response) of
variable j in sample i n1,...,N are the number
of components (species, sources...) cin is the
concentration of component n in sample i snj is
the response of component n at variable j
37Soft-modelling
Bilinear models for two way data
J
J
J
U or C
VT or ST
N
D
E
?
I
I
I
N ltlt I or J
N
PCA D UVT E U orthogonal, VT orthonormal VT
in the direction of maximum variance Unique
solutions but without physical meaning Useful
for interpretation but not for resolution!
MCR D CST E Other constraints
(non-negativity, unimodality, local rank, ) UC
and VT ST non-negative,... C or ST
normalization Non-unique solutions but with
physical meaning Useful for resolution (and
obviously for interpretation)!
38PCA Model (Principal Component Analysis) X U
VT E U scores matrix (orthogonal) VT
loadings matrix (orthonormal) SVD Model
(Singular Value Decomposition) D U S VT
E U scores matrix (orthonormal) S diagonal
matrix of the singular values s s ?1/2 ?
eigenvalues of the covariances matrix DDT VT
loadings matrix (orthonormal)
39PCA Model D U VT
unexplained variance
VT
D
E
loadings (projections)
U
scores
D u1v1T u2v2T unvnT E
n number of components (ltlt number of variables in
D)
.
D
u1v1T
u2 v2T
unvnT
E
rank 1
rank 1
rank 1
40- PCA Model
- X U VT E
- X structure noise
- It is an approximation to the experimental data
matrix X - Loadings, Projections VT relationships between
original variables - and the principal components (eigenvectors of the
covariances matrix). - Vectors in VT (loadings) are orthonormals
(orthogonal and normalized). - Scores, Targets U relationships between the
samples (coordinates of - samples or objects in the space defined by the
principal components - Vectors in U (scores) are orthogonal
- Noise E Experimental error, non-explained
variances
41- Summary of Principal Component Analysis PCA
- Formulation of the problem to solve
- Plot of the original data
- 3. Data pretreatment.
- (data centering, autoscaling, logarithmic
transformation) - 4. Built PCA model. Determination of the number
of - components. Graphical inspection of
explained/residual - plots)
- 5. Study of the PCA model PCA. Multivariate
data exploration - - loadings plot gt map of the variables
- - scores plot gt map of the samples
- Interpretation of the PCA mode. Identification of
the - main sources of data variance
- 7. Analysis of the residuals matrix E D -U VT
42PCA
U scores
VT loadings
43Multivariate Curve Resolution (MCR)
Pure component information
s1
?
sn
ST
c
c
n
1
C
Retention times
Pure signals Compound identity source
identification and Interpretation
Pure concentration profiles Chemical
model Process evolution Compound
contribution relative quantitation
44Lecture 1
- Introduction to data structures and
soft-modelling methods. - Factor Analysis of two-way data Bilinear models.
- Rotation and intensity ambiguities.
- Pseudo-rank, local rank and rank deficiency.
- Evolving Factor Analysis.
45Factor Analysis Ambiguities in the analysis of a
data matrix (two-way data)
Rotation and scale/intensity ambiguities
Rotation Ambiguities
Factor Analysis (PCA) Data Matrix
Decomposition D U VT E True Data Matrix
Decomposition D C ST E
46Factor Analysis Ambiguities in the analysis of a
data matrix (two-way data)
Rotation and scale/intensity ambiguities
Rotation Ambiguities
D U T T-1 VT E C ST E C U T ST
T-1 VT
How to find the rotation matrix T?
47Rotation and scale/intensity ambiguities
- D C ST E D E
- Cnew C T
- ( NR,N) (NR,N) (N,N)
- STnew T-1 ST
- (N,NC) (N,N) (N,NC)
- D C ST CnewSTnew
- Matrix decomposition is not unique!
- T(N,N) is any non-singular matrix
- Rotational freedom for any T
48Rotation and scale/intensity ambiguities
Rotation ambiguities and rotation matrix T(N,N)
49Rotation and scale/intensity ambiguities
Intensity (scale) ambiguities
For any scalar k
Intensity/scale ambiguities make difficuly to
obtain quantitative information When they are
solved then it is also possible to have
quantitative information
50Rotation and scale/intensity ambiguities
Intensity (scale) ambiguities
cold x k cnew
cold sold ( cold x k)(1/k x sold) cnew snew
x
x
1/k x sold snew
51Rotation and scale/intensity ambiguities
- Questions to answer
- Is it possible to have unique solutions?
- What are the conditions to have unique solutions?
- If total unique solutions are not possible
- Is it still possible at least to find out some of
the possible solutions? - Is it possible to have an estimation of the band
or range of possible/feasible solutions? - How this range of feasible solutions can be
reduced?
52Lecture 1
- Introduction to data structures and
soft-modelling methods. - Factor Analysis of two-way data Bilinear models.
- Rotation and intensity ambiguities.
- Pseudo rank, local rank and rank deficiency.
- Evolving Factor Analysis.
53Definitions Mathematical rank of a data matrix
is the minimum number of linearly independent
rows or columns describing the variance of the
whole data set. Minimum number of basis vectors
spanning the row and column vector spaces. It may
be obtained by SVD or PCA. Pseudo-rank or
Chemical rank is the mathematical rank in absence
of experimental error/noise. Usually it is equal
to the number of chemical/physical components
contributing to the observed data variance apart
from experimental noise/error. Obtained from the
number of larger components from PCA, SVD or
other FA methods Local Rank is the chemical
rank of data submatrices. Obtained from EFA,
EFF, SIMPLISMA, OPA, or other FA submatrix
analysis methods Rank deficiency when chemical
rank is lower than the known number of
contributions. Rank deficiency may be
broken/solved by data matrix augmentation and
perturbation strategies. Rank overlap rank
deficiency caused by equal vector profiles of
different chemical/physical components in one or
more modes.
54Pseudo Rank Number of contributions (factors,
components)
Principal Component Analysis
Gives an abstract (orthogonal) bilinear model to
describe optimally the variation in our data set.
Useful chemical information Size of the
model (chemical rank)
Number of chemical contributions
55Pseudo Rank Number of contributions (factors,
components)
56Pseudo Rank Number of contributions (factors,
components)
Principal Component Analysis (SVD algorithm)
large size ?
small size ?
57Pseudo Rank Number of contributions (factors,
components)
- Overestimations of rank (overfitting).
- Large overestimation the measurements may not
follow a bilinear model. - Small overfestimation presence of structured
noise or high noise levels.
- Underestimations of rank (rank deficiency).
- Linear dependencies
- Contributions with very similar signals or
concentration profiles. - Compounds with non-measurable signals.
- Minor compounds.
58Rank deficiency
- Are all the signals distinguishable and
independent? - Are all the concentration profiles
distinguishable and independent?
No ? Rank- deficient systems
Detectable rank lt nr. of process contributions
Examples 1) 2nd order reaction A B C, B
C, 3 chemical species/contributions, but Rank
2 2) Enantiomer conversion monitored by UV and
the spectrum D spectrum L, two chemical
species/components but Rank 1 (Rank overlap)
59Rank deficiency
- Closed reaction systems. Some concentration
profiles are described as linear combinations of
others. - System HA / A-, HB / B-
- CA HA A-
- CB HB B-
- CB kCA
- HA, HB
- A- CA - HA
- B- CB - HB kCA - HB
-
Rank 3
HA, HB, A-, B- ? f (HA, HB, CA)
60Breaking rank-deficiency by matrix augmentation
Rank deficiency
- Matrix Augmentation in the rank-deficient
direction
Data set
Rank 4
61Lecture 1
- Introduction to data structures and
soft-modelling methods. - Factor Analysis of two-way data Bilinear models.
- Rotation and intensity ambiguities.
- Pseudo-rank and rank deficiency.
- Local Rank and Evolving Factor Analysis.
62- Local exploratory analysis
- Study of the variation of the number of
contributions in the process or system. Study of
the rank variation during the process. - Evolving Factor Analysis (EFA)
- Fixed Size Moving Window - Evolving Factor
Analysis (FSMW-EFA)
63Evolving Factor Analysis
- Stepwise chemometric monitoring of a process.
- Forward Evolving FA (from beginning to end)
- Backward Evolving FA (from end to beginning)
Working procedure Display of subsequent PCA
analyses along gradually increasing data set
windows.
64Evolving Factor Analysis
D
65Evolving Factor Analysis
- Forward Evolving Factor Analysis
66Evolving Factor Analysis
- Forward Evolving Factor Analysis
Noise level
67Evolving Factor Analysis
- Backward Evolving Factor Analysis
68Evolving Factor Analysis
- Backward Evolving Factor Analysis
Noise level
69Evolving Factor Analysis
- Combined EFA plot (forward and backward EFA)
70Evolving Factor Analysis
Consecutive emergence-decay profiles. No embedded
compounds.
71Evolving Factor Analysis
- Approximate concentration profiles
EFA derived concentration profiles
Real concentration profiles
72Fixed Size Moving Window-Evolving FA (FSMW-EFA)
- Local rank map along the process direction or the
signal direction.
Working procedure Subsequent PCA in fixed size
windows moving stepwisely along the data set.
Window size ? min(number of components 1)
73FSMW-EFA
74FSMW-EFA
Noise level
75EFA
EFA
Local rank detection
FSMW-EFA
FSMW-EFA
76FSMW-EFA vs. EFA
- EFA
- Displays the evolution of the process.
- The compounds are well identified (concentration
windows) - Local rank information is not easily interpreted.
- FSMW-EFA
- Clear definition of local rank.
- Sensitive to detection of minor compounds.
- The idea of process evolution is not preserved.
77Getting Local rank information from Evolving
Factor Analysis methods
- Detection of the selective windows or regions
where only one species exists (total selectivity) - Detection of zero concentration windows or
regions (no species is present) - Detection of windows or regions where a
particular species is not present - Detection of the concentration windows or regions
where one species is present (other species can
coexist)
78References
- EFA
- H. Gampp, M. Maeder, C.J. Meyer and A.D.
Zuberbühler. Talanta, 32, 1133-1139 (1985). - M. Maeder. Anal. Chem. 59, 527-530 (1987).
- FSMW-EFA
- H.R. Keller and D.L. Massart. Anal. Chim. Acta,
246, 379-390 (1991). - SIMPLISMA
- W. Windig and J. Guilment. Anal. Chem., 63,
1425-1432 (1991).