Title: MS Preprocessing and Evaluation
1MS Preprocessing and Evaluation
2Introduction
- Context
- diagnosis and biomarker extraction from SELDI
MALDI mass spectra - Issues
- preprocessing mass spectra
- tune preprocessing parameters
- evaluate preprocessing quality
3Work Flow
- Preprocessing
- Baseline Estimation
- MS Normalisation (TIC)
- Noise Estimation/Elimination
- Peak Detection
- Extract Peak Caracteristics
- Peak Alignment
Spectra with Control Diseased labels
List of Discriminant Features
A learning dataset
Classification model for prediction of patient
state from its mass spectrum
Machine Learning
4Signal Distortions Correction
- Baseline estimated with open operator (local
maxima of the local minima) in a sliding window - Total Ion Current Normalization with baseline
corrected signal and part of the signal gt 2000 Da
5Noise Estimation / Elimination
- Noise estimated by standard deviation in a
sliding window - To determine what is a peak and what is not
- Possibility of signal smoothing (e.g. with
wavelet, FFT) - Better to work with raw data
6Peak Detection
7Extracting Peak Area
- p fixed point found by peak detection
- The signal is splitted into regions according to
the minima between two consecutive peaks - In each region, pl and pr are found by mean least
square fitting of a piecewise linear model in two
segments (horizontal and oblique) - Area of the peak is given by area of the triangle
(pl, p, pr)
8Peak Alignement Missing Values
- Peak alignment performed by hierachical
clustering (closest peaks are merged) - Two strategies for missing values
- set missing values to zero because there is no
peak - retrieve signal intensity (not obvious for peak
area)
?
?
?
9Data Representation
- 3 possible data representations
- peak intensity signal intensity for missing
values (is) - peak intensity zero filling of missing values
(iz) - peak area zero filling of missing values (az)
10Preprocessing Evaluation
- Solution 1 (Ideal case) Have samples with known
content, or use a MS simulator and estimate peak
detection performance. - Solution 2 Do spectra replicates and estimate
peak detection stability. - Solution 3 In diagnostic applications, choose
preprocessing parameters minimising
generalisation error.
11Choosing Peak Detection Parameter
- Compare detections between a normal MS spectrum
and a blank one - ? We can used a peak detection parameter of 2.5
12Preprocessing Evaluation in Diagnostic (1/2)Data
Representation Evaluation
- is/iz ? filling missing values with signal
intensity instead of zeroes retains more
discriminatory informations - iz/az ? using area or intensity does not result
in significant differences - is/raw ? no significant information lost in
preprocessing, but a much more compact
representation gain
13Preprocessing Evaluation in Diagnostic
(2/2)Influence of Peak Detection Parameter
- Choosing is representation and SMO algorithm,
what is the influence of peak detection parameter
on the information content of the preprocessed
datasets ?
14Preprocessing Evaluation in Presence of
Replicates (1/2)
- perform peak detection and alignment
15Preprocessing Evaluation in Presence of
Replicates (2/2)
- Estimate pourcentage of peaks find in at least
10/10, 9/10, 8/10 ... 1/10 replicates
16Conclusion
- Results depends heavily on parameter tunning, it
should be done in an informed manner - manual selection
- automatic selection
- We saw preprocessing pipeline of SELDI data.
Preprocessing LC-MS data bring new issues - More dimensions (2D, 3D)
- How to perform realtime peak detection and
alignement of LCMS data ? - How to perform realtime protein identification
and guide MS/MS selection ? - Challenge How to build learning methods able to
deal with raw data
17Peak Definition
- Valley definition for a point p The minimum
points on left and right of p such that their is
no point with intensity higher than the intensity
of p between them. - Peak definition a spectrum point p is considered
a peak if its left and right valleys are deeper
than the noise level. - Remark
- No assumption on peak width
18Data Representation Evaluation
- Error evaluation of 3 classification algorithms
- Instance Base Learning (IBk)
- Decision Tree (J48)
- Support Vector Machine (SMO)
- On 3 data representations
- peak intensity signal intensity for missing
values (is) - peak intensity zero filling of missing values
(iz) - peak area zero filling of missing values (az)
- For 3 datasets
- Stroke (Stk)
- Prostate Cancer (Pro)
- Ovarian Cancer (Ova)
19Perspectives
- Preprocessing pipeline used in a reproducibility
study of MALDI-TOF MS Zeferos et al. Sample
Preparation and Bioinformatics in MALDI Profiling
of Urinary Proteins. Submitted, JChromat, 2006 - Peak detection algorithm has been extended to 2D
(and even nD) and applied on Nano-LC MS
202D Nano-LC Peak Detection
21Thank you !
22Preprocessing Objectives
- Correct signal distortions (baseline,
normalization) - Reduce dimensionality of learning problem (peak
detection) - Avoid removing discriminative informations
23vd vs Error