Title: Popitam,
1Popitam, une méthode tolérante aux
mutations/modifications pour l'identification de
protéines à partir de données de spectrométrie de
masse (MS/MS)
Patricia Hernandez Swiss Institute of
Bioinformatics
2Overview
- proteomics - proteome - proteome
visualization 2D gels - protein
identification - classical workflow - shared
peak count - modifications and identification -
modified peptides - SPC - spectral alignment,
de novo sequencing, tag extraction - Popitam -
overview - tags - scoring function, genetic
programming - some results
3proteomics
Proteome
--gt Proteomics science that studies proteins
expressed by a genome --gt proteome --gt changes
with the state of development, the tissue
or the environmental conditions
--gt identification and quantification--gt 3D
structure prediction--gt localisation in the
cell--gt biological function --gt modifications
--gt interactions with other proteins ...
4proteomics
2d gels
--gt a simple way to "see" a proteome --gt
numerous proteins from a biological sample
(example blood) are separated according to
2 criteria molecular weight of the
protein isoelectric point --gt this method
allows separating simultaneously thousands of
proteins and displaying them on a
two-dimensional map --gt spot (generally) one
purified protein --gt we can "see" the proteins,
but we don't know to which protein
corresponds a given spot...
5protein identification
Spots identification classical workflow
--gt identify a spot give a protein name to a
spot --gt protein databases (for example
SwissProt) - records all known proteic
sequences - annotated
MS/MS identification
MGMGQ MGQGWAWATWATA...
fragmentit
select a peptide
measure the mass of the fragments by ms
cut the aa chain into peptides (every K and R aa)
measure the mass of the peptides by ms
select an unknown purified protein
MGQGWATAGLPSFRPEPYKCYGHPVPSQEASQQVTVK...
MS identification (PMF)
MGQGWATAGLPSFRPEPYKCYGHPVPSQEASQQVTVKTHGTSSQATTSSQ
K
6protein identification
Shared peak count
MS spectrum list of the masses of peptides that
constitute the protein of interestMS/MS
spectrum list of masses of fragments that
constitute a peptide of the protein of interest
MS virtually cut the theo. seq. into peptides
and compute masses
compare the list of experimental and theoretical
masses in order to find the best match between
experimental and virtual spectra--gt
detection --gt ions --gt noise
MS/MS virtually cut the theo. seq. into
peptides, and further cut the peptides into
fragments, and compute the masses
p i g
protein database
hbb_human
7modifications and identification
Modified peptides (1)
PTMs--gt most eukaryote proteins --gt addition of
a chemical group --gt participate to
- methylation14- phosphroylation80-
glycosylation gt800 ...
- proteic structures- proteic functions -
control of metabolic pathways
The sequence of the database may differ from the
experimental peptide
CONFLICT (different sources report differing
sequences) --gt in about 4'600 human
entries VARIANT (authors report that sequence
variants exist) alleles --gt in about 2'200
human entries MUTATIONS associated with
diseases --gt 187 references to mutations and
diseases in COMMENTS section
8modifications and identification
Modified peptides (2)
a modified protein
MS, selection of the peptide
digestion
fragmentation
9modifications and identification
SPC and modified peptides
experimental MS/MS spectrum
modified experimental MS/MS spectrum
intensity
intensity
m/z
m/z
intensity
intensity
m/z
m/z
theoretical peptide
"Shared peak count" algorithms have to introduce
modifications into the theoretical peptide
databases.
10modifications and identification
Database size (1)
AAIEGKaAIEGKAaIEGKaaIEGKAAIeGKaAIeGKAaIeGKa
aIeGK LMQRlMQR APALKaPALKAPaLKaPaLKAPAlKaPA
lKAPalKaPalK
AAIEGK LMQRAPALK
New database, if the two following modifications
are taken into account - modification occurring
on amino acid A A-gta - modification occurring on
amino acids L L-gtl and E E-gte all the
peptide from the initial database, plus all
modified peptides that can be built from the
initial database
11modifications and identification
Database size (2)
B(L,p,k) gives the probability to have k
positions of modification in a sequence of lenght
L, if p is the probability that a position may be
modified (we assume the positions to be
independent) Aim assess the number of peptides
that contain zero, one, two... "positions" for a
possible modification
N0 N1 N2
xxxxoxxx xoxx xxox xxxo ooxx oxox oxxo xoox xox
o xxoo
L 10, p 1/20800'000 478'990 252'100
59'710 8'380 771 c L 10, p 5/20 800'000
45'050 150'169 225'254 200'225 116'798
c
12modifications and identification
Database size (3)
Expected number s of peptides that may contain
exactly M modifications Expected size of
database when taking into account 0 to M
modifications
xxxxoxxx xoxx xxox xxxo ooxx ooxx oxoxoxox ...
N0 N1 N2
13modifications and identification
Database size (3)
SwissProt Human, 10'000 proteins n 806'787
peptides 300,3000 (from 3 to 30 aa)
L 11 amino acids 0 to 3
modifications occuring on one specific amino
acid p1/20P0to3_mod 1'375'700 c 0 to 3
modifications that may occur on several
loci Phosphorylation H,D,S,T,Y (eucaryotes) p
5/20P0to3_mod 4'865'100 c 0 to 3
modifications that may occur on every amino acid
p1 P0to3_mod 3,97e12 c Mutation scenario
Each amino acid may mutate into one of the
remaining 19 amino acidsAll possible words
19k-1 P1_mut 1.16e14
14modifications and identification
Other strategies
2 major problems - size of the database - a
priori knowledge on the deltaMass due to the
modification Solutions Define an
identification algorithm that is not based on a
SPC --gt spectral convolution/alignment -
PEDENTA (2000) --gt de novo sequencing followed
by sequence matching - extraction of one or
several complete sequences LUTEFISK (1997),
SHERENGA (1999)... - extraction of one or
several small tags (PeptideSearch, 1994),
Patchwork sequencing... --gt Popitam (2003)
"guided" sequencing
15modifications and identification
Spectral convolution/alignment
Pevzner PA, Dancik V, Tang CL Mutation-tolerant
protein identification by mass spectrometry.
J.Comput.Biol. 2000, 7777-787
Key idea k-similarity D(k) Given Sexp and Stheo,
the goal is to find a serie of k shifts in Sexp
that makes Sexp and Stheo as similar as
possible. D(k) represents the maximum number of
elements in common between a theoretical and an
experimental spectrum after k shifts
theo. MS/MS spectrum
A
B
D
SPC score D(k0) 2 SA score D(k2) 6
exp. MS/MS spectrum
F
16modifications and identification
De novo sequencing
Taylor JA, Johnson RS Sequence database searches
via de novo peptide sequencing by tandem mass
spectrometry. Rapid Commun.Mass Spectrom. 1997,
111067-1075
Longest path problem in a directed acyclic graph
--gt dynamic programming--gt complete sequences
--gt mutations, but no modifications
4/24
17modifications and identification
Tag extraction
Mann M, Wilm M Error-tolerant identification of
peptides in sequence databases by peptide
sequence tags. Anal.Chem. 1994, 664390-4399
Island of sequence ionsThe tags (m1-SEQ-m2) are
manually extracted2 steps tags as filtering,
then SPC
Schlosser A, Lehmann WD Patchwork peptide
sequencing Extraction of sequence information
from accurate mass data of peptide tandem mass
spectra recorded at high resolution. Proteomics.
2002, 2524-533
Based on very accurate masses (10 mDa)Small tags
are extracted from low mass regions (2 aa)
18Popitam
Popitam key's idea
Spectrum graph --gt good way to structure the
information contained in the MS/MS
spectrum, allows mutations Tags --gt modified
source peptides --gt fragmented spectra Search
space --gt use dtb information during tag
extraction --gt take into account only mutations
compatible with the spectrum (graph) --gt
make only modification scenarios compatible
with the current theoretical
peptide Scoring function --gt take into account
a lot of parameters --gt genetic programming
19Popitam
Popitam overview
any source of biological sequences
initial node
I(P1) I(P2) ...
P1 P2 ...
Peptide sequence database
filter
final node
IDENTIFICATION
MS/MS
7/12
20Popitam
Spectrum graph
5/12
21Popitam
Tag extraction
LTELetLvmITEIetIvmtlE
ckTEetvmgoEV
peLTEpeLetpeLvmpeITEpeIetpeIvmpetlE
9 nodes,11 edges --gt 21 tags
22Popitam
Tag extraction (2)
LVNELTEFAK (125 peaks)
Pentium, 1.6 GHz
AIGGGLSSVGGSSTIK (1159 peaks) 1 16/97
5.6104 0m02s 2 30/338 5.4106 0m27s3
44/692 5.7107 3m16s4 58/1121 3.4108
21m09s5 72/1667 2.3109 2h17m07s AHFSISNSAED
PFIAIHADSK(145 peaks)1 24/121 6.1104
0m02s2 46/308 1.9108 16m15s3 68/831
2.01010 22h06m47s
23Popitam
Tag extraction (3)
Recursively extract from the graph all tags that
are compatible with the current theoretical
peptide--gt a tag a path (bMass, edge label,
ionic hypothesis)
ACCACMCAK
-
k
MCAK
C
A
MCAK
A
CACMCAK
C
k
k
MCAK
CMCAK
CACMCAK
24Popitam
Tag processing
- discard subtags- discard tags that begin the
theo. peptide, but not the graph (and vice
versa)- discard tags that finish on the last aa,
but not on the last node- group "family" tags
- AVVQDPALKPLALVYGEATSRPeakNb 1260
- ParentMass 2197.15
- NodeNb 86
- EdgeNb 142 / 1098
29 tags --gt 13 subSeqs
KplALVYGE 30 39 43 45 50 58 63 64 68
plALVYGE 39 43 45 50 58 63 64 68
ALVYGE 43 45 50 58 63 64 68 LVYGE
45 50 58 63 64 68 VYGE
50 58 63 64 68 YGE
58 63 64 68 paLKplALvy 0 4 10 16 22 26
31 42 LKplALvy 4 10 16 22 26 31 42
KplALvy 10 16 22 26 31 42
plALvy 16 22 26 31 42 ALvy
22 26 31 42 LKPla 10 13
19 22 31 LKPla 10 14 19 22 31
KPla 13 19 22 31 KPla
14 19 22 31
PLAlv 29 35 40 42 48 LAlv
35 40 42 48 DpaL 65 69 78 84
LKP 11 15 20 24 LVY 16
19 24 29 LVY 44 49 57 62PAL
19 22 26 31 QDP 10 16 20
24 alkpL 54 63 71 75 avVqd
0 5 9 18 dpAL 37 43 45 50 avVQD
55 60 65 70 75 VQD 60 65 70
75 paLK 59 66 69 75
25Popitam
Subsequence processing (1)
Aim Find all possible arrangements of
subsequences, given the theoretical
peptideBUTdo not include in a same arrangement
tags that are incompatible with the
others. Compatibility rules --gt no peak shared
--gt beginMasses must respect positions in the
sequences
A V V Q D P A L K P L A L V Y G E A T S R0
5 10 15
Compatibility graph
0 1 2 3 4 5 ... 0 x x 1
x x 2 x x 3 x
x x x 4 x 5 x
x x ...
0 KplALVYGE 794.41 0 1 2 6 15 19 21 27 30 1
LKPla 282.17 2 7 29 33 41 2 PLAlv
785.34 6 8 19 21 28 3 DpaL 1673.89 14
20 31 36 4 LKP 284.11 17 22 32 36 5
LVY 410.26 14 22 28 29 ...
Each found clique in the graph is a possible
arrangement of subsequencesHere, 91 cliques, but
most of them are really uninteresting.
26Popitam
Scoring function (1)
--gt 2 levels scoring - scoring linked to the
subsequences (local) subscores number of tags
that compose the subsequence length of the
subsequence occurrence probabilities of the
ionic type hypothesized (geometric/arithmetic
mean) - scoring linked to the arrangement
(global) subscores global coverage linear
regression
AVVQDPALKPLALVYGEATSR KplALVYGE
794.4 LKP 284.1
LVY 1202.7 AVVQDPALKPLALVYGEATSR
KplALVYGE 794.4 LKP 284.1
LVY 1202.7 avVqd 1.0
AVVQDPALKPLALVYGEATSR KplALVYGE
794.4 LKP 284.1 avVqd 1.0
AVVQDPALKPLALVYGEATSR KplALVYGE
794.4 LVY 1202.7 ...
27Popitam
Scoring function (2)
How can we combine the subscores in order to
build an efficient scoring function ?--gt
empirical function (expert knowledge) --gt
probabilitic function --gt function built using
GENETIC PROGRAMMING
GENETIC PROGRAMMING
population of "programs" trees nodes
mathematic operators (, -, , /, , ...)
bolean operators (AND, OR, NOT...) conditional
operators (if-then-else...) iterative functions
(do-until...) other specific functions... leaves
subscores, coefficient
28Popitam
Genetic operators (1)
Initiation Programs are initially randomly
determined (structure, functions,
values) Iterations At each iteration, the
programs are evaluated (fitness function). Only
the best are allowed to reproduce, using genetic
operators (permutation, mutation,
crossing-over...).
29Popitam
Genetic operators (2)
30Popitam
Genetic programming
genetic programming allows testing several
scoring functions and making them "cleverly"
evolve in order to find an optimal one
tree population
if (correctId() ) si ? 0.51
(according to the discriminative power) else
if (belongToList() ) si ? 00.5
(according to the position in the list)
else si 0
scoring function1
Popitam
fitness
scoring function3
scoring function2
Popitam
Popitam
fitness
fitness
31Popitam
Some results