Title: In silico methods to predict protein phosphorylation sites
1In silico methods to predict protein
phosphorylation sites
C.C. King, Ph.D.
2(No Transcript)
3Outline Section 1 Introduction 1) Overview
of the Grb14 and PDK-1 2) Searching for protein
sequences on the internet Section 2
Phosphorylation prediction programs 1)
Literature and unpublished/published high
throughput datasets 2) Computational 3)
Orientated peptide libraries 4) The consensus on
Grb14 Section 3 In class exercise
4Outline Section 1 Introduction 1) Overview
of the Grb14 and PDK-1 2) Searching for protein
sequences on the internet Section 2
Phosphorylation prediction programs 1)
Literature and unpublished/published high
throughput datasets 2) Computational 3)
Orientated peptide libraries 4) The consensus on
Grb14 Section 3 In class exercise
53-phosphoinositide-dependent kinase, PDK-1
Linker Region
PH domain
Kinase Domain
6Growth factor receptor bound protein 14 (Grb14)
Ras Association Domain
SH2 domain
PH domain
Proline Rich Regions
Function adaptor protein (Insulin Receptor,
PDGF-R, EGF-R, FGF-R) plays a critical role in
terminating insulin signaling binds PDK-1
through a PIF sequence
7Outline Section 1 Introduction 1) Overview
of the Grb14 and PDK-1 2) Searching for protein
sequences on the internet Section 2
Phosphorylation prediction programs 1)
Literature and unpublished/published high
throughput datasets 2) Computational 3)
Orientated peptide libraries 4) The consensus on
Grb14 Section 3 In class exercise
8 Searching for protein sequences on the
internet Protein Sequence Information NCBI -
http//www.ncbi.nlm.nih.gov/ Swiss Prot/UniProt
http//www.uniprot.org/ Human Protein Reference
Database - http//www.hprd.org/ Protein Kinase
Resources Kinase.com - http//www.kinase.com/
Highlights KinBase allows one to search for
kinase information across species with links to
major protein databases. Updates on various
Kinome projects Protein Kinase Resource -
http//www.nih.go.jp/mirror/Kinases/ Highlights
protocols, structure links, sequence alignments,
list of researchers working on kinases
9Protein Sequence Information UniProt (part
1) uniprot.org gt query grb14 or q14449
10Protein Sequence Information UniProt (part 2)
11FASTA format
A text-based format comprised of two parts 1) a
single line description of the protein (marked by
a greater than symbol) 2) lines of sequence
data representing single letter codes for nucleic
acid or peptide sequences
gtspQ14449GRB14_HUMAN Growth factor
receptor-bound protein 14 OSHomo sapiens
GNGRB14 PE1 SV2 MTTSLQDGQSAASRAAARDSPLAAQVCGAAQ
GRGDAHDLAPAPWLHARALLPLPDGTRGCAADRRKKKDLDVPEMPSIPNP
FPELCCSPFTSVLSADLFPKANSRKKQVIKVYSEDETSRALDVPSDITAR
DVCQLLILKNHYIDDHSWTLFEHLPHIGVERTIEDHELVIEVLSNWGIEE
ENKLYFRKNYAKYEFFKNPMYFFPEHMVSFATETNGEISPTQILQMFLSS
STYPEIHGFLHAKEQGKKSWKKIYFFLRRSGLYFSTKGTSKEPRHLQFFS
EFGNSDIYVSLAGKKKHGAPTNYGFCFKPNKAGGPRDLKMLCAEEEQSRT
CWVTAIRLLKYGMQLYQNYMHPYQGRSGCSSQSISPMRSISENSLVAMDF
SGQKSRVIENPTEALSVAVEEGLAWRKKGCLRLGTHGSPTASSQSSATNM
AIHRSQPWFHHKISRDEAQRLIIQQGLVDGVFLVRDSQSNPKTFVLSMSH
GQKIKHFQIIPVEDDGEMFHTLDDGHTRFTDLIQLVEFYQLNKGVLPCKL
KHYCARIAL
GenBank gigi-numbergbaccessionlocus EMBL
Data Library gigi-numberembaccessionlocus DD
BJ, DNA Database of Japan gigi-numberdbjaccess
ionlocus NBRF PIR pirentry Protein
Research Foundation prfname SWISS-PROT
spaccessionname Brookhaven Protein Data Bank
(1) pdbentrychain Brookhaven Protein Data
Bank (2) entrychainPDBIDCHAINSEQUENCE Patents
patcountrynumber GenInfo Backbone Id
bbsnumber General database identifier
gnldatabaseidentifier NCBI Reference Sequence
refaccessionlocus Local Sequence identifier
lclidentifier
12Which of the following is NOT an attribute of
protein sequence databases like UniProt/NCBI
- Identification of conserved damians
- Links to primary literature in PubMed
- Identified phosphorylation sites
- FASTA sequences with links to BLAST
- None of the above
13Outline Section 1 Introduction 1) Overview
of the Grb14 and PDK-1 2) Searching for protein
sequences on the internet Section 2
Phosphorylation prediction programs 1)
Literature and unpublished/published high
throughput datasets 2) Computational 3)
Orientated peptide libraries 4) The consensus on
Grb14 Section 3 In class exercise
14Method for prediction Literature Program
PhosphoMotif Finder URL http//www.hprd.org/Phos
phoMotif_finder
15A Brief Introduction to the HPRD (Human Protein
Reference Database) URL http//www.hprd.org
A comprehensive site that provides information
about the interactions, sub-cellular
localization, post-translational modifications,
tissue distribution, and domain structure of
specific proteins
16cDNA and protein sequence of Grb14
Amino Acid sequence within domains
NCBI links
17Experimentally identified interactions of
Grb14 (with embedded PubMed Links)
18Experimentally identified PTMs of Grb14 - HPRD
19Method for prediction Literature and
unpublished/published high throughput
datasets Program Phospho ELM URL
http//phospho.elm.eu.org/
20Experimentally identified PTMs of Grb14 Phospho
ELM
21Method for prediction Literature and
unpublished/published high throughput
datasets Program PhosphoSitePlus URL
http//www.phosphosite.org/
22Experimentally identified PTMs of Grb14
PhosphoSite Plus
23Experimentally identified PTMs of Grb14
PhosphoSite Plus
24(No Transcript)
25Experimentally identified PTMs of Grb14
PhosphoSite Plus
26A second example of PhosphoSitePlus a brief
introduction to PHLPP and mass spectrometry
27(No Transcript)
28Phosphosite Plus and PHLPP- mass spectrometry
data
The amino acid masses
29Literature and unpublished/published databases
provide the user with all of the following
information except
- Identified phosphorylation sites from the
literature - Protein-protein interactions
- Predicted phosphorylation sites from large scale
mass spectrometry experiments - Previously unidentified phosphorylation sites
- None of the above
30Outline Section 1 Introduction 1) Overview
of the Grb14 and PDK-1 2) Searching for protein
sequences on the internet Section 2
Phosphorylation prediction programs 1)
Literature and unpublished/published high
throughput datasets 2) Computational 3)
Orientated peptide libraries 4) The consensus on
Grb14 Section 3 In class exercise
31Method for prediction Computational Program
NetPhos 2.0 (artificial neural network) URL
http//www.cbs.dtu.dk/services/NetPhos/
32Method for prediction NetPhos 2.0
Grb14 sequence MTTSLQDGQSAASRAAARDSPLAAQVCGAAQGRGD
AHDLAPAPWLHARALLPLPDGTRGCAADRRKKKDLDVPEMPSIPN
80 PFPELCCSPFTSVLSADLFPKANSRKKQVIKVYSEDETSRALDVPSD
ITARDVCQLLILKNHYIDDHSWTLFEHLPHIGV 160
ERTIEDHELVIEVLSNWGIEEENKLYFRKNYAKYEFFKNPMYFFPEHMVS
FATETNGEISPTQILQMFLSSSTYPEIHGF 240
LHAKEQGKKSWKKIYFFLRRSGLYFSTKGTSKEPRHLQFFSEFGNSDIYV
SLAGKKKHGAPTNYGFCFKPNKAGGPRDLK 320
MLCAEEEQSRTCWVTAIRLLKYGMQLYQNYMHPYQGRSGCSSQSISPMRS
ISENSLVAMDFSGQKSRVIENPTEALSVAV 400
EEGLAWRKKGCLRLGTHGSPTASSQSSATNMAIHRSQPWFHHKISRDEAQ
RLIIQQGLVDGVFLVRDSQSNPKTFVLSMS 480
HGQKIKHFQIIPVEDDGEMFHTLDDGHTRFTDLIQLVEFYQLNKGVLPCK
LKHYCARIAL 560 ............S......S..........
..................................................
80 .......................S.........S...T......
....T.............Y....S............ 160
.........................Y...............Y........
.........S.............Y...... 240
.........S.............Y.S...TS.................Y.
.............................. 320
.........................................S.S.S...S
.S........................S... 400
.......................S..S.................S.....
.................S............ 480
..................................................
.......... 560 Phosphorylation sites
predicted Ser 19 Thr 3 Tyr 6
33Serine predictions Name Pos Context Score Pred
_______________v_________________ Sequence 4
-MTTSLQDG 0.027 . Sequence 10 QDGQSAASR 0.009
. Sequence 13 QSAASRAAA 0.960 S Sequence 20
AARDSPLAA 0.968 S Sequence 77 PEMPSIPNP 0.036
. Sequence 88 ELCCSPFTS 0.020 . Sequence 92
SPFTSVLSA 0.050 . Sequence 95 TSVLSADLF 0.073 .
Sequence 104 PKANSRKKQ 0.993 S Sequence 114
IKVYSEDET 0.968 S Sequence 119 EDETSRALD 0.185
. Sequence 126 LDVPSDITA 0.221 . Sequence 148
IDDHSWTLF 0.761 S Sequence 175 IEVLSNWGI 0.011
. Sequence 210 EHMVSFATE 0.013 . Sequence 220
NGEISPTQI 0.969 S Sequence 230 QMFLSSSTY 0.008
. Sequence 231 MFLSSSTYP 0.024 . Sequence 232
FLSSSTYPE 0.427 . Sequence 250 QGKKSWKKI 0.995
S Sequence 261 FLRRSGLYF 0.104 . Sequence 266
GLYFSTKGT 0.982 S Sequence 271 TKGTSKEPR 0.984
S Sequence 281 LQFFSEFGN 0.004 . Sequence 286
EFGNSDIYV 0.024 . Sequence 291 DIYVSLAGK 0.064 .
Sequence 329 EEEQSRTCW 0.380 . Sequence 358
YQGRSGCSS 0.337 . Sequence 361 RSGCSSQSI 0.033 .
Sequence 362 SGCSSQSIS 0.766 S Sequence 364
CSSQSISPM 0.625 S Sequence 366 SQSISPMRS 0.987
S Sequence 370 SPMRSISEN 0.991 S Sequence
372 MRSISENSL 0.896 S Sequence 375 ISENSLVAM
0.453 . Sequence 382 AMDFSGQKS 0.028 . Sequence
386 SGQKSRVIE 0.060 . Sequence 397 TEALSVAVE
0.680 S Sequence 419 GTHGSPTAS 0.070 .
Sequence 423 SPTASSQSS 0.192 . Sequence 424
PTASSQSSA 0.930 S Sequence 426 ASSQSSATN 0.137
. Sequence 427 SSQSSATNM 0.901 S Sequence 436
AIHRSQPWF 0.013 . Sequence 445 HHKISRDEA 0.938
S Sequence 468 LVRDSQSNP 0.996 S Sequence
470 RDSQSNPKT 0.091 . Sequence 478 TFVLSMSHG
0.479 . Sequence 480 VLSMSHGQK 0.312 .
__________________________________
Method for prediction NetPhos 2.0
Threonine predictions Name Pos Context Score
Pred _______________v_________________ Sequence
2 ---MTTSLQ 0.052 . Sequence 3 --MTTSLQD 0.273
. Sequence 57 LPDGTRGCA 0.120 . Sequence 91
CSPFTSVLS 0.102 . Sequence 118 SEDETSRAL 0.903
T Sequence 129 PSDITARDV 0.929 T Sequence
150 DHSWTLFEH 0.204 . Sequence 163 GVERTIEDH
0.268 . Sequence 213 VSFATETNG 0.048 . Sequence
215 FATETNGEI 0.158 . Sequence 222 EISPTQILQ
0.008 . Sequence 233 LSSSTYPEI 0.023 . Sequence
267 LYFSTKGTS 0.033 . Sequence 270 STKGTSKEP
0.977 T Sequence 302 HGAPTNYGF 0.478 .
Sequence 331 EQSRTCWVT 0.063 . Sequence 335
TCWVTAIRL 0.010 . Sequence 393 IENPTEALS 0.019 .
Sequence 416 LRLGTHGSP 0.277 . Sequence 421
HGSPTASSQ 0.028 . Sequence 429 QSSATNMAI 0.006 .
Sequence 474 SNPKTFVLS 0.038 . Sequence 502
EMFHTLDDG 0.350 . Sequence 508 DDGHTRFTD 0.026 .
Sequence 511 HTRFTDLIQ 0.041 .
__________________________________
Tyrosine predictions Name Pos Context Score Pred
_________________v_________________ Sequence
113 VIKVYSEDE 0.008 . Sequence 143 LKNHYIDDH
0.577 Y Sequence 186 ENKLYFRKN 0.695 Y
Sequence 191 FRKNYAKYE 0.066 . Sequence 194
NYAKYEFFK 0.286 . Sequence 202 KNPMYFFPE 0.571
Y Sequence 234 SSSTYPEIH 0.806 Y Sequence
255 WKKIYFFLR 0.096 . Sequence 264 RSGLYFSTK
0.592 Y Sequence 289 NSDIYVSLA 0.987 Y
Sequence 304 APTNYGFCF 0.021 . Sequence 342
RLLKYGMQL 0.028 . Sequence 347 GMQLYQNYM 0.336 .
Sequence 350 LYQNYMHPY 0.020 . Sequence 354
YMHPYQGRS 0.121 . Sequence 520 LVEFYQLNK 0.039 .
Sequence 534 KLKHYCARI 0.008 .
__________________________________
34Method for prediction Computational Program
KinasePhos (Hidden Markov Model HMM) URL
http//kinasephos.mbc.nctu.edu.tw/
35Method for prediction KinasePhos
36Location position of potentially phosphorylated
amino acid Phosphorylated Sites AA sequence
surrounding potential site HMM Bit Score The
score is the base two logarithm of the ratio
between the probability that the query sequence
is a significant match and the probability that
it is generated by a random model. E-value
The E-value represents the expected number of
sequences with a score greater than or equal to
the returned HMMER bit scores. While decreasing
the E-value threshold favors finding true
positives, increasing the E-value threshold
favors finding true negatives. We select the
HMMER score as the criteria to define a HMM
match. A search with the HMMER score greater than
0 is taken as a match (positive prediction),
i.e., a HMM recognizes a phosphorylation site.
KinasePhos Results Page 1
37KinasePhos Results Page 2
1) Summary of potential phosphorylation
sites. 2) Predicted kinases
38Outline Section 1 Introduction 1) Overview
of the Grb14 and PDK-1 2) Searching for protein
sequences on the internet Section 2
Phosphorylation prediction programs 1)
Literature and unpublished/published high
throughput datasets 2) Computational 3)
Orientated peptide libraries 4) The consensus on
Grb14 Section 3 In class exercise
39Method for prediction Orientated Peptide
Libraries Program ScanSite URL
http//scansite.mit.edu
40Method for prediction Orientated Peptide
Libraries
Amino acid distribution frequencies surrounding
serine, threonine, and tyrosine residues in
proteomes and within mapped phosphorylation sites.
- The relative occurrence of each amino acid in
each flanking position in the bacterial and
mammalian subsets of GenPept was normalized to
the corresponding frequency of that amino acid
within the entire database subset, and plotted
topographically. Distribution values range from
0.7 (dark blue) to 1.3 (bright red). - Relative occurrences of each amino acid within
mapped phosphorylation sites for PKA, Src, and
EGFR kinases using data from PhosphoBase. The
frequency of each amino acid has been normalized
to the corresponding global amino acid frequency
in mammalian GenPept. To facilitate comparisons,
the color bar has been divided into two linear
scales, one ranging from 0.0 (negative selection,
dark blue), to 1.0 (neutral selection, gray), and
one ranging from 1.0 (neutral selection) to the
maximum value seen in these conserved motifs
(bright red).
41Method for prediction Orientated Peptide
Libraries
Comparison of the optimal sequence determined by
the peptide library with sequences at the same
region of known substrates and inhibitors of PKA
Bold residues emphasize positions that are
important for phosphorylation
42Method for prediction Orientated Peptide
Libraries
q14449
43Method for prediction Orientated Peptide
Libraries
Change settings Stringency (Low vs High) Domain
Information - Pfam GeneCard Link Sequence Blast
link Score PDK-1
44Method for prediction Orientated Peptide
Libraries Program ScanSite URL
http//scansite.mit.edu
Search using a ScanSite Motif
45Method for prediction Orientated Peptide
Libraries Program ScanSite Query Page
46Method for prediction Orientated Peptide
Libraries Program ScanSite Results 1-20 (of
2000)
47Outline Section 1 Introduction 1) Overview
of the Grb14 and PDK-1 2) Searching for protein
sequences on the internet Section 2
Phosphorylation prediction programs 1)
Literature and unpublished/published high
throughput datasets 2) Computational 3)
Orientated peptide libraries 4) The consensus on
Grb14 Section 3 In class exercise
48The consensus on Grb14
MTTSLQDGQSAASRAAARDSPLAAQVCGAAQGRGDAHDLAPAPWLHAR
ALLPLPDGTRGCAADRRKKKDLDVPEMPSIPNPFPELCCSPFTSVLSADL
FPKANSRKKQVIKVYSEDETSRALDVPSDITARDVCQLLILKNHYIDDHS
WTLFEHLPHIGVERTIEDHELVIEVLSNWGIEEENKLYFRKNYAKYEFFK
NPMYFFPEHMVSFATETNGEISPTQILQMFLSSSTYPEIHGFLHAKEQGK
KSWKKIYFFLRRSGLYFSTKGTSKEPRHLQFFSEFGNSDIYVSLAGKKKH
GAPTNYGFCFKPNKAGGPRDLKMLCAEEEQSRTCWVTAIRLLKYGMQLYQ
NYMHPYQGRSGCSSQSISPMRSISENSLVAMDFSGQKSRVIENPTEALSV
AVEEGLAWRKKGCLRLGTHGSPTASSQSSATNMAIHRSQPWFHHKISRDE
AQRLIIQQGLVDGVFLVRDSQSNPKTFVLSMSHGQKIKHFQIIPVEDDGE
MFHTLDDGHTRFTDLIQLVEFYQLNKGVLPCKLKHYCARIAL
NetPhos
KinPhos
ScanSite
Database
49Grb14 is tyrosine phosphorylated. Given all of
the information you now have about databases and
predicting phosphorylation sites, which tyrosine
would you mutate?
Site 3
Site 1
Site 2
Site 4
- Site 1
- Site 2
- Site 3
- Site 4
- Another tyrosine, I am smarter than all of these
prediction programs
50Outline Section 1 Introduction 1) Searching
for protein sequences on the internet 2)
Overview of the PDK-1 signal transduction
pathway Section 2 Phosphorylation prediction
programs 1) Literature and unpublished/published
high throughput datasets 2) Computational 3)
Orientated peptide libraries 4) The consensus on
Grb14 Section 3 In class exercise
51In class exercise Predict phosphorylation sites.
3 proteins that interact with Grb14
Split into three groups and use as many of the
databases as you can in the time remaining to
identify domain structure and known/predicted
phosphorylation sites. Sequestrome (AKA p62,
ZIP) UniProt Q13501 Tankyrase 2 UniProt
Q9H2K2 IRS-1 UniProt P35568