Title: Exploiting Basic Evolutionary Principles for the
1Exploiting Basic Evolutionary Principles for
the Quality Control of Gene Predictions László
Patthy Institute of Enzymology, Budapest Darwin
Day Collegium Budapest 2009. February 9.
2 In the last decade the genomes of numerous
organisms have been sequenced, however,
conversion of raw genome sequence data into
biological knowledge remains a difficult task.
Genome annotation - the process that maps
biological knowledge onto the relevant
genome-elements - requires the definition of the
positions of all protein-coding (and non-coding)
genes along the genome sequence, identification
of their coding regions, regulatory sequences,
promoters etc.
3 Although a large number of programs have been
developed for computational gene identification,
correct prediction of the structure of all
protein-coding genes of higher eukaryotes is
still an elusive goal. The uncertainties
associated with gene finding may be illustrated
by the fact that - eight years after the
publication of the draft genome sequence (2001) -
the exact number of protein-coding genes in the
human genome is still unknown.
4Finishing the euchromatic sequence of the human
genome. Nature. 2004 Oct 21431(7011)931-45.
5 Proc Natl Acad Sci U S A. 2007 Dec
4104(49)19428-33.
6 Since direct evidence of protein existence is
generally absent, the criterion often employed to
annotate a transcript as protein-coding is the
existence of an Open Reading Frame (ORF).
However, this criterion has been recently
questioned by a number of methods developed to
assess the quality of protein-coding gene
annotations.
7 The rationale of the method of Clamp et al. is
that functional protein-coding genes are subject
to purifying selection, and therefore they are
expected to show evolutionary conservation. The
authors used two types of measures for the
assessment of evolutionary conservation of
predicted human genes reading frame conservation
(RFC, based on the observation that indels do not
affect significantly the size of functional
proteins) and codon substitution frequency (CSF,
based on the observation that the patterns of
nucleotide substitution in functional
protein-coding genes is different from that
observed on random DNA).
Clamp M, Fry B, Kamal M, Xie X, Cuff J, Lin MF,
Kellis M, Lindblad-Toh K, Lander ES.
Distinguishing protein-coding and noncoding genes
in the human genome. 1 Proc Natl Acad Sci U S A.
2007 Dec 4104(49)19428-33.
8 In their analysis of a number of human gene
reference sets, Clamp et al. identified 1200
human orphans ORFs that lack homology to known
genes. Both, RFC and CSF analysis revealed that
many of these human orphans exhibit a behavior
which is essentially indistinguishable from
matched random controls, and very different of
that observed in nonorphan protein-coding genes.
From these, the authors concluded that overall
about 15 of the entries in the gene catalogues
investigated are not valid protein-coding genes.
Clamp M, Fry B, Kamal M, Xie X, Cuff J, Lin MF,
Kellis M, Lindblad-Toh K, Lander ES.
Distinguishing protein-coding and noncoding genes
in the human genome. 1 Proc Natl Acad Sci U S A.
2007 Dec 4104(49)19428-33.
9 While the quality control method of Clamp et al.
can distinguish protein-coding genes from
noncoding sequences, it is less suitable to
identify gene predictions that are only partially
correct. If an annotated gene misses one or more
exons, or a fraction of one exon, it may still
exhibit the expected evolutionary characteristics
of protein-coding genes.
10 Indeed, - in addition to uncertainties of the
number of protein-coding genes a very serious
problem is that the structure of a significant
proportion of the human genes is incorrectly
predicted. According to recent analyses the
predicted genomic structure of human genes is
estimated to be correct for only about half of
the predicted genes. Obviously, erroneous
prediction of the structure of protein-coding
genes leads to serious problems in prediction of
the structure and function of the proteins they
encode and hinder the identification of elements
that regulate their expression.
11 A recent study has systematically compared the
performance of various computational methods to
predict human protein-coding genes. A set of
well-annotated ENCODE sequences were
blind-analyzed with the different gene finding
programs and the predictions obtained were
compared with the annotations. Predictions were
analyzed at the nucleotide, exon, transcript and
gene levels to evaluate how well the predictions
reproduce the annotation.
Guigó R, Flicek P, Abril JF, Reymond A, Lagarde
J, Denoeud F, Antonarakis S, Ashburner M, Bajic
VB, Birney E, Castelo R, Eyras E, Ucla C,
Gingeras TR, Harrow J, Hubbard T, Lewis SE, Reese
MG. EGASP the human ENCODE Genome Annotation
Assessment Project. Genome Biol. 20067 Suppl
1S2.1-31. Epub 2006 Aug 7. Review.
12-
- The computational methods compared were
classified as - EST-, mRNA-, and protein-based methods
(AUGUSTUS-EST, PARAGONNSCAN_EST, ACEVIEW,
ENSEMBL, EXOGEAN, EXONHUNTER, ACEMBLY, ECGene,
MGCGene) - single-genome ab initio methods (AUGUSTUSabinit,
GENEMARKhmm, GENEZILLA, GENEID, GENESCAN) - dual- or multiple-genome based comparative
genomic methods (AUGUST-dual, ACESCAN, DOGFISH-C,
NSCAN, SAGA, MARS, SGP2, TWINSCAN) - complex methods using any type of available
information (AUGUSTUSany, FGENESH, JIGSAW,
PARAGONany, CCDSGene, KNOWNGene, REFSEQ) -
13- At all levels, two basic measures were
computed - sensitivity the proportion of annotated
features (nucleotide, exon, gene) that have been
correctly predicted - specificity the proportion of predicted
features that is correct. - The average sensitivity and specificity ((Sn
Sp)/2) was also calculated for each program.
14KNOWN
PREDICTION
Gene feature projection for evaluation of the
accuracy of predictions
missing exons wrong exons
Guigo et al., Genome Biol. 20067 Suppl
1S2.1-31.
15Gene prediction accuracy at the transcript level.
Boxplots of the average sensitivity and
specificity ((Sn Sp)/2) for each program.
A transcript is accurately predicted if the
beginning and end of translation are correctly
annotated and each of the 5' and 3' splice sites
for the coding exons are correct.
Guigo et al., Genome Biol. 20067 Suppl
1S2.1-31.
16- These studies have revealed that
- none of the strategies produced perfect
predictions - prediction methods that rely on mRNA and protein
sequences and those that used combined
informations (including expressed sequence
information) were generally the most accurate. - the dual- or multiple genome methods were more
accurate than the single genome ab initio
prediction methods. - At the transcript level (the most stringent
criterion) - no prediction method correctly
identified greater than 45 of the coding
transcripts.
Guigo et al., Genome Biol. 20067 Suppl
1S2.1-31.
17 The MisPred project The implicit question is
are there signs that could indicate that the
predicted structure of a protein-coding gene may
be incorrect? The rationale of our MisPred
project is that a protein-coding gene is
suspected to be mispredicted if some of its
features conflict with our current knowledge
about protein-coding genes and proteins.
MisPred Database of mispredicted and abnormal
proteins http//mispred.enzim.hu/
Nagy A, Hegyi H, Farkas K, Tordai H, Kozma E,
Bányai L, Patthy L. Identification and correction
of abnormal, incomplete and mispredicted proteins
in public databases. BMC Bioinformatics. 2008 Aug
279353.
18 Several quality control tools of MisPred
address the issue whether the predicted protein
is able to reach the cellular compartment where
it could be properly folded, is stable and
functional. The rationale of these tools is
that protein domains have adapted to different
subcellular compartments during evolution and
they are usually misfolded, unstable and
non-functional if mislocalized.
Nagy A, Hegyi H, Farkas K, Tordai H, Kozma E,
Bányai L, Patthy L. Identification and correction
of abnormal, incomplete and mispredicted proteins
in public databases. BMC Bioinformatics. 2008 Aug
279353.
19- As a corollary, in multidomain proteins
domain-types do not co-occur at random - in extracellular proteins domains adapted to
the extracellular milieu are used - extracellular and intracellular domains can
co-occur only in transmembrane proteins - nuclear and extracellular domains do not
co-occur in a single protein etc.
Domain co-occurrence network of metazoan
multidomain proteins
Extracellular domain
Cytoplasmic signalling domain
Nuclear domain
Tordai H, Nagy A, Farkas K, Bányai L, Patthy L.
Modules, multidomain proteins and organismic
complexity. FEBS J. 2005 Oct272(19)5064-78.
20- Some mislocalization-based MisPred tools used for
the identification of abnormal or mispredicted
proteins - Conflict between the presence of extracellular
domains and the absence of the appropriate
sequence signals. - Conflict between the presence of extracellular
and intracellular signaling domains and the
absence of transmembrane domains. - Co-occurrence of extracellular and nuclear
domains.
21Conflict between the presence of extracellular
domains and the absence of the appropriate
sequence signals.
Rationale proteins containing domains that
occur exclusively in the extracellular space
(e.g. in secreted extracellular proteins or in
the extracellular part of type I, type II, type
III single pass transmembrane proteins or in
multispanning transmembrane proteins) have a
cleavable signal peptide at the N-terminal end
and/or transmembrane segments. Accordingl
y, proteins that contain extracellular domains
but lack signal peptide and/or transmembrane
segments are considered abnormal.
TM
SP
latrophilin-2
22Complement factor H, isoform b.
SP
Q8N708
CORRECT
enst00000359637.1.pep UNI_TREMBLQ8N708 ID
Q8N708 PRELIMINARY PRT 449 AA. AC
Q8N708 DT 01-OCT-2002 (TrEMBLrel. 22,
Created) DT 01-OCT-2002 (TrEMBLrel. 22, Last
sequence update) DT 01-MAR-2003 (TrEMBLrel. 23,
Last annotation update) DE HF1 protein. . . .
SCORES Init1 3167 Initn 3167 Opt 3167
z-score 3657.9 E() 1.1e-195 gtgtUNI_TREMBLQ8N708
(449 aa)
initn 3167 init1 3167 opt 3167 Z-score 3657.9
expect() 1.1e-195 Smith-Waterman score 3167
99.5 identity in 430 aa overlap
(1-43020-449)
10 20 30 40
enst00000359
DCNELPPRRNTEILTGSWSDQTYPEGTQAIYKCRPGYRSLG
Q8N708 MRLLAKIICLMLWAICVAEDC
NELPPRRNTEILTGSWSDQTYPEGTQAIYKCRPGYRSLG
10 20 30 40
50 60 50 60
70 80 90 100
enst00000359 NVIMVCRKGEWVALNPLRKCQKRPCGHPGDTPFGTF
TLTGGNVFEYGVKAVYTCNEGYQL
Q8N708 NVIMVCRKGEWVALNPLRKCQKRPCG
HPGDTPFGTFTLTGGNVFEYGVKAVYTCNEGYQL
70 80 90 100 110
120 110 120
130 140 150 160 enst00000359
LGEINYRECDTDGWTNDIPICEVVKCLPVTAPENGKIVSSAMEPDREYHF
GQAVRFVCNS
Q8N708
LGEINYRECDTDGWTNDIPICEVVKCLPVTAPENGKIVSSAMEPDREYHF
GQAVRFVCNS 130 140
150 160 170 180
170 180 190 200 210
220 enst00000359 GYKIEGDEEMHCSDDGFWSKEKPKCVE
ISCKSPDVINGSPISQKIIYKENERFQYKCNMG
Q8N708 GYKIEGDEEMHCSDDGFWSKEKPKCV
EISCKSPDVINGSPISQKIIYKENERFQYKCNMG
190 200 210 220 230
240 230 240
250 260 270 280 enst00000359
YEYSERGDAVCTESGWRPLPSCEEKSCDNPYIPNGDYSPLRIKHRTGDEI
TYQCRNGFYP
Q8N708
YEYSERGDAVCTESGWRPLPSCEEKSCDNPYIPNGDYSPLRIKHRTGDEI
TYQCRNGFYP 250 260
270 280 290 300
290 300 310 320 330
340 enst00000359 ATRGNTAKCTSTGWIPAPRCTLKPCDY
PDIKHGGLYHENMRRPYFPVAVGKYYSYYCDEH
Q8N708 ATRGNTAKCTSTGWIPAPRCTLKPCD
YPDIKHGGLYHENMRRPYFPVAVGKYYSYYCDEH
310 320 330 340 350
360 350 360
370 380 390 400 enst00000359
FETPSGSYWDHIHCTQDGWSPAVPCLRKCYFPYLENGYNQNHGRKFVQGK
SIDVACHPGY
Q8N708
FETPSGSYWDHIHCTQDGWSPAVPCLRKCYFPYLENGYNQNYGRKFVQGK
SIDVACHPGY 370 380
390 400 410 420
410 420 430 enst00000359
ALPKAQTTVTCMENGWSPTPRCIRVKFTL
Q8N708
ALPKAQTTVTCMENGWSPTPRCIRVSFTL
430 440
ENSP00000352658.1
MISPREDICTED
23SP
TM
Interleukin-2 receptor alpha chain precursor
CORRECT
P01589
enst00000256876.3.pep UNI_SPROTIL2A_HUMAN ID
IL2A_HUMAN STANDARD PRT 272 AA. AC
P01589 DT 21-JUL-1986 (Rel. 01, Created) DT
21-JUL-1986 (Rel. 01, Last sequence update) DT
01-OCT-2004 (Rel. 45, Last annotation update) DE
Interleukin-2 receptor alpha chain precursor
(IL-2 receptor alpha . . . SCORES Init1 637
Initn 719 Opt 637 z-score 759.2 E()
3.2e-34 gtgtUNI_SPROTIL2A_HUMAN
(272 aa) initn 719 init1 637
opt 637 Z-score 759.2 expect()
3.2e-34 Smith-Waterman score 637 100.0
identity in 92 aa overlap (1-9231-122)
10
20 30 enst00000256
IPHATFKAMAYKEGTMLNCECKRGFRRIKS
IL2A_HUMAN
MDSYLLMWGLLTFIMVPGCQAELCDDDPPEIPHATFKAMAYKEGTMLNCE
CKRGFRRIKS 10 20
30 40 50 60
40 50 60 70 80
90 enst00000256 GSLYMLCTGNSSHSSWDNQCQCTSSAT
RNTTKQVTPQPEEQKERKTTEMQSPMQPVDQAS
IL2A_HUMAN GSLYMLCTGNSSHSSWDNQCQCTSSA
TRNTTKQVTPQPEEQKERKTTEMQSPMQPVDQAS
70 80 90 100 110
120 100
enst00000256
LPDFQIQTEMAATMETSI
IL2A_HUMAN
LPGHCREPPPWENEATERIYHFVVGQMVYYQCVQGYRALHRGPAESVCKM
THGKTRWTQP 130 140
150 160 170 180
ENSP00000256876.3
MISPREDICTED
24CD209 antigen-like protein 1
enst00000248228.2.pep UNI_SPROT209L_HUMAN ID
209L_HUMAN STANDARD PRT 399 AA. AC
Q9H2X3 Q969M4 Q96QP3 Q96QP4 Q96QP5 Q96QP6
Q9BXS3 Q9H2Q9 AC Q9H8F0 Q9Y2A8 DT
05-JUL-2004 (Rel. 44, Created) DT 05-JUL-2004
(Rel. 44, Last sequence update) DT 01-OCT-2004
(Rel. 45, Last annotation update) . . . SCORES
Init1 1721 Initn 1997 Opt 1732 z-score
1784.9 E() 2.4e-91 gtgtUNI_SPROT209L_HUMAN
(399 aa) initn 1997
init1 1721 opt 1732 Z-score 1784.9 expect()
2.4e-91 Smith-Waterman score 2034 82.2
identity in 399 aa overlap (1-3321-399)
10 20 30 40
enst00000248 MSDSKEPRVQQLGLLEEDPT
TSGIRLFPRDFQFQQIHGHKSST------------VPFLL
209L_HUMAN MSDSKEPRVQQLGLLEEDPTTS
GIRLFPRDFQFQQIHGHKSSTGCLGHGALVLQLLSFML
10 20 30 40
50 60 50 60
70 80 90
enst00000248 --G-------PVSKVPSSLSQEQSEQDAIYQNLTQL
KAAVGELSEKSKLQEIYQELTQLK
209L_HUMAN LAGVLVAILVQVSKVPSSLSQEQSEQDAIYQNLTQLK
AAVGELSEKSKLQEIYQELTQLK 70
80 90 100 110 120
100 110 120 130
140 150 enst00000248
AAVGELPEKSKLQEIYQELTRLKAAVGELPEKSKLQEIYQELTRLKAAVG
ELPEKSKLQE
209L_HUMAN
AAVGELPEKSKLQEIYQELTRLKAAVGELPEKSKLQEIYQELTRLKAAVG
ELPEKSKLQE 130 140
150 160 170 180
160 170 180 190
enst00000248 IYQELTRLKAAVGELPEKSKLQEI
YQELTELKAAV-------------------------
209L_HUMAN IYQELTRLKAAVGELPEKSKLQEIYQ
ELTELKAAVGELPEKSKLQEIYQELTQLKAAVGE
190 200 210 220 230
240
200 210 220 230 enst00000248
---------------------ERLCRHCPKDWTFFQGNCYFMSNSQRNWH
DSVTACQEVR
209L_HUMAN
LPDQSKQQQIYQELTDLKTAFERLCRHCPKDWTFFQGNCYFMSNSQR
NWHDSVTACQEVR 250 260
270 280 290 300
240 250 260 270 280
290 enst00000248 AQLVVIKTAEEQNFLQLQTSRSNR
FSWMGLSDLNQEGTWQWVDGSPLSPSFQRYWNSGEP
209L_HUMAN AQLVVIKTAEEQNFLQLQTSRSNRFS
WMGLSDLNQEGTWQWVDGSPLSPSFQRYWNSGEP
310 320 330 340 350
360 300 310
320 330 enst00000248 NNSGNEDCAEFSGSGWNDNRC
DVDNYWICKKPAACFRDE
209L_HUMAN
NNSGNEDCAEFSGSGWNDNRCDVDNYWICKKPAACFRDE
370 380 390
TM
CORRECT
Q9H2X3
ENSP00000248228.2
MISPREDICTED
25Cadherin-2 precursor
SP
TM
ABNORMAL
AC110015.1_002 191 residues
CADH2_HUMAN 906 residues
CORRECT
1
50 cadh2_human MCRIAGALRT
LLPLLLALLQ ASVEASGEIA LCKTGFPEDV YSAVLSKDVH
ac110015
51
100 cadh2_human EGQPLLN... VKFSNCNGKR KVQYESSEPA
DFKVDEDGMV YAVRSFPLSS ac110015 MCNTQR
MKFSNCNGKR KVQYESSEPA DFKVDEDGMV YAVRSFPLSS
101
150 cadh2_human EHAKFLIYAQ DKETQEKWQV
AVKLSLKPTL TEESVKESAE VEEIVFPRQF ac110015
EHAKFLIYAQ DKETQEKWQV AVKLSLKPTL TEESVKESAE
VEEIVFPRQF 151
200 cadh2_human
SKHSGHLQRQ KRDWVIPPIN LPENSRGPFP QELVRIRSDR
DKNLSLRYSV ac110015 SKHSGHLQRQ KRDWVIPPIN
LPENSRGPFP QELVRIRSDR DKNLSLRYSV
201
250 cadh2_human TGPGADQPPT GIFIINPISG
QLSVTKPLDR EQIARFHLRA HAVDINGNQV ac110015
TGPGADQPPT GIFIINPISG QLSVTKPLDR EQIARFHLRA
HAVDI 251
300 cadh2_human
ENPIDIVINV IDMNDNRPEF LHQVWNGTVP EGSKPGTYVM
TVTAIDADDP ac110015
301
350 cadh2_human NALNGMLRYR IVSQAPSTPS
PNMFTINNET GDIITVAAGL DREKVQQYTL ac110015
351
400 cadh2_human
IIQATDMEGN PTYGLSNTAT AVITVTDVND NPPEFTAMTF
YGEVPENRVD ac110015
401
450 cadh2_human IIVANLTVTD KDQPHTPAWN
AVYRISGGDP TGRFAIQTDP NSNDGLVTVV ac110015
451
500 cadh2_human
KPIDFETNRM FVLTVAAENQ VPLAKGIQHP PQSTATVSVT
VIDVNENPYF ac110015
501
550 cadh2_human APNPKIIRQE EGLHAGTMLT
TFTAQDPDRY MQQNIRYTKL SDPANWLKID ac110015
551
600 cadh2_human
PVNGQITTIA VLDRESPNVK NNIYNATFLA SDNGIPPMSG
TGTLQIYLLD ac110015
601
650 cadh2_human INDNAPQVLP QEAETCETPD
PNSINITALD YDIDPNAGPF AFDLPLSPVT ac110015
651
700 cadh2_human
IKRNWTITRL NGDFAQLNLK IKFLEAGIYE VPIIITDSGN
PPKSNISILR ac110015
701
750 cadh2_human VKVCQCDSNG DCTDVDRIVG
AGLGTGAIIA ILLCIIILLI LVLMFVVWMK ac110015
751
800 cadh2_human
RRDKERQAKQ LLIDPEDDVR DNILKYDEEG GGEEDQDYDL
SQLQQPDTVE ac110015
801
850 cadh2_human PDAIKPVGIR RMDERPIHAE
PQYPVRSAAP HPGDIGDFIN EGLKAADNDP ac110015
851
900 cadh2_human
TAPPYDSLLV FDYEGSGSTA GSLSSLNSSS SGGEQDYDYL
NDWGPRFKKL ac110015
901 cadh2_human ADMYGGGDD ac110015
26Conflict between the presence of extracellular
and intracellular signalling domains and the
absence of transmembrane domains. Rationale
extracellular domains and intracellular
signalling domains can co-occur in multidomain
proteins only if transmembrane segments separate
these two types of domains.
Accordingly, proteins that contain extracellular
and intracellular signalling domains but lack a
transmembrane segment separating them are
considered abnormal.
Domain co-occurrence network of metazoan
multidomain proteins
Extracellular module
Cytoplasmic signalling module
Nuclear module
Tordai et al., FEBS J. 2005 272(19)5064-78.
27 ENSXETP00000040601 (Xenopus tropicalis) is
erroneous since it lacks a transmembrane segment
although it contains both extracellular and
cytoplasmic signaling domains.
28 The chicken ortholog of ENSXETP00000040601
(Xenopus tropicalis), EPHA7_CHICK Ephrin type-A
receptor 7 (np_990414), does contain a
transmembrane segment
ENSXETP00000040601 (Xenopus tropicalis) deviates
most significantly in this region from EPHA7_CHICK
Query 181 SDVTYRVVCKRCSWEQGECIPCANTIGYVPQQSGLVD
TYISIVDLVAHANYTFEVEAVNG 240
DVTYRCKRCSWEQGECPC IGYPQQGLVD
YDLAHANYTFEVEAVNG Sbjct 361
NDVTYRILCKRCSWEQGECVPCGSNIGYMPQQTGLVDNYVTVMDLLAHAN
YTFEVEAVNG 420 Query 241 VSDLSRSQRLFAAVSVTTGQA
APSQVSGVMKERVLQRAVDLSWQEPEHPNGVITEYEIKY 300
VSDLSRSQRLFAAVSTTGQAAPSQVSGVMKERVLQRVLSW
QEPEHPNGVITEYEIKY Sbjct 421 VSDLSRSQRLFAAVSITTGQ
AAPSQVSGVMKERVLQRSVELSWQEPEHPNGVITEYEIKY
480 Query 301 YEKDQRERTYSTLKTKSTSVSINNLRPGTAYIF
QIRAFTAAGYGMYSPRLDVSTLEEATV 360
YEKDQRERTYSTKTKSTS SINNLPGT YFQIRAFTAAGYG
YSPRLDVTLEEAT Sbjct 481 YEKDQRERTYSTVKTKSTSASI
NNLKPGTVYVFQIRAFTAAGYGNYSPRLDVATLEEATA
540 Query 361 YYIFA-CSYCIAYIMGSQSSLLLCLQIALQLLI
NSSSLYYTAALCDLNYNKSLKMHFPSG 419
I I Y A D L
HF Sbjct 541 TAVSSEQNPVIIIAVVAVAGTIILVFMVFGF
IIGRRHCGYSKA--DQEGDEELYFHF--- 595 Query 420
LVKFPGTKTYIDPETYEDPNRAVHQFAKELDASCIKIERVIGAGEFGEVC
SGRLKLPGKR 479 KFPGTKTYIDPETYEDPNRA
VHQFAKELDASCIKIERVIGAGEFGEVCSGRLKLPGKR Sbjct 596
--KFPGTKTYIDPETYEDPNRAVHQFAKELDASCIKIERVIGAGEFGEV
CSGRLKLPGKR 653 Query 480 DVPVAIKTLKVGYTEKQRRD
FLCEASIMGQFDHPNVVHLEGVVTRGKPVMIVIEFMENGA 539
DV VAIKTLKVGYTEKQRRDFLCEASIMGQFDHPNVVHLEGV
VTRGKPVMIVIEMENGA Sbjct 654 DVAVAIKTLKVGYTEKQRR
DFLCEASIMGQFDHPNVVHLEGVVTRGKPVMIVIEYMENGA
713 Query 540 LDAFLRKLDGQFTVIQLVGMLRGIAAGMRYLAD
MGYVHRDLAARNILVNSNLVCKVSDFG 599
LDAFLRK DGQFTVIQLVGMLRGIAAGMRYLADMGYVHRDLAARNILVNS
NLVCKVSDFG Sbjct 714 LDAFLRKHDGQFTVIQLVGMLRGIAAG
MRYLADMGYVHRDLAARNILVNSNLVCKVSDFG 773 Query
600 LSRIIEDDPDAVYTTTQGGKIPVRWTAPEAIQYRKFTSASDVWSY
GIVMWEVMSYGERPY 659 LSRIEDDPAVYTTT
GGKIPVRWTAPEAIQYRKFTSASDVWSYGIVMWEVMSYGERPY Sbjct
774 LSRVIEDDPEAVYTTT-GGKIPVRWTAPEAIQYRKFTSASDVWS
YGIVMWEVMSYGERPY 832 Query 660
WDMSNQDVIKAIEEGYRLPAPMDCPAGLHQLMLDCWQKERGERPKFEQIV
GILDKMIRNP 719 WDMSNQDVIKAIEEGYRLPAPM
DCPAGLHQLMLDCWQKERGERPKFEQIVGILDKMIRNP Sbjct 833
WDMSNQDVIKAIEEGYRLPAPMDCPAGLHQLMLDCWQKERGERPKFEQI
VGILDKMIRNP 892 Query 720 NSLKTPMGTCNRPTSPLLDQ
NTLDFNSFCSVGEWLEAIKMERYKENFSSSGYNSLESVAR 779
NSLKTPGTCRP SPLLDQNT DF
FCSVGEWLAIKMERYKNFGYNSLESVAR Sbjct 893
NSLKTPLGTCSRPISPLLDQNTPDFTTFCSVGEWLQAIKMERYKDNFTAA
GYNSLESVAR 952 Query 780 MSIDDVISLGITLVGHQKKIM
NSIQTMRAQMLQLHGTGI 818
MIDVSLGITLVGHQKKIMSIQTMRAQML LHGTGI Sbjct
953 MTIEDVMSLGITLVGHQKKIMSSIQTMRAQMLHLHGTGI 991
29 The erroneous part of ENSXETP00000040601
(Xenopus tropicalis) could be corrected, by
identifying the exons encoding the missing
transmembrane segment.
451
500 ensxetp00000040
601_corrected KERVLQRAVD LSWQEPEHPN GVITEYEIKY
YEKDQRERTY STLKTKSTSV
np_990414 KERVLQRSVE LSWQEPEHPN GVITEYEIKY
YEKDQRERTY STVKTKSTSA
ensxetp00000040601 KERVLQRAVD LSWQEPEHPN
GVITEYEIKY YEKDQRERTY STLKTKSTSV
501
550 ensxetp00000040601_corrected
SINNLRPGTA YIFQIRAFTA AGYGMYSPRL DVSTLEEATA
TAVSTEQNPV np_990414
SINNLKPGTV YVFQIRAFTA AGYGNYSPRL DVATLEEATA
TAVSSEQNPV ensxetp00000040601
SINNLRPGTA YIFQIRAFTA AGYGMYSPRL DVSTLEEATV
YYIFACSYCI 551
600 ensxetp00000040601_corrected IIIAVVAVAG
TIILVFMVFG FIIGRRHCGY SKA..DQEGD EELYFHC...
np_990414 IIIAVVAVAG TIILVFMVFG
FIIGRRHCGY SKA..DQEGD EELYFHF...
ensxetp00000040601 AYI.MGSQSS LLLCLQIALQ
LLINSSSLYY TAALCDLNYN KSLKMHFPSG
601
650 ensxetp00000040601_corrected
......TKTY IDPETYEDPN RAVHQFAKEL DASCIKIERV
IGAGEFGEVC np_990414
..KFPGTKTY IDPETYEDPN RAVHQFAKEL DASCIKIERV
IGAGEFGEVC ensxetp00000040601
LVKFPGTKTY IDPETYEDPN RAVHQFAKEL DASCIKIERV
IGAGEFGEVC 651
700 ensxetp00000040601_corrected SGRLKLPGKR
DVPVAIKTLK VGYTEKQRRD FLCEASIMGQ FDHPNVVHLE
np_990414 SGRLKLPGKR DVAVAIKTLK
VGYTEKQRRD FLCEASIMGQ FDHPNVVHLE
ensxetp00000040601 SGRLKLPGKR DVPVAIKTLK
VGYTEKQRRD FLCEASIMGQ FDHPNVVHLE
30Co-occurrence of extracellular and nuclear
domains. Rationale nuclear domains do not
co-occur with extracellular domains in
multidomain proteins. Accordingly, proteins that
contain both extracellular and nuclear domains
are considered abnormal.
Domain co-occurrence network of metazoan
multidomain proteins
Extracellular module
Cytoplasmic signalling module
Nuclear module
Tordai et al., FEBS J. 2005 272(19)5064-78.
31Mispredicted protein (Swiss-Prot entry)
containing nuclear and extracellular domains
MISPREDICTED
YL15_CAEEL
Hypothetical homeobox protein C02F12.5 in
chromosome X
1
50 q619j1_caebr
yl15_caeel MTSKTNMTSN KFAYDFFPWS NDTNSSQQIK
NIKPPPKRSN RPTKRTTFTS 51
100 q619j1_caebr
MFVWSAAVLI FSSVVPTFAQ YGCI....SE yl15_caeel
EQVTLLELEF AKNEYICKDR RGELAQTIEL TECQVKTWFQ
NRRTKKRSSE 101
150 q619j1_caebr
LTFGKACPQN KTSTKWFFDA KLSFCYPYQF LGCDEGSNSF
ESSDICLESC yl15_caeel LKFGTACSEN KTSTKWYYDS
KLLFCYPYKY LGCGEGSNSF ESNENCLESC
151
200 q619j1_caebr KPADQFSCGG NTDADGICFS
PSDSGCKKGT DCVMGGNIGF CCNKATQDEW yl15_caeel
KPADQFSCGG NTGPDGVCFA HGDQGCKKGT VCVMGGMVGF
CCDKKIQDEW 201
250 q619j1_caebr
NKEHSPTCSK GSVVQFKQWF GMTPLIGRNC AHKFCPAGST
CIQGKWTAHC yl15_caeel NKENSPKCLK GQVVQFKQWF
GMTPLIGRSC SHNFCPEKST CVQGKWTAYC
251 q619j1_caebr CQ yl15_caeel CQ
Exons belonging to tandem genes on C. elegans
chromosome X have been incorrectly joined
32Corrected predictions for the distinct
constituent proteins containing the nuclear
homeobox and extracellular KUNITZ_BPTI domains
CORRECT
1
50 yl15_caeel_corr MTSKTNMTSN
KFAYDFFPWS NDTNSSQQIK NIKPPPKRSN RPTK.RTTFT
hm07_caeel MK HEMVFTFLLM
MVRPEASTSR IPRR.RTTFT q7qbz2_anoga
MLFTTS YSRNKPTNNS NVARRRKKEG
RPRRQRTTFS 51
100 yl15_caeel_corr
SEQVTLLELE FAKNEYICKD RRGELAQTIE LTECQVKTWF
QNRRTKKRSF hm07_caeel VEQLYLLEMY
FAQSQYVGCD ERERLARILS LDEYQVKIWF QNRRIRMRRE
q7qbz2_anoga SEQTLRLEVE FHRNEYISRG RRFELAEVLK
LSETQIKIWF QNRRAKDKRI 101
115 yl15_caeel_corr I
hm07_caeel ANK q7qbz2_anoga
EKAQIDQQYR SVRIK
CORRECT
1
50 q619j1_caebr MFVWSAAVLI
FSSVVPTFAQ YGCISELTFG KACPQNKTST KWFFDAKLSF
yl15_caeel_corr1 MLFFTLLIQL F..LVPVLCQ
YACSSELKFG TACSENKTST KWYYDSKLLF
51
100 q619j1_caebr CYPYQFLGCD EGSNSFESSD
ICLESCKPAD QFSCGGNTDA DGICFSPSDS
yl15_caeel_corr1 CYPYKYLGCG EGSNSFESNE
NCLESCKPAD QFSCGGNTGP DGVCFAHGDQ
101
150 q619j1_caebr GCKKGTDCVM GGNIGFCCNK
ATQDEWNKEH SPTCSKGSVV QFKQWFGMTP
yl15_caeel_corr1 GCKKGTVCVM GGMVGFCCDK
KIQDEWNKEN SPKCLKGQVV QFKQWFGMTP
151 178
q619j1_caebr LIGRNCAHKF CPAGSTCIQG
KWTAHCCQ yl15_caeel_corr1 LIGRSCSHNF CPEKSTCVQG
KWTAYCCQ
33Another MisPred tool detects errors in gene
prediction based on Domain size deviation. The
rationale of this tool is that the highly
cooperative, rapid folding of protein domains is
the result of natural selection , therefore
insertion/deletion of larger segments into/from
protein domains may yield macromolecules that are
unable to rapidly adopt a correctly folded,
viable and stable three-dimensional structure.
Accordingly, proteins containing domains that
consist of a significantly larger or smaller
number of residues than closely related members
of the same family may be suspected to be unable
to fold efficiently into a correctly folded,
viable and stable domain/protein.
34RP11-247A12.5-001 encodes an internally deleted
Carn_acyltransf domain
RP11-247A12.5-001 544 aa
ABNORMAL
CACP_HUMAN, Carnitine O-acetyltransferase 626
residues
Region missing from RP11-247A12.5-001
CORRECT
1
100 cacp_human MLAFAARTVV
KPLGFLKPFS LMKASSRFKA HQDALPRLPV PPLQQSLDHY
LKALQPIVSE EEWAHTKQLV DEFQASGGVG ERLQKGLERR
ARKTENWLSE rp11-247a12 MLAFAARTVV KPLGFLKPFS
LMKASSRFKA HQDALPRLPV PPLQQSLDHY LKALQPIVSE
EEWAHTKQLV DEFQASGGVG ERLQKGLERR ARKTENWLSE Â Â
101
200 cacp_human WWLKTAYLQY
RQPVVIYSSP GVMLPKQDFV DLQGQLRFAA KLIEGVLDFK
VMIDNETLPV EYLGGKPLCM NQYYQILSSC RVPGPKQDTV
SNFSKTKKPP rp11-247a12 WWLKTAYLQY RQPVVIYSSP
GVMLPKQDFV DLQGQLRFAA KLIEGVLDFK VMIDNETLPV
EYLGGKPLCM NQYYQILSSC RVPGPKQDTV SNFSKTKKPP Â
201
300 cacp_human THITVVHNYQ
FFELDVYHSD GTPLTADQIF VQLEKIWNSS LQTNKEPVGI
LTSNHRNSWA KAYNTLIKDK VNRDSVRSIQ KSIFTVCLDA
TMPRVSEDVY rp11-247a12 THITVVHNYQ FFELDVYHSD
GTPLTADQIF VQLEKIWNSS LQTNKEPVGI LTSNHRNSWA
KAYNTLIKDK VNRDSVRSIQ .......... .......... Â
301
400 cacp_human RSHVAGQMLH
GGGSRLNSGN RWFDKTLQFI VAEDGSCGLV YEHAAAEGFP
IVTLLDYVIE YTKKPELVRS PMVPLPMPKK LRFNITPEIK
SDIEKAKQNL rp11-247a12 .......... ..........
.......... .......... .......... ..........
..KKPELVRS PLVPLPMPKK LRFNITPEIK SDIEKAKQNL Â
401
500 cacp_human SIMIQDLDIT
VMVFHHFGKD FPKSEKLSPD AFIQMALQLA YYRIYGQACA
TYESASLRMF HLGRTDTIRS ASMDSLTFVK AMDDSSVTEH
QKVELLRKAV rp11-247a12 SIMIQDLDIT VMVFHHFGKD
FPKSEKLSPD AFIQMALQLA YYRIYGQACA TYESASLRMF
HLGRTDTIRS ASMDSLTFVK AMDDSSVTEH QKVELLRKAV Â
501
600 cacp_human QAHRGYTDRA
IRGEAFDRHL LGLKLQAIED LVSMPDIFMD TSYAIAMHFH
LSTSQVPAKT DCVMFFGPVV PDGYGVCYNP MEAHINFSLS
AYNSCAETNA rp11-247a12 QAHRGYTDRA IRGEAFDRHL
LGLKLQAIED LVSMPDIFMD TSYAIAMHFH LSTSQVPAKT
DCVMFFGPVV PDGYGVCYNP MEAHINFSLS AYNSCAETNA Â
601 626
cacp_human ARLAHYLEKA LLDMRALLQS
HPRAKL rp11-247a12 ARLAHYLEKA LLDMRALLQS HPRAKL
35His 343
1NM8.pdb
Three-dimensional structure of human carnitine
O-acetyltransferase.
STRUCTURE OF HUMAN CARNITINE ACETYLTRANSFERASE
1NM8.pdb
The region highlighted in yellow is missing from
transcript RP11-247A12.5-001. This region also
contains the catalytic residue His-343
36 1
50 epha5_human MRGSGPRGAG HRRPP..SGG
GDTPITPASL AGCYSAPRRA PLWTCLLLCA epha5_rat
MRGSGPRGAG RRRTQGRGGG GDTPRVPASL AGCYSAPLKG
PLWTCLLLCA epha5_chick M...GLRGGG .....GRAGG
......PA.. .......... PGWTCLLLCA epha5_mouse
MRGSGPRGAG HRRTQGRGGG DDTPRVPASL AGCYSAPLKG
PLWTCLLLCA 51
100 epha5_human
ALRTLLASPS NEVNLLDSRT VMGDLGWIAF PKNGWEEIGE
VDENYAPIHT epha5_rat ALRTLLASPS NEVNLLDSRT
VLGDLGWIAF PKNGWEEIGE VDENYAPIHT epha5_chick
ALRSLLASPG SEVNLLDSRT VMGDLGWIAY PKNGWEEIGE
VDENYAPIHT epha5_mouse ALRTLLASPS NEVNLLDSRT
VMGDLGWIAF PKNGWEEIGE VDENYAPIHT
101
150 epha5_human YQVCKVMEQN QNNWLLTSWI
SNEGASRIFI ELKFTLRDCN SLPGGLGTCK epha5_rat
YQVCKVMEQN QNNWLLTSWI SNEGASRIFI ELKFTLRDCN
SLPGGLGTCK epha5_chick YQVCKVMEQN QNNWLLTSWI
SNEGRPASSF ELKFTLRDCN SLPGGLGTCK epha5_mouse
YQVCKVMEQN QNNWLLTSWI SNEGASRIFI ELKFTLRDCN
SLPGGLGTCK 151
200 epha5_human
ETFNMYYFES DDQNGRNIKE NQYIKIDTIA ADESFTELDL
GDRVMKLNTE epha5_rat ETFNMYYFES DDENGRNIKD
NQYIKIDTIA ADESFTELDL GDRVMKLNTE epha5_chick
ETFNMYYFES DDEDGRNIRE NQYIKIDTIA ADESFTELDL
GDRVMKLNTE epha5_mouse ETFNMYYFES DDENGRSIKE
NQYIKIDTIA ADESFTELDL GDRVMKLNTE
201
250 epha5_human VRDVGPLSKK GFYLAFQDVG
ACIALVSVRV YYKKCPSVVR HLAVFPDTIT epha5_rat
VRDVGPLSKK GFYLAFQDVG ACIALVSVRV YYKKCPSVVR
HLAVFPDTIT epha5_chick VRDVGPLTKK GFYLAFQDVG
ACIALVSVRV YYKKCPSVIR NLARFPDTIT epha5_mouse
VRDVGPLSKK GFYLAFQDVG ACIALVSVRV YYKKCPSVVR
HLAIFPDTIT 251
300 epha5_human
GADSSQLLEV SGSCVNHSVT DEPPKMHCSA EGEWLVPIGK
CMCKAGYEEK epha5_rat GADSSQLLEV SGSCVNHSVT
DDPPKMHCSA EGEWLVPIGK CMCKAGYEEK epha5_chick
GADSSQLLEV SGVCVNHSVT DEAPKMHCSA EGEWLVPIGK
CLCKAGYEEK epha5_mouse GADSSQLLEV SGSCVNHSVT
DDPPKMHCSA EGEWLVPIGK CMCKAGYEEK
301
350 epha5_human NGTCQVCRPG FFKASPHIQS
CGKCPPHSYT HEEASTSCVC EKDYFRRESD epha5_rat
NGTCQVCRPG FFKASPHSQT CSKCPPHSYT HEEASTSCVC
EKDYFRRESD epha5_chick NNTCQVCRPG FFKASPHSPS
CSKCPPHSYT LDEASTSCLC EEHYFRRESD epha5_mouse
NGTCQ..... .......... .......... ..........
.......... 351
400 epha5_human
PPTMACTRPP SAPRNAISNV NETSVFLEWI PPADTGGRKD
VSYYIACKKC epha5_rat PPTMACTRPP SAPRNAISNV
NETSVFLEWI PPADTGGGKD VSYYILCKKC epha5_chick
PPTMACTRPP SAPRSAISNV NETSVFLEWI PPADTGGRKD
VSYYIACKKC epha5_mouse .......... ..........
.......... .......... ..........
401
450 epha5_human NSHAGVCEEC GGHVRYLPRQ
SGLKNTSVMM VDLLAHTNYT FEIEAVNGVS epha5_rat
NSHAGVCEEC GGHVRYLPQQ IGLKNTSVMM ADPLAHTNYT
FEIEAVNGVS epha5_chick NSHSGLCEAC GSHVRYLPQQ
TGLKNTSVMM VDLLAHTNYT FEIEAVNGVS epha5_mouse
.......... .......... .......... ..........
.......... 451
500 epha5_human
DLSPGARQYV SVNVTTNQAA PSPVTNVKKG KIAKNSISLS
WQEPDRPNGI epha5_rat DLSPGTRQYV SVNVTTNQAA
PSPVTNVKKG KIAKNSISLS WQEPDRPNGI epha5_chick
DQNPGARQFV SVNVTTNQAA PSPVSSVKKG KITKNSISLS
WQEPDRPNGI epha5_mouse .......... .........A
PSPVTNVKKG KIAKNSISLS WQEPDRPNGI
501
550 epha5_human ILEYEIKHFE KDQETSYTII
KSKETTITAE GLKPASVYVF QIRARTAAGY epha5_rat
ILEYEIKYFE KDQETSYTII KSKETTITAE GLKPASVYVF
QIRARTAAGY epha5_chick ILEYEIKYFE KDQETSYTII
KSKETAITAD GLKPGSAYVF QIRARTAAGY epha5_mouse
ILEYEIKYFE KDQETSYTII KSKETSITAE GLKPASVYVF
QIRARTAAGY 551
600 epha5_human
GVFSRRFEFE TTPV.FAASS DQSQIPVIAV SVTVGVILLA
VVIGVLLSGS epha5_rat GVFSRRFEFE TTPV.FGASN
DQSQIPIIGV SVTVGVILLA VMIGFLLSGS epha5_chick
GGFSRRFEFE TSPV.LAASS DQSQIPIIVV SVTVGVILLA
VVIGFLLSGS epha5_mouse GVFSRRFEFE TTPVSVAASN
DQSQIPIIAV SVTVGVILLA VMIGFLLSGS
601
650 epha5_human CCECGCGRAS SLCAVAHPIL
IWRCGYSKAK QDPEEEKMHF HNGHIKLPGV epha5_rat
CCECGCGRAS SLCAVAHPSL IWRCGYSKAK QDPEEEKMHF
HNGHIKLPGV epha5_chick CCDHGCGWAS SLRAVAYPSL
IWRCGYSKAK QDPEEEKMHF HNGHIKLPGV epha5_mouse
CCDCGCGRAS SLCAVAHPSL IWRCGYSKAK QDPEEEKMHF
HNGHIKLPGV 651
700 epha5_human
RTYIDPHTYE DPNQAVHEFA KEIEASCITI ERVIGAGEFG
EVCSGRLKLP epha5_rat RTYIDPHTYE DPTQAVHEFG
KEIEASCITI ERVIGAGEFG EVCSGRLKLP epha5_chick
RTYIDPHTYE DPNQAVHEFA KEIEASCITI ERVIGAGEFG
EVCSGRLKLQ epha5_mouse RTYIDPHTYE DPNQAVHEFA
KEIEASCITI ERVIGAGEFG EVCSGCLKLP
701
750 epha5_human GKRELPVAIK TLKVGYTEKQ
RRDFLGEASI MGQFDHPNII HLEGVVTKSK epha5_rat
GKRELPVATK TLKVGYTEKQ RRDFLSEASI MGQFDHPNII
HLEGVVTKSK epha5_chick GKREFPVAIK TLKVGYTEKQ
RRDFLGEASI MGQFDHPNII HLEGVVTKSK epha5_mouse
GKRELPVAIK TLKVGYTEKQ RRDFLGEASI MGQFDHPNII
HLEGVVTKSK 751
800 epha5_human
PVMIVTEYME NGSLDTFLKK NDGQFTVIQL VGMLRGISAG
MKYLSDMGYV epha5_rat PVMIVTEYME NGSLDTFLKK
NDGQFTVIQL VGMLRGIAAG MKYLSDMGYV epha5_chick
PVMIVTEYME NGSLDTFLKK NDGQFTVIQL VGMLRGIASG
MKYLSDMGYV epha5_mouse PVMIVTEYME NGSLDTFLKK
NDGQFTVIQL VGMLRGIAAG MKYLSDMGYV
801
850 epha5_human HRDLAARNIL INSNLVCKVS
DFGLSRVLED DPEAAYTTRG GKIPIRWTAP epha5_rat
HRDLAARNIL INSNLVCKVS DFGLSRVLED DPEAAYTTRG
GKIPIRWTAP epha5_chick HRDLAARNIL INSNLVCKVS
DFGLSRVLED DPEAAYTTRG GKIPIRWTAP epha5_mouse
HRDLAARNIL INSNLVCKVS DFGLSRVLED DPEAAYTTRG
GKIPIRWTAP 851
900 epha5_human
EAIAFRKFTS ASDVWSYGIV MWEVVSYGER PYWEMTNQDV
IKAVEEGYRL epha5_rat EAIAFRKFTS ASDVWSYGIV
MWEVVSYGER PYWEMTNQDV IKAVEEGYRL epha5_chick
EAIAFRKFTS ASDVWSYGIV MWEVMSYGER PYWEMTNQDV
IKAVEEGYRL epha5_mouse EAIAFRKFTS ASDVWSYGIV
MWEVVSYGER PYWEMTNQDV IKAVEEGYRL
901
950 epha5_human PSPMDCPAAL YQLMLDCWQK
ERNSRPKFDE IVNMLDKLIR NPSSLKTLVN epha5_rat
PSPMDCPAAL YQLMLDCWQK DRNSRPKFDD IVNMLDKLIR
NPSSLKTLVN epha5_chick PSPMDCPAAL YQLMLDCWQK
DRNSRPKFDE IVSMLDKLIR NPSSLKTLVN epha5_mouse
PSPMDCPAAL YQLMLDCWQK DRNSRPKFDE IVNMLDKLIR
NPSSLKTLVN 951
1000 epha5_human
ASCRVSNLLA EHSPLGSGAY RSVGEWLEAI KMGRYTEIFM
ENGYSSMDAV epha5_rat ASSRVSTLLA EHGSLGSGAY
RSVGEWLEAT KMGRYTEIFM ENGYSSMDAV epha5_chick
ASSRVSNLLV EHSPVGSGAY RSVGEWLEAI KMGRYTEIFM
ENGYSSMDSV epha5_mouse ASSRVSTLLA EHGSLGSGAY
RSVGEWLEAI KMGRYTEIFM ENGYSSMDAV
1001
1041 epha5_human AQVTLEDLRR LGVTLVGHQ.
KKIMNSLQEM KVQLVNGMVP L epha5_rat AQVTLE....
.......... .......... .......... . epha5_chick
AQVTLEDLRR LGVTLVGHQ. KKIMNSLQEM KVQLVNGMVP
L epha5_mouse AQVTLEDLRR LGVTLVGHQK KKIMSSLQEM
KVQMVNGMVP V
EPHA5_RAT contains a C-terminal truncated SAM_1
domain, although not annotated as fragment by
SwissProt. It is noteworthy that orthologs from
mouse, human and chicken contain an intact SAM_1
domain.
EPHA5_RAT ephrin type-a receptor 5 precursor Â
1005 residues
EPHA5_HUMAN ephrin type-a receptor 5 precursor Â
1037 residues
37 851
900 epha5_rat_corrected EAIAFRKFTS ASDVWSYGIV
MWEVVSYGER PYWEMTNQDV IKAVEEGYRL
epha5_rat EAIAFRKFTS ASDVWSYGIV MWEVVSYGER
PYWEMTNQDV IKAVEEGYRL epha5_human
EAIAFRKFTS ASDVWSYGIV MWEVVSYGER PYWEMTNQDV
IKAVEEGYRL epha5_chick EAIAFRKFTS
ASDVWSYGIV MWEVMSYGER PYWEMTNQDV IKAVEEGYRL
epha5_mouse EAIAFRKFTS ASDVWSYGIV MWEVVSYGER
PYWEMTNQDV IKAVEEGYRL 901
950 epha5_rat_corrected PSPMDCPAAL YQLMLDCWQK
DRNSRPKFDD IVNMLDKLIR NPSSLKTLVN
epha5_rat PSPMDCPAAL YQLMLDCWQK DRNSRPKFDD
IVNMLDKLIR NPSSLKTLVN epha5_human
PSPMDCPAAL YQLMLDCWQK ERNSRPKFDE IVNMLDKLIR
NPSSLKTLVN epha5_chick PSPMDCPAAL
YQLMLDCWQK DRNSRPKFDE IVSMLDKLIR NPSSLKTLVN
epha5_mouse PSPMDCPAAL YQLMLDCWQK DRNSRPKFDE
IVNMLDKLIR NPSSLKTLVN 951
1000 epha5_rat_corrected ASSRVSTLLA EHGSLGSGAY
RSVGEWLEAT KMGRYTEIFM ENGYSSMDAV
epha5_rat ASSRVSTLLA EHGSLGSGAY RSVGEWLEAT
KMGRYTEIFM ENGYSSMDAV epha5_human
ASCRVSNLLA EHSPLGSGAY RSVGEWLEAI KMGRYTEIFM
ENGYSSMDAV epha5_chick ASSRVSNLLV
EHSPVGSGAY RSVGEWLEAI KMGRYTEIFM ENGYSSMDSV
epha5_mouse ASSRVSTLLA EHGSLGSGAY RSVGEWLEAI
KMGRYTEIFM ENGYSSMDAV 1001
1042 epha5_rat_corrected AQVTLEDLRR LGVTLVGHQ.
KKIMNSLQEM KVQLVNGMVP V epha5_rat
AQVTLE
epha5_human AQVTLEDLRR LGVTLVGHQ.
KKIMNSLQEM KVQLVNGMVP L epha5_chick
AQVTLEDLRR LGVTLVGHQ. KKIMNSLQEM KVQLVNGMVP L
epha5_mouse AQVTLEDLRR LGVTLVGHQK
KKIMSSLQEM KVQMVNGMVP V
corrected
38Conclusions from MisPred analyses of various
databases
The number of UniProtKB/Swiss-Prot entries
identified by MisPred as erroneous is very low,
attesting to both the high quality of this
manually curated database and the reliability of
the MisPred approach. In the case of
UniProtKB/TrEMBL MisPred identified a large
proportion of TrEMBL entries as erroneous, the
majority of which were missing signal peptides or
suffered from domain size deviation. This is due
primarily to the fact that these TrEMBL entries
are translated in silico from non-full length
cDNAs. In the case of the EnsEMBL- and
NCBI/GNOMON-predicted sequences MisPred
identified 3-4 of human sequences as
erroneous. The majority of errors were also
identified on the basis of missing signal
peptides and domain size deviation, probably
reflecting the influence of non-full-length or
abnormal cDNAs on gene predictions.
Nagy A, Hegyi H, Farkas K, Tordai H, Kozma E,
Bányai L, Patthy L. Identification and correction
of abnormal, incomplete and mispredicted proteins
in public databases. BMC Bioinformatics. 2008 Aug
279353.
39Conclusions from MisPred analyses of various
databases
Application of the MisPred tools to GENCODE
peptides revealed that many of the potential
alternative gene products encode proteins that
are likely to be mislocalized and/or misfolded,
suggesting that they do not have a role as
functional proteins.
Tress ML, Martelli PL, Frankish A, Reeves GA,
Wesselink JJ, Yeats C, Olason PI, Albrecht M,
Hegyi H, Giorgetti A, Raimondo D, Lagarde J,
Laskowski RA, López G, Sadowski MI, Watson JD,
Fariselli P, Rossi I, Nagy A, Kai W, Størling Z,
Orsini M, Assenov Y, Blankenburg H, Huthmacher C,
RamÃrez F, Schlicker A, Denoeud F, Jones P,
Kerrien S, Orchard S, Antonarakis SE, Reymond A,
Birney E, Brunak S, Casadio R, Guigo R, Harrow J,
Hermjakob H, Jones DT, Lengauer T, Orengo CA,
Patthy L, Thornton JM, Tramontano A, Valencia
A. The implications of alternative splicing in
the ENCODE protein complement. Proc Natl Acad Sci
U S A. 2007 Mar 27104(13)5495-500. Epub 2007
Mar 19.
40(No Transcript)
41 Although large scale whole genome analyses have
shown that mammalian transcriptomes are made of a
swarming mass of different overlapping
transcripts, little evidence exists that the
majority of this transcript complexity leads to
protein complexity. The 5.7 average transcripts
per coding locus annotated in GENCODE translates
only to 1.7 proteins per locus (since a large
fraction of transcript variation corresponds to
non-coding transcripts or accumulates in the UTRs
of coding transcripts). Moreover, if the GENCODE
proteins flagged as problematic by the protein
assessment methods, such as MisPred, are ignored,
there are barely 1.3 annotated proteins per
locus. The discrepancy between a complex,
variable and largely unexplored population of RNA
molecules, and a relatively small, stable, and
well defined population of proteins, constitutes
one of the challenges that Molecular Biology
needs to address to fully elucidate cellular
function.
Harrow J, Nagy A, Reymond A, Alioto T, Patthy L,
Antonarakis SE and Guigó R. Identifying
protein-coding genes in genomic sequences. Genome
Biology 2009, 10201
42 László Bányai Krisztina Farkas Hédi Hegyi
Evelin Kozma Alinda Nagy Hedvig Tordai This
work was carried out as part of the BioSapiens
project. The BioSapiens project is funded by the
European Commission within its FP6 Programme,
under the thematic area "Life sciences, genomics
and biotechnology for health", contract number
LHSG-CT-2003-503265. The authors thank the
partial support of the National Office for
Research and Technology under grant no. eScience
RET14/2005.
43(No Transcript)