Title: Facts and Artefacts:
1Facts and Artefacts Database Anomalies Revealed
by the Analysis of Rat Ly6 Proteins Dr
Christopher Southan Oxford GlycoSciences (UK)
Ltd Harwell Seminar, November 2002
2Introduction Quirks that Lurk in Databases
- The sequence deluge into the primary databases
necessitates automated pipelines to produce
'value added' secondary databases - But, however sophisticated the data parsing or
curation, anomalies will get through - Most things that could have gone wrong, have
- Although the overall quirk frequency is low, they
present pitfalls for the unwary - Responsibility for primary annotation and
sequence quality lies solely with submitting
authors - Few originating authors correct, update or
withdraw their primary sequence entries - It is difficult to discriminate between in vitro
artifacts or rare in vivo events
3Outline
- Proteomic analysis of rat urine lead to the
identification of 2 novel secreted proteins in
EST data - Further searching expanded these findings to a
large family of rat and mouse proteins, and
vertebrate homolgoues of short Ly-6 proteins with
unknown biology - Bioinformatic analysis of database matches
exposed a swathe of anomalies including - chimeric and pre-mRNAs
- sequence errors
- naming ambiguities
- equivocal functional data
- The OGS Protein Atlas of the human genome
includes peptide data from one short Ly6
homologue, Lynx1 - Combining proteomic data with sequence analysis
delineated the Lynx1 gene product and inferred
biochemical properties of the protein
4Rat Urine ? 2Dgel ? Trypsin ? MS/MS ? PepSea
Search ? EST hits
- Spot area 1 gave two different
- peptide matches
- CTSFDSTGFCHVGR contained within Rat EST A893514
- CESLDSTGLCR contained within the nucleotide
sequence of EST AA800439
5Rat Urine ? HPLC ? MALDI ? N-Terminal Sequence
6EST AA893514 vs dbEST 30 Rat Hits at 95 to
100 Identity
7Assembly of Rat Urinary Proteins 1 and 2
- 9 EST sequences, the MS/MS sequences, and the
N-terminal data, were all consistent with two
paralogous proteins - 90 identical at the AA level and 96 identical
at the DNA level - One N-glycosylation site
- Secreted forms abundant in male rat urine by HPLC
- Highly represented in rat liver ESTs
8RUP-3 Confirmed by Data from Wait et al
Electrophoresis 22, 3043-3052 (2001)
RUP1 MGKPILLLPLGLSLLMSSLLALQCFRCESLDSTGLCRVGRRICQ
TYPDEICAWVVVTTRD RUP2 MGKHILLLPLGLSLLMSSLLALQCFRC
TSFDSTGFCHVGRQKCQTYPDEICAWVVVTTRD RUP3
MGKHILLLPLGLSLLMSSLLALQCFRCISFDSTGFCYVGRHICQTYPDEI
CAWVVVTTRD
. RUP1
GKFVYGNQSCAECIGTTVEHGSLIISTNCCSATPFCNMVHP EST
AA800439 RUP2 GKFVYGNQSCAECNATTVEHGSLIVSTNCCSATPF
CNMVHR EST AA893514 RUP3 GKFVYGNQSCAECNATTVEHGSL
IVSTNCCSATPFCNMVHR EST AA893518
.
9TIGR One Assembly but at Least Two Gene Products
10Solid Matches between RUP2 and Three Unrelated
mRNAs
- Rat mitochondrial IF1 protein mRNA, L07806, 883
bp - Rat casein kinase II alpha subunit (CK2), L15618,
2180 bp - Rat mitochondrial succinyl-CoA synthetase alpha
subunit J03621, 1684 bp - Matches of 92 to 100 identity over 300-500
bases - Two in reverse-frame, one in forward frame
11Searches Against Rat ESTs Confirmed the Three
mRNAs as Chimeras
L07806
L15618
J03621
12Translation Matches for the Chimeras Reveal a
Cryptic Ly-6 Protein verified by rat genomic hit
RUP-2 28 TSFDSTGFCHVGRQKCQTYPDEICAWVVVTTRDGKF
VYGNQSCAECNATTVEHGSLIVSTNCCSATPFCNMVHR 101
TSFDSTGFCHVGRQKCQTYPDEICAWVVVTTRDGKFVYGNQSCAEC
NATTVEHGSLIVSTNCCSATPFCNMVHR 417
TSFDSTGFCHVGRQKCQTYPDEICAWVVVTTRDGKFVYGNQSCAECNATT
VEHGSLIVSTNCCSATPFCNMVHR 196 L07806 Rattus
rattus mitochondrial IF1 protein mRNA RUP-2 59
RDGKFVYGNQSCAECNATTVEHGSLIVSTNCCSATPFCNMVHR 101
RDGKFVYGNQSCAECNATTVEHGSLIVSTNCCSATPFCNMV
HR 708 RDGKFVYGNQSCAECNATTVEHGSLIVSTNCCSATP
FCNMVHR 580 L15618 Rat casein kinase II alpha
subunit (CK2) mRNA RUP-2 24 CFRCTSFDSTGFCHVGRQK
CQTYPDEICAWVVVTTRDGKFVYGNQSCAECNATTVEHGSLIVSTNCCSA
TPFCNMV 99 CF C S G C C P
ECA VT DGKFVYGNQSCAEC TVEHGSLIVSTNCCSAT
FCNV 50 CFECGNLNSMGICNFRTAVCYAHPGEVCA-SVL
TYKDGKFVYGNQSCAECSGRTVEHGSLIVSTNCCSATSFCNIV
274 J03621 Rat mitochondrial succinyl-CoA
synthetase alpha subunit
13Comparing the J03621 Chimera EST matches against
rat genome HTG
J0362 mRNA vs rat ESTs
J0362 mRNA vs rat genomic
14What Caused the Chimeras?
- Each of the chimeric cDNAs submitted by different
research groups 1988-1993 - All prepared from rat cDNA libraries
- Two of these genes are nuclear-encoded
mitochondrial proteins - L07806 has 2 non-chimeric counterparts
- 3 'host' transcripts are on different loci in
humans (no rat map data yet) - The 5' insertions are different sequences,
lengths and orientations - Only L15618 is single-exon insert
- Hits to unfinished rat genome data confirm the
chimeras - Is the insertion of RUP2-like genes an in vitro
artefact or a rare event in vivo?
15mRNA Anomaly No. 4 Unspliced?
- LOCUS AF368860 1197 bp mRNA 13-JUN-2001
- cds 10..96 "MGKHILLLPLVLSLLMSSLQDSCGHEPS
trQ91XP0 - Rattus norvegicus 3' non-translated
beta-F1-ATPase mRNA-binding protein mRNA,
complete cds. "Identification of a liver
specific cDNA clone chaperoning the differential
assembly of ribonucleoprotein complexes at the 3'
UTR of the mRNAs of oxidative phosphorylation"
BLAST vs Rat ESTs
16The L07806 Chimera Caused Major Errors in the
UniGene Cluster for Rn.1658
Atpi ATPase inhibitor (rat mitochondrial IF1
protein)
17Sequence Conflict arising from the L07806-Derived
Chimeric ORF is Flagged by SwissProt
- But the L07806-derived protein, without the
targeting sequence, was expressed as maltose
binding protein fusion in E coli and was fully
active!
18RUP Homologues Expand a New Family of Secreted
Ly-6 Proteins but not (yet?) Recognised by
InterPro
19Confusion Over Caltrin 5 Different Sequences in
SwissProt 22 PubMed Citations
- Caltrin inhibition of Ca2 uptake into
spermatozoa - CALTRIN PRECURSOR (CALCIUM TRANSPORT INHIBITOR).
- Mus musculus (a Ly-6 protein) - CALTRIN PRECURSOR (CALCIUM TRANSPORT INHIBITOR)
(SEMINALPLASMIN) (SPLN). - Bos taurus (PYY-like) - CALTRIN-LIKE PROTEIN I. - Cavia porcellus (weak
protease inhibitor match) - CALTRIN-LIKE PROTEIN II. - Cavia porcellus
(elastase inhibitor like) - PANCREATIC SECRETORY TRYPSIN INHIBITOR II
PRECURSOR (PSTI-II) (CALTRIN) (CALCIUM TRANSPORT
INHIBITOR). - Rattus norvegicus (trypsin
inhibitor identity)
20Mouse Ly-6-like Caltrin Sequence Errors,
Unverified Reported Function, New Name and New
Function?
21ARS component B, Antineoplastic Urinary Protein
and Secreted Mammalian Ly-6/uPAR Related Protein
Updates for SwissProt P55000
22Linking Sequence to Function the Lost Keyword
Problem (PubMed Queries in red)
- Adermann et al. "Structural and phylogenetic
characterisation of human SLURP-1, the first
secreted mammalian member of the Ly-6 /uPAR
protein superfamily" Protein Sci. 1999 from
blood and urine peptide libraries. SLURP-1 is
encoded by the ARS (component B)-81/s locus, and
appears to be the first mammalian member of the
Ly-6/uPAR family lacking a GPI-anchoring signal
sequence ... SLURP-1 () Ly-6 () ANUP (-) - Katz et al "A partial catalog of proteins
secreted by epidermal keratinocytes in culture."
J Invest Dermatol. 1999 proteins secreted by
adult human epidermal keratinocytes included
anti-neoplastic urinary protein () ANUP (-)
SLURP-1(-) Ly-6 (-) - Fischer et al. "Mutations in the gene encoding
SLURP-1 in Mal de Meleda". Hum Mol Genet. 2001
Three different homozygous mutations (a deletion,
a nonsense and a splice site mutation) were
detected in 19 families of Algerian and Croatian
origin first instance of a secreted protein
being involved in a palmoplantar keratoderma..
SLURP-1 () Ly-6 () ANUP (-)
23First RUP orthologue in Mouse Ensembl, Chr 9
24NCBI Genomic Pipeline also Predicts New
Orthologues in Mouse
25Not all Pipelines are Equal Matching NCBI
XP_135421 to the UCSC Mouse Genome
26Unknown Biology for the Short Ly-6 Proteins
- Single domain proteins 85-100 residues mostly
with signal peptide - Probable ligands by inference from toxin
structures? - Recently duplicated rat parologous family of up
to 10 gene loci - Liver and spleen expression in rat
- SLURP linked to skin physiology?
- Caltrin/SVS VII Phospholipid binding?
- Foetal expression for probable pig and bovine
orthologues - Fast-evolving orthologues in mouse but only
homologues in human ? - Distant homologues involved in myelopoiesis in
Xenopus and liver acute phase in rainbow trout - Distant homologues in C.elegans
27Proteins in SwissProt/TREMBL Submitted or Updated
from this Work
- Submitted
- RSP1_RAT (Q9QXN2) Spleen protein 1
- UP1_RAT (P81827) Urinary protein 1 (RUP-1)
- UP2_RAT (P81828) Urinary protein 2 (RUP-2)
- P8312 Urinary protein 3 (RUP-3) Rat
- P83106 PIP1 protein (PIP1) - Sus scrofa
- P83107 BOP1 protein (BOP1) - Bos taurus
- Q9BZG9 Ly-6 neurotoxin-like protein Lynx1 - Homo
sapiens - Updated/corrected links
- SLUR_MOUSE (Q9Z0K7) Secreted Ly-6/uPAR protein
- SLUR_HUMAN P55000 Secreted Ly-6/uPAR protein
- CALT_MOUSE Q09098 Caltrin (now renamed Seminal
Vesicle Protein 7)
28The Pitfall List (1)
- TIGR EST assembly has merged paralogues by shared
identity - The chimeric and pre-mRNAs lead to
- Artifactual clustering of ESTs and non-homologous
gene products in Unigene - Protein database conflicts and artifacts
- Annotation errors in LocusLink and dbEST
- RUP gene products missed in LocusLink/Unigene
- Chimeric mRNA picked as Refseq and therefore
transitively propagated to RATMAP - Translation of cryptic novel protein not captured
29The Pitfall List (2)
- Loose ends and sequence errors in old data
captured by SwissProt but unresolved by authors - Equivocal functional annotation transitively
perpetuated - Sequence-literature links broken by gene name
ambiguities - Incorrect signal peptide annotation
- Similarity scores for Ly-6 homologues fall below
those used in domain databases - Big time lag for sequences without mRNA appearing
in NCBI system
30Conclusions
- Finding quirks in database entries is definitely
part of the fun - of bioinformatics, but
- Sequence anomalies can seriously confound
automated and manual annotation - They can only be unravelled, or at least exposed
by - transitive and broad sequence/keyword searching
- detailed examination of sequence and literature
links - understanding the technology of sequencing from
different sources and database building
procedures - Conflicting data links should be ideally be
resolved by new data but may have to be resolved
by judgment - Inferring biological meaning from database search
results requires an understanding of the
experiments and the in-silico analyses
underpinning the annotations
31Acknowledgements
Southan, C., Cutler, P., Birrell, H., Connelly,
J., Fantom, K.G.M., Sims, M., Shaikh, N., and
Schneider, K. "The characterisation of novel
secreted Ly-6 proteins from rat urine by the
combined use of 2-dimensional gel
electrophoresis, microbore HPLC and expressed
sequence tag data" Proteomics, (2002) 2, 187-196