Title: Protein Sequence Databases for Proteomics
1Protein Sequence Databases for Proteomics
- Nathan Edwards
- Center for Bioinformatics and Computational
Biology - University of Maryland, College Park
2Protein Sequence Databases
- Link between mass spectra and proteins
- A proteins amino-acid sequence provides a basis
for interpreting - Enzymatic digestion
- Separation protocols
- Fragmentation
- Peptide ion masses
- We must interpret database information as
carefully as mass spectra.
3More than sequence
- Protein sequence databases provide much more than
sequence - Names
- Descriptions
- Facts
- Predictions
- Links to other information sources
- Protein databases provide a link to the current
state of our understanding about a protein.
4Much more than sequence
- Names
- Accession, Name, Description
- Biological Source
- Organism, Source, Taxonomy
- Literature
- Function
- Biological process, molecular function, cellular
component - Known and predicted
- Features
- Polymorphism, Isoforms, PTMs, Domains
- Derived Data
- Molecular weight, pI
5Database types
6SwissProt
- From ExPASy
- Expert Protein Analysis System
- Swiss Institute of Bioinformatics
- 180,000 protein sequence entries
- 9,000 species represented
- 12,000 Human proteins
- Highly curated
- Minimal redundancy
- Some restrictions on commercial use
7PIR
- Protein Information Resource
- Georgetown University Medical Center
- 280,000 protein sequence entries
- Highly curated
- Public domain resource
- 10,500 Human proteins
- Grew out of the Atlas of Protein Sequence and
Structure (1965-1978) edited by Margaret Dayhoff.
8TrEMBL
- Translated EMBL nucleotide sequences
- European Molecular Biology Laboratory
- European Bioinformatics Institute (EBI)
- Computer annotated
- Only sequences absent from SwissProt
- 165,000 protein sequence entries
- 88,000 species
- 52,000 Human proteins
9RefSeq
- Reference Sequence
- From NCBI (National Center for Biotechnology
Information), NLM, NIH - Integrated genomic, transcript, and protein
sequences. - Varying levels of curation
- Reviewed, Validated, , Predicted,
- 1,350,000 protein sequence entries
- 44,000 reviewed
- 28,000 Human proteins
10RefSeq
- Particular focus on major research organisms
- Tightly integrated with genome projects.
- Curated entries NP accesssions
- Predicted entries XP accessions
11UniProt
- Universal Protein Resource
- Combination of
- Swiss-Prot
- TrEMBL
- PIR
- Knowledgebase is highly curated
- Similar sequence clusters are available
- 50, 90, 100 sequence similarity
12IPI
- International Protein Index
- From EBI
- For a specific species, combines
- UniProt, RefSeq, Ensembl
- Species specific databases
- 48,000 protein sequence entries
- Human, mouse, rat, zebra fish, arabidopsis
13NCBIs nr
- non-redundant
- Contains
- GenBank CDS translations
- RefSeq Proteins
- Protein Data Bank (PDB)
- SwissProt, TrEMBL, PIR
- Others
- Similar sequences suppressed
- 100 sequence similarity
- 1,800,000 protein sequence entries
- 33,000 species
14MSDB
- From the Imperial College (London)
- Combines
- PIR, TrEMBL, GenBank, SwissProt
- Distributed with Mascot
- so well integrated with Mascot
15Others
- HPRD
- Manually curated integration of literature
- PDB
- Focus on protein structure
- dbEST
- Part of GenBank - EST sequences
- Genome Sequences
16Human Sequences
- Number of Human Genes is believed to be between
20,000 and 25,000
17DNA to Protein Sequence
Derived from http//online.itp.ucsb.edu/online/inf
obio01/burge
18Genome Browsers
- Link genomic, transcript, and protein sequence in
a graphical manner - Genes, ESTs, SNPs, cross-species, etc.
- UC Santa Cruz
- http//genome.ucsc.edu
- Ensembl
- http//www.ensembl.org
- NCBI Map View
- http//www.ncbi.nlm.nih.gov/mapview
19UCSC Genome Browser
- Shows many sources of protein sequence evidence
in a unified display - Can use EST accession as a location!
20Accessions
- Permanent labels
- Short, machine readable
- Enable precise communication
- Typos render them unusable!
- Each database uses a different format
- Swiss-Prot P17947
- Ensembl ENSG00000066336
- PIR S60367 S60367
- GO GO0003700
21Names / IDs
- Compact mnemonic labels
- Not guaranteed permanent
- Require careful curation
- Conceptual objects
- Swiss-Prot names changed recently!
- ALBU_HUMAN
- Serum Albumin
- RT30_HUMAN
- Mitochondrial 28S ribosomal protein S30
- CP3A7_HUMAN
- Cytochrome P450 3A7
22Description / Name
- Free text description
- Human readable
- Space limited
- Hard for computers to interpret!
- No standard nomenclature or format
- Often abused.
- COX7R_HUMAN
- Cytochrome c oxidase subunit VIIa-related
protein, mitochondrial Precursor
23FASTA Format
24FASTA Format
- gt
- Accession number
- No uniform format
- Multiple accessions separated by
- One line of description
- Usually pretty cryptic
- Organism of sequence?
- No uniform format
- Official latin name not necessarily used
- Amino-acid sequence in single-letter code
- Usually spread over multiple lines.
25Organism / Species / Taxonomy
- The proteins organism
- or the source of the biological sample
- The most reliable sequence annotation available
- Useful only to the extent that it is correct
- NCBIs taxonomy is widely used
- Provides a standard of sorts Heirachical
- Other databases dont necessarily keep up
- Organism specific sequence databases starting to
become available.
26Organism / Species / Taxonomy
- Buffalo rat
- Gunn rats
- Norway rat
- Rattus PC12 clone IS
- Rattus norvegicus
- Rattus norvegicus8
- Rattus norwegicus
- Rattus rattiscus
- Rattus sp.
- Rattus sp. strain Wistar
- Sprague-Dawley rat
- Wistar rats
- brown rat
- laboratory rat
- rat
- rats
- zitter rats
27Controlled Vocabulary
- Middle ground between computers and people
- Provides precision for concepts
- Searching, sorting, browsing
- Concept relationships
- Vocabulary / Ontology must be established
- Human curation
- Link between concept and object
- Manually curated
- Automatic / Predicted
28Controlled Vocabulary
29Controlled Vocabulary
30Controlled Vocabulary
31Controlled Vocabulary
32Controlled Vocabulary
33Ontology Structure
- NCBI Taxonomy
- Tree
- Gene Ontology (GO)
- Molecular function
- Biological process
- Cellular component
- Directed, Acyclic Graph (DAG)
- Unstructured labels
- Overlapping?
34Ontology Structure
35Protein Families
- Similar sequence implies similar function
- Similar structure implies similar function
- Common domains imply similar function
- Bootstrap up from small sets of proteins with
well understood characteristics - Usually a hybrid manual / automatic approach
36Protein Families
37Protein Families
38Protein Families
- PROSITE, PFam, InterPro, PRINTS
- Gene Ontology
- Swiss-Prot keywords
- Differences
- Motif style, ontology structure, degree of manual
curation - Similarities
- Primarily sequence based, cross species
39Sequence Variants
- Protein sequence can vary due to
- Polymorphism
- Alternative splicing
- Post-translational modification
- Sequence databases typically do not capture all
versions of a proteins sequence
40Sequence Variants
- Swiss-Prot variants, isoforms and conflicts are
retained as features - Script varsplic.pl can enumerate all sequence
variants - Command-line options for full enumeration
- -which full -varsplic -variant -conflict
41Swiss-Prot Variant Annotations
42Swiss-Prot Variant Annotations
43Swiss-Prot Variant Annotations
Feature viewer
Variants
44Swiss-Prot VarSplic Output
P13746-00-01-00 MAVMAPRTLLLLLSGALALTQTWAGSHSM
RYFYTSVSRPGRGEPRFIAVGYVDDTQFVRF P13746-01-01-00
MAVMAPRTLLLLLSGALALTQTWAGSHSMRYFYTSVSRPGRGEPRFI
AVGYVDDTQFVRF P13746-00-00-00
MAVMAPRTLLLLLSGALALTQTWAGSHSMRYFYTSVSRPGRGEPRFIAVG
YVDDTQFVRF P13746-00-03-00
MAVMAPRTLLLLLSGALALTQTWAGSHSMRYFYTSVSRPGRGEPRFIAVG
YVDDTQFVRF P13746-01-03-00
MAVMAPRTLLLLLSGALALTQTWAGSHSMRYFYTSVSRPGRGEPRFIAVG
YVDDTQFVRF P13746-00-04-00
MAVMAPRTLLLLLSGALALTQTWAGSHSMRYFYTSVSRPGRGKPRFIAVG
YVDDTQFVRF P13746-01-04-00
MAVMAPRTLLLLLSGALALTQTWAGSHSMRYFYTSVSRPGRGKPRFIAVG
YVDDTQFVRF P13746-00-05-00
MAVMAPRTLLLLLSGALALTQTWAGSHSMRYFYTSVSRPGRGEPRFIAVG
YVDDTQFVRF P13746-01-05-00
MAVMAPRTLLLLLSGALALTQTWAGSHSMRYFYTSVSRPGRGEPRFIAVG
YVDDTQFVRF P13746-01-00-00
MAVMAPRTLLLLLSGALALTQTWAGSHSMRYFYTSVSRPGRGEPRFIAVG
YVDDTQFVRF P13746-00-02-00
MAVMAPRTLLLLLSGALALTQTWAGSHSMRYFYTSVSRPGRGEPRFIAVG
YVDDTQFVRF P13746-01-02-00
MAVMAPRTLLLLLSGALALTQTWAGSHSMRYFYTSVSRPGRGEPRFIAVG
YVDDTQFVRF
45Swiss-Prot VarSplic Output
P13746-00-01-00 SSQPTIPIVGIIAGLVLLGAVITGAVVAA
VMWRRKSS------DRKGGSYTQAASSDSAQ P13746-01-01-00
SSQPTIPIVGIIAGLVLLGAVITGAVVAAVMWRRKSSGGEGVKDRKG
GSYTQAASSDSAQ P13746-00-00-00
SSQPTIPIVGIIAGLVLLGAVITGAVVAAVMWRRKSS------DRKGGSY
TQAASSDSAQ P13746-00-03-00
SSQPTIPIVGIIAGLVLLGAVITGAVVAAVMWRRKSS------DRKGGSY
TQAASSDSAQ P13746-01-03-00
SSQPTIPIVGIIAGLVLLGAVITGAVVAAVMWRRKSSGGEGVKDRKGGSY
TQAASSDSAQ P13746-00-04-00
SSQPTIPIVGIIAGLVLLGAVITGAVVAAVMWRRKSS------DRKGGSY
TQAASSDSAQ P13746-01-04-00
SSQPTIPIVGIIAGLVLLGAVITGAVVAAVMWRRKSSGGEGVKDRKGGSY
TQAASSDSAQ P13746-00-05-00
SSQPTIPIVGIIAGLVLLGAVITGAVVAAVMWRRKSS------DRKGGSY
TQAASSDSAQ P13746-01-05-00
SSQPTIPIVGIIAGLVLLGAVITGAVVAAVMWRRKSSGGEGVKDRKGGSY
TQAASSDSAQ P13746-01-00-00
SSQPTIPIVGIIAGLVLLGAVITGAVVAAVMWRRKSSGGEGVKDRKGGSY
TQAASSDSAQ P13746-00-02-00
SSQPTIPIVGIIAGLVLLGAVITGAVVAAVMWRRKSS------DRKGGSY
SQAASSDSAQ P13746-01-02-00
SSQPTIPIVGIIAGLVLLGAVITGAVVAAVMWRRKSSGGEGVKDRKGGSY
SQAASSDSAQ
46Omnibus Database Redundancy Elimination
- Source databases often contain the same sequences
with different descriptions - Omnibus databases keep one copy of the sequence,
and - An arbitrary description, or
- All descriptions, or
- Particular description, based on source
preference - Good definitions can be lost, including taxonomy
47Description Elimination
- gi12053249embCAB66806.1 hypothetical protein
Homo sapiens - gi46255828gbAAH68998.1 COMMD4 protein Homo
sapiens - gi42632621gbAAS22242.1 COMMD4 Homo
sapiens - gi21361661refNP_060298.2 COMM domain
containing 4 Homo sapiens - gi51316094spQ9H0A8COM4_HUMAN COMM domain
containing protein 4 - gi49065330embCAG38483.1 COMMD4 Homo
sapiens
48Description Elimination
- gi2947219gbAAC39645.1 UDP-galactose 4'
epimerase Homo sapiens - gi1119217gbAAB86498.1 UDP-galactose-4-epimera
se Homo sapiens - gi14277913pdb1HZJB Chain B, Human
Udp-Galactose 4-Epimerase Accommodation Of
Udp-N- Acetylglucosamine Within The Active Site - gi14277912pdb1HZJA Chain A, Human
Udp-Galactose 4-Epimerase Accommodation Of
Udp-N- Acetylglucosamine Within The Active Site - gi2494659spQ14376GALE_HUMAN UDP-glucose
4-epimerase (Galactowaldenase) (UDP-galactose
4-epimerase) - gi1585500prf2201313AUDP galactose
4'-epimerase
49Description Elimination
- gi4261710gbAAD14010.1 chlordecone reductase
Homo sapiens - gi2117443pirA57407 chlordecone reductase (EC
1.1.1.225) / 3alpha-hydroxysteroid dehydrogenase
(EC 1.1.1.-) I validated human - gi1839264gbAAB47003.1 HAKRa
product/3Â alpha-hydroxysteroid dehydrogenase
homolog human, liver, Peptide, 323 aa - gi1705823spP17516AKC4_HUMAN Aldo-keto
reductase family 1 member C4 (Chlordecone reductas
e) (CDR) (3-alpha-hydroxysteroid dehydrogenase)
(3-alpha-HSD) (Dihydrodiol dehydrogenase 4) (DD4)
(HAKRA) - gi7328948dbjBAA92885.1 dihydrodiol
dehydrogenase 4 Homo sapiens - gi7328971dbjBAA92893.1dihydrodiol
dehydrogenase 4 Homo sapiens
50Summary
- Protein sequence databases should be interpreted
with as much care as mass spectra - Protein sequences come from genes
- Use controlled vocabularies
- Understand the structure of ontologies
- Take advantage of computational predictions
- Look for sequence variants
- Be careful with omnibus databases