Protein Sequence Databases for Proteomics - PowerPoint PPT Presentation

1 / 50

About This Presentation

Title:

Protein Sequence Databases for Proteomics

Description:

Motif style, ontology structure, degree of manual curation. Similarities: ... Script varsplic.pl can enumerate all sequence variants ... – PowerPoint PPT presentation

Number of Views:426

Avg rating:3.0/5.0

Slides: 51

Provided by: nathanjoh

Category:

more less

Transcript and Presenter's Notes

Title: Protein Sequence Databases for Proteomics

1
Protein Sequence Databases for Proteomics

Nathan Edwards
Center for Bioinformatics and Computational
Biology
University of Maryland, College Park

2
Protein Sequence Databases

Link between mass spectra and proteins
A proteins amino-acid sequence provides a basis
for interpreting
Enzymatic digestion
Separation protocols
Fragmentation
Peptide ion masses
We must interpret database information as
carefully as mass spectra.

3
More than sequence

Protein sequence databases provide much more than
sequence
Names
Descriptions
Facts
Predictions
Links to other information sources
Protein databases provide a link to the current
state of our understanding about a protein.

4
Much more than sequence

Names
Accession, Name, Description
Biological Source
Organism, Source, Taxonomy
Literature
Function
Biological process, molecular function, cellular
component
Known and predicted
Features
Polymorphism, Isoforms, PTMs, Domains
Derived Data
Molecular weight, pI

5
Database types
6
SwissProt

From ExPASy
Expert Protein Analysis System
Swiss Institute of Bioinformatics
180,000 protein sequence entries
9,000 species represented
12,000 Human proteins
Highly curated
Minimal redundancy
Some restrictions on commercial use

7
PIR

Protein Information Resource
Georgetown University Medical Center
280,000 protein sequence entries
Highly curated
Public domain resource
10,500 Human proteins
Grew out of the Atlas of Protein Sequence and
Structure (1965-1978) edited by Margaret Dayhoff.

8
TrEMBL

Translated EMBL nucleotide sequences
European Molecular Biology Laboratory
European Bioinformatics Institute (EBI)
Computer annotated
Only sequences absent from SwissProt
165,000 protein sequence entries
88,000 species
52,000 Human proteins

9
RefSeq

Reference Sequence
From NCBI (National Center for Biotechnology
Information), NLM, NIH
Integrated genomic, transcript, and protein
sequences.
Varying levels of curation
Reviewed, Validated, , Predicted,
1,350,000 protein sequence entries
44,000 reviewed
28,000 Human proteins

10
RefSeq

Particular focus on major research organisms
Tightly integrated with genome projects.
Curated entries NP accesssions
Predicted entries XP accessions

11
UniProt

Universal Protein Resource
Combination of
Swiss-Prot
TrEMBL
PIR
Knowledgebase is highly curated
Similar sequence clusters are available
50, 90, 100 sequence similarity

12
IPI

International Protein Index
From EBI
For a specific species, combines
UniProt, RefSeq, Ensembl
Species specific databases
48,000 protein sequence entries
Human, mouse, rat, zebra fish, arabidopsis

13
NCBIs nr

non-redundant
Contains
GenBank CDS translations
RefSeq Proteins
Protein Data Bank (PDB)
SwissProt, TrEMBL, PIR
Others
Similar sequences suppressed
100 sequence similarity
1,800,000 protein sequence entries
33,000 species

14
MSDB

From the Imperial College (London)
Combines
PIR, TrEMBL, GenBank, SwissProt
Distributed with Mascot
so well integrated with Mascot

15
Others

HPRD
Manually curated integration of literature
PDB
Focus on protein structure
dbEST
Part of GenBank - EST sequences
Genome Sequences

16
Human Sequences

Number of Human Genes is believed to be between
20,000 and 25,000

17
DNA to Protein Sequence
Derived from http//online.itp.ucsb.edu/online/inf
obio01/burge
18
Genome Browsers

Link genomic, transcript, and protein sequence in
a graphical manner
Genes, ESTs, SNPs, cross-species, etc.
UC Santa Cruz
http//genome.ucsc.edu
Ensembl
http//www.ensembl.org
NCBI Map View
http//www.ncbi.nlm.nih.gov/mapview

19
UCSC Genome Browser

Shows many sources of protein sequence evidence
in a unified display
Can use EST accession as a location!

20
Accessions

Permanent labels
Short, machine readable
Enable precise communication
Typos render them unusable!
Each database uses a different format
Swiss-Prot P17947
Ensembl ENSG00000066336
PIR S60367 S60367
GO GO0003700

21
Names / IDs

Compact mnemonic labels
Not guaranteed permanent
Require careful curation
Conceptual objects
Swiss-Prot names changed recently!
ALBU_HUMAN
Serum Albumin
RT30_HUMAN
Mitochondrial 28S ribosomal protein S30
CP3A7_HUMAN
Cytochrome P450 3A7

22
Description / Name

Free text description
Human readable
Space limited
Hard for computers to interpret!
No standard nomenclature or format
Often abused.
COX7R_HUMAN
Cytochrome c oxidase subunit VIIa-related
protein, mitochondrial Precursor

23
FASTA Format
24
FASTA Format

gt
Accession number
No uniform format
Multiple accessions separated by
One line of description
Usually pretty cryptic
Organism of sequence?
No uniform format
Official latin name not necessarily used
Amino-acid sequence in single-letter code
Usually spread over multiple lines.

25
Organism / Species / Taxonomy

The proteins organism
or the source of the biological sample
The most reliable sequence annotation available
Useful only to the extent that it is correct
NCBIs taxonomy is widely used
Provides a standard of sorts Heirachical
Other databases dont necessarily keep up
Organism specific sequence databases starting to
become available.

26
Organism / Species / Taxonomy

Buffalo rat
Gunn rats
Norway rat
Rattus PC12 clone IS
Rattus norvegicus
Rattus norvegicus8
Rattus norwegicus
Rattus rattiscus
Rattus sp.

Rattus sp. strain Wistar
Sprague-Dawley rat
Wistar rats
brown rat
laboratory rat
rat
rats
zitter rats

27
Controlled Vocabulary

Middle ground between computers and people
Provides precision for concepts
Searching, sorting, browsing
Concept relationships
Vocabulary / Ontology must be established
Human curation
Link between concept and object
Manually curated
Automatic / Predicted

28
Controlled Vocabulary
29
Controlled Vocabulary
30
Controlled Vocabulary
31
Controlled Vocabulary
32
Controlled Vocabulary
33
Ontology Structure

NCBI Taxonomy
Tree
Gene Ontology (GO)
Molecular function
Biological process
Cellular component
Directed, Acyclic Graph (DAG)
Unstructured labels
Overlapping?

34
Ontology Structure
35
Protein Families

Similar sequence implies similar function
Similar structure implies similar function
Common domains imply similar function
Bootstrap up from small sets of proteins with
well understood characteristics
Usually a hybrid manual / automatic approach

36
Protein Families
37
Protein Families
38
Protein Families

PROSITE, PFam, InterPro, PRINTS
Gene Ontology
Swiss-Prot keywords
Differences
Motif style, ontology structure, degree of manual
curation
Similarities
Primarily sequence based, cross species

39
Sequence Variants

Protein sequence can vary due to
Polymorphism
Alternative splicing
Post-translational modification
Sequence databases typically do not capture all
versions of a proteins sequence

40
Sequence Variants

Swiss-Prot variants, isoforms and conflicts are
retained as features
Script varsplic.pl can enumerate all sequence
variants
Command-line options for full enumeration
-which full -varsplic -variant -conflict

41
Swiss-Prot Variant Annotations
42
Swiss-Prot Variant Annotations
43
Swiss-Prot Variant Annotations
Feature viewer
Variants
44
Swiss-Prot VarSplic Output
P13746-00-01-00 MAVMAPRTLLLLLSGALALTQTWAGSHSM
RYFYTSVSRPGRGEPRFIAVGYVDDTQFVRF P13746-01-01-00
MAVMAPRTLLLLLSGALALTQTWAGSHSMRYFYTSVSRPGRGEPRFI
AVGYVDDTQFVRF P13746-00-00-00
MAVMAPRTLLLLLSGALALTQTWAGSHSMRYFYTSVSRPGRGEPRFIAVG
YVDDTQFVRF P13746-00-03-00
MAVMAPRTLLLLLSGALALTQTWAGSHSMRYFYTSVSRPGRGEPRFIAVG
YVDDTQFVRF P13746-01-03-00
MAVMAPRTLLLLLSGALALTQTWAGSHSMRYFYTSVSRPGRGEPRFIAVG
YVDDTQFVRF P13746-00-04-00
MAVMAPRTLLLLLSGALALTQTWAGSHSMRYFYTSVSRPGRGKPRFIAVG
YVDDTQFVRF P13746-01-04-00
MAVMAPRTLLLLLSGALALTQTWAGSHSMRYFYTSVSRPGRGKPRFIAVG
YVDDTQFVRF P13746-00-05-00
MAVMAPRTLLLLLSGALALTQTWAGSHSMRYFYTSVSRPGRGEPRFIAVG
YVDDTQFVRF P13746-01-05-00
MAVMAPRTLLLLLSGALALTQTWAGSHSMRYFYTSVSRPGRGEPRFIAVG
YVDDTQFVRF P13746-01-00-00
MAVMAPRTLLLLLSGALALTQTWAGSHSMRYFYTSVSRPGRGEPRFIAVG
YVDDTQFVRF P13746-00-02-00
MAVMAPRTLLLLLSGALALTQTWAGSHSMRYFYTSVSRPGRGEPRFIAVG
YVDDTQFVRF P13746-01-02-00
MAVMAPRTLLLLLSGALALTQTWAGSHSMRYFYTSVSRPGRGEPRFIAVG
YVDDTQFVRF

45
Swiss-Prot VarSplic Output
P13746-00-01-00 SSQPTIPIVGIIAGLVLLGAVITGAVVAA
VMWRRKSS------DRKGGSYTQAASSDSAQ P13746-01-01-00
SSQPTIPIVGIIAGLVLLGAVITGAVVAAVMWRRKSSGGEGVKDRKG
GSYTQAASSDSAQ P13746-00-00-00
SSQPTIPIVGIIAGLVLLGAVITGAVVAAVMWRRKSS------DRKGGSY
TQAASSDSAQ P13746-00-03-00
SSQPTIPIVGIIAGLVLLGAVITGAVVAAVMWRRKSS------DRKGGSY
TQAASSDSAQ P13746-01-03-00
SSQPTIPIVGIIAGLVLLGAVITGAVVAAVMWRRKSSGGEGVKDRKGGSY
TQAASSDSAQ P13746-00-04-00
SSQPTIPIVGIIAGLVLLGAVITGAVVAAVMWRRKSS------DRKGGSY
TQAASSDSAQ P13746-01-04-00
SSQPTIPIVGIIAGLVLLGAVITGAVVAAVMWRRKSSGGEGVKDRKGGSY
TQAASSDSAQ P13746-00-05-00
SSQPTIPIVGIIAGLVLLGAVITGAVVAAVMWRRKSS------DRKGGSY
TQAASSDSAQ P13746-01-05-00
SSQPTIPIVGIIAGLVLLGAVITGAVVAAVMWRRKSSGGEGVKDRKGGSY
TQAASSDSAQ P13746-01-00-00
SSQPTIPIVGIIAGLVLLGAVITGAVVAAVMWRRKSSGGEGVKDRKGGSY
TQAASSDSAQ P13746-00-02-00
SSQPTIPIVGIIAGLVLLGAVITGAVVAAVMWRRKSS------DRKGGSY
SQAASSDSAQ P13746-01-02-00
SSQPTIPIVGIIAGLVLLGAVITGAVVAAVMWRRKSSGGEGVKDRKGGSY
SQAASSDSAQ

46
Omnibus Database Redundancy Elimination

Source databases often contain the same sequences
with different descriptions
Omnibus databases keep one copy of the sequence,
and
An arbitrary description, or
All descriptions, or
Particular description, based on source
preference
Good definitions can be lost, including taxonomy

47
Description Elimination

gi12053249embCAB66806.1 hypothetical protein
Homo sapiens
gi46255828gbAAH68998.1 COMMD4 protein Homo
sapiens
gi42632621gbAAS22242.1 COMMD4 Homo
sapiens
gi21361661refNP_060298.2 COMM domain
containing 4 Homo sapiens
gi51316094spQ9H0A8COM4_HUMAN COMM domain
containing protein 4
gi49065330embCAG38483.1 COMMD4 Homo
sapiens

48
Description Elimination

gi2947219gbAAC39645.1 UDP-galactose 4'
epimerase Homo sapiens
gi1119217gbAAB86498.1 UDP-galactose-4-epimera
se Homo sapiens
gi14277913pdb1HZJB Chain B, Human
Udp-Galactose 4-Epimerase Accommodation Of
Udp-N- Acetylglucosamine Within The Active Site
gi14277912pdb1HZJA Chain A, Human
Udp-Galactose 4-Epimerase Accommodation Of
Udp-N- Acetylglucosamine Within The Active Site
gi2494659spQ14376GALE_HUMAN UDP-glucose
4-epimerase (Galactowaldenase) (UDP-galactose
4-epimerase)
gi1585500prf2201313AUDP galactose
4'-epimerase

49
Description Elimination

gi4261710gbAAD14010.1 chlordecone reductase
Homo sapiens
gi2117443pirA57407 chlordecone reductase (EC
1.1.1.225) / 3alpha-hydroxysteroid dehydrogenase
(EC 1.1.1.-) I validated human
gi1839264gbAAB47003.1 HAKRa
product/3 alpha-hydroxysteroid dehydrogenase
homolog human, liver, Peptide, 323 aa
gi1705823spP17516AKC4_HUMAN Aldo-keto
reductase family 1 member C4 (Chlordecone reductas
e) (CDR) (3-alpha-hydroxysteroid dehydrogenase)
(3-alpha-HSD) (Dihydrodiol dehydrogenase 4) (DD4)
(HAKRA)
gi7328948dbjBAA92885.1 dihydrodiol
dehydrogenase 4 Homo sapiens
gi7328971dbjBAA92893.1dihydrodiol
dehydrogenase 4 Homo sapiens

50
Summary

Protein sequence databases should be interpreted
with as much care as mass spectra
Protein sequences come from genes
Use controlled vocabularies
Understand the structure of ontologies
Take advantage of computational predictions
Look for sequence variants
Be careful with omnibus databases

Write a Comment

User Comments (0)