Protein Sequence Databases for Proteomics - PowerPoint PPT Presentation

1 / 54

About This Presentation

Title:

Protein Sequence Databases for Proteomics

Description:

Human, mouse, rat, zebra fish, arabidopsis. 12. NCBI's nr ... Buffalo rat. Gunn rats. Norway rat. Rattus PC12 clone IS. Rattus norvegicus. Rattus norvegicus8 ... – PowerPoint PPT presentation

Number of Views:51

Avg rating:3.0/5.0

Slides: 55

Provided by: nathanjoh

Category:

more less

Transcript and Presenter's Notes

Title: Protein Sequence Databases for Proteomics

1
Protein Sequence Databases for Proteomics

Nathan Edwards
Center for Bioinformatics and Computational
Biology
University of Maryland, College Park

2
Protein Sequence Databases

Link between mass spectra and proteins
A proteins amino-acid sequence provides a basis
for interpreting
Enzymatic digestion
Separation protocols
Fragmentation
Peptide ion masses
We must interpret database information as
carefully as mass spectra.

3
More than sequence

Protein sequence databases provide much more than
sequence
Names
Descriptions
Facts
Predictions
Links to other information sources
Protein databases provide a link to the current
state of our understanding about a protein.

4
Much more than sequence

Names
Accession, Name, Description
Biological Source
Organism, Source, Taxonomy
Literature
Function
Biological process, molecular function, cellular
component
Known and predicted
Features
Polymorphism, Isoforms, PTMs, Domains
Derived Data
Molecular weight, pI

5
Database types
6
SwissProt

From ExPASy
Expert Protein Analysis System
Swiss Institute of Bioinformatics
260,00 protein sequence entries
11,000 species represented
16,000 Human proteins
Highly curated
Minimal redundancy
Part of UniProt Consortium

7
TrEMBL

Translated EMBL nucleotide sequences
European Molecular Biology Laboratory
European Bioinformatics Institute (EBI)
Computer annotated
Only sequences absent from SwissProt
3.9M protein sequence entries
130,000 species
53,000 Human proteins
Part of UniProt Consortium

8
UniProt

Universal Protein Resource
Combination of
Swiss-Prot
TrEMBL
PIR (Georgetown Medical Center)
Knowledgebase is highly curated
Similar sequence clusters are available
50, 90, 100 sequence similarity

9
RefSeq

Reference Sequence
From NCBI (National Center for Biotechnology
Information), NLM, NIH
Integrated genomic, transcript, and protein
sequences.
Varying levels of curation
Reviewed, Validated, , Predicted,
3.2M protein sequence entries
89,000 reviewed
33,888 Human proteins

10
RefSeq

Particular focus on major research organisms
Tightly integrated with genome projects.
Curated entries NP accesssions
Predicted entries XP accessions

11
IPI

International Protein Index
From EBI
For a specific species, combines
UniProt, RefSeq, Ensembl
Species specific databases
68,000 (from 228,000) human protein sequence
entries
Human, mouse, rat, zebra fish, arabidopsis

12
NCBIs nr

non-redundant
Contains
GenBank CDS translations
RefSeq Proteins
Protein Data Bank (PDB)
SwissProt, TrEMBL, PIR
Others
Similar sequences suppressed
100 sequence similarity
4.7M protein sequence entries

13
MSDB

From the Imperial College (London)
Combines
PIR, TrEMBL, GenBank, SwissProt
Distributed with Mascot
so well integrated with Mascot
3.2M protein sequence entries
Similar sequences suppressed
100 sequence similarity

14
Others

HPRD
Manually curated integration of literature
PDB
Focus on protein structure
dbEST
Part of GenBank - EST sequences
Genome Sequences

15
Human Sequences

Number of Human Genes is believed to be between
20,000 and 25,000

16
DNA to Protein Sequence
Derived from http//online.itp.ucsb.edu/online/inf
obio01/burge
17
Genome Browsers

Link genomic, transcript, and protein sequence in
a graphical manner
Genes, ESTs, SNPs, cross-species, etc.
UC Santa Cruz
http//genome.ucsc.edu
Ensembl
http//www.ensembl.org
NCBI Map View
http//www.ncbi.nlm.nih.gov/mapview

18
UCSC Genome Browser

Shows many sources of protein sequence evidence
in a unified display
Can use EST accession as a location!

19
Accessions

Permanent labels
Short, machine readable
Enable precise communication
Typos render them unusable!
Each database uses a different format
Swiss-Prot P17947
Ensembl ENSG00000066336
PIR S60367 S60367
GO GO0003700

20
Names / IDs

Compact mnemonic labels
Not guaranteed permanent
Require careful curation
Conceptual objects
ALBU_HUMAN
Serum Albumin
RT30_HUMAN
Mitochondrial 28S ribosomal protein S30
CP3A7_HUMAN
Cytochrome P450 3A7

21
Description / Name

Free text description
Human readable
Space limited
Hard for computers to interpret!
No standard nomenclature or format
Often abused.
COX7R_HUMAN
Cytochrome c oxidase subunit VIIa-related
protein, mitochondrial Precursor

22
FASTA Format
23
FASTA Format

gt
Accession number
No uniform format
Multiple accessions separated by
One line of description
Usually pretty cryptic
Organism of sequence?
No uniform format
Official latin name not necessarily used
Amino-acid sequence in single-letter code
Usually spread over multiple lines.

24
Organism / Species / Taxonomy

The proteins organism
or the source of the biological sample
The most reliable sequence annotation available
Useful only to the extent that it is correct
NCBIs taxonomy is widely used
Provides a standard of sorts Heirachical
Other databases dont necessarily keep up
Organism specific sequence databases starting to
become available.

25
Organism / Species / Taxonomy

Buffalo rat
Gunn rats
Norway rat
Rattus PC12 clone IS
Rattus norvegicus
Rattus norvegicus8
Rattus norwegicus
Rattus rattiscus
Rattus sp.

Rattus sp. strain Wistar
Sprague-Dawley rat
Wistar rats
brown rat
laboratory rat
rat
rats
zitter rats

26
Controlled Vocabulary

Middle ground between computers and people
Provides precision for concepts
Searching, sorting, browsing
Concept relationships
Vocabulary / Ontology must be established
Human curation
Link between concept and object
Manually curated
Automatic / Predicted

27
Controlled Vocabulary
28
Controlled Vocabulary
29
Controlled Vocabulary
30
Controlled Vocabulary
31
Controlled Vocabulary
32
Ontology Structure

NCBI Taxonomy
Tree
Gene Ontology (GO)
Molecular function
Biological process
Cellular component
Directed, Acyclic Graph (DAG)
Unstructured labels
Overlapping?

33
Ontology Structure
34
Protein Families

Similar sequence implies similar function
Similar structure implies similar function
Common domains imply similar function
Bootstrap up from small sets of proteins with
well understood characteristics
Usually a hybrid manual / automatic approach

35
Protein Families
36
Protein Families
37
Protein Families

PROSITE, PFam, InterPro, PRINTS
Swiss-Prot keywords
Differences
Motif style, ontology structure, degree of manual
curation
Similarities
Primarily sequence based, cross species

38
Gene Ontology

Hierarchical
Molecular function
Biological process
Cellular component
Describes the vocabulary only!
Protein families provide GO association
Not necessarily any appropriate GO category.
Not necessarily in all three hierarchies.
Sometimes general categories are used because
none of the specific categories are correct.

39
Protein Family / Gene Ontology
40
Sequence Variants

Protein sequence can vary due to
Polymorphism
Alternative splicing
Post-translational modification
Sequence databases typically do not capture all
versions of a proteins sequence

41
Sequence Variants

Swiss-Prot a curated protein sequence database
which strives to provide a high level of
annotation (such as the description of the
function of a protein, its domains structure,
post-translational modifications, variants,
etc.), a minimal level of redundancy and high
level of integration with other databases
- Swiss-Prot web site front page

42
Sequence Variants

b) Minimal redundancy
Many sequence databases contain, for a given
protein sequence, separate entries which
correspond to different literature reports. In
Swiss-Prot we try as much as possible to merge
all these data so as to minimize the redundancy
of the database. If conflicts exist between
various sequencing reports, they are indicated in
the feature table of the corresponding entry.
- Swiss-Prot User Manual, Section 1.1

43
Sequence Variants

IPI provides a top level guide to the main
databases that describe the proteomes of higher
eukaryotic organisms. IPI
1. effectively maintains a database of cross
references between the primary data sources
2. provides minimally redundant yet maximally
complete sets of proteins for featured species
(one sequence per transcript)
3. maintains stable identifiers (with
incremental versioning) to allow the tracking of
sequences in IPI between IPI releases.
- IPI web site front page

44
Sequence Variants

Swiss-Prot variants, isoforms and conflicts are
retained as features
Script varsplic.pl can enumerate all sequence
variants
Command-line options for full enumeration
-which full -varsplic -variant -conflict

45
Swiss-Prot Variant Annotations
46
Swiss-Prot Variant Annotations
47
Swiss-Prot Variant Annotations
Feature viewer
Variants
48
Swiss-Prot VarSplic Output
P13746-00-01-00 MAVMAPRTLLLLLSGALALTQTWAGSHSM
RYFYTSVSRPGRGEPRFIAVGYVDDTQFVRF P13746-01-01-00
MAVMAPRTLLLLLSGALALTQTWAGSHSMRYFYTSVSRPGRGEPRFI
AVGYVDDTQFVRF P13746-00-00-00
MAVMAPRTLLLLLSGALALTQTWAGSHSMRYFYTSVSRPGRGEPRFIAVG
YVDDTQFVRF P13746-00-03-00
MAVMAPRTLLLLLSGALALTQTWAGSHSMRYFYTSVSRPGRGEPRFIAVG
YVDDTQFVRF P13746-01-03-00
MAVMAPRTLLLLLSGALALTQTWAGSHSMRYFYTSVSRPGRGEPRFIAVG
YVDDTQFVRF P13746-00-04-00
MAVMAPRTLLLLLSGALALTQTWAGSHSMRYFYTSVSRPGRGKPRFIAVG
YVDDTQFVRF P13746-01-04-00
MAVMAPRTLLLLLSGALALTQTWAGSHSMRYFYTSVSRPGRGKPRFIAVG
YVDDTQFVRF P13746-00-05-00
MAVMAPRTLLLLLSGALALTQTWAGSHSMRYFYTSVSRPGRGEPRFIAVG
YVDDTQFVRF P13746-01-05-00
MAVMAPRTLLLLLSGALALTQTWAGSHSMRYFYTSVSRPGRGEPRFIAVG
YVDDTQFVRF P13746-01-00-00
MAVMAPRTLLLLLSGALALTQTWAGSHSMRYFYTSVSRPGRGEPRFIAVG
YVDDTQFVRF P13746-00-02-00
MAVMAPRTLLLLLSGALALTQTWAGSHSMRYFYTSVSRPGRGEPRFIAVG
YVDDTQFVRF P13746-01-02-00
MAVMAPRTLLLLLSGALALTQTWAGSHSMRYFYTSVSRPGRGEPRFIAVG
YVDDTQFVRF

49
Swiss-Prot VarSplic Output
P13746-00-01-00 SSQPTIPIVGIIAGLVLLGAVITGAVVAA
VMWRRKSS------DRKGGSYTQAASSDSAQ P13746-01-01-00
SSQPTIPIVGIIAGLVLLGAVITGAVVAAVMWRRKSSGGEGVKDRKG
GSYTQAASSDSAQ P13746-00-00-00
SSQPTIPIVGIIAGLVLLGAVITGAVVAAVMWRRKSS------DRKGGSY
TQAASSDSAQ P13746-00-03-00
SSQPTIPIVGIIAGLVLLGAVITGAVVAAVMWRRKSS------DRKGGSY
TQAASSDSAQ P13746-01-03-00
SSQPTIPIVGIIAGLVLLGAVITGAVVAAVMWRRKSSGGEGVKDRKGGSY
TQAASSDSAQ P13746-00-04-00
SSQPTIPIVGIIAGLVLLGAVITGAVVAAVMWRRKSS------DRKGGSY
TQAASSDSAQ P13746-01-04-00
SSQPTIPIVGIIAGLVLLGAVITGAVVAAVMWRRKSSGGEGVKDRKGGSY
TQAASSDSAQ P13746-00-05-00
SSQPTIPIVGIIAGLVLLGAVITGAVVAAVMWRRKSS------DRKGGSY
TQAASSDSAQ P13746-01-05-00
SSQPTIPIVGIIAGLVLLGAVITGAVVAAVMWRRKSSGGEGVKDRKGGSY
TQAASSDSAQ P13746-01-00-00
SSQPTIPIVGIIAGLVLLGAVITGAVVAAVMWRRKSSGGEGVKDRKGGSY
TQAASSDSAQ P13746-00-02-00
SSQPTIPIVGIIAGLVLLGAVITGAVVAAVMWRRKSS------DRKGGSY
SQAASSDSAQ P13746-01-02-00
SSQPTIPIVGIIAGLVLLGAVITGAVVAAVMWRRKSSGGEGVKDRKGGSY
SQAASSDSAQ

50
Omnibus Database Redundancy Elimination

Source databases often contain the same sequences
with different descriptions
Omnibus databases keep one copy of the sequence,
and
An arbitrary description, or
All descriptions, or
Particular description, based on source
preference
Good definitions can be lost, including taxonomy

51
Description Elimination

gi12053249embCAB66806.1 hypothetical protein
Homo sapiens
gi46255828gbAAH68998.1 COMMD4 protein Homo
sapiens
gi42632621gbAAS22242.1 COMMD4 Homo
sapiens
gi21361661refNP_060298.2 COMM domain
containing 4 Homo sapiens
gi51316094spQ9H0A8COM4_HUMAN COMM domain
containing protein 4
gi49065330embCAG38483.1 COMMD4 Homo
sapiens

52
Description Elimination

gi2947219gbAAC39645.1 UDP-galactose 4'
epimerase Homo sapiens
gi1119217gbAAB86498.1 UDP-galactose-4-epimera
se Homo sapiens
gi14277913pdb1HZJB Chain B, Human
Udp-Galactose 4-Epimerase Accommodation Of
Udp-N- Acetylglucosamine Within The Active Site
gi14277912pdb1HZJA Chain A, Human
Udp-Galactose 4-Epimerase Accommodation Of
Udp-N- Acetylglucosamine Within The Active Site
gi2494659spQ14376GALE_HUMAN UDP-glucose
4-epimerase (Galactowaldenase) (UDP-galactose
4-epimerase)
gi1585500prf2201313AUDP galactose
4'-epimerase

53
Description Elimination

gi4261710gbAAD14010.1 chlordecone reductase
Homo sapiens
gi2117443pirA57407 chlordecone reductase (EC
1.1.1.225) / 3alpha-hydroxysteroid dehydrogenase
(EC 1.1.1.-) I validated human
gi1839264gbAAB47003.1 HAKRa
product/3 alpha-hydroxysteroid dehydrogenase
homolog human, liver, Peptide, 323 aa
gi1705823spP17516AKC4_HUMAN Aldo-keto
reductase family 1 member C4 (Chlordecone reductas
e) (CDR) (3-alpha-hydroxysteroid dehydrogenase)
(3-alpha-HSD) (Dihydrodiol dehydrogenase 4) (DD4)
(HAKRA)
gi7328948dbjBAA92885.1 dihydrodiol
dehydrogenase 4 Homo sapiens
gi7328971dbjBAA92893.1dihydrodiol
dehydrogenase 4 Homo sapiens

54
Summary

Protein sequence databases should be interpreted
with as much care as mass spectra
Protein sequences come from genes
Use controlled vocabularies
Understand the structure of ontologies
Take advantage of computational predictions
Look for sequence variants
Be careful with omnibus databases

Write a Comment

User Comments (0)