Protein Sequence Databases - PowerPoint PPT Presentation

About This Presentation

Title:

Protein Sequence Databases

Description:

Protein Sequence Databases Nathan Edwards Department of Biochemistry and Mol. & Cell. Biology Georgetown University Medical Center – PowerPoint PPT presentation

Number of Views:202

Avg rating:3.0/5.0

Slides: 67

Provided by: edwardsla5

Learn more at: http://edwardslab.bmcb.georgetown.edu

Category:

more less

Transcript and Presenter's Notes

Title: Protein Sequence Databases

1
Protein Sequence Databases

Nathan Edwards
Department of Biochemistry and Mol. Cell.
Biology
Georgetown University Medical Center

2
Protein Sequence Databases

Link between mass spectra and proteins
A proteins amino-acid sequence provides a basis
for interpreting
Enzymatic digestion
Separation protocols
Fragmentation
Peptide ion masses
We must interpret database information as
carefully as mass spectra.

3
More than sequence

Protein sequence databases provide much more than
sequence
Names
Descriptions
Facts
Predictions
Links to other information sources
Protein databases provide a link to the current
state of our understanding about a protein.

4
Much more than sequence

Names
Accession, Name, Description
Biological Source
Organism, Source, Taxonomy
Literature
Function
Biological process, molecular function, cellular
component
Known and predicted
Features
Polymorphism, Isoforms, PTMs, Domains
Derived Data
Molecular weight, pI

5
Database types
Curated Swiss-Prot UniProt RefSeq NP Translated TrEMBL RefSeq XP, ZP
Omnibus NCBIs nr MSDB IPI Other PDB HPRD EST Genomic
6
SwissProt

From ExPASy
Expert Protein Analysis System
Swiss Institute of Bioinformatics
515,000 protein sequence entries
12,000 species represented
20,000 Human proteins
Highly curated
Minimal redundancy
Part of UniProt Consortium

7
TrEMBL

Translated EMBL nucleotide sequences
European Molecular Biology Laboratory
European Bioinformatics Institute (EBI)
Computer annotated
Only sequences absent from SwissProt
10.5 M protein sequence entries
230,000 species
75,000 Human proteins
Part of UniProt Consortium

8
UniProt

Universal Protein Resource
Combination of sequences from
Swiss-Prot
TrEMBL
Mixture of highly curated (Swiss-Prot) and
computer annotation (TrEMBL)
Similar sequence clusters are available
50, 90, 100 sequence similarity

9
RefSeq

Reference Sequence
From NCBI (National Center for Biotechnology
Information), NLM, NIH
Integrated genomic, transcript, and protein
sequences.
Varying levels of curation
Reviewed, Validated, , Predicted,
9.7 M protein sequence entries
209,000 reviewed, 90,000 validated
39,000 Human proteins

10
RefSeq

Particular focus on major research organisms
Tightly integrated with genome projects.
Curated entries NP accessions
Predicted entries XP accessions
Others YP, ZP, AP

11
IPI

International Protein Index
From EBI
For a specific species, combines
UniProt, RefSeq, Ensembl
Species specific databases
HInv-DB, VEGA, TAIR
87,000 (from 307,000 ) human protein sequence
entries
Human, mouse, rat, zebra fish, arabidopsis,
chicken, cow

12
MSDB

From the Imperial College (London)
Combines
PIR, TrEMBL, GenBank, SwissProt
Distributed with Mascot
so well integrated with Mascot
3.2M protein sequence entries
Similar sequences suppressed
100 sequence similarity
Not updated since September 2006 (obsolete)

13
NCBIs nr

non-redundant
Contains
GenBank CDS translations
RefSeq Proteins
Protein Data Bank (PDB)
SwissProt, TrEMBL, PIR
Others
Similar sequences suppressed
100 sequence similarity
10.5 M protein sequence entries

14
Others

HPRD
Manually curated integration of literature
PDB
Focus on protein structure
dbEST
Part of GenBank - EST sequences
Genome Sequences

15
Human Sequences

Number of Human genes is believed to be between
20,000 and 25,000

SwissProt 20,000
RefSeq 39,000
TrEMBL 75,000
IPI-HUMAN 87,000
MSDB 130,000
nr 230,000
16
DNA to Protein Sequence
Derived from http//online.itp.ucsb.edu/online/inf
obio01/burge
17
Genome Browsers

Link genomic, transcript, and protein sequence in
a graphical manner
Genes, ESTs, SNPs, cross-species, etc.
UC Santa Cruz
http//genome.ucsc.edu
Ensembl
http//www.ensembl.org
NCBI Map View
http//www.ncbi.nlm.nih.gov/mapview

18
UCSC Genome Browser

Shows many sources of protein sequence evidence
in a unified display

19
PeptideMapper Web Service
Im Feeling Lucky
20
PeptideMapper Web Service
Im Feeling Lucky
21
Unannotated Splice Isoform
22
Accessions

Permanent labels
Short, machine readable
Enable precise communication
Typos render them unusable!
Each database uses a different format
Swiss-Prot P17947
Ensembl ENSG00000066336
PIR S60367 S60367
GO GO0003700

23
Names / IDs

Compact mnemonic labels
Not guaranteed permanent
Require careful curation
Conceptual objects
ALBU_HUMAN
Serum Albumin
RT30_HUMAN
Mitochondrial 28S ribosomal protein S30
CP3A7_HUMAN
Cytochrome P450 3A7

24
Description / Name

Free text description
Human readable
Space limited
Hard for computers to interpret!
No standard nomenclature or format
Often abused.
COX7R_HUMAN
Cytochrome c oxidase subunit VIIa-related
protein, mitochondrial Precursor

25
FASTA Format
26
FASTA Format

gt
Accession number
No uniform format
Multiple accessions separated by
One line of description
Usually pretty cryptic
Organism of sequence?
No uniform format
Official latin name not necessarily used
Amino-acid sequence in single-letter code
Usually spread over multiple lines.

27
Organism / Species / Taxonomy

The proteins organism
or the source of the biological sample
The most reliable sequence annotation available
Useful only to the extent that it is correct
NCBIs taxonomy is widely used
Provides a standard of sorts Heirachical
Other databases dont necessarily keep up
Organism specific sequence databases starting to
become available.

28
Organism / Species / Taxonomy

Buffalo rat
Gunn rats
Norway rat
Rattus PC12 clone IS
Rattus norvegicus
Rattus norvegicus8
Rattus norwegicus
Rattus rattiscus
Rattus sp.

Rattus sp. strain Wistar
Sprague-Dawley rat
Wistar rats
brown rat
laboratory rat
rat
rats
zitter rats

29
Controlled Vocabulary

Middle ground between computers and people
Provides precision for concepts
Searching, sorting, browsing
Concept relationships
Vocabulary / Ontology must be established
Human curation
Link between concept and object
Manually curated
Automatic / Predicted

30
Controlled Vocabulary
31
Controlled Vocabulary
32
Controlled Vocabulary
33
Controlled Vocabulary
34
Controlled Vocabulary
35
Controlled Vocabulary
36
Controlled Vocabulary
37
Controlled Vocabulary
38
Controlled Vocabulary
39
Controlled Vocabulary
40
Controlled Vocabulary
41
Controlled Vocabulary
42
Controlled Vocabulary
43
Controlled Vocabulary
44
Ontology Structure

NCBI Taxonomy
Tree
Gene Ontology (GO)
Molecular function
Biological process
Cellular component
Directed, Acyclic Graph (DAG)
Unstructured labels
Overlapping?

45
Ontology Structure
46
Protein Families

Similar sequence implies similar function
Similar structure implies similar function
Common domains imply similar function
Bootstrap up from small sets of proteins with
well understood characteristics
Usually a hybrid manual / automatic approach

47
Protein Families
48
Protein Families
49
Protein Families

PROSITE, PFam, InterPro, PRINTS
Swiss-Prot keywords
Differences
Motif style, ontology structure, degree of manual
curation
Similarities
Primarily sequence based, cross species

50
Gene Ontology

Hierarchical
Molecular function
Biological process
Cellular component
Describes the vocabulary only!
Protein families provide GO association
Not necessarily any appropriate GO category.
Not necessarily in all three hierarchies.
Sometimes general categories are used because
none of the specific categories are correct.

51
Protein Family / Gene Ontology
52
Sequence Variants

Protein sequence can vary due to
Polymorphism
Alternative splicing
Post-translational modification
Sequence databases typically do not capture all
versions of a proteins sequence

53
Sequence Variants

Swiss-Prot a curated protein sequence database
which strives to provide a high level of
annotation (such as the description of the
function of a protein, its domains structure,
post-translational modifications, variants,
etc.), a minimal level of redundancy and high
level of integration with other databases
- Swiss-Prot web site front page

54
Sequence Variants

b) Minimal redundancy
Many sequence databases contain, for a given
protein sequence, separate entries which
correspond to different literature reports. In
Swiss-Prot we try as much as possible to merge
all these data so as to minimize the redundancy
of the database. If conflicts exist between
various sequencing reports, they are indicated in
the feature table of the corresponding entry.
- Swiss-Prot User Manual, Section 1.1

55
Sequence Variants

IPI provides a top level guide to the main
databases that describe the proteomes of higher
eukaryotic organisms. IPI
1. effectively maintains a database of cross
references between the primary data sources
2. provides minimally redundant yet maximally
complete sets of proteins for featured species
(one sequence per transcript)
3. maintains stable identifiers (with
incremental versioning) to allow the tracking of
sequences in IPI between IPI releases.
- IPI web site front page

56
Swiss-Prot Variant Annotations
57
Swiss-Prot Variant Annotations
58
Swiss-Prot Variant Annotations
59
Peptides to Proteins
Nesvizhskii et al., Anal. Chem. 2003
60
Peptides to Proteins
61
Peptides to Proteins

A peptide sequence may occur in many different
protein sequences
Variants, paralogues, protein families
Separation, digestion and ionization is not well
understood
Proteins in sequence database are extremely
non-random, and very dependent

62
Omnibus Database Redundancy Elimination

Source databases often contain the same sequences
with different descriptions
Omnibus databases keep one copy of the sequence,
and
An arbitrary description, or
All descriptions, or
Particular description, based on source
preference
Good definitions can be lost, including taxonomy

63
Description Elimination

gi12053249embCAB66806.1 hypothetical protein
Homo sapiens
gi46255828gbAAH68998.1 COMMD4 protein Homo
sapiens
gi42632621gbAAS22242.1 COMMD4 Homo
sapiens
gi21361661refNP_060298.2 COMM domain
containing 4 Homo sapiens
gi51316094spQ9H0A8COM4_HUMAN COMM domain
containing protein 4
gi49065330embCAG38483.1 COMMD4 Homo
sapiens

64
Description Elimination

gi2947219gbAAC39645.1 UDP-galactose 4'
epimerase Homo sapiens
gi1119217gbAAB86498.1 UDP-galactose-4-epimera
se Homo sapiens
gi14277913pdb1HZJB Chain B, Human
Udp-Galactose 4-Epimerase Accommodation Of
Udp-N- Acetylglucosamine Within The Active Site
gi14277912pdb1HZJA Chain A, Human
Udp-Galactose 4-Epimerase Accommodation Of
Udp-N- Acetylglucosamine Within The Active Site
gi2494659spQ14376GALE_HUMAN UDP-glucose
4-epimerase (Galactowaldenase) (UDP-galactose
4-epimerase)
gi1585500prf2201313AUDP galactose
4'-epimerase

65
Description Elimination

gi4261710gbAAD14010.1 chlordecone reductase
Homo sapiens
gi2117443pirA57407 chlordecone reductase (EC
1.1.1.225) / 3alpha-hydroxysteroid dehydrogenase
(EC 1.1.1.-) I validated human
gi1839264gbAAB47003.1 HAKRa
product/3 alpha-hydroxysteroid dehydrogenase
homolog human, liver, Peptide, 323 aa
gi1705823spP17516AKC4_HUMAN Aldo-keto
reductase family 1 member C4 (Chlordecone reductas
e) (CDR) (3-alpha-hydroxysteroid dehydrogenase)
(3-alpha-HSD) (Dihydrodiol dehydrogenase 4) (DD4)
(HAKRA)
gi7328948dbjBAA92885.1 dihydrodiol
dehydrogenase 4 Homo sapiens
gi7328971dbjBAA92893.1dihydrodiol
dehydrogenase 4 Homo sapiens

66
Summary

Protein sequence databases should be interpreted
with as much care as mass spectra
Protein sequences come from genes
Use controlled vocabularies
Understand the structure of ontologies
Take advantage of computational predictions
Look for sequence variants
Peptides to proteins not as simple as it seems
Be careful with omnibus databases

Write a Comment

User Comments (0)