Protein Sequence Databases for Proteomics - PowerPoint PPT Presentation

1 / 54
About This Presentation
Title:

Protein Sequence Databases for Proteomics

Description:

Human, mouse, rat, zebra fish, arabidopsis. 12. NCBI's nr ... Buffalo rat. Gunn rats. Norway rat. Rattus PC12 clone IS. Rattus norvegicus. Rattus norvegicus8 ... – PowerPoint PPT presentation

Number of Views:51
Avg rating:3.0/5.0
Slides: 55
Provided by: nathanjoh
Category:

less

Transcript and Presenter's Notes

Title: Protein Sequence Databases for Proteomics


1
Protein Sequence Databases for Proteomics
  • Nathan Edwards
  • Center for Bioinformatics and Computational
    Biology
  • University of Maryland, College Park

2
Protein Sequence Databases
  • Link between mass spectra and proteins
  • A proteins amino-acid sequence provides a basis
    for interpreting
  • Enzymatic digestion
  • Separation protocols
  • Fragmentation
  • Peptide ion masses
  • We must interpret database information as
    carefully as mass spectra.

3
More than sequence
  • Protein sequence databases provide much more than
    sequence
  • Names
  • Descriptions
  • Facts
  • Predictions
  • Links to other information sources
  • Protein databases provide a link to the current
    state of our understanding about a protein.

4
Much more than sequence
  • Names
  • Accession, Name, Description
  • Biological Source
  • Organism, Source, Taxonomy
  • Literature
  • Function
  • Biological process, molecular function, cellular
    component
  • Known and predicted
  • Features
  • Polymorphism, Isoforms, PTMs, Domains
  • Derived Data
  • Molecular weight, pI

5
Database types
6
SwissProt
  • From ExPASy
  • Expert Protein Analysis System
  • Swiss Institute of Bioinformatics
  • 260,00 protein sequence entries
  • 11,000 species represented
  • 16,000 Human proteins
  • Highly curated
  • Minimal redundancy
  • Part of UniProt Consortium

7
TrEMBL
  • Translated EMBL nucleotide sequences
  • European Molecular Biology Laboratory
  • European Bioinformatics Institute (EBI)
  • Computer annotated
  • Only sequences absent from SwissProt
  • 3.9M protein sequence entries
  • 130,000 species
  • 53,000 Human proteins
  • Part of UniProt Consortium

8
UniProt
  • Universal Protein Resource
  • Combination of
  • Swiss-Prot
  • TrEMBL
  • PIR (Georgetown Medical Center)
  • Knowledgebase is highly curated
  • Similar sequence clusters are available
  • 50, 90, 100 sequence similarity

9
RefSeq
  • Reference Sequence
  • From NCBI (National Center for Biotechnology
    Information), NLM, NIH
  • Integrated genomic, transcript, and protein
    sequences.
  • Varying levels of curation
  • Reviewed, Validated, , Predicted,
  • 3.2M protein sequence entries
  • 89,000 reviewed
  • 33,888 Human proteins

10
RefSeq
  • Particular focus on major research organisms
  • Tightly integrated with genome projects.
  • Curated entries NP accesssions
  • Predicted entries XP accessions

11
IPI
  • International Protein Index
  • From EBI
  • For a specific species, combines
  • UniProt, RefSeq, Ensembl
  • Species specific databases
  • 68,000 (from 228,000) human protein sequence
    entries
  • Human, mouse, rat, zebra fish, arabidopsis

12
NCBIs nr
  • non-redundant
  • Contains
  • GenBank CDS translations
  • RefSeq Proteins
  • Protein Data Bank (PDB)
  • SwissProt, TrEMBL, PIR
  • Others
  • Similar sequences suppressed
  • 100 sequence similarity
  • 4.7M protein sequence entries

13
MSDB
  • From the Imperial College (London)
  • Combines
  • PIR, TrEMBL, GenBank, SwissProt
  • Distributed with Mascot
  • so well integrated with Mascot
  • 3.2M protein sequence entries
  • Similar sequences suppressed
  • 100 sequence similarity

14
Others
  • HPRD
  • Manually curated integration of literature
  • PDB
  • Focus on protein structure
  • dbEST
  • Part of GenBank - EST sequences
  • Genome Sequences

15
Human Sequences
  • Number of Human Genes is believed to be between
    20,000 and 25,000

16
DNA to Protein Sequence
Derived from http//online.itp.ucsb.edu/online/inf
obio01/burge
17
Genome Browsers
  • Link genomic, transcript, and protein sequence in
    a graphical manner
  • Genes, ESTs, SNPs, cross-species, etc.
  • UC Santa Cruz
  • http//genome.ucsc.edu
  • Ensembl
  • http//www.ensembl.org
  • NCBI Map View
  • http//www.ncbi.nlm.nih.gov/mapview

18
UCSC Genome Browser
  • Shows many sources of protein sequence evidence
    in a unified display
  • Can use EST accession as a location!

19
Accessions
  • Permanent labels
  • Short, machine readable
  • Enable precise communication
  • Typos render them unusable!
  • Each database uses a different format
  • Swiss-Prot P17947
  • Ensembl ENSG00000066336
  • PIR S60367 S60367
  • GO GO0003700

20
Names / IDs
  • Compact mnemonic labels
  • Not guaranteed permanent
  • Require careful curation
  • Conceptual objects
  • ALBU_HUMAN
  • Serum Albumin
  • RT30_HUMAN
  • Mitochondrial 28S ribosomal protein S30
  • CP3A7_HUMAN
  • Cytochrome P450 3A7

21
Description / Name
  • Free text description
  • Human readable
  • Space limited
  • Hard for computers to interpret!
  • No standard nomenclature or format
  • Often abused.
  • COX7R_HUMAN
  • Cytochrome c oxidase subunit VIIa-related
    protein, mitochondrial Precursor

22
FASTA Format
23
FASTA Format
  • gt
  • Accession number
  • No uniform format
  • Multiple accessions separated by
  • One line of description
  • Usually pretty cryptic
  • Organism of sequence?
  • No uniform format
  • Official latin name not necessarily used
  • Amino-acid sequence in single-letter code
  • Usually spread over multiple lines.

24
Organism / Species / Taxonomy
  • The proteins organism
  • or the source of the biological sample
  • The most reliable sequence annotation available
  • Useful only to the extent that it is correct
  • NCBIs taxonomy is widely used
  • Provides a standard of sorts Heirachical
  • Other databases dont necessarily keep up
  • Organism specific sequence databases starting to
    become available.

25
Organism / Species / Taxonomy
  • Buffalo rat
  • Gunn rats
  • Norway rat
  • Rattus PC12 clone IS
  • Rattus norvegicus
  • Rattus norvegicus8
  • Rattus norwegicus
  • Rattus rattiscus
  • Rattus sp.
  • Rattus sp. strain Wistar
  • Sprague-Dawley rat
  • Wistar rats
  • brown rat
  • laboratory rat
  • rat
  • rats
  • zitter rats

26
Controlled Vocabulary
  • Middle ground between computers and people
  • Provides precision for concepts
  • Searching, sorting, browsing
  • Concept relationships
  • Vocabulary / Ontology must be established
  • Human curation
  • Link between concept and object
  • Manually curated
  • Automatic / Predicted

27
Controlled Vocabulary
28
Controlled Vocabulary
29
Controlled Vocabulary
30
Controlled Vocabulary
31
Controlled Vocabulary
32
Ontology Structure
  • NCBI Taxonomy
  • Tree
  • Gene Ontology (GO)
  • Molecular function
  • Biological process
  • Cellular component
  • Directed, Acyclic Graph (DAG)
  • Unstructured labels
  • Overlapping?

33
Ontology Structure
34
Protein Families
  • Similar sequence implies similar function
  • Similar structure implies similar function
  • Common domains imply similar function
  • Bootstrap up from small sets of proteins with
    well understood characteristics
  • Usually a hybrid manual / automatic approach

35
Protein Families
36
Protein Families
37
Protein Families
  • PROSITE, PFam, InterPro, PRINTS
  • Swiss-Prot keywords
  • Differences
  • Motif style, ontology structure, degree of manual
    curation
  • Similarities
  • Primarily sequence based, cross species

38
Gene Ontology
  • Hierarchical
  • Molecular function
  • Biological process
  • Cellular component
  • Describes the vocabulary only!
  • Protein families provide GO association
  • Not necessarily any appropriate GO category.
  • Not necessarily in all three hierarchies.
  • Sometimes general categories are used because
    none of the specific categories are correct.

39
Protein Family / Gene Ontology
40
Sequence Variants
  • Protein sequence can vary due to
  • Polymorphism
  • Alternative splicing
  • Post-translational modification
  • Sequence databases typically do not capture all
    versions of a proteins sequence

41
Sequence Variants
  • Swiss-Prot a curated protein sequence database
    which strives to provide a high level of
    annotation (such as the description of the
    function of a protein, its domains structure,
    post-translational modifications, variants,
    etc.), a minimal level of redundancy and high
    level of integration with other databases
  • - Swiss-Prot web site front page

42
Sequence Variants
  • b) Minimal redundancy
  • Many sequence databases contain, for a given
    protein sequence, separate entries which
    correspond to different literature reports. In
    Swiss-Prot we try as much as possible to merge
    all these data so as to minimize the redundancy
    of the database. If conflicts exist between
    various sequencing reports, they are indicated in
    the feature table of the corresponding entry.
  • - Swiss-Prot User Manual, Section 1.1

43
Sequence Variants
  • IPI provides a top level guide to the main
    databases that describe the proteomes of higher
    eukaryotic organisms. IPI
  • 1. effectively maintains a database of cross
    references between the primary data sources
  • 2. provides minimally redundant yet maximally
    complete sets of proteins for featured species
    (one sequence per transcript)
  • 3. maintains stable identifiers (with
    incremental versioning) to allow the tracking of
    sequences in IPI between IPI releases.
  • - IPI web site front page

44
Sequence Variants
  • Swiss-Prot variants, isoforms and conflicts are
    retained as features
  • Script varsplic.pl can enumerate all sequence
    variants
  • Command-line options for full enumeration
  • -which full -varsplic -variant -conflict

45
Swiss-Prot Variant Annotations
46
Swiss-Prot Variant Annotations
47
Swiss-Prot Variant Annotations
Feature viewer
Variants
48
Swiss-Prot VarSplic Output
P13746-00-01-00 MAVMAPRTLLLLLSGALALTQTWAGSHSM
RYFYTSVSRPGRGEPRFIAVGYVDDTQFVRF P13746-01-01-00
MAVMAPRTLLLLLSGALALTQTWAGSHSMRYFYTSVSRPGRGEPRFI
AVGYVDDTQFVRF P13746-00-00-00
MAVMAPRTLLLLLSGALALTQTWAGSHSMRYFYTSVSRPGRGEPRFIAVG
YVDDTQFVRF P13746-00-03-00
MAVMAPRTLLLLLSGALALTQTWAGSHSMRYFYTSVSRPGRGEPRFIAVG
YVDDTQFVRF P13746-01-03-00
MAVMAPRTLLLLLSGALALTQTWAGSHSMRYFYTSVSRPGRGEPRFIAVG
YVDDTQFVRF P13746-00-04-00
MAVMAPRTLLLLLSGALALTQTWAGSHSMRYFYTSVSRPGRGKPRFIAVG
YVDDTQFVRF P13746-01-04-00
MAVMAPRTLLLLLSGALALTQTWAGSHSMRYFYTSVSRPGRGKPRFIAVG
YVDDTQFVRF P13746-00-05-00
MAVMAPRTLLLLLSGALALTQTWAGSHSMRYFYTSVSRPGRGEPRFIAVG
YVDDTQFVRF P13746-01-05-00
MAVMAPRTLLLLLSGALALTQTWAGSHSMRYFYTSVSRPGRGEPRFIAVG
YVDDTQFVRF P13746-01-00-00
MAVMAPRTLLLLLSGALALTQTWAGSHSMRYFYTSVSRPGRGEPRFIAVG
YVDDTQFVRF P13746-00-02-00
MAVMAPRTLLLLLSGALALTQTWAGSHSMRYFYTSVSRPGRGEPRFIAVG
YVDDTQFVRF P13746-01-02-00
MAVMAPRTLLLLLSGALALTQTWAGSHSMRYFYTSVSRPGRGEPRFIAVG
YVDDTQFVRF


49
Swiss-Prot VarSplic Output
P13746-00-01-00 SSQPTIPIVGIIAGLVLLGAVITGAVVAA
VMWRRKSS------DRKGGSYTQAASSDSAQ P13746-01-01-00
SSQPTIPIVGIIAGLVLLGAVITGAVVAAVMWRRKSSGGEGVKDRKG
GSYTQAASSDSAQ P13746-00-00-00
SSQPTIPIVGIIAGLVLLGAVITGAVVAAVMWRRKSS------DRKGGSY
TQAASSDSAQ P13746-00-03-00
SSQPTIPIVGIIAGLVLLGAVITGAVVAAVMWRRKSS------DRKGGSY
TQAASSDSAQ P13746-01-03-00
SSQPTIPIVGIIAGLVLLGAVITGAVVAAVMWRRKSSGGEGVKDRKGGSY
TQAASSDSAQ P13746-00-04-00
SSQPTIPIVGIIAGLVLLGAVITGAVVAAVMWRRKSS------DRKGGSY
TQAASSDSAQ P13746-01-04-00
SSQPTIPIVGIIAGLVLLGAVITGAVVAAVMWRRKSSGGEGVKDRKGGSY
TQAASSDSAQ P13746-00-05-00
SSQPTIPIVGIIAGLVLLGAVITGAVVAAVMWRRKSS------DRKGGSY
TQAASSDSAQ P13746-01-05-00
SSQPTIPIVGIIAGLVLLGAVITGAVVAAVMWRRKSSGGEGVKDRKGGSY
TQAASSDSAQ P13746-01-00-00
SSQPTIPIVGIIAGLVLLGAVITGAVVAAVMWRRKSSGGEGVKDRKGGSY
TQAASSDSAQ P13746-00-02-00
SSQPTIPIVGIIAGLVLLGAVITGAVVAAVMWRRKSS------DRKGGSY
SQAASSDSAQ P13746-01-02-00
SSQPTIPIVGIIAGLVLLGAVITGAVVAAVMWRRKSSGGEGVKDRKGGSY
SQAASSDSAQ


50
Omnibus Database Redundancy Elimination
  • Source databases often contain the same sequences
    with different descriptions
  • Omnibus databases keep one copy of the sequence,
    and
  • An arbitrary description, or
  • All descriptions, or
  • Particular description, based on source
    preference
  • Good definitions can be lost, including taxonomy

51
Description Elimination
  • gi12053249embCAB66806.1 hypothetical protein
    Homo sapiens
  • gi46255828gbAAH68998.1 COMMD4 protein Homo
    sapiens
  • gi42632621gbAAS22242.1 COMMD4 Homo
    sapiens
  • gi21361661refNP_060298.2 COMM domain
    containing 4 Homo sapiens
  • gi51316094spQ9H0A8COM4_HUMAN COMM domain
    containing protein 4
  • gi49065330embCAG38483.1 COMMD4 Homo
    sapiens

52
Description Elimination
  • gi2947219gbAAC39645.1 UDP-galactose 4'
    epimerase Homo sapiens
  • gi1119217gbAAB86498.1 UDP-galactose-4-epimera
    se Homo sapiens
  • gi14277913pdb1HZJB Chain B, Human
    Udp-Galactose 4-Epimerase Accommodation Of
    Udp-N- Acetylglucosamine Within The Active Site
  • gi14277912pdb1HZJA Chain A, Human
    Udp-Galactose 4-Epimerase Accommodation Of
    Udp-N- Acetylglucosamine Within The Active Site
  • gi2494659spQ14376GALE_HUMAN UDP-glucose
    4-epimerase (Galactowaldenase) (UDP-galactose
    4-epimerase)
  • gi1585500prf2201313AUDP galactose
    4'-epimerase

53
Description Elimination
  • gi4261710gbAAD14010.1 chlordecone reductase
    Homo sapiens
  • gi2117443pirA57407 chlordecone reductase (EC
    1.1.1.225) / 3alpha-hydroxysteroid dehydrogenase
    (EC 1.1.1.-) I validated human
  • gi1839264gbAAB47003.1 HAKRa
    product/3 alpha-hydroxysteroid dehydrogenase
    homolog human, liver, Peptide, 323 aa
  • gi1705823spP17516AKC4_HUMAN Aldo-keto
    reductase family 1 member C4 (Chlordecone reductas
    e) (CDR) (3-alpha-hydroxysteroid dehydrogenase)
    (3-alpha-HSD) (Dihydrodiol dehydrogenase 4) (DD4)
    (HAKRA)
  • gi7328948dbjBAA92885.1 dihydrodiol
    dehydrogenase 4 Homo sapiens
  • gi7328971dbjBAA92893.1dihydrodiol
    dehydrogenase 4 Homo sapiens

54
Summary
  • Protein sequence databases should be interpreted
    with as much care as mass spectra
  • Protein sequences come from genes
  • Use controlled vocabularies
  • Understand the structure of ontologies
  • Take advantage of computational predictions
  • Look for sequence variants
  • Be careful with omnibus databases
Write a Comment
User Comments (0)
About PowerShow.com