A. Auchincloss - PowerPoint PPT Presentation

1 / 146
About This Presentation
Title:

A. Auchincloss

Description:

UniProtKB/Swiss-Prot and ExPASy: Protein sequence databases and proteomics tools developed at the Swiss Institute of Bioinformatics Andrea Auchincloss (andrea ... – PowerPoint PPT presentation

Number of Views:311
Avg rating:3.0/5.0
Slides: 147
Provided by: Amos83
Category:

less

Transcript and Presenter's Notes

Title: A. Auchincloss


1
UniProtKB/Swiss-Prot and ExPASy Protein sequence
databases and proteomics tools developed at the
Swiss Institute of Bioinformatics
Andrea Auchincloss (andrea.auchincloss_at_isb-sib.ch)
Tunis, March 19, 2007
2
Outline
  • The Swiss Institute of Bioinformatics
  • What is UniProt?
  • UniProt Knowledgebase Swiss-Prot and TrEMBL
  • HPI, post-translational modifications, HAMAP
  • UniRef and UniParc
  • Databases for protein function and domains
    PROSITE, InterPro etc.
  • ExPASy other tools

3
Swiss Institute of Bioinformatics (SIB)
  • Non-profit foundation created in 1998
  • Groups in Geneva, Lausanne and Basel
  • Federation of several groups (some of which
    existed and collaborated long before the
    foundation of the institute), about 170
    researchers in 2006.

4
www.isb-sib.ch
5
SIB missions
  • Development of databases and software tools
  • High-quality bioinformatics research program
  • Courses and seminars for the training of
    bioinformatics research scientists. This includes
    a masters degree in proteomics and
    bioinformatics, several weekly courses and a
    doctoral school
  • Services to the Swiss Life Sciences community
    (EMBnet node).

6
Swiss Institute of Bioinformatics 20 research
and service groups
7
Proteins are organic compounds made of amino
acids arranged in a linear chain and joined by
peptide bonds Wikipedia
8
Proteins are composed of 20 "standard" amino
acids, symbolised by a LETTER.
Different views of a protein
9
Proteins can also work together to perform a
particular function, and they often associate to
form complexes.
10
Proteins are essential parts of all living
organisms and participate in every process within
cells. -gt enzymes -gt structural or mechanical
functions -gt important in cell signaling, immune
response, cell adhesion, cell cycle,
toxins. Proteins are a necessary component in
our diet, since animals cannot synthesize all the
amino acids and must obtain essential amino acids
from food.
11
Protein/Gene number
Organism Number Bacteria
182-8,591 S. cerevisiae 6,127 C. elegans
17,947 Drosophila 13,849 A. thaliana
25,674 Human 21,000
12
The universe in which protein databases evolve
1953 1st sequence (bovine insulin) 1986 4,000
sequences 2006 3.5 million sequences
Where will it stop?
AMB, SP20
13
179,000,021,000
1st estimate 30 million species (1.5 million
named)
2nd estimate 20 million bacteria/archaea
x 4,000 genes 5 million
protists x 6,000
genes 3 million insects
x 14,000 genes 1 million fungi
x 6,000 genes 0.6
million plants x
20,000 genes 0.2 million molluscs, worms,
arachnids, etc. x 20,000 genes 0.2 million
vertebrates x 21,000
genes
The calculation 2x107x40005x106x60003x106x14000
106x60006x105x 200002x105x200002x105x21000210
00(you!)
Caveat this is an estimate of the number of
potential sequence entries, but not that of the
number of distinct protein entities in the
biosphere.
AMB, SP20
14
What is sequencing is underway right now? Many
eukaryotic bacterial genomes (varying sizes)
Metagenomics (environmental samples) 6
million sequences submitted/published in
December 2006, 17 million sequences being
generated at the Venter Institute, 6 million
proteins are being submitted from the GOS (Global
Ocean Sampling) trip
15
Protein sequences what is sequenced? Currently
about 3.5 to 4.0 million known protein
sequences More than 99 of these are derived by
translation of nucleotide sequences Less than
1 direct protein sequencing (Edman, MS/MS)
-gt It is important that users know where the
protein sequence comes from (sequence gene
prediction quality)!
16
  • Level of DNA/RNA sequence quality
  • DNA/RNA sequencing quality (genome or WGS, cDNA
    or EST )
  • Gene prediction quality programs used, is there
    manual intervention afterwards?
  • For example
  • Authors can specify the nature of the CDS in the
    nucleotide databases by using qualifiers
  • "/evidenceexperimental" or "/evidencenot_experim
    ental".
  • Very rarely done

17
The hectic life of a sequence
Data not submitted to public databases, delayed
or cancelled
cDNAs, ESTs, genomes,
Public nucleic acid databases
EMBL, GenBank, DDBJ
if the submitters provide an annotated Coding
Sequence (CDS)
Public protein sequence databases
18
CDS CoDing Sequence (CDS)
CDS provided by the submitters
The first Met !
CDS translation provided by EMBL
19
Data not submitted
Complete genome (submitted) only 1,858 CDS
available!
20
Issue for the users the protein database jungle
21
The hectic life of a sequence
Data not submitted to public databases, delayed
or cancelled
cDNAs, ESTs, genomes,
Public nucleic acid databases
EMBL, GenBank, DDBJ
if the submitters provide an annotated Coding
Sequence (CDS)
Public protein sequence databases
22
The hectic life of a sequence
Data not submitted to public databases, delayed
or cancelled
cDNAs, ESTs, genomes,
Scientific publications derived sequences
EMBL, GenBank, DDBJ
CoDing Sequences provided by submitters
TrEMBL GenPept
RefSeq
PRF
PIR
UniProtKB
IPI
Swiss-Prot
UniParc
Manually annotated
EnsEMBL
PDB
CCDS
Also gene prediction
species-specific databases (EcoGene,
TubercuList, TIGR)
23
(No Transcript)
24
Major public protein sequence database sources
Integrated resources cross-references
PIR
PDB
PRF
UniProtKB Swiss-Prot TrEMBL NCBI-nr
Swiss-Prot GenPept PIR PDB PRF RefSeq
Separated resources
UniProtKB/Swiss-Prot manually annotated protein
sequences (11,000 species) UniProtKB/TrEMBL
submitted CDS (EMBL) automated annotation non
redundant with Swiss-Prot (127,000
species) GenPept submitted CDS (GenBank)
redundant with UniProtKB (about 130,000
species) PIR Protein Information Resource
archive since 2003 integrated into
UniProtKB PDB Protein Databank 3D data and
associated sequences PRF journal scan of
published peptide sequences RefSeq Reference
Sequence for DNA, RNA, protein gene prediction
(4,000 species)
25
Other protein sequence databases
CCDS EBI NCBI Wellcome Trust Sanger UC
Santa Cruz (2 species) Consensus human and mouse
sequences between 4 institutions Combining
different approaches ab initio, by similarity -
and taking advantage of the expertise acquired by
different institutes, including manual
annotation EnsEMBL UniProtKB RefSeq gene
prediction (31 species) aligns some eukaryotic
genomic sequences with all the sequences found in
EMBL, UniProtKB/Swiss-Prot, RefSeq and
UniProtKB/TrEMBL (? known genes)- Also does some
gene prediction (? novel genes) IPI UniProtKB
RefSeq EnsEMBL (H-InvDB, TAIR, VEGA) (7
species) provides a guide to the main databases
that describe the human, mouse, rat, zebrafish,
Arabidopsis, chicken, and cow proteomes.
26
The UniProt consortium
27
The UniProt Consortium
UniProt (Universal Protein Resource) the world's
most comprehensive catalogue of protein
information www.uniprot.org, Wu et al. Nucleic
Acids Res. 34D187-191(2006).
Provides 3 databases
-UniProtKB (Swiss-Prot TrEMBL)
-UniRef
-UniParc
and soon UniMES (for Metagenomic and
Environmental Sequences)
28
The Universal Protein resource components
UniProt KnowledgeBase
UniProtKB
UniRef100 UniRef 90 UniRef 50
UniProt Archives 8000000 entries
  • One UniRef100 entry
  • All identical sequences (including fragments).
  • One UniRef90 entry
  • Sequences that have at least 90 or more
    identity.
  • One UniRef50 entry
  • Sequences that are at least 50 or more
    identity.
  • Independent of species.

UniProtKB Release 9.7 consists of
Archived raw protein sequences, found in
publicly accessible databases Swiss-Prot,
TrEMBL, PIR, EMBL, Ensembl, IPI, PDB, RefSeq,
FlyBase, WormBase, Patent Offices.
UniProtKB/TrEMBL Computer annotated protein
sequences 3600000 entries 100000 species
  • UniProtKB/Swiss-Prot
  • Manually annotated
  • protein sequences
  • 260000 entries
  • 10000 species

Use with extreme caution Contains pseudogenes,
incorrect CDS predictions, etc
Allows comprehensible BLAST similarity searches
by providing sets of representative sequences
produced by SIB and EBI
produced by PIR
produced by EBI
29
The Universal Protein resource components
UniProt KnowledgeBase
UniProtKB
UniRef100 UniRef 90 UniRef 50
UniProt Archives 8000000 entries
  • One UniRef100 entry
  • All identical sequences (including fragments).
  • One UniRef90 entry
  • Sequences that have at least 90 or more
    identity.
  • One UniRef50 entry
  • Sequences that are at least 50 or more
    identity.
  • Independent of species.

UniProtKB/TrEMBL Computer annotated protein
sequences 3,900,000 entries 127,000 species
Archived raw protein sequences, found in
publicly accessible databases Swiss-Prot,
TrEMBL, PIR, EMBL, Ensembl, IPI, PDB, RefSeq,
FlyBase, WormBase, Patent Offices.
  • UniProtKB/Swiss-Prot
  • Manually annotated
  • protein sequences
  • 260,000 entries
  • 11,000 species

Use with extreme caution Contains pseudogenes,
incorrect CDS predictions, etc
Allows comprehensible BLAST similarity searches
by providing sets of representative sequences
produced by SIB and EBI
produced by PIR
produced by EBI
30
The Universal Protein resource components
UniProt KnowledgeBase
UniProtKB
UniRef100 UniRef 90 UniRef 50
UniProt Archives 8000000 entries
  • One UniRef100 entry
  • All identical sequences (including fragments).
  • One UniRef90 entry
  • Sequences that have at least 90 or more
    identity.
  • One UniRef50 entry
  • Sequences that are at least 50 or more
    identity.
  • Independent of species.

UniProtKB/TrEMBL Computer annotated protein
sequences 3,900,000 entries 127,000 species
Archived raw protein sequences, found in
publicly accessible databases Swiss-Prot,
TrEMBL, PIR, EMBL, Ensembl, IPI, PDB, RefSeq,
FlyBase, WormBase, Patent Offices.
UniProtKB/Swiss-Prot Manually annotated protein
sequences 260,000 entries 11,000 species
Use with extreme caution Contains pseudogenes,
incorrect CDS predictions, etc
Allows comprehensible BLAST similarity searches
by providing sets of representative sequences
produced by SIB and EBI
produced by PIR
produced by EBI
31
The Universal Protein resource components
UniProt KnowledgeBase
UniProtKB
UniRef100 UniRef 90 UniRef 50
UniProt Archives 8,800,000 entries
  • One UniRef100 entry
  • All identical sequences (including fragments).
  • One UniRef90 entry
  • Sequences that have at least 90 or more
    identity.
  • One UniRef50 entry
  • Sequences that are at least 50 or more
    identity.
  • Independent of species.

UniProtKB/TrEMBL Computer annotated protein
sequences 3,900,000 entries 127,000 species
Archived raw protein sequences, found in
publicly accessible databases Swiss-Prot,
TrEMBL, PIR, EMBL, Ensembl, IPI, PDB, RefSeq,
FlyBase, WormBase, Patent Offices.
UniProtKB/Swiss-Prot Manually annotated protein
sequences 260,000 entries 11,000 species
Use with extreme caution Contains pseudogenes,
incorrect CDS predictions, etc
Allows comprehensible BLAST similarity searches
by providing sets of representative sequences
produced by SIB and EBI
produced by PIR
produced by EBI
32
UniProt web sites http//www.expasy.org/sprot/ h
ttp//www.pir.uniprot.org/ http//www.ebi.ac.uk/un
iprot/ http//www.uniprot.org/ Soon, a new
unified web site, with a very powerful search
engine.
33
http//beta.uniprot.org/
Test it! Logonguest Password amazing
34
The UniProt groups from SIB, EBI and PIR
(Antibes, September 2004)
In Geneva (SIB) 2 Group Leaders 44
Annotators 4 Prosite annotators 22
Programmers and Researchers 5 Administrators,
science communicators 3 System
Administrators 4 Students 1
GISAID ------------------ 85 people
At PIR 1 Group Leader 13 Protein Science
Team 12 Informatics Team ------------------ 26
people
At EBI (Swiss-Prot EMBL TrEMBL) 75 people
(29 Annotators)
35
  • UniProtKB has biweekly releases available from
    about 100 servers, the main sources being ExPASy
    and www.uniprot.org

36
UniProtKBFrom EMBL (DNA) to TrEMBL (protein)
37
Gene/protein name
Taxonomy
Reference
CDS
TrEMBL
Automated extract of the protein sequence (CDS),
gene name, taxonomy and references. Automated
annotation (KWs and protein family).
EMBL
38
! TrEMBL does not translate DNA sequences, nor
does it use gene prediction programs only takes
the existing CDS proposed by the submitting
authors in the EMBL/Genbank/DDBJ entry In
particular, the proposed CDS and derived protein
sequences can be experimentally proven or derived
from gene prediction programs (this is not
obvious from the TrEMBL entry) TrEMBL does not
validate any sequences
39
!!!! The quality of UniProtKB/TrEMBL data is
directly dependent on the information provided by
the submitter of the original nucleotide entry.
40
UniProtKBFrom TrEMBL to Swiss-Prot
41
CDS
Manual annotation of the sequence and associated
biological information (derived from literature,
external experts, databases)
Automated extraction of the protein sequence
(CDS), gene name and references. Automated
annotation.
TrEMBL
Annotation of sequence differences (conflicts,
variants, splicing)
Average of 6 independent sequence reports for
each human protein
EMBL
42
Distinguishing Swiss-Prot and TrEMBL
  • A TrEMBL entry is a computer-annotated record
    derived from a coding sequence (CDS) in the
    nucleotide sequence databases, not in Swiss-Prot,
    after some redundancy removal and automated
    annotation.
  • A Swiss-Prot entry is a manually annotated record
    for a given protein.

43
UniProtKB From TrEMBL to Swiss-Prot Step 1
Sequence check
44
UniProtKB/Swiss-Prot
Non-redundant 1 entry -gt 1 gene (1 species)
i) Merge all known protein sequences (CDS
and amino acid) derived from the same gene -gt
decreases redundancy and improves sequence
reliability ii) Annotation of the sequence
differences (including conflicts,
polymorphisms, splice variants etc..) -gt
annotation of protein diversity
45
Redundancy
UniProtKB/Swiss-Prot 11,000 species
UniProtKB/TrEMBL 127,000 species
  • 260,000 3,800,000 ? 3,600,000

Redundancy in TrEMBL Redundancy between TrEMBL
and Swiss-Prot
In the future redundancy is going to decrease
"new" genome sequencing ? "new" proteins
46
- 13 sequences (complete or partial)
- derived from mRNA (n6) or genomic
DNA (n7)
47
All alternatively spliced sequences are available
for BLAST searches, protein identification tools
and are downloadable Human 2/3 of the human
genes are alternatively spliced
48
- 6 genomic sequences (complete or partial)
- 1 protein sequence from PIR
49
Multiple alignment of the available clpB sequences
50
(No Transcript)
51
Within Swiss-Prot?
  • A snapshot of the situation (December 2006)
  • 28,200 entries with 82,000 sequence conflicts
  • 2,600 entries with corrected frameshifts
  • 15,100 entries with corrected initiation sites
  • 4,300 entries with other sequence problems.
  • At least 43,000 entries (19 of Swiss-Prot)
    required a minimal amount of annotation effort to
    obtain the correct sequence.

52
Quality of protein information from genome
projects
  • Proteins originating from different genome
    projects
  • Drosophila what a curated (thanks to FlyBase)
    genome effort should look like only 1.8 of the
    gene models conflict with what we have in
    UniProtKB/Swiss-Prot
  • Arabidopsis a genome where lots of work was done
    to annotate it when it was sequenced, but where
    nothing as been done since (at least in the
    public view) 19.5 of the gene models are
    erroneous
  • Tetraodon nigroviridis a quick and dirty
    automatic run through a genome with no manual
    intervention gt90 of the gene models produce
    incorrect proteins.
  • Bacteria and Archaea have almost no splicing, so
    prediction is easier, however errors are still
    made

53
  • Producing a clean set of sequences is not a
    trivial task
  • It is not getting easier as more and more types
    of sequence data is submitted
  • It is important to pursue our efforts in making
    sure we provide to our users the most correct set
    of sequences for a given organism.

54
New Protein existence evidence tag
  • As most protein sequences are derived from
    translation of nucleotide sequence and are only
    predictions, the new PE line indicates whether
    there is any evidence that proves the existence
    of a protein
  • The Protein existence evidence will have 5
    different qualifiers
  • Evidence at protein level
  • Evidence at transcript level
  • Inferred from homology
  • Predicted
  • - Unassigned (used mostly in TrEMBL)

55
Righting the wrongs Sequences are rarely
deposited in a mature state as with all
scientific research, DNA and protein annotation
is a continual process of learning, revision and
corrections. Sequencing error rates 1 base
in 10000 Making people aware of errors is
good and great making people aware that theyre
responsible also for correcting errors is even
greater C. Hardley, EMBO reports, 4(9), 2003.
56
UniProtKB From TrEMBL to Swiss-Prot Step
2 Annotationliterature controlled
vocabulary
57
Annotation
  • The focal point of the efforts to maintain and
    develop UniProtKB/Swiss-Prot
  • It is becoming more and more important as it
    provides
  • a summary of what is known about a protein
  • creates template for automatic annotation for the
    many organisms whose genome sequence is/will be
    available but whose proteins will not be
    characterized
  • provides well annotated (corpus) entries to train
    literature mining tools (text mining).

58
.
  • Source of data
  • - publications (gt 1,700 journals cited)
  • also external scientific expertise other
  • databases

()
59
Comments structured free text, 27 defined
topics
Manually annotated Information from papers,
specialized databases, computer prediction,
external experts, brain storming Distinction
between data obtained experimentally and
computerized inferences
60
UniProtKB From TrEMBL to Swiss-Prot Step
3 Sequence analysis (bioinformatics tools)
61
The annotation platform
  • Annotators could not work without the help of
    our software developers

62
Anabelle much more than a domain annotation
platform
63
(No Transcript)
64
We manually check the results !
65
What else is in a UniProtKB/Swiss-Prot entry?
66
Cross-references a central hub
Gasteiger E. et al, Curr. Issues Mol. Biol.
347-55(2001) www.expasy.org/cgi-bin/lists?dbxref.
txt
  • Swiss-Prot was the first database with
    X-references
  • Explicitly X-referenced to 85 databases
  • DNA (EMBL/GenBank/DDBJ),
  • 3D-structure (PDB)
  • Family and domain (InterPro, HAMAP, PROSITE,
    Pfam, etc.)
  • genomic (OMIM, MGI, FlyBase, SGD, SubtiList,
    etc.)
  • 2D-gel (e.g. SWISS-2DPAGE)
  • specialized db (e.g.GlycoSuiteDB, PhosSite,
    MEROPS)
  • literature (PubMed)
  • Each UniProtKB/Swiss-Prot entry can be seen as a
    central hub for the data available about the
    protein it describes

67
Organism-specific databases AGD CYGD
DictyBase EchoBASE EcoGene euHCVdb FlyBase GeneDB
_Spombe GeneFarm Gramene H-InvDB HGNC HIV HPA
LegioList Leproma ListiList MaizeGDB MGI MIM Mypu
List PhotoList RGD SagaList SGD StyGene SubtiList
TAIR TubercuList WormBase WormPep ZFIN
Family and domain databases Gene3D HAMAP InterPro
PANTHER PIRSF Pfam PRINTS ProDom PROSITE SMART TIG
RFAMs
Enzyme and pathway databases BioCyc Reactome
Sequence databases EMBL PIR UniGene
2D-gel databases ANU-2DPAGE Aarhus/Ghent-2DPAGE C
OMPLUYEAST-2DPAGE Cornea-2DPAGE
DOSAC-COBS-2DPAGE ECO2DBASE HSC-2DPAGE OGP PHCI-2
DPAGE PMMA-2DPAGE Rat-heart-2DPAGE REPRODUCTION-2D
PAGE Siena-2DPAGE SWISS-2DPAGE
UniProtKB/Swiss-Prot explicit links
Miscellaneous ArrayExpress dbSNP DIP DrugBank
GO IntAct LinkHub RZPD-ProtExp
Protein family/group databases GermOnline MEROPS P
eroxiBase PptaseDB REBASE TRANSFAC
Genome annotation databases Ensembl GenomeReviews
KEGG TIGR
3D structure databases HSSP PDB SMR
PTM databases GlycoSuiteDB PhosSite
68
Implicit cross-references on new web server and
ExPASy
  • Implicit X-references to 26 additional db added
    by the ExPASy server on the www (i.e. GeneCards,
    ModBase, etc.)
  • These X-refs are not present as hard-coded DR
    lines in the Swiss-Prot entry as it can be
    downloaded by ftp, but are added on the fly when
    someone views an entry on ExPASy. This can be
    done because enough information is present in the
    UniProtKB entry to access the related information
    in another db.
  • Example All Swiss-Prot/TrEMBL are linked to the
    BLOCKS domain db, via the Swiss-Prot/TrEMBL
    accession number

69
Keyword definition and usage in Swiss-Prot
Linked to Gene Ontology to further
facilitate information retrieval via controlled
vocabularies
70
In a UniProtKB/Swiss-Prot entry, you can expect
to find
  • All the names of a given protein (and of its
    gene)
  • Its biological origin with links to the taxonomic
    databases
  • A selection of references
  • A summary of what is known about the protein
    function, alternative products, PTM, tissue
    expression, disease, 3D-structures, etc.
  • Numerous cross-references
  • Selected keywords
  • A description of important sequence features
    domains, PTMs, variations, etc.
  • A (often corrected) protein sequence and the
    description of various isoforms/variants.

71
Monitoring entry history The UniProtKB
Sequence/Annotation Version archive
72
(No Transcript)
73
and many useful links
74
And on the new website
other tools are not yet available
75
UniProt Knowledgebase
  • Swiss-Prot Manually annotated section
  • TrEMBL Automatically annotated section

76
Distinguishing Swiss-Prot and TrEMBL
77
(No Transcript)
78
Accession number to be used when you cite a
UniProt entry in anywhere (never cite the entry
name (ID) alone)
79
Non-Redundant Complete Proteome Sets
  • Text search UniProtKB keyword Complete
    proteome, combined with an organism name
  • Or download precomputed sets (bacteria, archaea,
    some eukaryotes) ftp//ftp.expasy.org/databases/c
    omplete_proteomes/entries
  • Or EBI Integr8 http//www.ebi.ac.uk/integr8/

80
Swiss-Prot annotation priorities
  • The main annotation programs
  • HAMAP (High quality Automated and Manual
    Annotation of microbial Proteomes bacteria,
    archaea, plastids)
  • HPI (Human Proteomics Initiative)
  • PPAP (Plant Proteome Annotation Project)
  • FPAP (Fungal Proteome Annotation Project)
  • Viral proteins
  • Tox-Prot (Toxin Annotation Project)
  • ENZYMES (proteins with EC numbers)
  • PTMs
  • 3D-structure
  • Protein-protein interactions
  • Quality assurance, includes controlled
    vocabularies

81
Model organisms
  • Organisms for which we want to have a more
    in-depth coverage
  • Completeness, links with specialized databases,
    specific documents
  • Examples E.coli, B.subtilis, human, mouse,
    fruitfly, C.elegans, yeast, S.pombe, A.thaliana.

82
Human Proteomics Initiative (HPI)
83
From genome to proteome
Ê
genome
21,000 human genes
Considerable increase in complexity
84
In the case of human genes, the Swiss-Prot/TrEMBL
redundancy is still very high 15,803 53,100
? about 20,000
human gene number estimation 21,000-35,000
MS proteomics has verified more than 10 of human
genes products, but has not identified
significant numbers of unpredicted proteins
85
(No Transcript)
86
Post-translational modifications (PTMs)
87
PTM definition
a post-translational modification or PTM is
a modification of a polypeptide chain involving
the making or the breaking of covalent bond(s)
that occurs during (co-translational class) or
after translation.
88
PTMs influence or even define protein function
  • phosphorylation and possibly GlcNAcylation and
    S-nitrosylation are a means of transducing
    extracellular signals to the inside of the cells.
  • methylation has a role in nuclear protein import.
  • lipid addition allows protein to membrane
    association (e.g. GPI-anchor, myristate,
    palmitate).
  • intrachain disulfide bonds and N-glycosylation
    influence protein folding.
  • interchain disulfide bonds bind subunits
    together.
  • other PTMs are directly involved in the protein
    function, as for example the binding of cofactors
    (e.g. pyridoxal phosphate), or the synthesis of a
    cofactor by the modification of amino acids
    present in the protein (e.g. quinones).

89
PTM variety
Each protein can be modified at various
siteswhich gives a high number of alternative
peptides. 283 different protein modifications
are annotated in UniProtKB/Swiss-Prot
90
Large scale experiments (LSE) for PTMs!
  • PTM information can now be obtained from results
    of proteomics large scale experiments (LSE)
  • In the past 12 months we have added about 6000
    experimental PTMs using data originating from
    some of these projects.

AMB, SP20
91
  • Proteomic studies have lead to the updating of
    2767 human
  • Swiss-Prot entries, mainly with PTM information
  • (UniProt release 10.0 , March 2007)

92
Bacteria and Archaea (HAMAP)
93
In 2006, 130 new bacterial and archaeal genomes
(not WGS) were submitted to the DNA
databases If on "average" 4,000
proteins/genomegt500,000 proteins!
How to cope????
94
  • High quality
  • Automated and
  • Manual
  • Annotation of microbial
  • Proteomes

Lots of microbial genomes, lots of proteins. What
should we do with them in UniProt?
95
http//www.expasy.org/unirule/MF_00319
96
Automatic annotation of proteins belonging to
specified families (1)
  • This program requires the continuous development
    and adaptation of software tools as well as the
    development of a database of annotation rules for
    each family (so far about 1,400).

97
  • Allows us to annotate automatically, yet with a
    very high level of quality, proteins that belong
    to well defined protein families
  • Can be applied to both characterized proteins and
    to some UPFs (Uncharacterized Protein Family)

The families are based on UniProtKB/Swiss-Prot
entries, so we first do all the annotation steps
described earlier!
98
(No Transcript)
99
/www.expasy.ch/sprot/hamap/
Using HAMAP, we can currently annotate to
Swiss-Prot quality level between 10 to 50 of a
complete microbial proteome (next step HAMAP for
Fungi)
100
Updates
  • DNA sequence archives
  • EMBL/GenBank/DDBJ is an archive
  • All submitted data goes into the archive
  • Submitters are responsible for the submitted
    sequences and the accompanying annotation
  • Nobody else can change them (including the
    curators at EMBL/GenBank/DDBJ)
  • Protein sequence databases
  • UniPRotKB/Swiss-Prot is NOT an archive
  • Swiss-Prot chooses what goes into the database
    and where to place it
  • Swiss-Prot updates annotation and sequences when
    necessary

101
ZB SYP, 28-NOV-2003 ALB, 16-NOV-2004 MIM,
31-Jan-2006 ZB BER, 13-FEB-2006 LYG,
14-JUN-2006 LYG, 21-SEP-2006 ZB CHH,
05-DEC-2006
102
  • User updates or annotation requests

103
Accessing Searching UniProtKB
  • Direct access (keyword search)
  • New search tool well use it later
  • Sequence Retrieval System (SRS, Europe), will
    disappear
  • Entrez (NCBI, USA) UniProtKB/Swiss-Prot (not
    TrEMBL) is integrated in GenPept, but with a
    changed format, and with some information (e.g.
    implicit cross-references) is missing
  • Query tools on ExPASy UniProt
    (http//www.expasy.org/sprot/, http//www.uniprot.
    org)
  • Indirect access (sequence search)
  • Bioinformatics sequence analysis tools (Blast,
    Fasta, GCG, Emboss, MS Identification tools)

104
Downloading the UniProt Knowledgebasehttp//www.e
xpasy.org/sprot/download.html
  • Swiss-Prot and TrEMBL form a complete,
    non-redundant database, the UniProt Knowledgebase
  • Can be downloaded from ftp//ftp.expasy.org/databa
    ses/uniprot/current_release/knowledgebase
  • In Swiss-Prot format, fasta or xml format
  • Complemented by sequences of alternative splice
    isoforms
  • everything about all proteins! (at least all
    CDS submitted to the public nucleotide sequence
    databases)

105
If you want to develop tools to work with your
local copy of UniProtKB
  • Swissknife a PERL parser for UniProtKB
  • Constantly updated according to latest format
    changes
  • Advantage you do not need to know how exactly
    the information is stored in the flat file
  • http//swissknife.sourceforge.net/
  • ftp//ftp.ebi.ac.uk/pub/software/swissprot/Swisskn
    ife/

106
Take home message
  • Swiss-Prot is the non redundant, manually
    annotated and highly cross-referenced section of
    the UniProt Knowledgebase
  • Be aware of the differences between
    UniProtKB/TrEMBL and UniProtKB/Swiss-Prot
  • Computer vs. Human
  • Redundant vs. Non-redundant
  • Always cite the Accession number, not the entry
    name
  • The AC is stable
  • The entry name might change

107
The UniProt Consortium
UniProt (Universal Protein Resource) the world's
most comprehensive catalogue of protein
information www.uniprot.org, Wu et al. Nucleic
Acids Res. 34D187-191(2006).
Provides 3 databases
-UniProtKB (Swiss-Prot TrEMBL)
-UniRef
-UniParc
and soon UniMES (for Metagenomic and
Environmental Sequences)
108
UniRef100, 90 and 50 clusters
  • One UniRef100 entry -gt all identical sequences
    from UniProtKB and some sections of UniParc
    (including fragments, Swiss-Prot splice
    variants).
  • One UniRef90 entry -gt sequences that have at
    least 90 or more identity.
  • One UniRef50 entry -gt sequences that are at least
    50 identical.

109
UniRef100, 90 and 50 clusters
  • One cluster can contain sequences of several
    species, clustering is done independently of the
    organism
  • Each cluster has a representative, reference
    sequence, preferably that of the best-annotated
    Swiss-Prot entry
  • UniRef identifiers are of the form
    UniRef100_P99999, UniRef50_P00414 not stable,
    as clusters are recomputed with every biweekly
    release, and cluster representatives can change!
  • UniRef is useful for comprehensive BLAST sequence
    searches by providing sets of representative
    sequences.

110
Implicit cross-link UniProtKB to UniRef
new web view
111
The UniProt Consortium
UniProt (Universal Protein Resource) the world's
most comprehensive catalogue of protein
information www.uniprot.org, Wu et al. Nucleic
Acids Res. 34D187-191(2006).
Provides 3 databases
-UniProtKB (Swiss-Prot TrEMBL)
-UniRef
-UniParc
and soon UniMES (for Metagenomic and
Environmental Sequences)
112
UniParc the UniProt Archive
  • 8.8 million sequences
  • Sequences and cross-references (AC numbers)
  • A comprehensive collection of the raw protein
    sequences in public databases (including those
    not submitted to the DNA databases)
  • Swiss-Prot, TrEMBL, PIR, EMBL, Ensembl, IPI,
    PDB, RefSeq, FlyBase, WormBase, Patent Offices.
  • UniParc can be used to track sequence versions

Use with extreme caution also contains
pseudogenes, incorrect CDS predictions, etcand
is highly redundant !
113
UniParc tracks a protein sequence and its
integration in various databases
http//www.pir.uniprot.org/cgi-bin/textSearch_AR
Patent data
114
UniParc entry UPI0000033477 part 2
115
(No Transcript)
116
www.expasy.ch/prosite
117
PROSITE
  • A database of protein families and domains using
    two kinds of motif descriptors
  • Patterns or regular expressions
  • User friendly (easy to understand and to use)
  • Well designed for the detection of biologically
    meaningful sites such as residues playing a
    structural or functional role
  • Can be used to scan a protein database in
    reasonable time on any computer
  • Generalized profiles or weight matrices
  • Well adapted to cover the full length of the
    protein or domain
  • Are able to detect highly divergent families or
    domains with only a few well conserved positions

118
Identification of protein domains and families
  • There are two non-exclusive approaches for the
    determination of the function of an
    uncharacterized protein
  • Comparison with a complete sequence database
    (BLAST)
  • Scanning a database of patterns and profiles
  • Most proteins can be grouped into families.
    Proteins belonging to a particular family share
    functional attributes and are derived from a
    common ancestor
  • Some regions in the sequence are more conserved
    than others during evolution because they are
    important for the function or the structure of
    the protein
  • Like fingerprints for police identification,
    signatures built out of sequence patterns or
    profiles can be used to formulate hypotheses
    about the function of uncharacterized proteins.

119
Definitions of conserved regions
  • Conserved regions can be classified into 5
    different groups
  • Families proteins that have the same domain
    arrangement, be 1 or many domains.
  • Domains specific combination of secondary
    structures that assume characteristic three
    dimensional structures or folds.
  • Repeats structural units always found in two or
    more copies that assemble in specific fold.
    Assemblies of repeats might also be thought of as
    domains.
  • Motifs short regions with conserved active- or
    binding-sites that usually adopt a folded
    conformation only in association with their
    ligands.
  • Sites functional residues (active sites,
    disulfide bridges, post-translationally modified
    residues)

120
Conserved regions (2)
Binding cleft (motif)
Cys 181 active site residue
PPID family 1 CSA_PPIASE domain 3 TPR repeat
121
http//www.expasy.org/tools/scanprosite/
122
(No Transcript)
123
Functionally and structurally relevant residues
in PROSITE motif descriptorsA new concept to
extract more information from profiles
  • Principle
  • Combining the advantages of profiles (high
    sensitivity) and patterns (position-specific
    information)
  • Tagging of amino acids at precise positions in
    the profile and checking their presence in the
    matched sequence

124
ProRule
  • Aim
  • Provide users with biologically meaningful
    functional and structural information
  • active sites,
  • post-translational modification sites,
  • binding sites,
  • disulfide bonds,
  • transmembrane regions.
  • Help the UniProtKB/Swiss-Prot annotation and
    provide enhanced homogeneity
  • domain name and boundaries,
  • keywords and linked GO terms,
  • EC numbers,
  • false negative PROSITE patterns.

           
 
 
 
125
www.expasy.ch/prosite/prorule.html
Sigrist et al. Bioinformatics 214060-4066(2005)
126
Other methods for protein/domain identification
Pfam, TIGRFAMs, SMART, Gene3D, PANTHER, CDD
Hidden Markov Models (HMM), Probabilistic
models PRINTS Unweighted matrices protein
fingerprints BLOCKS Weight matrix derived from
ungapped alignments PIRSF, SUPERFAMILY
classification system based on evolutionary
relationship of whole proteins ProDom
automatic compilation of homologous domains
based on recursive PSI-BLAST searches.
127
The InterPro projectwww.ebi.ac.uk/interpro
Integrated Documentation Resource of Protein
Families, Domains and Functional Sites
128
The InterPro projectwww.ebi.ac.uk/interpro
  • Unification of PROSITE, PRINTS, Pfam and ProDom
    into an integrated resource of protein families,
    domains and functional sites in 2000
  • Joint effort in creating a unified yet
    methodologically diverse system for protein
    family/domain identification
  • Single set of documents linked to the various
    methods
  • Distributed with tools by anonymous FTP and
    through www servers
  • Used to enhance the functional annotation of
    UniProtKB (Swiss-Prot and TrEMBL)
  • Has progressively incorporated other databases

129
Current status of InterPro
Release 14.1 (February 2007) was built from Pfam,
PRINTS, PROSITE, ProDom, SMART, TIGRFAMs, PIRSF,
Scop based SUPERFAMILY, Gene3D and PANTHER, and
the current UniProt/Swiss-Prot TrEMBL data.
(for details see http//www.ebi.ac.uk/interpro/re
lease_notes.html) InterPro release 14.1 contains
13,953 entries, representing 3,911 domains, 9,610
families, 232 repeats, 34 active sites, 20
binding sites and 19 post-translational
modification sites. Overall, there are 15,880,845
InterPro hits from 3,100,874 UniProtKB protein
sequences.
92.4 of Swiss-Prot and 76.4 of TrEMBL protein
sequences have one or more InterPro hits.

130
http//www.ebi.ac.uk/interpro/
131
http//www.ebi.ac.uk/interpro/IEntry?acIPR001304
132
InterPro Graphical domain representation
133
http//www.ebi.ac.uk/integr8/ProteomeAnalysisActio
n.do?orgProteomeID25
134
http//www.ebi.ac.uk/integr8/ProteomeAnalysisActio
n.do?orgProteomeId18
135
The ExPASy www server
www.expasy.org
  • First molecular biology server on the Web (August
    1993) 500 million accesses since
  • Dedicated to proteomics
  • Databases UniProtKB, PROSITE, Swiss-2DPAGE,
    etc.
  • Many 2D/MS protein identification/characterization
    and sequence analysis tools
  • Mirror sites in Australia, Brazil, Canada, China
    and Korea http//aubrcacnkrwww.expasy.org

136
(No Transcript)
137
ExPASy software tools
  • Tools for the display and management of databases
    (NiceProt, Swiss-Shop sequence alerting system,
    etc.)
  • Tools for sequence analysis (ScanProsite,
    ProtParam, ProtScale, RandSeq, Translate, etc.)
  • Proteomics tools (AACompIdent, FindMod, FindPept,
    Aldente, PeptideMass, TagIdent, etc.)
  • 3D-structure analysis and display tools
    (Swiss-Model, Swiss-PDBviewer)

138
Identification Aldente, TagIdent, AAcompIdent,
MultiIdent
http//www.expasy.org/tools/
Characterization FindMod, GlycoMod, FindPept
Analysis PeptideMass, GlycanMass, BioGraph,
PeptideCutter ProtScale, ProtParam
- Use annotation in Swiss-Prot and TrEMBL
(preprocessing, PTMs, etc.) - Hyper-links between
tools and databases
139
http//www.expasy.org/links.html
140
Finding out about recent developmentsUniProtKB/
Swiss-Prot recent format changeshttp//www.expas
y.org/sprot/relnotes/sp_news.html
UniProtKB/Swiss-Prot planned format
changeshttp//www.expasy.org/sprot/relnotes/sp_s
oon.html Subscribe to the electronic
Swiss-Flash bulletins http//www.expasy.org/swiss
-flash/ Whats new on ExPASy
http//www.expasy.org/history.html
141
References (1)
  • UniProtKB/Swiss-Prot
  • http//www.expasy.org/sprot/sprot-ref.html
  • Wu C. et al. The Universal Protein Resource
    (UniProt) an expanding universe of protein
    information.Nucleic Acids Res.
    34D187-191(2006).
  • Boeckmann B. et al. Protein variety and
    functional diversity Swiss-Prot annotation in
    its biological contextComptes Rendus Biologies
    328882-99(2005).
  • Bairoch A.Swiss-Prot Juggling between evolution
    and stability Brief. Bioinform. 539-55(2004).
  • Farriol-Mathis N. et al. Annotation of
    post-translational modifications in the
    Swiss-Prot knowledgebase. Proteomics
    41537-1550(2004).
  • Gasteiger E. et al. A. Swiss-Prot Connecting
    biological knowledge via a protein database
  • Curr. Issues Mol. Biol. 347-55(2001).

142
References (2)
PROSITE Hulo N., et al., The PROSITE database.
Nucleic Acids Res. 34D227-D230(2006). Sigrist
C.J.A., et al., PROSITE a documented database
using patterns and profiles as motif descriptors.
Brief Bioinform. 3265-274(2002). Gattiker A.,
et al., ScanProsite a reference implementation
of a PROSITE scanning tool. Applied
Bioinformatics 1107-108(2002). Sigrist C.J.A.,
et al., ProRule a new database containing
functional and structural information on PROSITE
profiles. Bioinformatics. 2005 21(21)4060-6.
ExPASy Gasteiger E. et al.ExPASy the
proteomics server for in-depth protein knowledge
and analysis. Nucleic Acids Res.
313784-3788(2003).
143
Useful general publications
  • Nucleic Acids Res. Database issue 2006, vol. 34,
    supplement 1 http//nar.oupjournals.org/conte
    nt/vol34/suppl_1/
  • Nucleic Acids Res. Web server issue 2005, vol.
    33, supplement 2 http//nar.oupjournals.org/conte
    nt/vol33/suppl_2/
  • Book Bioinformatics for Dummies, by J.-M.
    Claverie and C. Notredame
  • Publisher For Dummies 2nd edition (December,
    2006)
  • ISBN 0764516965

144
Take home message
Or via the website
145
Before the introduction to Swiss-Prot/ExPASy
After the introduction to Swiss-Prot /ExPASy
146
  • Some practical exercises
  • http//education.expasy.org/cours/Tunis/
  • Finding databases
  • Comparing protein databases
  • Comparing BLAST programs
  • BLAST output
  • Bacterial start sites
  • UniRef
  • Different views of UniProtKB
  • Environmental sequences
  • Inter-database links PROSITE
  • InterPro
  • Using UniProtKB/Swiss-Prot to create datasets
Write a Comment
User Comments (0)
About PowerShow.com