Title: A. Auchincloss
1UniProtKB/Swiss-Prot and ExPASy Protein sequence
databases and proteomics tools developed at the
Swiss Institute of Bioinformatics
Andrea Auchincloss (andrea.auchincloss_at_isb-sib.ch)
Tunis, March 19, 2007
2Outline
- The Swiss Institute of Bioinformatics
- What is UniProt?
- UniProt Knowledgebase Swiss-Prot and TrEMBL
- HPI, post-translational modifications, HAMAP
- UniRef and UniParc
- Databases for protein function and domains
PROSITE, InterPro etc. - ExPASy other tools
3Swiss Institute of Bioinformatics (SIB)
- Non-profit foundation created in 1998
- Groups in Geneva, Lausanne and Basel
- Federation of several groups (some of which
existed and collaborated long before the
foundation of the institute), about 170
researchers in 2006.
4www.isb-sib.ch
5SIB missions
- Development of databases and software tools
- High-quality bioinformatics research program
- Courses and seminars for the training of
bioinformatics research scientists. This includes
a masters degree in proteomics and
bioinformatics, several weekly courses and a
doctoral school - Services to the Swiss Life Sciences community
(EMBnet node).
6Swiss Institute of Bioinformatics 20 research
and service groups
7Proteins are organic compounds made of amino
acids arranged in a linear chain and joined by
peptide bonds Wikipedia
8Proteins are composed of 20 "standard" amino
acids, symbolised by a LETTER.
Different views of a protein
9Proteins can also work together to perform a
particular function, and they often associate to
form complexes.
10 Proteins are essential parts of all living
organisms and participate in every process within
cells. -gt enzymes -gt structural or mechanical
functions -gt important in cell signaling, immune
response, cell adhesion, cell cycle,
toxins. Proteins are a necessary component in
our diet, since animals cannot synthesize all the
amino acids and must obtain essential amino acids
from food.
11Protein/Gene number
Organism Number Bacteria
182-8,591 S. cerevisiae 6,127 C. elegans
17,947 Drosophila 13,849 A. thaliana
25,674 Human 21,000
12The universe in which protein databases evolve
1953 1st sequence (bovine insulin) 1986 4,000
sequences 2006 3.5 million sequences
Where will it stop?
AMB, SP20
13179,000,021,000
1st estimate 30 million species (1.5 million
named)
2nd estimate 20 million bacteria/archaea
x 4,000 genes 5 million
protists x 6,000
genes 3 million insects
x 14,000 genes 1 million fungi
x 6,000 genes 0.6
million plants x
20,000 genes 0.2 million molluscs, worms,
arachnids, etc. x 20,000 genes 0.2 million
vertebrates x 21,000
genes
The calculation 2x107x40005x106x60003x106x14000
106x60006x105x 200002x105x200002x105x21000210
00(you!)
Caveat this is an estimate of the number of
potential sequence entries, but not that of the
number of distinct protein entities in the
biosphere.
AMB, SP20
14What is sequencing is underway right now? Many
eukaryotic bacterial genomes (varying sizes)
Metagenomics (environmental samples) 6
million sequences submitted/published in
December 2006, 17 million sequences being
generated at the Venter Institute, 6 million
proteins are being submitted from the GOS (Global
Ocean Sampling) trip
15Protein sequences what is sequenced? Currently
about 3.5 to 4.0 million known protein
sequences More than 99 of these are derived by
translation of nucleotide sequences Less than
1 direct protein sequencing (Edman, MS/MS)
-gt It is important that users know where the
protein sequence comes from (sequence gene
prediction quality)!
16- Level of DNA/RNA sequence quality
- DNA/RNA sequencing quality (genome or WGS, cDNA
or EST ) - Gene prediction quality programs used, is there
manual intervention afterwards? - For example
- Authors can specify the nature of the CDS in the
nucleotide databases by using qualifiers - "/evidenceexperimental" or "/evidencenot_experim
ental". - Very rarely done
17The hectic life of a sequence
Data not submitted to public databases, delayed
or cancelled
cDNAs, ESTs, genomes,
Public nucleic acid databases
EMBL, GenBank, DDBJ
if the submitters provide an annotated Coding
Sequence (CDS)
Public protein sequence databases
18CDS CoDing Sequence (CDS)
CDS provided by the submitters
The first Met !
CDS translation provided by EMBL
19Data not submitted
Complete genome (submitted) only 1,858 CDS
available!
20 Issue for the users the protein database jungle
21The hectic life of a sequence
Data not submitted to public databases, delayed
or cancelled
cDNAs, ESTs, genomes,
Public nucleic acid databases
EMBL, GenBank, DDBJ
if the submitters provide an annotated Coding
Sequence (CDS)
Public protein sequence databases
22The hectic life of a sequence
Data not submitted to public databases, delayed
or cancelled
cDNAs, ESTs, genomes,
Scientific publications derived sequences
EMBL, GenBank, DDBJ
CoDing Sequences provided by submitters
TrEMBL GenPept
RefSeq
PRF
PIR
UniProtKB
IPI
Swiss-Prot
UniParc
Manually annotated
EnsEMBL
PDB
CCDS
Also gene prediction
species-specific databases (EcoGene,
TubercuList, TIGR)
23(No Transcript)
24Major public protein sequence database sources
Integrated resources cross-references
PIR
PDB
PRF
UniProtKB Swiss-Prot TrEMBL NCBI-nr
Swiss-Prot GenPept PIR PDB PRF RefSeq
Separated resources
UniProtKB/Swiss-Prot manually annotated protein
sequences (11,000 species) UniProtKB/TrEMBL
submitted CDS (EMBL) automated annotation non
redundant with Swiss-Prot (127,000
species) GenPept submitted CDS (GenBank)
redundant with UniProtKB (about 130,000
species) PIR Protein Information Resource
archive since 2003 integrated into
UniProtKB PDB Protein Databank 3D data and
associated sequences PRF journal scan of
published peptide sequences RefSeq Reference
Sequence for DNA, RNA, protein gene prediction
(4,000 species)
25Other protein sequence databases
CCDS EBI NCBI Wellcome Trust Sanger UC
Santa Cruz (2 species) Consensus human and mouse
sequences between 4 institutions Combining
different approaches ab initio, by similarity -
and taking advantage of the expertise acquired by
different institutes, including manual
annotation EnsEMBL UniProtKB RefSeq gene
prediction (31 species) aligns some eukaryotic
genomic sequences with all the sequences found in
EMBL, UniProtKB/Swiss-Prot, RefSeq and
UniProtKB/TrEMBL (? known genes)- Also does some
gene prediction (? novel genes) IPI UniProtKB
RefSeq EnsEMBL (H-InvDB, TAIR, VEGA) (7
species) provides a guide to the main databases
that describe the human, mouse, rat, zebrafish,
Arabidopsis, chicken, and cow proteomes.
26The UniProt consortium
27The UniProt Consortium
UniProt (Universal Protein Resource) the world's
most comprehensive catalogue of protein
information www.uniprot.org, Wu et al. Nucleic
Acids Res. 34D187-191(2006).
Provides 3 databases
-UniProtKB (Swiss-Prot TrEMBL)
-UniRef
-UniParc
and soon UniMES (for Metagenomic and
Environmental Sequences)
28The Universal Protein resource components
UniProt KnowledgeBase
UniProtKB
UniRef100 UniRef 90 UniRef 50
UniProt Archives 8000000 entries
- One UniRef100 entry
- All identical sequences (including fragments).
- One UniRef90 entry
- Sequences that have at least 90 or more
identity. - One UniRef50 entry
- Sequences that are at least 50 or more
identity. - Independent of species.
UniProtKB Release 9.7 consists of
Archived raw protein sequences, found in
publicly accessible databases Swiss-Prot,
TrEMBL, PIR, EMBL, Ensembl, IPI, PDB, RefSeq,
FlyBase, WormBase, Patent Offices.
UniProtKB/TrEMBL Computer annotated protein
sequences 3600000 entries 100000 species
- UniProtKB/Swiss-Prot
- Manually annotated
- protein sequences
- 260000 entries
- 10000 species
Use with extreme caution Contains pseudogenes,
incorrect CDS predictions, etc
Allows comprehensible BLAST similarity searches
by providing sets of representative sequences
produced by SIB and EBI
produced by PIR
produced by EBI
29The Universal Protein resource components
UniProt KnowledgeBase
UniProtKB
UniRef100 UniRef 90 UniRef 50
UniProt Archives 8000000 entries
- One UniRef100 entry
- All identical sequences (including fragments).
- One UniRef90 entry
- Sequences that have at least 90 or more
identity. - One UniRef50 entry
- Sequences that are at least 50 or more
identity. - Independent of species.
UniProtKB/TrEMBL Computer annotated protein
sequences 3,900,000 entries 127,000 species
Archived raw protein sequences, found in
publicly accessible databases Swiss-Prot,
TrEMBL, PIR, EMBL, Ensembl, IPI, PDB, RefSeq,
FlyBase, WormBase, Patent Offices.
- UniProtKB/Swiss-Prot
- Manually annotated
- protein sequences
- 260,000 entries
- 11,000 species
Use with extreme caution Contains pseudogenes,
incorrect CDS predictions, etc
Allows comprehensible BLAST similarity searches
by providing sets of representative sequences
produced by SIB and EBI
produced by PIR
produced by EBI
30The Universal Protein resource components
UniProt KnowledgeBase
UniProtKB
UniRef100 UniRef 90 UniRef 50
UniProt Archives 8000000 entries
- One UniRef100 entry
- All identical sequences (including fragments).
- One UniRef90 entry
- Sequences that have at least 90 or more
identity. - One UniRef50 entry
- Sequences that are at least 50 or more
identity. - Independent of species.
UniProtKB/TrEMBL Computer annotated protein
sequences 3,900,000 entries 127,000 species
Archived raw protein sequences, found in
publicly accessible databases Swiss-Prot,
TrEMBL, PIR, EMBL, Ensembl, IPI, PDB, RefSeq,
FlyBase, WormBase, Patent Offices.
UniProtKB/Swiss-Prot Manually annotated protein
sequences 260,000 entries 11,000 species
Use with extreme caution Contains pseudogenes,
incorrect CDS predictions, etc
Allows comprehensible BLAST similarity searches
by providing sets of representative sequences
produced by SIB and EBI
produced by PIR
produced by EBI
31The Universal Protein resource components
UniProt KnowledgeBase
UniProtKB
UniRef100 UniRef 90 UniRef 50
UniProt Archives 8,800,000 entries
- One UniRef100 entry
- All identical sequences (including fragments).
- One UniRef90 entry
- Sequences that have at least 90 or more
identity. - One UniRef50 entry
- Sequences that are at least 50 or more
identity. - Independent of species.
UniProtKB/TrEMBL Computer annotated protein
sequences 3,900,000 entries 127,000 species
Archived raw protein sequences, found in
publicly accessible databases Swiss-Prot,
TrEMBL, PIR, EMBL, Ensembl, IPI, PDB, RefSeq,
FlyBase, WormBase, Patent Offices.
UniProtKB/Swiss-Prot Manually annotated protein
sequences 260,000 entries 11,000 species
Use with extreme caution Contains pseudogenes,
incorrect CDS predictions, etc
Allows comprehensible BLAST similarity searches
by providing sets of representative sequences
produced by SIB and EBI
produced by PIR
produced by EBI
32UniProt web sites http//www.expasy.org/sprot/ h
ttp//www.pir.uniprot.org/ http//www.ebi.ac.uk/un
iprot/ http//www.uniprot.org/ Soon, a new
unified web site, with a very powerful search
engine.
33http//beta.uniprot.org/
Test it! Logonguest Password amazing
34The UniProt groups from SIB, EBI and PIR
(Antibes, September 2004)
In Geneva (SIB) 2 Group Leaders 44
Annotators 4 Prosite annotators 22
Programmers and Researchers 5 Administrators,
science communicators 3 System
Administrators 4 Students 1
GISAID ------------------ 85 people
At PIR 1 Group Leader 13 Protein Science
Team 12 Informatics Team ------------------ 26
people
At EBI (Swiss-Prot EMBL TrEMBL) 75 people
(29 Annotators)
35- UniProtKB has biweekly releases available from
about 100 servers, the main sources being ExPASy
and www.uniprot.org
36UniProtKBFrom EMBL (DNA) to TrEMBL (protein)
37Gene/protein name
Taxonomy
Reference
CDS
TrEMBL
Automated extract of the protein sequence (CDS),
gene name, taxonomy and references. Automated
annotation (KWs and protein family).
EMBL
38 ! TrEMBL does not translate DNA sequences, nor
does it use gene prediction programs only takes
the existing CDS proposed by the submitting
authors in the EMBL/Genbank/DDBJ entry In
particular, the proposed CDS and derived protein
sequences can be experimentally proven or derived
from gene prediction programs (this is not
obvious from the TrEMBL entry) TrEMBL does not
validate any sequences
39!!!! The quality of UniProtKB/TrEMBL data is
directly dependent on the information provided by
the submitter of the original nucleotide entry.
40UniProtKBFrom TrEMBL to Swiss-Prot
41CDS
Manual annotation of the sequence and associated
biological information (derived from literature,
external experts, databases)
Automated extraction of the protein sequence
(CDS), gene name and references. Automated
annotation.
TrEMBL
Annotation of sequence differences (conflicts,
variants, splicing)
Average of 6 independent sequence reports for
each human protein
EMBL
42Distinguishing Swiss-Prot and TrEMBL
- A TrEMBL entry is a computer-annotated record
derived from a coding sequence (CDS) in the
nucleotide sequence databases, not in Swiss-Prot,
after some redundancy removal and automated
annotation. - A Swiss-Prot entry is a manually annotated record
for a given protein.
43UniProtKB From TrEMBL to Swiss-Prot Step 1
Sequence check
44UniProtKB/Swiss-Prot
Non-redundant 1 entry -gt 1 gene (1 species)
i) Merge all known protein sequences (CDS
and amino acid) derived from the same gene -gt
decreases redundancy and improves sequence
reliability ii) Annotation of the sequence
differences (including conflicts,
polymorphisms, splice variants etc..) -gt
annotation of protein diversity
45Redundancy
UniProtKB/Swiss-Prot 11,000 species
UniProtKB/TrEMBL 127,000 species
- 260,000 3,800,000 ? 3,600,000
Redundancy in TrEMBL Redundancy between TrEMBL
and Swiss-Prot
In the future redundancy is going to decrease
"new" genome sequencing ? "new" proteins
46- 13 sequences (complete or partial)
- derived from mRNA (n6) or genomic
DNA (n7)
47All alternatively spliced sequences are available
for BLAST searches, protein identification tools
and are downloadable Human 2/3 of the human
genes are alternatively spliced
48- 6 genomic sequences (complete or partial)
- 1 protein sequence from PIR
49Multiple alignment of the available clpB sequences
50(No Transcript)
51Within Swiss-Prot?
- A snapshot of the situation (December 2006)
- 28,200 entries with 82,000 sequence conflicts
- 2,600 entries with corrected frameshifts
- 15,100 entries with corrected initiation sites
- 4,300 entries with other sequence problems.
- At least 43,000 entries (19 of Swiss-Prot)
required a minimal amount of annotation effort to
obtain the correct sequence.
52Quality of protein information from genome
projects
- Proteins originating from different genome
projects - Drosophila what a curated (thanks to FlyBase)
genome effort should look like only 1.8 of the
gene models conflict with what we have in
UniProtKB/Swiss-Prot - Arabidopsis a genome where lots of work was done
to annotate it when it was sequenced, but where
nothing as been done since (at least in the
public view) 19.5 of the gene models are
erroneous - Tetraodon nigroviridis a quick and dirty
automatic run through a genome with no manual
intervention gt90 of the gene models produce
incorrect proteins. - Bacteria and Archaea have almost no splicing, so
prediction is easier, however errors are still
made
53- Producing a clean set of sequences is not a
trivial task - It is not getting easier as more and more types
of sequence data is submitted - It is important to pursue our efforts in making
sure we provide to our users the most correct set
of sequences for a given organism.
54New Protein existence evidence tag
- As most protein sequences are derived from
translation of nucleotide sequence and are only
predictions, the new PE line indicates whether
there is any evidence that proves the existence
of a protein - The Protein existence evidence will have 5
different qualifiers - Evidence at protein level
- Evidence at transcript level
- Inferred from homology
- Predicted
- - Unassigned (used mostly in TrEMBL)
55Righting the wrongs Sequences are rarely
deposited in a mature state as with all
scientific research, DNA and protein annotation
is a continual process of learning, revision and
corrections. Sequencing error rates 1 base
in 10000 Making people aware of errors is
good and great making people aware that theyre
responsible also for correcting errors is even
greater C. Hardley, EMBO reports, 4(9), 2003.
56 UniProtKB From TrEMBL to Swiss-Prot Step
2 Annotationliterature controlled
vocabulary
57Annotation
- The focal point of the efforts to maintain and
develop UniProtKB/Swiss-Prot - It is becoming more and more important as it
provides - a summary of what is known about a protein
- creates template for automatic annotation for the
many organisms whose genome sequence is/will be
available but whose proteins will not be
characterized - provides well annotated (corpus) entries to train
literature mining tools (text mining).
58.
- Source of data
- - publications (gt 1,700 journals cited)
- also external scientific expertise other
- databases
()
59Comments structured free text, 27 defined
topics
Manually annotated Information from papers,
specialized databases, computer prediction,
external experts, brain storming Distinction
between data obtained experimentally and
computerized inferences
60 UniProtKB From TrEMBL to Swiss-Prot Step
3 Sequence analysis (bioinformatics tools)
61The annotation platform
- Annotators could not work without the help of
our software developers
62Anabelle much more than a domain annotation
platform
63(No Transcript)
64We manually check the results !
65What else is in a UniProtKB/Swiss-Prot entry?
66Cross-references a central hub
Gasteiger E. et al, Curr. Issues Mol. Biol.
347-55(2001) www.expasy.org/cgi-bin/lists?dbxref.
txt
- Swiss-Prot was the first database with
X-references - Explicitly X-referenced to 85 databases
- DNA (EMBL/GenBank/DDBJ),
- 3D-structure (PDB)
- Family and domain (InterPro, HAMAP, PROSITE,
Pfam, etc.) - genomic (OMIM, MGI, FlyBase, SGD, SubtiList,
etc.) - 2D-gel (e.g. SWISS-2DPAGE)
- specialized db (e.g.GlycoSuiteDB, PhosSite,
MEROPS) - literature (PubMed)
- Each UniProtKB/Swiss-Prot entry can be seen as a
central hub for the data available about the
protein it describes
67Organism-specific databases AGD CYGD
DictyBase EchoBASE EcoGene euHCVdb FlyBase GeneDB
_Spombe GeneFarm Gramene H-InvDB HGNC HIV HPA
LegioList Leproma ListiList MaizeGDB MGI MIM Mypu
List PhotoList RGD SagaList SGD StyGene SubtiList
TAIR TubercuList WormBase WormPep ZFIN
Family and domain databases Gene3D HAMAP InterPro
PANTHER PIRSF Pfam PRINTS ProDom PROSITE SMART TIG
RFAMs
Enzyme and pathway databases BioCyc Reactome
Sequence databases EMBL PIR UniGene
2D-gel databases ANU-2DPAGE Aarhus/Ghent-2DPAGE C
OMPLUYEAST-2DPAGE Cornea-2DPAGE
DOSAC-COBS-2DPAGE ECO2DBASE HSC-2DPAGE OGP PHCI-2
DPAGE PMMA-2DPAGE Rat-heart-2DPAGE REPRODUCTION-2D
PAGE Siena-2DPAGE SWISS-2DPAGE
UniProtKB/Swiss-Prot explicit links
Miscellaneous ArrayExpress dbSNP DIP DrugBank
GO IntAct LinkHub RZPD-ProtExp
Protein family/group databases GermOnline MEROPS P
eroxiBase PptaseDB REBASE TRANSFAC
Genome annotation databases Ensembl GenomeReviews
KEGG TIGR
3D structure databases HSSP PDB SMR
PTM databases GlycoSuiteDB PhosSite
68Implicit cross-references on new web server and
ExPASy
- Implicit X-references to 26 additional db added
by the ExPASy server on the www (i.e. GeneCards,
ModBase, etc.) - These X-refs are not present as hard-coded DR
lines in the Swiss-Prot entry as it can be
downloaded by ftp, but are added on the fly when
someone views an entry on ExPASy. This can be
done because enough information is present in the
UniProtKB entry to access the related information
in another db. - Example All Swiss-Prot/TrEMBL are linked to the
BLOCKS domain db, via the Swiss-Prot/TrEMBL
accession number
69Keyword definition and usage in Swiss-Prot
Linked to Gene Ontology to further
facilitate information retrieval via controlled
vocabularies
70In a UniProtKB/Swiss-Prot entry, you can expect
to find
- All the names of a given protein (and of its
gene) - Its biological origin with links to the taxonomic
databases - A selection of references
- A summary of what is known about the protein
function, alternative products, PTM, tissue
expression, disease, 3D-structures, etc. - Numerous cross-references
- Selected keywords
- A description of important sequence features
domains, PTMs, variations, etc. - A (often corrected) protein sequence and the
description of various isoforms/variants.
71Monitoring entry history The UniProtKB
Sequence/Annotation Version archive
72(No Transcript)
73 and many useful links
74And on the new website
other tools are not yet available
75UniProt Knowledgebase
- Swiss-Prot Manually annotated section
- TrEMBL Automatically annotated section
76Distinguishing Swiss-Prot and TrEMBL
77(No Transcript)
78Accession number to be used when you cite a
UniProt entry in anywhere (never cite the entry
name (ID) alone)
79Non-Redundant Complete Proteome Sets
- Text search UniProtKB keyword Complete
proteome, combined with an organism name - Or download precomputed sets (bacteria, archaea,
some eukaryotes) ftp//ftp.expasy.org/databases/c
omplete_proteomes/entries - Or EBI Integr8 http//www.ebi.ac.uk/integr8/
80Swiss-Prot annotation priorities
- The main annotation programs
- HAMAP (High quality Automated and Manual
Annotation of microbial Proteomes bacteria,
archaea, plastids) - HPI (Human Proteomics Initiative)
- PPAP (Plant Proteome Annotation Project)
- FPAP (Fungal Proteome Annotation Project)
- Viral proteins
- Tox-Prot (Toxin Annotation Project)
- ENZYMES (proteins with EC numbers)
- PTMs
- 3D-structure
- Protein-protein interactions
- Quality assurance, includes controlled
vocabularies
81Model organisms
- Organisms for which we want to have a more
in-depth coverage - Completeness, links with specialized databases,
specific documents - Examples E.coli, B.subtilis, human, mouse,
fruitfly, C.elegans, yeast, S.pombe, A.thaliana.
82Human Proteomics Initiative (HPI)
83From genome to proteome
Ê
genome
21,000 human genes
Considerable increase in complexity
84In the case of human genes, the Swiss-Prot/TrEMBL
redundancy is still very high 15,803 53,100
? about 20,000
human gene number estimation 21,000-35,000
MS proteomics has verified more than 10 of human
genes products, but has not identified
significant numbers of unpredicted proteins
85(No Transcript)
86Post-translational modifications (PTMs)
87PTM definition
a post-translational modification or PTM is
a modification of a polypeptide chain involving
the making or the breaking of covalent bond(s)
that occurs during (co-translational class) or
after translation.
88PTMs influence or even define protein function
- phosphorylation and possibly GlcNAcylation and
S-nitrosylation are a means of transducing
extracellular signals to the inside of the cells. - methylation has a role in nuclear protein import.
- lipid addition allows protein to membrane
association (e.g. GPI-anchor, myristate,
palmitate). - intrachain disulfide bonds and N-glycosylation
influence protein folding. - interchain disulfide bonds bind subunits
together. - other PTMs are directly involved in the protein
function, as for example the binding of cofactors
(e.g. pyridoxal phosphate), or the synthesis of a
cofactor by the modification of amino acids
present in the protein (e.g. quinones).
89PTM variety
Each protein can be modified at various
siteswhich gives a high number of alternative
peptides. 283 different protein modifications
are annotated in UniProtKB/Swiss-Prot
90Large scale experiments (LSE) for PTMs!
- PTM information can now be obtained from results
of proteomics large scale experiments (LSE) - In the past 12 months we have added about 6000
experimental PTMs using data originating from
some of these projects.
AMB, SP20
91- Proteomic studies have lead to the updating of
2767 human - Swiss-Prot entries, mainly with PTM information
- (UniProt release 10.0 , March 2007)
92Bacteria and Archaea (HAMAP)
93In 2006, 130 new bacterial and archaeal genomes
(not WGS) were submitted to the DNA
databases If on "average" 4,000
proteins/genomegt500,000 proteins!
How to cope????
94- High quality
- Automated and
- Manual
- Annotation of microbial
- Proteomes
Lots of microbial genomes, lots of proteins. What
should we do with them in UniProt?
95http//www.expasy.org/unirule/MF_00319
96Automatic annotation of proteins belonging to
specified families (1)
- This program requires the continuous development
and adaptation of software tools as well as the
development of a database of annotation rules for
each family (so far about 1,400).
97- Allows us to annotate automatically, yet with a
very high level of quality, proteins that belong
to well defined protein families - Can be applied to both characterized proteins and
to some UPFs (Uncharacterized Protein Family)
The families are based on UniProtKB/Swiss-Prot
entries, so we first do all the annotation steps
described earlier!
98(No Transcript)
99/www.expasy.ch/sprot/hamap/
Using HAMAP, we can currently annotate to
Swiss-Prot quality level between 10 to 50 of a
complete microbial proteome (next step HAMAP for
Fungi)
100Updates
- DNA sequence archives
- EMBL/GenBank/DDBJ is an archive
- All submitted data goes into the archive
- Submitters are responsible for the submitted
sequences and the accompanying annotation - Nobody else can change them (including the
curators at EMBL/GenBank/DDBJ) - Protein sequence databases
- UniPRotKB/Swiss-Prot is NOT an archive
- Swiss-Prot chooses what goes into the database
and where to place it - Swiss-Prot updates annotation and sequences when
necessary
101ZB SYP, 28-NOV-2003 ALB, 16-NOV-2004 MIM,
31-Jan-2006 ZB BER, 13-FEB-2006 LYG,
14-JUN-2006 LYG, 21-SEP-2006 ZB CHH,
05-DEC-2006
102- User updates or annotation requests
103Accessing Searching UniProtKB
- Direct access (keyword search)
- New search tool well use it later
- Sequence Retrieval System (SRS, Europe), will
disappear - Entrez (NCBI, USA) UniProtKB/Swiss-Prot (not
TrEMBL) is integrated in GenPept, but with a
changed format, and with some information (e.g.
implicit cross-references) is missing - Query tools on ExPASy UniProt
(http//www.expasy.org/sprot/, http//www.uniprot.
org) - Indirect access (sequence search)
- Bioinformatics sequence analysis tools (Blast,
Fasta, GCG, Emboss, MS Identification tools)
104Downloading the UniProt Knowledgebasehttp//www.e
xpasy.org/sprot/download.html
- Swiss-Prot and TrEMBL form a complete,
non-redundant database, the UniProt Knowledgebase - Can be downloaded from ftp//ftp.expasy.org/databa
ses/uniprot/current_release/knowledgebase - In Swiss-Prot format, fasta or xml format
- Complemented by sequences of alternative splice
isoforms - everything about all proteins! (at least all
CDS submitted to the public nucleotide sequence
databases)
105If you want to develop tools to work with your
local copy of UniProtKB
- Swissknife a PERL parser for UniProtKB
- Constantly updated according to latest format
changes - Advantage you do not need to know how exactly
the information is stored in the flat file - http//swissknife.sourceforge.net/
- ftp//ftp.ebi.ac.uk/pub/software/swissprot/Swisskn
ife/
106Take home message
- Swiss-Prot is the non redundant, manually
annotated and highly cross-referenced section of
the UniProt Knowledgebase - Be aware of the differences between
UniProtKB/TrEMBL and UniProtKB/Swiss-Prot - Computer vs. Human
- Redundant vs. Non-redundant
- Always cite the Accession number, not the entry
name - The AC is stable
- The entry name might change
107The UniProt Consortium
UniProt (Universal Protein Resource) the world's
most comprehensive catalogue of protein
information www.uniprot.org, Wu et al. Nucleic
Acids Res. 34D187-191(2006).
Provides 3 databases
-UniProtKB (Swiss-Prot TrEMBL)
-UniRef
-UniParc
and soon UniMES (for Metagenomic and
Environmental Sequences)
108UniRef100, 90 and 50 clusters
- One UniRef100 entry -gt all identical sequences
from UniProtKB and some sections of UniParc
(including fragments, Swiss-Prot splice
variants). - One UniRef90 entry -gt sequences that have at
least 90 or more identity. - One UniRef50 entry -gt sequences that are at least
50 identical.
109UniRef100, 90 and 50 clusters
- One cluster can contain sequences of several
species, clustering is done independently of the
organism - Each cluster has a representative, reference
sequence, preferably that of the best-annotated
Swiss-Prot entry - UniRef identifiers are of the form
UniRef100_P99999, UniRef50_P00414 not stable,
as clusters are recomputed with every biweekly
release, and cluster representatives can change! - UniRef is useful for comprehensive BLAST sequence
searches by providing sets of representative
sequences.
110Implicit cross-link UniProtKB to UniRef
new web view
111The UniProt Consortium
UniProt (Universal Protein Resource) the world's
most comprehensive catalogue of protein
information www.uniprot.org, Wu et al. Nucleic
Acids Res. 34D187-191(2006).
Provides 3 databases
-UniProtKB (Swiss-Prot TrEMBL)
-UniRef
-UniParc
and soon UniMES (for Metagenomic and
Environmental Sequences)
112UniParc the UniProt Archive
- 8.8 million sequences
- Sequences and cross-references (AC numbers)
- A comprehensive collection of the raw protein
sequences in public databases (including those
not submitted to the DNA databases) - Swiss-Prot, TrEMBL, PIR, EMBL, Ensembl, IPI,
PDB, RefSeq, FlyBase, WormBase, Patent Offices. - UniParc can be used to track sequence versions
Use with extreme caution also contains
pseudogenes, incorrect CDS predictions, etcand
is highly redundant !
113UniParc tracks a protein sequence and its
integration in various databases
http//www.pir.uniprot.org/cgi-bin/textSearch_AR
Patent data
114UniParc entry UPI0000033477 part 2
115(No Transcript)
116www.expasy.ch/prosite
117PROSITE
- A database of protein families and domains using
two kinds of motif descriptors - Patterns or regular expressions
- User friendly (easy to understand and to use)
- Well designed for the detection of biologically
meaningful sites such as residues playing a
structural or functional role - Can be used to scan a protein database in
reasonable time on any computer - Generalized profiles or weight matrices
- Well adapted to cover the full length of the
protein or domain - Are able to detect highly divergent families or
domains with only a few well conserved positions
118Identification of protein domains and families
- There are two non-exclusive approaches for the
determination of the function of an
uncharacterized protein - Comparison with a complete sequence database
(BLAST) - Scanning a database of patterns and profiles
- Most proteins can be grouped into families.
Proteins belonging to a particular family share
functional attributes and are derived from a
common ancestor - Some regions in the sequence are more conserved
than others during evolution because they are
important for the function or the structure of
the protein - Like fingerprints for police identification,
signatures built out of sequence patterns or
profiles can be used to formulate hypotheses
about the function of uncharacterized proteins.
119Definitions of conserved regions
- Conserved regions can be classified into 5
different groups - Families proteins that have the same domain
arrangement, be 1 or many domains. - Domains specific combination of secondary
structures that assume characteristic three
dimensional structures or folds. - Repeats structural units always found in two or
more copies that assemble in specific fold.
Assemblies of repeats might also be thought of as
domains. - Motifs short regions with conserved active- or
binding-sites that usually adopt a folded
conformation only in association with their
ligands. - Sites functional residues (active sites,
disulfide bridges, post-translationally modified
residues)
120Conserved regions (2)
Binding cleft (motif)
Cys 181 active site residue
PPID family 1 CSA_PPIASE domain 3 TPR repeat
121http//www.expasy.org/tools/scanprosite/
122(No Transcript)
123Functionally and structurally relevant residues
in PROSITE motif descriptorsA new concept to
extract more information from profiles
- Principle
- Combining the advantages of profiles (high
sensitivity) and patterns (position-specific
information) - Tagging of amino acids at precise positions in
the profile and checking their presence in the
matched sequence
124ProRule
- Aim
- Provide users with biologically meaningful
functional and structural information - active sites,
- post-translational modification sites,
- binding sites,
- disulfide bonds,
- transmembrane regions.
-
- Help the UniProtKB/Swiss-Prot annotation and
provide enhanced homogeneity - domain name and boundaries,
- keywords and linked GO terms,
- EC numbers,
- false negative PROSITE patterns.
125www.expasy.ch/prosite/prorule.html
Sigrist et al. Bioinformatics 214060-4066(2005)
126Other methods for protein/domain identification
Pfam, TIGRFAMs, SMART, Gene3D, PANTHER, CDD
Hidden Markov Models (HMM), Probabilistic
models PRINTS Unweighted matrices protein
fingerprints BLOCKS Weight matrix derived from
ungapped alignments PIRSF, SUPERFAMILY
classification system based on evolutionary
relationship of whole proteins ProDom
automatic compilation of homologous domains
based on recursive PSI-BLAST searches.
127The InterPro projectwww.ebi.ac.uk/interpro
Integrated Documentation Resource of Protein
Families, Domains and Functional Sites
128The InterPro projectwww.ebi.ac.uk/interpro
- Unification of PROSITE, PRINTS, Pfam and ProDom
into an integrated resource of protein families,
domains and functional sites in 2000 - Joint effort in creating a unified yet
methodologically diverse system for protein
family/domain identification - Single set of documents linked to the various
methods - Distributed with tools by anonymous FTP and
through www servers - Used to enhance the functional annotation of
UniProtKB (Swiss-Prot and TrEMBL) - Has progressively incorporated other databases
129Current status of InterPro
Release 14.1 (February 2007) was built from Pfam,
PRINTS, PROSITE, ProDom, SMART, TIGRFAMs, PIRSF,
Scop based SUPERFAMILY, Gene3D and PANTHER, and
the current UniProt/Swiss-Prot TrEMBL data.
(for details see http//www.ebi.ac.uk/interpro/re
lease_notes.html) InterPro release 14.1 contains
13,953 entries, representing 3,911 domains, 9,610
families, 232 repeats, 34 active sites, 20
binding sites and 19 post-translational
modification sites. Overall, there are 15,880,845
InterPro hits from 3,100,874 UniProtKB protein
sequences.
92.4 of Swiss-Prot and 76.4 of TrEMBL protein
sequences have one or more InterPro hits.
130http//www.ebi.ac.uk/interpro/
131http//www.ebi.ac.uk/interpro/IEntry?acIPR001304
132InterPro Graphical domain representation
133http//www.ebi.ac.uk/integr8/ProteomeAnalysisActio
n.do?orgProteomeID25
134http//www.ebi.ac.uk/integr8/ProteomeAnalysisActio
n.do?orgProteomeId18
135The ExPASy www server
www.expasy.org
- First molecular biology server on the Web (August
1993) 500 million accesses since - Dedicated to proteomics
- Databases UniProtKB, PROSITE, Swiss-2DPAGE,
etc. - Many 2D/MS protein identification/characterization
and sequence analysis tools - Mirror sites in Australia, Brazil, Canada, China
and Korea http//aubrcacnkrwww.expasy.org
136(No Transcript)
137ExPASy software tools
- Tools for the display and management of databases
(NiceProt, Swiss-Shop sequence alerting system,
etc.) - Tools for sequence analysis (ScanProsite,
ProtParam, ProtScale, RandSeq, Translate, etc.) - Proteomics tools (AACompIdent, FindMod, FindPept,
Aldente, PeptideMass, TagIdent, etc.) - 3D-structure analysis and display tools
(Swiss-Model, Swiss-PDBviewer)
138Identification Aldente, TagIdent, AAcompIdent,
MultiIdent
http//www.expasy.org/tools/
Characterization FindMod, GlycoMod, FindPept
Analysis PeptideMass, GlycanMass, BioGraph,
PeptideCutter ProtScale, ProtParam
- Use annotation in Swiss-Prot and TrEMBL
(preprocessing, PTMs, etc.) - Hyper-links between
tools and databases
139http//www.expasy.org/links.html
140Finding out about recent developmentsUniProtKB/
Swiss-Prot recent format changeshttp//www.expas
y.org/sprot/relnotes/sp_news.html
UniProtKB/Swiss-Prot planned format
changeshttp//www.expasy.org/sprot/relnotes/sp_s
oon.html Subscribe to the electronic
Swiss-Flash bulletins http//www.expasy.org/swiss
-flash/ Whats new on ExPASy
http//www.expasy.org/history.html
141References (1)
- UniProtKB/Swiss-Prot
- http//www.expasy.org/sprot/sprot-ref.html
- Wu C. et al. The Universal Protein Resource
(UniProt) an expanding universe of protein
information.Nucleic Acids Res.
34D187-191(2006). - Boeckmann B. et al. Protein variety and
functional diversity Swiss-Prot annotation in
its biological contextComptes Rendus Biologies
328882-99(2005). - Bairoch A.Swiss-Prot Juggling between evolution
and stability Brief. Bioinform. 539-55(2004). - Farriol-Mathis N. et al. Annotation of
post-translational modifications in the
Swiss-Prot knowledgebase. Proteomics
41537-1550(2004). - Gasteiger E. et al. A. Swiss-Prot Connecting
biological knowledge via a protein database - Curr. Issues Mol. Biol. 347-55(2001).
142References (2)
PROSITE Hulo N., et al., The PROSITE database.
Nucleic Acids Res. 34D227-D230(2006). Sigrist
C.J.A., et al., PROSITE a documented database
using patterns and profiles as motif descriptors.
Brief Bioinform. 3265-274(2002). Gattiker A.,
et al., ScanProsite a reference implementation
of a PROSITE scanning tool. Applied
Bioinformatics 1107-108(2002). Sigrist C.J.A.,
et al., ProRule a new database containing
functional and structural information on PROSITE
profiles. Bioinformatics. 2005 21(21)4060-6.
ExPASy Gasteiger E. et al.ExPASy the
proteomics server for in-depth protein knowledge
and analysis. Nucleic Acids Res.
313784-3788(2003).
143Useful general publications
- Nucleic Acids Res. Database issue 2006, vol. 34,
supplement 1 http//nar.oupjournals.org/conte
nt/vol34/suppl_1/ - Nucleic Acids Res. Web server issue 2005, vol.
33, supplement 2 http//nar.oupjournals.org/conte
nt/vol33/suppl_2/ - Book Bioinformatics for Dummies, by J.-M.
Claverie and C. Notredame - Publisher For Dummies 2nd edition (December,
2006) - ISBN 0764516965
144Take home message
Or via the website
145Before the introduction to Swiss-Prot/ExPASy
After the introduction to Swiss-Prot /ExPASy
146- Some practical exercises
- http//education.expasy.org/cours/Tunis/
- Finding databases
- Comparing protein databases
- Comparing BLAST programs
- BLAST output
- Bacterial start sites
- UniRef
- Different views of UniProtKB
- Environmental sequences
- Inter-database links PROSITE
- InterPro
- Using UniProtKB/Swiss-Prot to create datasets