A. Auchincloss

About This Presentation

Title:

A. Auchincloss

Description:

UniProtKB/Swiss-Prot and ExPASy: Protein sequence databases and proteomics tools developed at the Swiss Institute of Bioinformatics Andrea Auchincloss (andrea ... – PowerPoint PPT presentation

Number of Views:314

Avg rating:3.0/5.0

Slides: 147

Provided by: Amos83

Category:

more less

Transcript and Presenter's Notes

Title: A. Auchincloss

1
UniProtKB/Swiss-Prot and ExPASy Protein sequence
databases and proteomics tools developed at the
Swiss Institute of Bioinformatics
Andrea Auchincloss (andrea.auchincloss_at_isb-sib.ch)
Tunis, March 19, 2007
2
Outline

The Swiss Institute of Bioinformatics
What is UniProt?
UniProt Knowledgebase Swiss-Prot and TrEMBL
HPI, post-translational modifications, HAMAP
UniRef and UniParc
Databases for protein function and domains
PROSITE, InterPro etc.
ExPASy other tools

3
Swiss Institute of Bioinformatics (SIB)

Non-profit foundation created in 1998
Groups in Geneva, Lausanne and Basel
Federation of several groups (some of which
existed and collaborated long before the
foundation of the institute), about 170
researchers in 2006.

4
www.isb-sib.ch
5
SIB missions

Development of databases and software tools
High-quality bioinformatics research program
Courses and seminars for the training of
bioinformatics research scientists. This includes
a masters degree in proteomics and
bioinformatics, several weekly courses and a
doctoral school
Services to the Swiss Life Sciences community
(EMBnet node).

6
Swiss Institute of Bioinformatics 20 research
and service groups
7
Proteins are organic compounds made of amino
acids arranged in a linear chain and joined by
peptide bonds Wikipedia
8
Proteins are composed of 20 "standard" amino
acids, symbolised by a LETTER.
Different views of a protein
9
Proteins can also work together to perform a
particular function, and they often associate to
form complexes.
10
Proteins are essential parts of all living
organisms and participate in every process within
cells. -gt enzymes -gt structural or mechanical
functions -gt important in cell signaling, immune
response, cell adhesion, cell cycle,
toxins. Proteins are a necessary component in
our diet, since animals cannot synthesize all the
amino acids and must obtain essential amino acids
from food.
11
Protein/Gene number
Organism Number Bacteria
182-8,591 S. cerevisiae 6,127 C. elegans
17,947 Drosophila 13,849 A. thaliana
25,674 Human 21,000
12
The universe in which protein databases evolve
1953 1st sequence (bovine insulin) 1986 4,000
sequences 2006 3.5 million sequences
Where will it stop?
AMB, SP20
13
179,000,021,000
1st estimate 30 million species (1.5 million
named)
2nd estimate 20 million bacteria/archaea
x 4,000 genes 5 million
protists x 6,000
genes 3 million insects
x 14,000 genes 1 million fungi
x 6,000 genes 0.6
million plants x
20,000 genes 0.2 million molluscs, worms,
arachnids, etc. x 20,000 genes 0.2 million
vertebrates x 21,000
genes
The calculation 2x107x40005x106x60003x106x14000
106x60006x105x 200002x105x200002x105x21000210
00(you!)
Caveat this is an estimate of the number of
potential sequence entries, but not that of the
number of distinct protein entities in the
biosphere.
AMB, SP20
14
What is sequencing is underway right now? Many
eukaryotic bacterial genomes (varying sizes)
Metagenomics (environmental samples) 6
million sequences submitted/published in
December 2006, 17 million sequences being
generated at the Venter Institute, 6 million
proteins are being submitted from the GOS (Global
Ocean Sampling) trip
15
Protein sequences what is sequenced? Currently
about 3.5 to 4.0 million known protein
sequences More than 99 of these are derived by
translation of nucleotide sequences Less than
1 direct protein sequencing (Edman, MS/MS)
-gt It is important that users know where the
protein sequence comes from (sequence gene
prediction quality)!
16

Level of DNA/RNA sequence quality
DNA/RNA sequencing quality (genome or WGS, cDNA
or EST )
Gene prediction quality programs used, is there
manual intervention afterwards?
For example
Authors can specify the nature of the CDS in the
nucleotide databases by using qualifiers
"/evidenceexperimental" or "/evidencenot_experim
ental".
Very rarely done

17
The hectic life of a sequence
Data not submitted to public databases, delayed
or cancelled
cDNAs, ESTs, genomes,
Public nucleic acid databases
EMBL, GenBank, DDBJ
if the submitters provide an annotated Coding
Sequence (CDS)
Public protein sequence databases
18
CDS CoDing Sequence (CDS)
CDS provided by the submitters
The first Met !
CDS translation provided by EMBL
19
Data not submitted
Complete genome (submitted) only 1,858 CDS
available!
20
Issue for the users the protein database jungle
21
The hectic life of a sequence
Data not submitted to public databases, delayed
or cancelled
cDNAs, ESTs, genomes,
Public nucleic acid databases
EMBL, GenBank, DDBJ
if the submitters provide an annotated Coding
Sequence (CDS)
Public protein sequence databases
22
The hectic life of a sequence
Data not submitted to public databases, delayed
or cancelled
cDNAs, ESTs, genomes,
Scientific publications derived sequences
EMBL, GenBank, DDBJ
CoDing Sequences provided by submitters
TrEMBL GenPept
RefSeq
PRF
PIR
UniProtKB
IPI
Swiss-Prot
UniParc
Manually annotated
EnsEMBL
PDB
CCDS
Also gene prediction
species-specific databases (EcoGene,
TubercuList, TIGR)
23
(No Transcript)
24
Major public protein sequence database sources
Integrated resources cross-references
PIR
PDB
PRF
UniProtKB Swiss-Prot TrEMBL NCBI-nr
Swiss-Prot GenPept PIR PDB PRF RefSeq
Separated resources
UniProtKB/Swiss-Prot manually annotated protein
sequences (11,000 species) UniProtKB/TrEMBL
submitted CDS (EMBL) automated annotation non
redundant with Swiss-Prot (127,000
species) GenPept submitted CDS (GenBank)
redundant with UniProtKB (about 130,000
species) PIR Protein Information Resource
archive since 2003 integrated into
UniProtKB PDB Protein Databank 3D data and
associated sequences PRF journal scan of
published peptide sequences RefSeq Reference
Sequence for DNA, RNA, protein gene prediction
(4,000 species)
25
Other protein sequence databases
CCDS EBI NCBI Wellcome Trust Sanger UC
Santa Cruz (2 species) Consensus human and mouse
sequences between 4 institutions Combining
different approaches ab initio, by similarity -
and taking advantage of the expertise acquired by
different institutes, including manual
annotation EnsEMBL UniProtKB RefSeq gene
prediction (31 species) aligns some eukaryotic
genomic sequences with all the sequences found in
EMBL, UniProtKB/Swiss-Prot, RefSeq and
UniProtKB/TrEMBL (? known genes)- Also does some
gene prediction (? novel genes) IPI UniProtKB
RefSeq EnsEMBL (H-InvDB, TAIR, VEGA) (7
species) provides a guide to the main databases
that describe the human, mouse, rat, zebrafish,
Arabidopsis, chicken, and cow proteomes.
26
The UniProt consortium
27
The UniProt Consortium
UniProt (Universal Protein Resource) the world's
most comprehensive catalogue of protein
information www.uniprot.org, Wu et al. Nucleic
Acids Res. 34D187-191(2006).
Provides 3 databases
-UniProtKB (Swiss-Prot TrEMBL)
-UniRef
-UniParc
and soon UniMES (for Metagenomic and
Environmental Sequences)
28
The Universal Protein resource components
UniProt KnowledgeBase
UniProtKB
UniRef100 UniRef 90 UniRef 50
UniProt Archives 8000000 entries

One UniRef100 entry
All identical sequences (including fragments).
One UniRef90 entry
Sequences that have at least 90 or more
identity.
One UniRef50 entry
Sequences that are at least 50 or more
identity.
Independent of species.

UniProtKB Release 9.7 consists of
Archived raw protein sequences, found in
publicly accessible databases Swiss-Prot,
TrEMBL, PIR, EMBL, Ensembl, IPI, PDB, RefSeq,
FlyBase, WormBase, Patent Offices.
UniProtKB/TrEMBL Computer annotated protein
sequences 3600000 entries 100000 species

UniProtKB/Swiss-Prot
Manually annotated
protein sequences
260000 entries
10000 species

Use with extreme caution Contains pseudogenes,
incorrect CDS predictions, etc
Allows comprehensible BLAST similarity searches
by providing sets of representative sequences
produced by SIB and EBI
produced by PIR
produced by EBI
29
The Universal Protein resource components
UniProt KnowledgeBase
UniProtKB
UniRef100 UniRef 90 UniRef 50
UniProt Archives 8000000 entries

One UniRef100 entry
All identical sequences (including fragments).
One UniRef90 entry
Sequences that have at least 90 or more
identity.
One UniRef50 entry
Sequences that are at least 50 or more
identity.
Independent of species.

UniProtKB/Swiss-Prot
Manually annotated
protein sequences
260,000 entries
11,000 species

Use with extreme caution Contains pseudogenes,
incorrect CDS predictions, etc
Allows comprehensible BLAST similarity searches
by providing sets of representative sequences
produced by SIB and EBI
produced by PIR
produced by EBI
30
The Universal Protein resource components
UniProt KnowledgeBase
UniProtKB
UniRef100 UniRef 90 UniRef 50
UniProt Archives 8000000 entries

One UniRef100 entry
All identical sequences (including fragments).
One UniRef90 entry
Sequences that have at least 90 or more
identity.
One UniRef50 entry
Sequences that are at least 50 or more
identity.
Independent of species.

One UniRef100 entry
All identical sequences (including fragments).
One UniRef90 entry
Sequences that have at least 90 or more
identity.
One UniRef50 entry
Sequences that are at least 50 or more
identity.
Independent of species.

UniProtKB/TrEMBL Computer annotated protein
sequences 3,900,000 entries 127,000 species
Archived raw protein sequences, found in
publicly accessible databases Swiss-Prot,
TrEMBL, PIR, EMBL, Ensembl, IPI, PDB, RefSeq,
FlyBase, WormBase, Patent Offices.
UniProtKB/Swiss-Prot Manually annotated protein
sequences 260,000 entries 11,000 species
Use with extreme caution Contains pseudogenes,
incorrect CDS predictions, etc
Allows comprehensible BLAST similarity searches
by providing sets of representative sequences
produced by SIB and EBI
produced by PIR
produced by EBI
32
UniProt web sites http//www.expasy.org/sprot/ h
ttp//www.pir.uniprot.org/ http//www.ebi.ac.uk/un
iprot/ http//www.uniprot.org/ Soon, a new
unified web site, with a very powerful search
engine.
33
http//beta.uniprot.org/
Test it! Logonguest Password amazing
34
The UniProt groups from SIB, EBI and PIR
(Antibes, September 2004)
In Geneva (SIB) 2 Group Leaders 44
Annotators 4 Prosite annotators 22
Programmers and Researchers 5 Administrators,
science communicators 3 System
Administrators 4 Students 1
GISAID ------------------ 85 people
At PIR 1 Group Leader 13 Protein Science
Team 12 Informatics Team ------------------ 26
people
At EBI (Swiss-Prot EMBL TrEMBL) 75 people
(29 Annotators)
35

UniProtKB has biweekly releases available from
about 100 servers, the main sources being ExPASy
and www.uniprot.org

36
UniProtKBFrom EMBL (DNA) to TrEMBL (protein)
37
Gene/protein name
Taxonomy
Reference
CDS
TrEMBL
Automated extract of the protein sequence (CDS),
gene name, taxonomy and references. Automated
annotation (KWs and protein family).
EMBL
38
! TrEMBL does not translate DNA sequences, nor
does it use gene prediction programs only takes
the existing CDS proposed by the submitting
authors in the EMBL/Genbank/DDBJ entry In
particular, the proposed CDS and derived protein
sequences can be experimentally proven or derived
from gene prediction programs (this is not
obvious from the TrEMBL entry) TrEMBL does not
validate any sequences
39
!!!! The quality of UniProtKB/TrEMBL data is
directly dependent on the information provided by
the submitter of the original nucleotide entry.
40
UniProtKBFrom TrEMBL to Swiss-Prot
41
CDS
Manual annotation of the sequence and associated
biological information (derived from literature,
external experts, databases)
Automated extraction of the protein sequence
(CDS), gene name and references. Automated
annotation.
TrEMBL
Annotation of sequence differences (conflicts,
variants, splicing)
Average of 6 independent sequence reports for
each human protein
EMBL
42
Distinguishing Swiss-Prot and TrEMBL

A TrEMBL entry is a computer-annotated record
derived from a coding sequence (CDS) in the
nucleotide sequence databases, not in Swiss-Prot,
after some redundancy removal and automated
annotation.
A Swiss-Prot entry is a manually annotated record
for a given protein.

43
UniProtKB From TrEMBL to Swiss-Prot Step 1
Sequence check
44
UniProtKB/Swiss-Prot
Non-redundant 1 entry -gt 1 gene (1 species)
i) Merge all known protein sequences (CDS
and amino acid) derived from the same gene -gt
decreases redundancy and improves sequence
reliability ii) Annotation of the sequence
differences (including conflicts,
polymorphisms, splice variants etc..) -gt
annotation of protein diversity
45
Redundancy
UniProtKB/Swiss-Prot 11,000 species
UniProtKB/TrEMBL 127,000 species

260,000 3,800,000 ? 3,600,000

Redundancy in TrEMBL Redundancy between TrEMBL
and Swiss-Prot
In the future redundancy is going to decrease
"new" genome sequencing ? "new" proteins
46
- 13 sequences (complete or partial)
- derived from mRNA (n6) or genomic
DNA (n7)
47
All alternatively spliced sequences are available
for BLAST searches, protein identification tools
and are downloadable Human 2/3 of the human
genes are alternatively spliced
48
- 6 genomic sequences (complete or partial)
- 1 protein sequence from PIR
49
Multiple alignment of the available clpB sequences
50
(No Transcript)
51
Within Swiss-Prot?

A snapshot of the situation (December 2006)
28,200 entries with 82,000 sequence conflicts
2,600 entries with corrected frameshifts
15,100 entries with corrected initiation sites
4,300 entries with other sequence problems.
At least 43,000 entries (19 of Swiss-Prot)
required a minimal amount of annotation effort to
obtain the correct sequence.

52
Quality of protein information from genome
projects

Proteins originating from different genome
projects
Drosophila what a curated (thanks to FlyBase)
genome effort should look like only 1.8 of the
gene models conflict with what we have in
UniProtKB/Swiss-Prot
Arabidopsis a genome where lots of work was done
to annotate it when it was sequenced, but where
nothing as been done since (at least in the
public view) 19.5 of the gene models are
erroneous
Tetraodon nigroviridis a quick and dirty
automatic run through a genome with no manual
intervention gt90 of the gene models produce
incorrect proteins.
Bacteria and Archaea have almost no splicing, so
prediction is easier, however errors are still
made

Producing a clean set of sequences is not a
trivial task
It is not getting easier as more and more types
of sequence data is submitted
It is important to pursue our efforts in making
sure we provide to our users the most correct set
of sequences for a given organism.

54
New Protein existence evidence tag

As most protein sequences are derived from
translation of nucleotide sequence and are only
predictions, the new PE line indicates whether
there is any evidence that proves the existence
of a protein
The Protein existence evidence will have 5
different qualifiers
Evidence at protein level
Evidence at transcript level
Inferred from homology
Predicted
- Unassigned (used mostly in TrEMBL)

55
Righting the wrongs Sequences are rarely
deposited in a mature state as with all
scientific research, DNA and protein annotation
is a continual process of learning, revision and
corrections. Sequencing error rates 1 base
in 10000 Making people aware of errors is
good and great making people aware that theyre
responsible also for correcting errors is even
greater C. Hardley, EMBO reports, 4(9), 2003.
56
UniProtKB From TrEMBL to Swiss-Prot Step
2 Annotationliterature controlled
vocabulary
57
Annotation

The focal point of the efforts to maintain and
develop UniProtKB/Swiss-Prot
It is becoming more and more important as it
provides
a summary of what is known about a protein
creates template for automatic annotation for the
many organisms whose genome sequence is/will be
available but whose proteins will not be
characterized
provides well annotated (corpus) entries to train
literature mining tools (text mining).

58
.

Source of data
- publications (gt 1,700 journals cited)
also external scientific expertise other
databases

()
59
Comments structured free text, 27 defined
topics
Manually annotated Information from papers,
specialized databases, computer prediction,
external experts, brain storming Distinction
between data obtained experimentally and
computerized inferences
60
UniProtKB From TrEMBL to Swiss-Prot Step
3 Sequence analysis (bioinformatics tools)
61
The annotation platform

Annotators could not work without the help of
our software developers

62
Anabelle much more than a domain annotation
platform
63
(No Transcript)
64
We manually check the results !
65
What else is in a UniProtKB/Swiss-Prot entry?
66
Cross-references a central hub
Gasteiger E. et al, Curr. Issues Mol. Biol.
347-55(2001) www.expasy.org/cgi-bin/lists?dbxref.
txt

Swiss-Prot was the first database with
X-references
Explicitly X-referenced to 85 databases
DNA (EMBL/GenBank/DDBJ),
3D-structure (PDB)
Family and domain (InterPro, HAMAP, PROSITE,
Pfam, etc.)
genomic (OMIM, MGI, FlyBase, SGD, SubtiList,
etc.)
2D-gel (e.g. SWISS-2DPAGE)
specialized db (e.g.GlycoSuiteDB, PhosSite,
MEROPS)
literature (PubMed)
Each UniProtKB/Swiss-Prot entry can be seen as a
central hub for the data available about the
protein it describes

67
Organism-specific databases AGD CYGD
DictyBase EchoBASE EcoGene euHCVdb FlyBase GeneDB
_Spombe GeneFarm Gramene H-InvDB HGNC HIV HPA
LegioList Leproma ListiList MaizeGDB MGI MIM Mypu
List PhotoList RGD SagaList SGD StyGene SubtiList
TAIR TubercuList WormBase WormPep ZFIN
Family and domain databases Gene3D HAMAP InterPro
PANTHER PIRSF Pfam PRINTS ProDom PROSITE SMART TIG
RFAMs
Enzyme and pathway databases BioCyc Reactome
Sequence databases EMBL PIR UniGene
2D-gel databases ANU-2DPAGE Aarhus/Ghent-2DPAGE C
OMPLUYEAST-2DPAGE Cornea-2DPAGE
DOSAC-COBS-2DPAGE ECO2DBASE HSC-2DPAGE OGP PHCI-2
DPAGE PMMA-2DPAGE Rat-heart-2DPAGE REPRODUCTION-2D
PAGE Siena-2DPAGE SWISS-2DPAGE
UniProtKB/Swiss-Prot explicit links
Miscellaneous ArrayExpress dbSNP DIP DrugBank
GO IntAct LinkHub RZPD-ProtExp
Protein family/group databases GermOnline MEROPS P
eroxiBase PptaseDB REBASE TRANSFAC
Genome annotation databases Ensembl GenomeReviews
KEGG TIGR
3D structure databases HSSP PDB SMR
PTM databases GlycoSuiteDB PhosSite
68
Implicit cross-references on new web server and
ExPASy

Implicit X-references to 26 additional db added
by the ExPASy server on the www (i.e. GeneCards,
ModBase, etc.)
These X-refs are not present as hard-coded DR
lines in the Swiss-Prot entry as it can be
downloaded by ftp, but are added on the fly when
someone views an entry on ExPASy. This can be
done because enough information is present in the
UniProtKB entry to access the related information
in another db.
Example All Swiss-Prot/TrEMBL are linked to the
BLOCKS domain db, via the Swiss-Prot/TrEMBL
accession number

69
Keyword definition and usage in Swiss-Prot
Linked to Gene Ontology to further
facilitate information retrieval via controlled
vocabularies
70
In a UniProtKB/Swiss-Prot entry, you can expect
to find

All the names of a given protein (and of its
gene)
Its biological origin with links to the taxonomic
databases
A selection of references
A summary of what is known about the protein
function, alternative products, PTM, tissue
expression, disease, 3D-structures, etc.
Numerous cross-references
Selected keywords
A description of important sequence features
domains, PTMs, variations, etc.
A (often corrected) protein sequence and the
description of various isoforms/variants.

71
Monitoring entry history The UniProtKB
Sequence/Annotation Version archive
72
(No Transcript)
73
and many useful links
74
And on the new website
other tools are not yet available
75
UniProt Knowledgebase

Swiss-Prot Manually annotated section
TrEMBL Automatically annotated section

76
Distinguishing Swiss-Prot and TrEMBL
77
(No Transcript)
78
Accession number to be used when you cite a
UniProt entry in anywhere (never cite the entry
name (ID) alone)
79
Non-Redundant Complete Proteome Sets

Text search UniProtKB keyword Complete
proteome, combined with an organism name
Or download precomputed sets (bacteria, archaea,
some eukaryotes) ftp//ftp.expasy.org/databases/c
omplete_proteomes/entries
Or EBI Integr8 http//www.ebi.ac.uk/integr8/

80
Swiss-Prot annotation priorities

The main annotation programs
HAMAP (High quality Automated and Manual
Annotation of microbial Proteomes bacteria,
archaea, plastids)
HPI (Human Proteomics Initiative)
PPAP (Plant Proteome Annotation Project)
FPAP (Fungal Proteome Annotation Project)
Viral proteins
Tox-Prot (Toxin Annotation Project)
ENZYMES (proteins with EC numbers)
PTMs
3D-structure
Protein-protein interactions
Quality assurance, includes controlled
vocabularies

81
Model organisms

Organisms for which we want to have a more
in-depth coverage
Completeness, links with specialized databases,
specific documents
Examples E.coli, B.subtilis, human, mouse,
fruitfly, C.elegans, yeast, S.pombe, A.thaliana.

82
Human Proteomics Initiative (HPI)
83
From genome to proteome
Ê
genome
21,000 human genes
Considerable increase in complexity
84
In the case of human genes, the Swiss-Prot/TrEMBL
redundancy is still very high 15,803 53,100
? about 20,000
human gene number estimation 21,000-35,000
MS proteomics has verified more than 10 of human
genes products, but has not identified
significant numbers of unpredicted proteins
85
(No Transcript)
86
Post-translational modifications (PTMs)
87
PTM definition
a post-translational modification or PTM is
a modification of a polypeptide chain involving
the making or the breaking of covalent bond(s)
that occurs during (co-translational class) or
after translation.
88
PTMs influence or even define protein function

phosphorylation and possibly GlcNAcylation and
S-nitrosylation are a means of transducing
extracellular signals to the inside of the cells.
methylation has a role in nuclear protein import.
lipid addition allows protein to membrane
association (e.g. GPI-anchor, myristate,
palmitate).
intrachain disulfide bonds and N-glycosylation
influence protein folding.
interchain disulfide bonds bind subunits
together.
other PTMs are directly involved in the protein
function, as for example the binding of cofactors
(e.g. pyridoxal phosphate), or the synthesis of a
cofactor by the modification of amino acids
present in the protein (e.g. quinones).

89
PTM variety
Each protein can be modified at various
siteswhich gives a high number of alternative
peptides. 283 different protein modifications
are annotated in UniProtKB/Swiss-Prot
90
Large scale experiments (LSE) for PTMs!

PTM information can now be obtained from results
of proteomics large scale experiments (LSE)
In the past 12 months we have added about 6000
experimental PTMs using data originating from
some of these projects.

AMB, SP20
91

Proteomic studies have lead to the updating of
2767 human
Swiss-Prot entries, mainly with PTM information
(UniProt release 10.0 , March 2007)

92
Bacteria and Archaea (HAMAP)
93
In 2006, 130 new bacterial and archaeal genomes
(not WGS) were submitted to the DNA
databases If on "average" 4,000
proteins/genomegt500,000 proteins!
How to cope????
94

High quality
Automated and
Manual
Annotation of microbial
Proteomes

Lots of microbial genomes, lots of proteins. What
should we do with them in UniProt?
95
http//www.expasy.org/unirule/MF_00319
96
Automatic annotation of proteins belonging to
specified families (1)

This program requires the continuous development
and adaptation of software tools as well as the
development of a database of annotation rules for
each family (so far about 1,400).

Allows us to annotate automatically, yet with a
very high level of quality, proteins that belong
to well defined protein families
Can be applied to both characterized proteins and
to some UPFs (Uncharacterized Protein Family)

The families are based on UniProtKB/Swiss-Prot
entries, so we first do all the annotation steps
described earlier!
98
(No Transcript)
99
/www.expasy.ch/sprot/hamap/
Using HAMAP, we can currently annotate to
Swiss-Prot quality level between 10 to 50 of a
complete microbial proteome (next step HAMAP for
Fungi)
100
Updates

DNA sequence archives
EMBL/GenBank/DDBJ is an archive
All submitted data goes into the archive
Submitters are responsible for the submitted
sequences and the accompanying annotation
Nobody else can change them (including the
curators at EMBL/GenBank/DDBJ)
Protein sequence databases
UniPRotKB/Swiss-Prot is NOT an archive
Swiss-Prot chooses what goes into the database
and where to place it
Swiss-Prot updates annotation and sequences when
necessary

101
ZB SYP, 28-NOV-2003 ALB, 16-NOV-2004 MIM,
31-Jan-2006 ZB BER, 13-FEB-2006 LYG,
14-JUN-2006 LYG, 21-SEP-2006 ZB CHH,
05-DEC-2006
102

User updates or annotation requests

103
Accessing Searching UniProtKB

Direct access (keyword search)
New search tool well use it later
Sequence Retrieval System (SRS, Europe), will
disappear
Entrez (NCBI, USA) UniProtKB/Swiss-Prot (not
TrEMBL) is integrated in GenPept, but with a
changed format, and with some information (e.g.
implicit cross-references) is missing
Query tools on ExPASy UniProt
(http//www.expasy.org/sprot/, http//www.uniprot.
org)
Indirect access (sequence search)
Bioinformatics sequence analysis tools (Blast,
Fasta, GCG, Emboss, MS Identification tools)

104
Downloading the UniProt Knowledgebasehttp//www.e
xpasy.org/sprot/download.html

Swiss-Prot and TrEMBL form a complete,
non-redundant database, the UniProt Knowledgebase
Can be downloaded from ftp//ftp.expasy.org/databa
ses/uniprot/current_release/knowledgebase
In Swiss-Prot format, fasta or xml format
Complemented by sequences of alternative splice
isoforms
everything about all proteins! (at least all
CDS submitted to the public nucleotide sequence
databases)

105
If you want to develop tools to work with your
local copy of UniProtKB

Swissknife a PERL parser for UniProtKB
Constantly updated according to latest format
changes
Advantage you do not need to know how exactly
the information is stored in the flat file
http//swissknife.sourceforge.net/
ftp//ftp.ebi.ac.uk/pub/software/swissprot/Swisskn
ife/

106
Take home message

Swiss-Prot is the non redundant, manually
annotated and highly cross-referenced section of
the UniProt Knowledgebase
Be aware of the differences between
UniProtKB/TrEMBL and UniProtKB/Swiss-Prot
Computer vs. Human
Redundant vs. Non-redundant
Always cite the Accession number, not the entry
name
The AC is stable
The entry name might change

107
The UniProt Consortium
UniProt (Universal Protein Resource) the world's
most comprehensive catalogue of protein
information www.uniprot.org, Wu et al. Nucleic
Acids Res. 34D187-191(2006).
Provides 3 databases
-UniProtKB (Swiss-Prot TrEMBL)
-UniRef
-UniParc
and soon UniMES (for Metagenomic and
Environmental Sequences)
108
UniRef100, 90 and 50 clusters

One UniRef100 entry -gt all identical sequences
from UniProtKB and some sections of UniParc
(including fragments, Swiss-Prot splice
variants).
One UniRef90 entry -gt sequences that have at
least 90 or more identity.
One UniRef50 entry -gt sequences that are at least
50 identical.

109
UniRef100, 90 and 50 clusters

One cluster can contain sequences of several
species, clustering is done independently of the
organism
Each cluster has a representative, reference
sequence, preferably that of the best-annotated
Swiss-Prot entry
UniRef identifiers are of the form
UniRef100_P99999, UniRef50_P00414 not stable,
as clusters are recomputed with every biweekly
release, and cluster representatives can change!
UniRef is useful for comprehensive BLAST sequence
searches by providing sets of representative
sequences.

110
Implicit cross-link UniProtKB to UniRef
new web view
111
The UniProt Consortium
UniProt (Universal Protein Resource) the world's
most comprehensive catalogue of protein
information www.uniprot.org, Wu et al. Nucleic
Acids Res. 34D187-191(2006).
Provides 3 databases
-UniProtKB (Swiss-Prot TrEMBL)
-UniRef
-UniParc
and soon UniMES (for Metagenomic and
Environmental Sequences)
112
UniParc the UniProt Archive

8.8 million sequences
Sequences and cross-references (AC numbers)
A comprehensive collection of the raw protein
sequences in public databases (including those
not submitted to the DNA databases)
Swiss-Prot, TrEMBL, PIR, EMBL, Ensembl, IPI,
PDB, RefSeq, FlyBase, WormBase, Patent Offices.
UniParc can be used to track sequence versions

Use with extreme caution also contains
pseudogenes, incorrect CDS predictions, etcand
is highly redundant !
113
UniParc tracks a protein sequence and its
integration in various databases
http//www.pir.uniprot.org/cgi-bin/textSearch_AR
Patent data
114
UniParc entry UPI0000033477 part 2
115
(No Transcript)
116
www.expasy.ch/prosite
117
PROSITE

A database of protein families and domains using
two kinds of motif descriptors
Patterns or regular expressions
User friendly (easy to understand and to use)
Well designed for the detection of biologically
meaningful sites such as residues playing a
structural or functional role
Can be used to scan a protein database in
reasonable time on any computer
Generalized profiles or weight matrices
Well adapted to cover the full length of the
protein or domain
Are able to detect highly divergent families or
domains with only a few well conserved positions

118
Identification of protein domains and families

There are two non-exclusive approaches for the
determination of the function of an
uncharacterized protein
Comparison with a complete sequence database
(BLAST)
Scanning a database of patterns and profiles
Most proteins can be grouped into families.
Proteins belonging to a particular family share
functional attributes and are derived from a
common ancestor
Some regions in the sequence are more conserved
than others during evolution because they are
important for the function or the structure of
the protein
Like fingerprints for police identification,
signatures built out of sequence patterns or
profiles can be used to formulate hypotheses
about the function of uncharacterized proteins.

119
Definitions of conserved regions

Conserved regions can be classified into 5
different groups
Families proteins that have the same domain
arrangement, be 1 or many domains.
Domains specific combination of secondary
structures that assume characteristic three
dimensional structures or folds.
Repeats structural units always found in two or
more copies that assemble in specific fold.
Assemblies of repeats might also be thought of as
domains.
Motifs short regions with conserved active- or
binding-sites that usually adopt a folded
conformation only in association with their
ligands.
Sites functional residues (active sites,
disulfide bridges, post-translationally modified
residues)

120
Conserved regions (2)
Binding cleft (motif)
Cys 181 active site residue
PPID family 1 CSA_PPIASE domain 3 TPR repeat
121
http//www.expasy.org/tools/scanprosite/
122
(No Transcript)
123
Functionally and structurally relevant residues
in PROSITE motif descriptorsA new concept to
extract more information from profiles

Principle
Combining the advantages of profiles (high
sensitivity) and patterns (position-specific
information)
Tagging of amino acids at precise positions in
the profile and checking their presence in the
matched sequence

124
ProRule

Aim
Provide users with biologically meaningful
functional and structural information
active sites,
post-translational modification sites,
binding sites,
disulfide bonds,
transmembrane regions.
Help the UniProtKB/Swiss-Prot annotation and
provide enhanced homogeneity
domain name and boundaries,
keywords and linked GO terms,
EC numbers,
false negative PROSITE patterns.

125
www.expasy.ch/prosite/prorule.html
Sigrist et al. Bioinformatics 214060-4066(2005)
126
Other methods for protein/domain identification
Pfam, TIGRFAMs, SMART, Gene3D, PANTHER, CDD
Hidden Markov Models (HMM), Probabilistic
models PRINTS Unweighted matrices protein
fingerprints BLOCKS Weight matrix derived from
ungapped alignments PIRSF, SUPERFAMILY
classification system based on evolutionary
relationship of whole proteins ProDom
automatic compilation of homologous domains
based on recursive PSI-BLAST searches.
127
The InterPro projectwww.ebi.ac.uk/interpro
Integrated Documentation Resource of Protein
Families, Domains and Functional Sites
128
The InterPro projectwww.ebi.ac.uk/interpro

Unification of PROSITE, PRINTS, Pfam and ProDom
into an integrated resource of protein families,
domains and functional sites in 2000
Joint effort in creating a unified yet
methodologically diverse system for protein
family/domain identification
Single set of documents linked to the various
methods
Distributed with tools by anonymous FTP and
through www servers
Used to enhance the functional annotation of
UniProtKB (Swiss-Prot and TrEMBL)
Has progressively incorporated other databases

129
Current status of InterPro
Release 14.1 (February 2007) was built from Pfam,
PRINTS, PROSITE, ProDom, SMART, TIGRFAMs, PIRSF,
Scop based SUPERFAMILY, Gene3D and PANTHER, and
the current UniProt/Swiss-Prot TrEMBL data.
(for details see http//www.ebi.ac.uk/interpro/re
lease_notes.html) InterPro release 14.1 contains
13,953 entries, representing 3,911 domains, 9,610
families, 232 repeats, 34 active sites, 20
binding sites and 19 post-translational
modification sites. Overall, there are 15,880,845
InterPro hits from 3,100,874 UniProtKB protein
sequences.
92.4 of Swiss-Prot and 76.4 of TrEMBL protein
sequences have one or more InterPro hits.

130
http//www.ebi.ac.uk/interpro/
131
http//www.ebi.ac.uk/interpro/IEntry?acIPR001304
132
InterPro Graphical domain representation
133
http//www.ebi.ac.uk/integr8/ProteomeAnalysisActio
n.do?orgProteomeID25
134
http//www.ebi.ac.uk/integr8/ProteomeAnalysisActio
n.do?orgProteomeId18
135
The ExPASy www server
www.expasy.org

First molecular biology server on the Web (August
1993) 500 million accesses since
Dedicated to proteomics
Databases UniProtKB, PROSITE, Swiss-2DPAGE,
etc.
Many 2D/MS protein identification/characterization
and sequence analysis tools
Mirror sites in Australia, Brazil, Canada, China
and Korea http//aubrcacnkrwww.expasy.org

136
(No Transcript)
137
ExPASy software tools

Tools for the display and management of databases
(NiceProt, Swiss-Shop sequence alerting system,
etc.)
Tools for sequence analysis (ScanProsite,
ProtParam, ProtScale, RandSeq, Translate, etc.)
Proteomics tools (AACompIdent, FindMod, FindPept,
Aldente, PeptideMass, TagIdent, etc.)
3D-structure analysis and display tools
(Swiss-Model, Swiss-PDBviewer)

138
Identification Aldente, TagIdent, AAcompIdent,
MultiIdent
http//www.expasy.org/tools/
Characterization FindMod, GlycoMod, FindPept
Analysis PeptideMass, GlycanMass, BioGraph,
PeptideCutter ProtScale, ProtParam
- Use annotation in Swiss-Prot and TrEMBL
(preprocessing, PTMs, etc.) - Hyper-links between
tools and databases
139
http//www.expasy.org/links.html
140
Finding out about recent developmentsUniProtKB/
Swiss-Prot recent format changeshttp//www.expas
y.org/sprot/relnotes/sp_news.html
UniProtKB/Swiss-Prot planned format
changeshttp//www.expasy.org/sprot/relnotes/sp_s
oon.html Subscribe to the electronic
Swiss-Flash bulletins http//www.expasy.org/swiss
-flash/ Whats new on ExPASy
http//www.expasy.org/history.html
141
References (1)

UniProtKB/Swiss-Prot
http//www.expasy.org/sprot/sprot-ref.html
Wu C. et al. The Universal Protein Resource
(UniProt) an expanding universe of protein
information.Nucleic Acids Res.
34D187-191(2006).
Boeckmann B. et al. Protein variety and
functional diversity Swiss-Prot annotation in
its biological contextComptes Rendus Biologies
328882-99(2005).
Bairoch A.Swiss-Prot Juggling between evolution
and stability Brief. Bioinform. 539-55(2004).
Farriol-Mathis N. et al. Annotation of
post-translational modifications in the
Swiss-Prot knowledgebase. Proteomics
41537-1550(2004).
Gasteiger E. et al. A. Swiss-Prot Connecting
biological knowledge via a protein database
Curr. Issues Mol. Biol. 347-55(2001).

142
References (2)
PROSITE Hulo N., et al., The PROSITE database.
Nucleic Acids Res. 34D227-D230(2006). Sigrist
C.J.A., et al., PROSITE a documented database
using patterns and profiles as motif descriptors.
Brief Bioinform. 3265-274(2002). Gattiker A.,
et al., ScanProsite a reference implementation
of a PROSITE scanning tool. Applied
Bioinformatics 1107-108(2002). Sigrist C.J.A.,
et al., ProRule a new database containing
functional and structural information on PROSITE
profiles. Bioinformatics. 2005 21(21)4060-6.
ExPASy Gasteiger E. et al.ExPASy the
proteomics server for in-depth protein knowledge
and analysis. Nucleic Acids Res.
313784-3788(2003).
143
Useful general publications

Nucleic Acids Res. Database issue 2006, vol. 34,
supplement 1 http//nar.oupjournals.org/conte
nt/vol34/suppl_1/
Nucleic Acids Res. Web server issue 2005, vol.
33, supplement 2 http//nar.oupjournals.org/conte
nt/vol33/suppl_2/
Book Bioinformatics for Dummies, by J.-M.
Claverie and C. Notredame
Publisher For Dummies 2nd edition (December,
2006)
ISBN 0764516965

144
Take home message
Or via the website
145
Before the introduction to Swiss-Prot/ExPASy
After the introduction to Swiss-Prot /ExPASy
146

Some practical exercises
http//education.expasy.org/cours/Tunis/
Finding databases
Comparing protein databases
Comparing BLAST programs
BLAST output
Bacterial start sites
UniRef
Different views of UniProtKB
Environmental sequences
Inter-database links PROSITE
InterPro
Using UniProtKB/Swiss-Prot to create datasets

Write a Comment

User Comments (0)

About PowerShow.com

A. Auchincloss - PowerPoint PPT Presentation

A. Auchincloss

UniProtKB/Swiss-Prot and ExPASy: Protein sequence databases and proteomics tools developed at the Swiss Institute of Bioinformatics Andrea Auchincloss (andrea ... – PowerPoint PPT presentation