Classifying the protein universe - PowerPoint PPT Presentation

1 / 43

About This Presentation

Title:

Classifying the protein universe

Description:

Protein families are defined by homology: In a family, everyone is related to everyone ... describes essential features of the protein family ... – PowerPoint PPT presentation

Number of Views:172

Avg rating:3.0/5.0

Slides: 44

Provided by: ekhidnaBi

Category:

more less

Transcript and Presenter's Notes

Title: Classifying the protein universe

1
Classifying the protein universe
Ashwin Sivakumar
Wu et al, 2002. EMBO J 195740-5751
2
Domain Analysis and Protein Families

Introduction
What are protein families?
Protein families
Description Definition
Motifs and Profiles
The modular architecture of proteins
Domain Properties and Classification

3
Protein Families

Protein families are defined by homology
In a family, everyone is related to everyone
Everybody in a family shares a common ancestor

4
Homology versus Similarity

Homologous proteins have similar 3D structures
and (usually) share common ancestry
1chg and 1sgt ? 31 identity, 43 similarity
We can infer homology from similarity!

Superfamily Trypsin-like Serine Proteases
5
Homology versus Similarity

But Homologous proteins may not share sequence
similarity

Superfamily Trypsin-like Serine Proteases
1chg and 1sgc ? 15 identity, 25 similarity We
cannot infer similarity from homology
6
Homology versus Similarity

Similar sequences may not have structural
similarity

1chg and 2baa ? 30 similarity, 140/245 aa We
cannot assume homology from similarity!
7
Homology versus Similarity

Summary
Sequences can be similar without being homologous
Sequences can be homologous without being similar

Evolution / Homology
BLAST Similarity
Families ??
8
Domain Analysis and Protein Families

Introduction
What are protein families?
Protein families
Description Definition
Motifs and Profiles
The modular architecture of proteins
Domain Properties and Classification

9
Description of a Protein Family

Lets assume we know some members of a protein
family
What is common to them all?
Multiple alignment!

10
Describing Sequences in a Protein Family

As a motif or rule
describes essential features of the protein
family
catalytic residues, important structural residues
As a profile
describes variability in the family alignment

11
Techniques for searching sequence databases
to Some common strategies to uncover common
domains/motifs of biological significance that
categorize a protein into a family Pattern - a
deterministic syntax that describes multiple
combinations of possible residues within a
protein string Profile - probabilistic
generalizations that assign to every segment
position, a probability that each of the 20 aa
will occur

12
Consensus - mathematical probability that a
particular amino acid will be located at a given
position. Probabilistic pattern constructed
from a MSA. Opportunity to assign penalties for
insertions and deletions PSSM - (Position
Specific Scoring Matrix) Represents the
sequence profile in tabular form Columns of
weights for every aa corresponding to each column
of a MSA.
13
HMMs

Hidden Markov Models are Statistical methods that
consider all the possible combinations of
matches, mismatches, and gaps to generate a
consensus (Higgins, 2000)
Sequence ordering and alignments are not
necessary at the onset (but in many cases
alignments are recommended)
More the number of sequences better the models.
One can Generate a model (profile/PSSM), then
search a database with it (Eg PFAM)

14
Motif Description of a Protein Family

Regular expressions

........C.............S...L..I..DRY..I............
...........W... I E
W V
/ C x13 S x3 LI x2 I x2 DE R YW
x2 IV x10 x12 W /
x AC-IK-NP-TVWY
15
Motif Description of a Protein Family

Database PROSITE
PROSITE is a database of protein families and
domains. It is based on the observation that,
while there is a huge number of different
proteins, most of them can be grouped, on the
basis of similarities in their sequences, into a
limited number of families. Proteins or protein
domains belonging to a particular family
generally share functional attributes and are
derived from a common ancestor. It is apparent,
when studying protein sequence families, that
some regions have been better conserved than
others during evolution. These regions are
generally important for the function of a protein
and/or for the maintenance of its
three-dimensional structure. By analyzing the
constant and variable properties of such groups
of similar sequences, it is possible to derive a
signature for a protein family or domain, which
distinguishes its members from all other
unrelated proteins.
http//au.expasy.org/prosite/prosite_details.html

16
Automated Motif Discovery

Given a set of sequences
GIBBS Sampler
http//bayesweb.wadsworth.org/cgi-bin/gibbs.8.pl?d
ata_typeprotein
MEME
http//meme.sdsc.edu/meme/
PRATT
http//www.ebi.ac.uk/pratt
TEIRESIAS
http//cbcsrv.watson.ibm.com/Tspd.html

17
Automated Profile Generation

Any multiple alignment is a profile!
PSIBLAST
Algorithm
Start from a single query sequence
Perform BLAST search
Build profile of neighbours
Repeat from 2
Very sensitive method for database search

18
PSI-Blast

Starts with a sequence, BLAST it,
align select results to query sequence, estimate
a profile with the MSA, search database with the
profile - constructs PSSM
Iterate until process stabilizes
Focus here is on domains, not entire sequences
Greatly improves sensitivity

19
PSIBLAST

Position Specific Iterative Blast

20
Benchmarking a motif/profile

You have a description of a protein family, and
you do a database search
Are all hits truly members of your protein
family?
Benchmarking

TP true positive TN true negative FP false
positive FN false negative
Result
family member
Dataset
not a family member
unknown
21
Benchmarking a motif/profile

Precision / Selectivity
Precision TP / (TP FP)
Sensitivity / Recall
Sensitivity TP / (TP FN)
Balancing both
Precision 1, Recall 0 easy but useless
Precision 0, Recall 1 easy but useless
Precision 1, Recall 1 perfect but very
difficult

22
Domain Analysis and Protein Families

Introduction
What are protein families?
Protein families
Description Definition
Motifs and Profiles
The modular architecture of proteins
Domain Properties and Classification

23
The Modular Architecture of Proteins

BLAST search of a multi-domain protein

24
What are domains?

Functional - from experiments
example Decay Accelerating Factor (DAF) or CD55

Has six domains (units)
4x Sushi domain (complement regulation)
1x ST-rich stalk
1x GPI anchor (membrane attachment)
PDB entry 1ojy (sushi domains only)

P Williams et al (2003) Mapping CD55 Function. J
Biol Chem 278(12) 10691-10696
25
There is only so much we can conclude

Classifying domains To aid structure prediction
(predict structural domains, molecular function
of the domain)
Classifying complete sequences (predicting
molecular function of proteins, large scale
annotation)
Majority of proteins are multi-domain proteins.

26
What are domains?

Structural - from structures

MKTQVAIIGAGPSGLLLGQLLHKAGIDNVILERQTPDYVLGRIRAGVLEQ
GMVDLLREAGVDRRMARDGLVHEGVEIAFAGQRRRIDLKRLSGGKTVTVY
GQTEVTRDLMEAREACGATTVYQAAEVRLHDLQGERPYVTFERDGERLRL
DCDYIAGCDGFHGISRQSIPAERLKVFERVYPFGWLGLLADTPPVSHELI
YANHPRGFALCSQRSATRSRYYVQVPLTEKVEDWSDERFWTELKARLPAE
VAEKLVTGPSLEKSIAPLRSFVVEPMQHGRLFLAGDAAHIVPPTGAKGLN
LAASDVSTLYRLLLKAYREGRGELLERYSAICLRRIWKAERFSWWMTSVL
HRFPDTDAFSQRIQQTELEYYLGSEAGLATIAENYVGLPYEEIE
Are these domains? Yes - structural domains!
1phh
M A Marti-Renom (2003) Identification of
Structural Domains in Proteins. DIMACS, Rutgers
University, Piscataway, NJ, Feb 27 2003.
27
What are domains?

Mobile Sequence Domains

Protein 1 Protein 2 Protein 3 Protein 4
Mobile module
28
Domains are...

...evolutionary building blocks
Families of evolutionarily-related sequence
segments
Domain assignment often coupled with
classification
With one or more of the following properties
Globular
Independently foldable
Recurrence in different contexts
To be precise,
we say protein family
we mean protein domain family

29
Example global alignment

Phthalate dioxygenase reductase (PDR_BURCE)
Toluene - 4 -monooxygenase electron transfer
component (TMOF_PSEME)

Global alignment fails! Only aligns largest
domain.
30
Sometimes even more complex!
PGBM_HUMAN Basement membrane-specific heparan
sulphate proteoglycan core protein precursor
980 1960 2940 3920 4391
45 domains of 9 different type, according to PFam
http//www.sanger.ac.uk/cgi-bin/Pfam/swisspfamget.
pl?nameP98160 http//www.glycoforum.gr.jp/science
/word/proteoglycan/PGA09E.html
31
Domain Analysis and Protein Families

Introduction
What are protein families?
Protein families
Description Definition
Motifs and Profiles
The modular architecture of proteins
Domain Properties and Classification

32
Categories of Domain Definitions
Sequence(continuous domains)
Structure(discontinuous domains)
PFAM
SCOP
Curated
SMART
CATH
PROSITE
PRINTS
ADDA
DALI PUU DETEKTIVE DOMAINPARSER 1
2 DIAL STRUDL DOMAK
DOMO TRIBE-MCL GENERAGE SYSTERS PROTOMAP
Automatic
33
Pfam-Protein family database

Families of HMM profiles built from hand
curated multiple alignments. (Pfam A)
Pfam A covers 7973 protein families.
You can search your sequence against these
profiles to decipher family membership for your
sequence.

7973
34
Sequence Space Graph

Why we need to consider domains

Sequence
Alignment

Topology
80 of all sequences in one giant component
10 smaller groups
10 in singletons

35
Automatic domain definitions

Rely on alignment information
Alignment information is unreliable
Incomplete sequences (fragments)
Spurious alignments
Conserved motifs in mostly disordered region
How to remove the noise?

UREA_CANEN three domain protein
36

Sequence Space Graph
Where to cut connections?
What is real, what is noise?
Precision vs Sensitivity

37
ADDA

HolmGroup in-house database!
http//ekhidna.biocenter.helsinki.fi9801/sqgraph/
pairsdb
Classification of non-redundant sequences
100 level 1562243 sequences, 2697368 domains
40 level 479740 sequences, 827925 domains
PFAM-A benchmark
Sensitivity 87 (average unification in single
cluster)
Selectivity 98 (average purity of cluster)
Coverage 100 (all known proteins) Pfam 50

38
Example ABC transporter
PFAM
PRODOM
DOMO
ADDA
UniProt id CFTR_BOVIN
39
Properties of domains

Most domains size approx 75 200 residues

40
So, you have a sequence...

...look it up in existing database
SRS http//srs.ebi.ac.uk
INTERPRO http//www.ebi.ac.uk/interpro
...search against existing family descriptions
PFAM http//www.sanger.ac.uk/Software/Pfam
SMART http//smart.embl-heidelberg.de
PRINTS http//bioinf.man.ac.uk/dbbrowser/PRINTS
PROSITE http//us.expasy.org/prosite
...look it up in ADDA

41
Manually Curated Protein Family Databases

PFAM (Hidden Markov Models)
http//www.sanger.ac.uk/Software/Pfam
SMART (Hidden Markov Models)
http//smart.embl-heidelberg.de
PROSITE (Regular Expressions, Profiles)
http//au.expasy.org/prosite
PRINTS (combination of Profiles)
http//bioinf.man.ac.uk/dbbrowser/PRINTS

42
Why a multiple alignment?

With a multiple alignment, we can
guess which residues are important
secondary structure prediction
transmembrane segments prediction
homology modelling
guide to wet-lab EXPERIMENTATION!
build a motif/profile and find more family
members
build phylogenetic trees

Multiple Alignments are THE central object in
protein sequence analysis!
43
From sequence to function
3-motif resource The server seems to be down
today!
Methylmalanoyl CoA Decarboxylase Pattern
ILV-x(3)-E-x(7)-V-GA-x-IVL-x-L-N-R-P mapped
on the structure of 1DUB. Ball representation in
pink shows the potential ligands and its binding
pockets. The balls in blue represent the residues
making up the motif on the known structure.

Write a Comment

User Comments (0)