Classifying the protein universe - PowerPoint PPT Presentation

1 / 43
About This Presentation
Title:

Classifying the protein universe

Description:

Protein families are defined by homology: In a family, everyone is related to everyone ... describes essential features of the protein family ... – PowerPoint PPT presentation

Number of Views:172
Avg rating:3.0/5.0
Slides: 44
Provided by: ekhidnaBi
Category:

less

Transcript and Presenter's Notes

Title: Classifying the protein universe


1
Classifying the protein universe
Ashwin Sivakumar
Wu et al, 2002. EMBO J 195740-5751
2
Domain Analysis and Protein Families
  • Introduction
  • What are protein families?
  • Protein families
  • Description Definition
  • Motifs and Profiles
  • The modular architecture of proteins
  • Domain Properties and Classification

3
Protein Families
  • Protein families are defined by homology
  • In a family, everyone is related to everyone
  • Everybody in a family shares a common ancestor

4
Homology versus Similarity
  • Homologous proteins have similar 3D structures
    and (usually) share common ancestry
  • 1chg and 1sgt ? 31 identity, 43 similarity
  • We can infer homology from similarity!

Superfamily Trypsin-like Serine Proteases
5
Homology versus Similarity
  • But Homologous proteins may not share sequence
    similarity

Superfamily Trypsin-like Serine Proteases
1chg and 1sgc ? 15 identity, 25 similarity We
cannot infer similarity from homology
6
Homology versus Similarity
  • Similar sequences may not have structural
    similarity

1chg and 2baa ? 30 similarity, 140/245 aa We
cannot assume homology from similarity!
7
Homology versus Similarity
  • Summary
  • Sequences can be similar without being homologous
  • Sequences can be homologous without being similar

Evolution / Homology
BLAST Similarity
Families ??
8
Domain Analysis and Protein Families
  • Introduction
  • What are protein families?
  • Protein families
  • Description Definition
  • Motifs and Profiles
  • The modular architecture of proteins
  • Domain Properties and Classification

9
Description of a Protein Family
  • Lets assume we know some members of a protein
    family
  • What is common to them all?
  • Multiple alignment!

10
Describing Sequences in a Protein Family
  • As a motif or rule
  • describes essential features of the protein
    family
  • catalytic residues, important structural residues
  • As a profile
  • describes variability in the family alignment

11
Techniques for searching sequence databases
to Some common strategies to uncover common
domains/motifs of biological significance that
categorize a protein into a family Pattern - a
deterministic syntax that describes multiple
combinations of possible residues within a
protein string Profile - probabilistic
generalizations that assign to every segment
position, a probability that each of the 20 aa
will occur

12
Consensus - mathematical probability that a
particular amino acid will be located at a given
position. Probabilistic pattern constructed
from a MSA. Opportunity to assign penalties for
insertions and deletions PSSM - (Position
Specific Scoring Matrix) Represents the
sequence profile in tabular form Columns of
weights for every aa corresponding to each column
of a MSA.
13
HMMs
  • Hidden Markov Models are Statistical methods that
    consider all the possible combinations of
    matches, mismatches, and gaps to generate a
    consensus (Higgins, 2000)
  • Sequence ordering and alignments are not
    necessary at the onset (but in many cases
    alignments are recommended)
  • More the number of sequences better the models.
  • One can Generate a model (profile/PSSM), then
    search a database with it (Eg PFAM)

14
Motif Description of a Protein Family
  • Regular expressions

........C.............S...L..I..DRY..I............
...........W... I E
W V
/ C x13 S x3 LI x2 I x2 DE R YW
x2 IV x10 x12 W /
x AC-IK-NP-TVWY
15
Motif Description of a Protein Family
  • Database PROSITE
  • PROSITE is a database of protein families and
    domains. It is based on the observation that,
    while there is a huge number of different
    proteins, most of them can be grouped, on the
    basis of similarities in their sequences, into a
    limited number of families. Proteins or protein
    domains belonging to a particular family
    generally share functional attributes and are
    derived from a common ancestor. It is apparent,
    when studying protein sequence families, that
    some regions have been better conserved than
    others during evolution. These regions are
    generally important for the function of a protein
    and/or for the maintenance of its
    three-dimensional structure. By analyzing the
    constant and variable properties of such groups
    of similar sequences, it is possible to derive a
    signature for a protein family or domain, which
    distinguishes its members from all other
    unrelated proteins.
  • http//au.expasy.org/prosite/prosite_details.html

16
Automated Motif Discovery
  • Given a set of sequences
  • GIBBS Sampler
  • http//bayesweb.wadsworth.org/cgi-bin/gibbs.8.pl?d
    ata_typeprotein
  • MEME
  • http//meme.sdsc.edu/meme/
  • PRATT
  • http//www.ebi.ac.uk/pratt
  • TEIRESIAS
  • http//cbcsrv.watson.ibm.com/Tspd.html

17
Automated Profile Generation
  • Any multiple alignment is a profile!
  • PSIBLAST
  • Algorithm
  • Start from a single query sequence
  • Perform BLAST search
  • Build profile of neighbours
  • Repeat from 2
  • Very sensitive method for database search

18
PSI-Blast
  • Starts with a sequence, BLAST it,
  • align select results to query sequence, estimate
    a profile with the MSA, search database with the
    profile - constructs PSSM
  • Iterate until process stabilizes
  • Focus here is on domains, not entire sequences
  • Greatly improves sensitivity

19
PSIBLAST
  • Position Specific Iterative Blast

20
Benchmarking a motif/profile
  • You have a description of a protein family, and
    you do a database search
  • Are all hits truly members of your protein
    family?
  • Benchmarking

TP true positive TN true negative FP false
positive FN false negative
Result
family member
Dataset
not a family member
unknown
21
Benchmarking a motif/profile
  • Precision / Selectivity
  • Precision TP / (TP FP)
  • Sensitivity / Recall
  • Sensitivity TP / (TP FN)
  • Balancing both
  • Precision 1, Recall 0 easy but useless
  • Precision 0, Recall 1 easy but useless
  • Precision 1, Recall 1 perfect but very
    difficult

22
Domain Analysis and Protein Families
  • Introduction
  • What are protein families?
  • Protein families
  • Description Definition
  • Motifs and Profiles
  • The modular architecture of proteins
  • Domain Properties and Classification

23
The Modular Architecture of Proteins
  • BLAST search of a multi-domain protein

24
What are domains?
  • Functional - from experiments
  • example Decay Accelerating Factor (DAF) or CD55
  • Has six domains (units)
  • 4x Sushi domain (complement regulation)
  • 1x ST-rich stalk
  • 1x GPI anchor (membrane attachment)
  • PDB entry 1ojy (sushi domains only)

P Williams et al (2003) Mapping CD55 Function. J
Biol Chem 278(12) 10691-10696
25
There is only so much we can conclude
  • Classifying domains To aid structure prediction
    (predict structural domains, molecular function
    of the domain)
  • Classifying complete sequences (predicting
    molecular function of proteins, large scale
    annotation)
  • Majority of proteins are multi-domain proteins.

26
What are domains?
  • Structural - from structures

MKTQVAIIGAGPSGLLLGQLLHKAGIDNVILERQTPDYVLGRIRAGVLEQ
GMVDLLREAGVDRRMARDGLVHEGVEIAFAGQRRRIDLKRLSGGKTVTVY
GQTEVTRDLMEAREACGATTVYQAAEVRLHDLQGERPYVTFERDGERLRL
DCDYIAGCDGFHGISRQSIPAERLKVFERVYPFGWLGLLADTPPVSHELI
YANHPRGFALCSQRSATRSRYYVQVPLTEKVEDWSDERFWTELKARLPAE
VAEKLVTGPSLEKSIAPLRSFVVEPMQHGRLFLAGDAAHIVPPTGAKGLN
LAASDVSTLYRLLLKAYREGRGELLERYSAICLRRIWKAERFSWWMTSVL
HRFPDTDAFSQRIQQTELEYYLGSEAGLATIAENYVGLPYEEIE
Are these domains? Yes - structural domains!
1phh
M A Marti-Renom (2003) Identification of
Structural Domains in Proteins. DIMACS, Rutgers
University, Piscataway, NJ, Feb 27 2003.
27
What are domains?
  • Mobile Sequence Domains

Protein 1 Protein 2 Protein 3 Protein 4
Mobile module
28
Domains are...
  • ...evolutionary building blocks
  • Families of evolutionarily-related sequence
    segments
  • Domain assignment often coupled with
    classification
  • With one or more of the following properties
  • Globular
  • Independently foldable
  • Recurrence in different contexts
  • To be precise,
  • we say protein family
  • we mean protein domain family

29
Example global alignment
  • Phthalate dioxygenase reductase (PDR_BURCE)
  • Toluene - 4 -monooxygenase electron transfer
    component (TMOF_PSEME)

Global alignment fails! Only aligns largest
domain.
30
Sometimes even more complex!
PGBM_HUMAN Basement membrane-specific heparan
sulphate proteoglycan core protein precursor
980 1960 2940 3920 4391
45 domains of 9 different type, according to PFam
http//www.sanger.ac.uk/cgi-bin/Pfam/swisspfamget.
pl?nameP98160 http//www.glycoforum.gr.jp/science
/word/proteoglycan/PGA09E.html
31
Domain Analysis and Protein Families
  • Introduction
  • What are protein families?
  • Protein families
  • Description Definition
  • Motifs and Profiles
  • The modular architecture of proteins
  • Domain Properties and Classification

32
Categories of Domain Definitions
Sequence(continuous domains)
Structure(discontinuous domains)
PFAM
SCOP
Curated
SMART
CATH
PROSITE
PRINTS
ADDA
DALI PUU DETEKTIVE DOMAINPARSER 1
2 DIAL STRUDL DOMAK
DOMO TRIBE-MCL GENERAGE SYSTERS PROTOMAP
Automatic
33
Pfam-Protein family database
  • Families of HMM profiles built from hand
    curated multiple alignments. (Pfam A)
  • Pfam A covers 7973 protein families.
  • You can search your sequence against these
    profiles to decipher family membership for your
    sequence.

7973
34
Sequence Space Graph
  • Why we need to consider domains

Sequence
Alignment
  • Topology
  • 80 of all sequences in one giant component
  • 10 smaller groups
  • 10 in singletons

35
Automatic domain definitions
  • Rely on alignment information
  • Alignment information is unreliable
  • Incomplete sequences (fragments)
  • Spurious alignments
  • Conserved motifs in mostly disordered region
  • How to remove the noise?

UREA_CANEN three domain protein
36
  • Sequence Space Graph
  • Where to cut connections?
  • What is real, what is noise?
  • Precision vs Sensitivity

37
ADDA
  • HolmGroup in-house database!
  • http//ekhidna.biocenter.helsinki.fi9801/sqgraph/
    pairsdb
  • Classification of non-redundant sequences
  • 100 level 1562243 sequences, 2697368 domains
  • 40 level 479740 sequences, 827925 domains
  • PFAM-A benchmark
  • Sensitivity 87 (average unification in single
    cluster)
  • Selectivity 98 (average purity of cluster)
  • Coverage 100 (all known proteins) Pfam 50

38
Example ABC transporter
PFAM
PRODOM
DOMO
ADDA
UniProt id CFTR_BOVIN
39
Properties of domains
  • Most domains size approx 75 200 residues

40
So, you have a sequence...
  • ...look it up in existing database
  • SRS http//srs.ebi.ac.uk
  • INTERPRO http//www.ebi.ac.uk/interpro
  • ...search against existing family descriptions
  • PFAM http//www.sanger.ac.uk/Software/Pfam
  • SMART http//smart.embl-heidelberg.de
  • PRINTS http//bioinf.man.ac.uk/dbbrowser/PRINTS
  • PROSITE http//us.expasy.org/prosite
  • ...look it up in ADDA

41
Manually Curated Protein Family Databases
  • PFAM (Hidden Markov Models)
  • http//www.sanger.ac.uk/Software/Pfam
  • SMART (Hidden Markov Models)
  • http//smart.embl-heidelberg.de
  • PROSITE (Regular Expressions, Profiles)
  • http//au.expasy.org/prosite
  • PRINTS (combination of Profiles)
  • http//bioinf.man.ac.uk/dbbrowser/PRINTS

42
Why a multiple alignment?
  • With a multiple alignment, we can
  • guess which residues are important
  • secondary structure prediction
  • transmembrane segments prediction
  • homology modelling
  • guide to wet-lab EXPERIMENTATION!
  • build a motif/profile and find more family
    members
  • build phylogenetic trees

Multiple Alignments are THE central object in
protein sequence analysis!
43
From sequence to function
3-motif resource The server seems to be down
today!
Methylmalanoyl CoA Decarboxylase Pattern
ILV-x(3)-E-x(7)-V-GA-x-IVL-x-L-N-R-P mapped
on the structure of 1DUB. Ball representation in
pink shows the potential ligands and its binding
pockets. The balls in blue represent the residues
making up the motif on the known structure.
Write a Comment
User Comments (0)
About PowerShow.com