Sequence analysis: Macromolecular motif recognition - PowerPoint PPT Presentation

1 / 26

About This Presentation

Title:

Sequence analysis: Macromolecular motif recognition

Description:

Sequence analysis: Macromolecular motif recognition Sylvia Nagl Sequence analysis: Macromolecular motif recognition Sylvia Nagl Amino acid primary sequence 2. – PowerPoint PPT presentation

Number of Views:214

Avg rating:3.0/5.0

Slides: 27

Provided by: bioche2

Category:

more less

Transcript and Presenter's Notes

Title: Sequence analysis: Macromolecular motif recognition

1
Sequence analysisMacromolecular motif
recognition

Sylvia Nagl

2
DNA sequence
Automatic translation
Amino acid primary sequence
Physico-chemical properties (e. g., using EMBOSS
suite)
Primary db searches FASTA, BLAST
1. Search for sequence homologue(s) and construct
an alignment
2. Homologue(s) with known 3D structure?
Homology modelling
available
3. Motif recognition Search secondary databases
Secondary structure prediction
Fold assignment
3
Terminology
Terminology

Motif the biological object one attempts to
model - a functional or structural domain, active
site, phosphorylation site etc.
Pattern a qualitative motif description based on
a regular expression-like syntax
Profile a quantitative motif description -
assigns a degree of similarity to a potential
match

4
Active site recognition
EXAMPLE CATHEPSIN A PEPTIDASE FAMILY S10 EC
3.4.16.5
3-D representation
3D profile (PROCAT)
5
Active site motifs
Conserved seq patterns
1ac5
438LTFVSVYNASHMVPFDKS455
1ivy
419IAFLTIKGAGHMVPTDKP436
6
Domain recognition
Kringle domain from plasminogen protein
EGF-like domain from coagulation factor X
7
Macromolecular motif recognition

Why search for motifs?
to find homologous sequences
apply existing information to new sequence
find functionally important sites
to find templates for homology modelling -lecture
on homology modelling

8
Different analysis methods
Percent identity Method
100 90 80 70 60 50 40 30 20 10
0
Automatic pairwise Alignment BLAST, Fasta)
Macromolecular motif recognition
Twilight zone
Structure prediction
Midnight zone
9
Macromolecular motif recognition

What do we need?
Method for defining motifs
Algorithm for finding them
Statistics to evaluate matches

10
Macromolecular motif recognition

Methods for defining motifs
Regular expression (patterns)
Profiles
Hidden Markov Model (HMM)

11
Macromolecular motif recognition
1-D representation Primary amino acid
sequence MIRAAPPPLFLLLLLLLLLVSWASRGEAAPDQDEIQRLPGL
AKQPSFRQYSGYLKSSGSKHLHYWFVESQKDPENSPVVLWLNGGPGCSSL
DGLLTEHGPFLVQPDGVTLEYNPYSWNLIANVLYLESPAGVGFSYSDDKF
YATNDTEVAQSNFEALQDFFRLFPEYKNNKL...
Query secondary databases over the Internet
Computational sequence analysis
http//www.ebi.ac.uk/interpro/
12
Macromolecular motif recognition

single motif
exact regular expression (PROSITE)
full domain alignment
profile (PROSITE)
Hidden Markov Model (Pfam, PROSITE)
residue frequency matrices (PRINTS)
multiple motifs
13
Active site motifs
Conserved seq patterns
1ac5
438LTFVSVYNASHMVPFDKS455
1ivy
419IAFLTIKGAGHMVPTDKP436
14
Motif modelling methods
Prosite Regular
expressions CARBOXYPEPT_SER_HIS LIVF-x(2)-LIVS
TA-x-IVPST-x-GSDNQL-SAGV-SG-H-x-IVAQ-P-
x(3)-PSA Regular expressions represent
features by logical combinations of characters. A
regular expression defines a sequence pattern to
be matched.

15
Regular expressions contd.

Basic rules for regular expressions
Each position is separated by a hyphen -
A symbol X is a regular expression matching
itself
x means any residue
surround ambiguities - a string XYZ
matches any of the enclosed symbols
A string R matches any number of strings
that match
surround forbidden residues
( ) surround repeat counts
Model formation
Restricted to key conserved features in order to
reduce the noise level
Built by hand in a stepwise fashion from multiple
alignments

16
Regular expressions contd.
Regular expressions, such as PROSITE patterns,
are matched to primary amino acid sequences using
finite state automata.
all-or-none
17
Motif modelling methods
Prints Residue
frequency matrices Motif 1 NPESWTNFANMLW NPYSWV
NLTNVLW REYSWHQNHHMIY
NEGSWISKGDLLF NPYSWTNLTNVVY
NEYSWNKMASVVY
NDFGWDQESNLIY NENSWNNYANMIY
NEYGWDQVSNLLY
NPYAWSKVSTMIY NPYSWNGNASIIY
NEYAWNKFANVLF
NPYSWNRVSNILY NPYSWNLIANVLY
NEYRWNKVANVLF
Motif 2
LDQPFGTGYSQ
VDNPVGAGFSY
VDQPVGTGFSL
VDQPGGTGFSS
IDNPVGTGFSF
IDQPTGTGFSV
VDQPLGTGYSY
IDQPAGTGFSP
LESPIGVGFSY
LDQPVGSGFSY
LDQPVGSGFSY
LDQPINTGFSN
LDQPIGAGFSY
LDAPAGVGFSY
LDQPVGAGFSY
Motif 3 FFQHFPEYQTNDFHIAGESY
AGHYIP FFNKFPEYQNRPFYITGESYGGI
YVP WVERFPEYKGRDFYIVGESYAGNGLM
FLSKFPEYKGRDFWITGESYAGVYIP
WFQLYPEFLSNPFYIAGESYAGVYVP
FFEAFPHLRSNDFHIAGESYAGHYIP
FFRLFPEYKDNKLFLTGESYAGIYIP
FLTRFPQFIGRETYLAGESYGGVYVP
FFNEFPQYKGNDFYVTGESYGGIYVP
WMSRFPQYQYRDFYIVGESYAGHYVP
FFRLFPEYKNNKLFLTGESYAGIYIP
FFRLFPEYKNNKLFLTGESYAGIYIP
WLERFPEYKGREFYITGESYAGHYVP
WMSRFPQYRYRDFYIVGESYAGHYVP
WFEKFPEHKGNEFYIAGESYAGIYVP
Motif 4 LAFTLSNSVGHMAP
LQFWWILRAGHMVA
LMWAETFQSGHMQP
LTYVRVYNSSHMVP
LQEVLIRNAGHMVP
LTFVSVYNASHMVP
LTFARIVEASHMVP
LTFSSVYLSGHEIP
IDVVTVKGSGHFVP
MTFATIKGSGHTAE
MTFATIKGGGHTAE
FGYLRLYEAGHMVP
MTFATVKGSGHTAE
ITLISIKGGGHFPA
MTFATVKGSGHTAE

a collection of protein fingerprints that
exploit groups of motifs to build characteristic
family signatures
motifs are encoded in ungapped raw sequence
format
different scoring methods may be superimposed
onto the data, e. .g. BLAST
improved diagnostic reliability
mutual context provided by motif neighbours

18
Motif modelling methods
Prosite Profiles Feature is
represented as a matrix with a score for every
possible character. Matrix is derived from a
sequence alignment, e.g. F K L L S H
C L L V F K A F
G Q T M F Q Y P I V G Q E
L L G F P V V K E A I L K F
K V L A A V I A D L E F I
S E C I I Q
19
Profiles contd.
Derived matrix A -18 -10 -1 -8 8 -3
3 -10 -2 -8 C -22 -33 -18 -18 -22
-26 22 -24 -19 -7 D -35 0 -32 -33
-7 6 -17 -34 -31 0 E -27 15 -25
-26 -9 23 -9 -24 -23 -1 F 60 -30
12 14 -26 -29 -15 4 12 -29 G -30
-20 -28 -32 28 -14 -23 -33 -27 -5 H
-13 -12 -25 -25 -16 14 -22 -22 -23 -10 I
3 -27 21 25 -29 -23 -8 33 19 -23
K -26 25 -25 -27 -6 4 -15 -27 -26 0
L 14 -28 19 27 -27 -20 -9 33 26
-21 M 3 -15 10 14 -17 -10 -9 25
12 -11 N -22 -6 -24 -27 1 8 -15
-24 -24 -4 P -30 24 -26 -28 -14 -10
-22 -24 -26 -18 Q -32 5 -25 -26 -9
24 -16 -17 -23 7 R -18 9 -22 -22
-10 0 -18 -23 -22 -4 S -22 -8 -16
-21 11 2 -1 -24 -19 -4 T -10 -10
-6 -7 -5 -8 2 -10 -7 -11 V 0
-25 22 25 -19 -26 6 19 16 -16 W
9 -25 -18 -19 -25 -27 -34 -20 -17 -28 Y
34 -18 -1 1 -23 -12 -19 0 0 -18
Alignment positions
20
Profiles contd.

inclusion of all possible information to maximise
overall signal of protein/domain
i. e., a full representation of features in the
aligned sequences
can detect distant relationships with only few
well conserved residues
position-dependent weights/penalties for all 20
amino acids -- BASED ON AMINO ACID SUBSTITUTION
MATRICES -- and for gaps and insertions
dynamic programming algorithms for scoring hits

21
Macromolecular motif recognition

Pfam and Prosite Hidden Markov Models (HMMs)
Feature is represented by a probabilistic model
of interconnecting match, delete or insert states
contains statistical information on observed and
expected positional variation - platonic
ideal of protein family

Di
Ii
B
E
Mi
22
Macromolecular motif recognition
Pfam and Prosite Hidden Markov
Models (HMMs)
P of a given amino acid to occurs in a
particular state (M, I, D) - at particular
position in sequence (for all 20, profile-like)
P of transition state
Di
Ii
B
E
Mi
23
Statistical significance

Statistical tests aim to assess the likelihood
that a match of a query sequence to a profile,
regular expression, HMM, etc, is the result of
chance.
They control for such factors as sequence (match)
length, amino acid composition and size of the
database searched.

24
Statistical significance

log-odds score this number is the log of the
ratio between two probabilities - P that the
sequence belongs to the positive set, and P that
the result was obtained by chance due to the
amino acid distribution in the positive set
(random model).
Z-score one needs to estimate an average score
and a standard deviation as a function of
sequence length. Then, one uses the number of
standard deviations each sequence is away from
the average as the score.
e-value (Expect value) given a database search
result with alignment score S, the e-value is the
expected number of sequences of score gt S that
would be found by random chance.
p-value the probability that one or more
sequences of score gt S would have been found
randomly.

25
INTERPRO

The InterPro database allows efficient searching
An integrated annotation resource for protein
families, domains and functional sites that
amalgamates the efforts of the PROSITE, PRINTS,
Pfam, ProDom, SMART and TIGRFAMs secondary
database projects.
http//www.ebi.ac.uk/interpro

26
(No Transcript)

Write a Comment

User Comments (0)