Proteiinianalyysi 3 - PowerPoint PPT Presentation

About This Presentation
Title:

Proteiinianalyysi 3

Description:

Multiple sequence alignment (msa) A) define characters for ... TATA. TACGTC. TACAGTA. Evolution. Alignment. progress. Motivation: existing alignment methods ... – PowerPoint PPT presentation

Number of Views:638
Avg rating:3.0/5.0
Slides: 43
Provided by: luh8
Category:

less

Transcript and Presenter's Notes

Title: Proteiinianalyysi 3


1
Proteiinianalyysi 3
  • Monen sekvenssin linjaus
  • http//www.bioinfo.biocenter.helsinki.fi/downloads
    /teaching/spring2006/proteiinianalyysi

2
Multiple sequence alignment (msa)
  • A) define characters for phylogenetic analysis
  • B) search for additional family members

SeqA N F L S SeqB N F S SeqC N K Y L
S SeqD N Y L S
NYLS NKYLS NFS NFLS
K -L

Y?F
3
Complexity of performing msa
  • Sum of pairs (SP) score
  • Extension of sequence pair alignment by dynamic
    programming
  • O(LN), where L is sequence length and N is number
    of sequences
  • MSA algorithm searches subspace bounded by
  • Heuristic multiple alignment (suboptimal)
  • Pairwise optimal alignments

4
Pairwise projection
Pairwise optimal alignment Greedy multiple
alignment Global msa optimum (sequences A,B,C,)
is probably found in the area in-between
Sequence A
Sequence B
5
Dynamic programming
B
B
BEGIN
END
D
  • Maximal path sum BEGIN ? END ?
  • Enumerate every path brute force
  • Use induction only one optimal path up to any
    node in graph.

6
Example all paths leading to B
3
3
0
3
8
3
4
1
BEGIN
END
7
1
2
1
D
7
Motivation existing alignment methods
Heuristic alignment algorithms
8
Progressive alignment
  • 1. derive a guide tree from pairwise sequence
    distances
  • 2. align the closest (groups of) sequences
  • Pluses works with many sequences
  • Minuses errors made early on are frozen in the
    final alignment
  • Programs Clustal, Pileup, T-Coffee

9
Motivation existing alignment methods
Progressive pairwise alignment
Evolution
Alignment
TACAGTA
TACGTA
TACGTA
TACGTC
TACCTA
TATA
CACCTA
time
progress
10
Motivation existing alignment methods
Progressive pairwise alignment
1.
TACAGTA
T A C A G T A
1.
T A C A G T A T A C G T C
T A C G T C
2.
TACGTC
3.
TATA
CACCTA
progress
11
Motivation existing alignment methods
Progressive pairwise alignment
1.
2.
3.
T A C A G T A T A C G T C
T A T A
C A C G T A
T A C A G T A
T A C G T C
T A T A
C A C C T A
12
Motivation existing alignment methods
Profile HMMs
13
Iterative methods
  • Gotoh
  • inner loop uses progressive alignment
  • outer loop recomputes distances and guide tree
    based on the msa obtained
  • DIALIGN
  • Identify high-scoring ungapped segment pairs
  • Build up msa using consistent segment pairs
  • Greedy algorithm DIALIGN2 exact DIALIGN3

14
Motifs

- several proteins are grouped together by
similarity searches - they share a conserved
motif - motif is stringent enough to retrieve
the family members from the complete protein
database  
 
15
Local msa
  • Profile analysis a portion of the alignment that
    is highly conserved and produces a type of
    scoring matrix called a profile. New sequences
    can be aligned to the profile using dynamic
    programming.
  • Block analysis scans a global msa for ungapped
    regions, called blocks, and these blocks are then
    used in sequence alignments.
  • Pattern-searching or statistical methods find
    recurrent regions of sequence similarity in a set
    of initially unaligned sequences.

16
TEIRESIAS
  • Finds all patterns with minimum support in input
    set of unaligned sequences
  • Maximum spacer length
  • For example LG..A.LL
  • Enumerative algorithm
  • Build up longer patterns by merging overlapping
    short patterns
  • For example A.L LL ? A.LL
  • Fewer instances by chance when more positions are
    specified
  • Pluses exact
  • Minuses biological significance of the patterns?

17
Expectation maximization (EM)
  • Initial guess of motif location in each sequence
  • E-step Estimate frequencies of amino acids in
    each motif column. Columns not in motif provide
    background frequencies. For each possible site
    location, calculate the probability that the site
    starts there.
  • M-step use the site probabilities as weights to
    provide a new table of expected values for base
    counts for each of the site positions.
  • Repeat E-step and M-step until convergence.

18
EM procedure
  • Let the observed variables be known as y and the
    latent variables as z. Together, y and z form
    the complete data. Assume that p is a joint
    model of the complete data with parameters ?
    p(y,z ?). An EM algorithm will then iteratively
    improve an initial estimate ?0 and construct new
    estimates ?1 through ?N. An individual
    re-estimation step that derives from takes
    the following form (shown for the discrete case
    the continuous case is similar)
  • Eqn 1
  • In other words, ?n 1 is the value that
    maximizes (M) the expectation (E) of the complete
    data log-likelihood with respect to the
    conditional distribution of the latent data under
    the previous parameter value. This expectation is
    usually denoted as Q(?)
  • Eqn 2
  • Speaking of an expectation (E) step is a bit of a
    misnomer. What is calculated in the first step
    are the fixed, data-dependent parameters of the
    function Q. Once the parameters of Q are known,
    it is fully determined and is maximized in the
    second (M) step of an EM algorithm.
  • It can be shown that an EM iteration does not
    decrease the observed data likelihood function,
    and that the only stationary points of the
    iteration are the stationary points of the
    observed data likelihood function. In practice,
    this means that an EM algorithm will converge to
    a local maximum of the observed data likelihood
    function.
  • EM is particularly useful when maximum likelihood
    estimation of a complete data model is easy. If
    closed-form estimators exist, the M step is often
    trivial. A classic example is maximum likelihood
    estimation of a finite mixture of Gaussians,
    where each component of the mixture can be
    estimated trivially if the mixing distribution is
    known.
  • "Expectation-maximization" is a description of a
    class of related algorithms, not a specific
    algorithm EM is a recipe or meta-algorithm which
    is used to devise particular algorithms. The
    Baum-Welch algorithm is an example of an EM
    algorithm applied to hidden Markov models.
    Another example is the EM algorithm for fitting a
    mixture density model.
  • An EM algorithm can also find maximum a
    posteriori (MAP) estimates, by performing MAP
    estimation in the M step, rather than maximum
    likelihood.
  • There are other methods for finding maximum
    likelihood estimates, such as gradient descent,
    conjugate gradient or variations of the
    Gauss-Newton method. Unlike EM, such methods
    typically require the evaluation of first and/or
    second derivatives of the likelihood function.

Eqn 1 Eqn 2
19
Exercise
  • Analyze the following ten DNA sequences by the
    expectation maximization algorithm. Assume that
    the background base frequencies are each 0.25 and
    that the middle three positions are a motif. The
    size of the motif is a guess that is based on a
    molecular model. The alignment of the sequences
    is also a guess.

Seq1 C CAG A Seq2 G TTA A Seq3 G TAC C Seq4
T TAT T Seq5 C AGA T Seq6 T TTT G Seq7 A TAC
T Seq8 C TAT G Seq9 A GCT C Seq10 G TAG A
20
(a) Calculate the observed frequency of each base
at each of the three middle positions
Seq1 C CAG A Seq2 G TTA A Seq3 G TAC C Seq4
T TAT T Seq5 C AGA T Seq6 T TTT G Seq7 A TAC
T Seq8 C TAT G Seq9 A GCT C Seq10 G TAG A
Base 1st 2nd 3rd
A 0.1 0.6 0.2
C 0.1 0.1 0.2
G 0.1 0.1 0.2
T 0.7 0.2 0.4
21
(b) Calculate the odds likelihood of finding the
motif at each of the possible locations in
sequence 5
Seq5 CAGAT
CAGAT 0.10.60.2/0.25/0.25/0.250.768 CAGAT
0.10.10.2/0.25/0.25/0.250.128 CAGAT
0.10.60.4/0.25/0.25/0.251.536 Non-motif
sites 0.25/0.251
Base 1st 2nd 3rd
A 0.1 0.6 0.2
C 0.1 0.1 0.2
G 0.1 0.1 0.2
T 0.7 0.2 0.4
22
(c) Calculate the probability of finding the
motif at each position of sequence 5
Seq5 CAGAT
CAGAT 0.10.60.20.250.250.00075 CAGAT
0.250.10.10.20.250.000125 CAGAT
0.250.250.10.60.40.0015 Non-motif sites
p0.25
Base 1st 2nd 3rd
A 0.1 0.6 0.2
C 0.1 0.1 0.2
G 0.1 0.1 0.2
T 0.7 0.2 0.4
23
(d) Calculate what change will be made to the
base count in each column of the motif table as a
result of matching the motif to sequence 5.
CAGAT 0.10.60.20.250.250.00075, rel. weight
0.32 CAGAT 0.250.10.10.20.250.000125, rel.
weight 0.05 CAGAT 0.250.250.10.60.40.0015,
rel. weight 0.63
Base 1st 2nd 3rd
A 0.05 0.320.630.95 0.05
C 0.32 0 0
G 0.63 0.05 0.32
T 0 0 0.63
Updated counts from sequence 5 (initial location
guess shaded).
24
(e) What other steps are taken to update or
maximize the table values?
  • The weighted sequence data from the remaining
    sequences are also added to the counts table
  • The base frequencies in the new table are used as
    an updated estimate of the site residue
    composition.
  • The expectation and maximization steps are
    repeated until the base frequencies do not change.

25
Gibbs sampler
  • Start from random msa realign one sequence
    against profile derived from the other sequences
    iterate
  • Finds the most probable pattern common to all of
    the sequences by sliding them back and forth
    until the ratio of the motif probability to the
    background probability is a maximum.

26
A. Estimate the amino acid frequencies in the
motif columns of all but one sequence. Also
obtain background. Random start Motif positions
chosen ? xxxMxxxxx
xxxMxxxxx xxxxxxMxx xxxxxxMxx xxxxxMxxx
xxxxxMxxx xMxxxxxxx xMxxxxxxx Xxxxxxxxx
Xxxxxxxxx Mxxxxxxxx Mxxxxxxxx xxxxMxxxx
xxxxMxxxx xMxxxxxxx
xMxxxxxxx xxxxxxxxM xxxxxxxxM B. Use the
estimates from A to calculate the ratio of
probability of motif to background score at each
position in the left-out sequence. This ratio for
each possible location is the weight of the
position. xxxxxxxxxx xxxxxxxxxx xxxxxxxxxx xxxxxxx
xxx M? M? M? M?

27
C. Choose a new location for the motif in the
left-out sequence by a random selection using the
weights to bias the choice. xxxxxxxMxx Estimated
location of the motif in left-out sequence. D.
Repeat steps A to C many (gt100) times.
28
Exercise Gibbs sampler
  • Analyze the left-hand-side DNA sequences by the
    Gibbs sampling algorithm.

Seq1 C CAG A Seq2 G TTA A Seq3 G TAC C Seq4
T TAT T Seq5 C AGA T Seq6 T TTT G Seq7 A TAC
T Seq8 C TAT G Seq9 A GCT C Seq10 G TAG A
29
(a) Assuming that the background base frequencies
are 0.25, calculate a log odds matrix for the
central three positions.
Base 1st 2nd 3rd
A -1.32 1.26 -0.32
C -1.32 -1.32 -0.32
G -1.32 -1.32 -0.32
T 1.49 -0.32 0.68
Base 1st 2nd 3rd
A 0.1 0.6 0.2
C 0.1 0.1 0.2
G 0.1 0.1 0.2
T 0.7 0.2 0.4
30
(b) Assuming that another sequence GTTTG is the
left-out sequence, slide the log-odds matrix
along the left-out sequence and find the log-odds
score at each of three possible positions.
Base 1st 2nd 3rd
A -1.32 1.26 -0.32
C -1.32 -1.32 -0.32
G -1.32 -1.32 -0.32
T 1.49 -0.32 0.68
GTTTG -1.32 -0.32 0.68 -1.00 GTTTG 1.49
-0.32 0.68 1.81 GTTTG 1.49 -0.32 -0.32
0.85
31
(c) Calculate the probability of a match at each
position in the left-out sequence
Base 1st 2nd 3rd
A 0.1 0.6 0.2
C 0.1 0.1 0.2
G 0.1 0.1 0.2
T 0.7 0.2 0.4
log-oddslog2(p/bg), p2s bg GTTTG
2-1.00/640.078, 0.10.20.40.008 GTTTG
21.81/640.055, 0.70.20.40.056 GTTTG
20.85/640.028, 0.70.20.20.028
32
(d) How do we choose a possible location for the
motif in the left-out sequence?
  • 0.0080.0560.0280.092
  • Normalised weights
  • GTTTG 0.008/0.0920.09
  • GTTTG 0.056/0.0920.61
  • GTTTG 0.028/0.0920.30

33
Modelling protein families
34
PSSM
  • The PSSM is constructed by a logarithmic
    transformation of a matrix giving the frequency
    of each amino acid in the motif.
  • If a good sampling of sequences is available, the
    number of sequences is sufficiently large, and
    the motif structure is not too complex, it
    should, in principle, be possible to produce a
    PSSM that is highly representative of the same
    motif in other sequences also.
  • Pseudocounts
  • Sequence logo, information content

35
Exercise construct a PSSM
Base Column 1 frequency Column 2 frequency Column 3 frequency Column 4 frequency
A 0.6 0.1 0.2 0.1
C 0.1 0.7 0.1 0.1
G 0.1 0.1 0.6 0.1
T 0.2 0.1 0.1 0.7
36
Exercise cntd
  • (a) Assuming the background frequency is 0.25 for
    each base, calculate a log odds score for each
    table position, i.e. log to the base 2 of the
    ratio of each observed value to the expected
    frequency.

37
Observed/expected frequencies
Base Column 1 frequency Column 2 frequency Column 3 frequency Column 4 frequency
A 0.6/0.252.4 0.1/0.250.4 0.2/0.250.8 0.1/0.250.4
C 0.1/0.250.4 0.7/0.252.8 0.1/0.250.4 0.1/0.250.4
G 0.1/0.250.4 0.1/0.250.4 0.6/0.252.4 0.1/0.250.4
T 0.2/0.250.8 0.1/0.250.4 0.1/0.250.4 0.7/0.252.8
38
Log-odds scores
Base Column 1 frequency Column 2 frequency Column 3 frequency Column 4 frequency
A log22.41.26 log20.4-1.32 log20.8-0.32 log20.4-1.32
C log20.4-1.32 log22.81.49 log20.4-1.32 log20.4-1.32
G log20.4-1.32 log20.4-1.32 log22.41.26 log20.4-1.32
T log20.8-0.32 log20.4-1.32 log20.4-1.32 log22.81.49
39
Exercise cntd
  • (b) Align the matrix with each position in the
    sequence TCACGTAA starting at position 1,2, etc.,
    and calculate the log odds score for the matrix
    to that position.

40
Base Column 1 frequency Column 2 frequency Column 3 frequency Column 4 frequency
A 1.26 -1.32 -0.32 -1.32
C -1.32 1.49 -1.32 -1.32
G -1.32 -1.32 1.26 -1.32
T -0.32 -1.32 -1.32 1.49
TCACGTAA -0.32 1.49 -0.32 -1.32
-0.47 TCACGTAA -1.32 -1.32 -1.32 -1.32
-5.28 TCACGTAA 1.26 1.49 1.26 1.49
4.5 TCACGTAA -1.32 -1.32 -1.32 -1.32
-2.81 TCACGTAA -1.32 -1.32 -0.32 -1.32
-4.28
41
(c) Calculate the probability of the best
matching position.
  • p(TCACGTAA) 0.60.70.60.70.1764

42
Alignment of sequences with a structure hidden
Markov models
Hidden Markov Models
  • HMMs suit well on describing correlation among
    neighbouring sites
  • probabilistic framework allows for realistic
    modelling
  • well developed mathematical methods provide
  • best solution (most probable path through the
    model)
  • confidence score (posterior probability of any
    single solution)
  • inference of structure (posterior probability of
    using model states)
  • we can align multiple sequence using complex
    models and simultaneously predict their internal
    structure
Write a Comment
User Comments (0)
About PowerShow.com