Corn Hoogendoorn - PowerPoint PPT Presentation

1 / 41

About This Presentation

Title:

Corn Hoogendoorn

Description:

Identifying unambiguously homologous positions is ... Barton-Stenberg multiple alignment ... Iterative refinement methods: Barton-Stenberg multiple alignment ... – PowerPoint PPT presentation

Number of Views:50

Avg rating:3.0/5.0

Slides: 42

Provided by: choo6

Category:

more less

Transcript and Presenter's Notes

Title: Corn Hoogendoorn

1
Multiple sequence alignment methods

Corné Hoogendoorn
Denis Miretskiy

2
Overview

What a multiple alignment means
Scoring a multiple alignment
Break
Multidimensional dynamic programming
Progressive alignment methods

3
What a multiple alignment means

Homologous residues are aligned in columns
Structurally homologous
Evolutionarily homologous
Similar 3D structural positions
Diverging from a common ancestral residue

4
Multiple alignment - issues

Identifying unambiguously homologous positions is
not possible
A need to identify which alignment is best
Protein structures and sequences evolve
Sequences not entirely superposable

5
Multiple alignment - issues

There always is an unambiguously correct
evolutionary alignment
Common ancestral sequence
Sheerly impossible to infer the evolutionary
history
Usually easier to construct a structural alignment

6
Multiple alignment - issues

Sequence diverges even faster than structure
Structurally unalignable protein parts cannot be
aligned by sequence either
Some parts are very well alignable
Use these parts to align whatever can be aligned
Disregard the rest to assess alignment quality
Supposedly meaningless biases are omitted

7
Scoring an alignment

Some positions are more conserved than others
Position-specific scoring
Sequences are not independent
Related to each other by a phylogenetic tree
Specify a complete probabilistic model of
molecular sequence evolution

8
Complete probabilistic model

Probabilities of all evolutionary events
Prior probability of root ancestral sequence
Probabilities of evolutionary change depend on
evolutionary time
Position-specific structural and functional
constraints
We just dont have all the necessary data

9
Workable approximations

Assume that all columns are statistically
independent

Score for multiple alignment m
Gap score/penalty
Score for column i in the multiple alignment m
10
Scoring an alignment

Notations

11
Minimum EntropyFurther simplification

We already assumed independence between columns
Complex statistical dependence between sequences
(within columns) if their phylogenetic tree has
many intermediate ancestors
We assume independence between and within columns

12
Minimum entropy

Probability of column mi
Score of column mi can be defined as the negative
logarithm

A regularized probability estimate as used in
chapter 5
An entropy measure directly related to the
Shannon entropy (chapter 11)
13
Example (1)
14
Example (2)
15
Example (3)
Will this ever be 0 in reality? Why (not)?
16
Example (4)
17
Minimum entropy

Very near to the HMM formulation
Choose the sequences carefully
Usually the sample of sequences is biased
Weighting schemes as discussed in chapter 5 are
necessary
This partially compensates for the defects of the
assumption of sequence independence

18
Sum of pairs

Also assumes statistical independence between
columns
Uses substitution matrices
For simple linear gap costs, s(a,-) s(-,a) and
s(-,-) are defined, with s(-,-) 0

Scores s(a,b) come from substitution matrices
like PAM or BLOSUM
19
Sum of pairs

Substitution scores are usually log-odds scores
for pairwise comparisons
log(pab/qaqb) log(pbc/qbqc) log(pac/qaqc)
log(pabc/qaqbqc)
Each sequence is scored as if it descended from
the N-1 other sequences
Evolutionary events are over-counted

20
Problem with SP scores

Consider an alignment of N sequences
All have leucine (L) at position i

Number of symbol pairs in the column
Score for an L-L alignment according to the
BLOSUM50 matrix
21
Problem with SP scores

What if one sequence has glycine (G) at i?
G-L pair scores -4, difference with L-L is 9
The score is worse than the all-leucine column by
a fraction

22
What a multiple alignment meansScoring a
multiple alignment

Questions?
Break

23
Multidimensional dynamic programming

We assume that columns of an alignment are
statistically independent
Gaps are scored with a linear gap cost
Now we can calculate overall score S(m)
Where S(mi) is a score for column i

24
Calculating the overall score

Define as the maximum score of an
alignment up to the subsequences ending with

25
(No Transcript)
26
Simple notation

Introduce Di which is 0 or 1 and define the
product
Now recursion can be written as follows

27
Complexity of algorithm

The algorithm requires the computation of the
whole dynamic programming matrix with L1, L2,,LN
entries.
We have to view 2N - 1 combinations of gaps in a
column.
All sequences have roughly the same length
Memory complexity of algorithm is
Time complexity is

28
MSA

Let akl denote the pairwise alignment between
sequences k and l
the score of the complete alignment is given
Let âkl be the optimal pairwise alignment of k,
l
Obviously

29
Lower bound

Assume that we have a lower bound of the optimal
multiple alignment, so
In other words
Where

30
Lower bound

Now we can look only at pairwise alignments of k
and l that score better bkl
We need to obtain s(a), and this can be done by
using a progressive alignment algorithm

31
Restricted algorithm

For each pair k, l we can find the complete set
Bkl of coordinate pairs (ik, il) such that the
best alignment of xk to xl through (ik, il)
scores more than bkl
Now we only have to look at cells (i1, i2,, iN)
which meet the following condition
(ik, il) is in Bkl for all k, l

32
(No Transcript)
33
Progressive alignment methods

The algorithms differ in several ways
Choice of order to do the alignment
Whether the progression involves only alignment
of sequences to a single growing alignment or
whether subfamilies are built upon a tree
structure

34
Feng-Doolittle progressive multiple alignment

Calculate a diagonal matrix of N(N-1)/2 distances
between all pairs of N sequences by standard
pairwise alignment
Construct a guide tree from the distance matrix
using the FitchMargoliash clustering algorithm
Starting from the first node added to the tree,
align the child nodes
Repeat until all sequences have been aligned.

35
Converting scores to distances

Where
Smax is the maximum score
Sobs is the observed pairwise alignment score
Srand is the expected score for aligning two
random sequences

36
Profile alignment

Linear gap scores can be included in the SP
score
Global alignment score

37
CLUSTALW progressive alignment

Construct a distance matrix of all N(N-1)/2 pair
by pairwise dynamic programming alignment.
Construct a guide tree by a neighbor-joining
clustering algorithm (Saitou Nei).
Progressively align at nodes in order of
decreasing similarity, using sequence-sequence,
sequence-profile and profile-profile alignment.

38
CLUSTALW properties

Sequences are weighted to compensate for biased
representation.
The substitution matrix used to score an
alignment is chosen based on the expected
similarity of the sequences
Position-specific gap-open profile penalties are
multiplied by a modifier that is a function of
the residues observed at the position.

39
CLUSTALW properties

Gap-open penalties are also decreased if the
position is spanned by a consecutive stretch of
five or more hydrophilic residues.
Both gap-open and gap-extend penalties are
increased if there are also no gaps occur nearby
in the alignment.
In the progressive alignment stage, if the score
of an alignment is low, we have to accumulate
profile information

40
Iterative refinement methodsBarton-Stenberg
multiple alignment

Find two sequences with the highest pairwise
similarity and align them using standard pairwise
dynamic programming alignment.
Find the sequence that is most similar to a
profile of the alignment of the first two and
align it to the first two by profile-sequence
alignment. Repeat until all sequences have been
included in the multiply alignment.

41
Iterative refinement methodsBarton-Stenberg
multiple alignment

Remove sequence and realign it to a profile of
the other aligned sequences by profile-sequence
alignment. Repeat for sequences.
Repeat the previous realignment step a fixed
number of times or until the alignment score
converges.

Write a Comment

User Comments (0)