Corn Hoogendoorn - PowerPoint PPT Presentation

1 / 41
About This Presentation
Title:

Corn Hoogendoorn

Description:

Identifying unambiguously homologous positions is ... Barton-Stenberg multiple alignment ... Iterative refinement methods: Barton-Stenberg multiple alignment ... – PowerPoint PPT presentation

Number of Views:50
Avg rating:3.0/5.0
Slides: 42
Provided by: choo6
Category:

less

Transcript and Presenter's Notes

Title: Corn Hoogendoorn


1
Multiple sequence alignment methods
  • Corné Hoogendoorn
  • Denis Miretskiy

2
Overview
  • What a multiple alignment means
  • Scoring a multiple alignment
  • Break
  • Multidimensional dynamic programming
  • Progressive alignment methods

3
What a multiple alignment means
  • Homologous residues are aligned in columns
  • Structurally homologous
  • Evolutionarily homologous
  • Similar 3D structural positions
  • Diverging from a common ancestral residue

4
Multiple alignment - issues
  • Identifying unambiguously homologous positions is
    not possible
  • A need to identify which alignment is best
  • Protein structures and sequences evolve
  • Sequences not entirely superposable

5
Multiple alignment - issues
  • There always is an unambiguously correct
    evolutionary alignment
  • Common ancestral sequence
  • Sheerly impossible to infer the evolutionary
    history
  • Usually easier to construct a structural alignment

6
Multiple alignment - issues
  • Sequence diverges even faster than structure
  • Structurally unalignable protein parts cannot be
    aligned by sequence either
  • Some parts are very well alignable
  • Use these parts to align whatever can be aligned
  • Disregard the rest to assess alignment quality
  • Supposedly meaningless biases are omitted

7
Scoring an alignment
  • Some positions are more conserved than others
  • Position-specific scoring
  • Sequences are not independent
  • Related to each other by a phylogenetic tree
  • Specify a complete probabilistic model of
    molecular sequence evolution

8
Complete probabilistic model
  • Probabilities of all evolutionary events
  • Prior probability of root ancestral sequence
  • Probabilities of evolutionary change depend on
    evolutionary time
  • Position-specific structural and functional
    constraints
  • We just dont have all the necessary data

9
Workable approximations
  • Assume that all columns are statistically
    independent

Score for multiple alignment m
Gap score/penalty
Score for column i in the multiple alignment m
10
Scoring an alignment
  • Notations

11
Minimum EntropyFurther simplification
  • We already assumed independence between columns
  • Complex statistical dependence between sequences
    (within columns) if their phylogenetic tree has
    many intermediate ancestors
  • We assume independence between and within columns

12
Minimum entropy
  • Probability of column mi
  • Score of column mi can be defined as the negative
    logarithm

A regularized probability estimate as used in
chapter 5
An entropy measure directly related to the
Shannon entropy (chapter 11)
13
Example (1)
14
Example (2)
15
Example (3)
Will this ever be 0 in reality? Why (not)?
16
Example (4)
17
Minimum entropy
  • Very near to the HMM formulation
  • Choose the sequences carefully
  • Usually the sample of sequences is biased
  • Weighting schemes as discussed in chapter 5 are
    necessary
  • This partially compensates for the defects of the
    assumption of sequence independence

18
Sum of pairs
  • Also assumes statistical independence between
    columns
  • Uses substitution matrices
  • For simple linear gap costs, s(a,-) s(-,a) and
    s(-,-) are defined, with s(-,-) 0

Scores s(a,b) come from substitution matrices
like PAM or BLOSUM
19
Sum of pairs
  • Substitution scores are usually log-odds scores
    for pairwise comparisons
  • log(pab/qaqb) log(pbc/qbqc) log(pac/qaqc)
  • log(pabc/qaqbqc)
  • Each sequence is scored as if it descended from
    the N-1 other sequences
  • Evolutionary events are over-counted

20
Problem with SP scores
  • Consider an alignment of N sequences
  • All have leucine (L) at position i

Number of symbol pairs in the column
Score for an L-L alignment according to the
BLOSUM50 matrix
21
Problem with SP scores
  • What if one sequence has glycine (G) at i?
  • G-L pair scores -4, difference with L-L is 9
  • The score is worse than the all-leucine column by
    a fraction

22
What a multiple alignment meansScoring a
multiple alignment
  • Questions?
  • Break

23
Multidimensional dynamic programming
  • We assume that columns of an alignment are
    statistically independent
  • Gaps are scored with a linear gap cost
  • Now we can calculate overall score S(m)
  • Where S(mi) is a score for column i

24
Calculating the overall score
  • Define as the maximum score of an
    alignment up to the subsequences ending with

25
(No Transcript)
26
Simple notation
  • Introduce Di which is 0 or 1 and define the
    product
  • Now recursion can be written as follows

27
Complexity of algorithm
  • The algorithm requires the computation of the
    whole dynamic programming matrix with L1, L2,,LN
    entries.
  • We have to view 2N - 1 combinations of gaps in a
    column.
  • All sequences have roughly the same length
  • Memory complexity of algorithm is
  • Time complexity is

28
MSA
  • Let akl denote the pairwise alignment between
    sequences k and l
  • the score of the complete alignment is given
  • Let âkl be the optimal pairwise alignment of k,
    l
  • Obviously

29
Lower bound
  • Assume that we have a lower bound of the optimal
    multiple alignment, so
  • In other words
  • Where

30
Lower bound
  • Now we can look only at pairwise alignments of k
    and l that score better bkl
  • We need to obtain s(a), and this can be done by
    using a progressive alignment algorithm

31
Restricted algorithm
  • For each pair k, l we can find the complete set
    Bkl of coordinate pairs (ik, il) such that the
    best alignment of xk to xl through (ik, il)
    scores more than bkl
  • Now we only have to look at cells (i1, i2,, iN)
    which meet the following condition
  • (ik, il) is in Bkl for all k, l

32
(No Transcript)
33
Progressive alignment methods
  • The algorithms differ in several ways
  • Choice of order to do the alignment
  • Whether the progression involves only alignment
    of sequences to a single growing alignment or
    whether subfamilies are built upon a tree
    structure

34
Feng-Doolittle progressive multiple alignment
  • Calculate a diagonal matrix of N(N-1)/2 distances
    between all pairs of N sequences by standard
    pairwise alignment
  • Construct a guide tree from the distance matrix
    using the FitchMargoliash clustering algorithm
  • Starting from the first node added to the tree,
    align the child nodes
  • Repeat until all sequences have been aligned.

35
Converting scores to distances
  • Where
  • Smax is the maximum score
  • Sobs is the observed pairwise alignment score
  • Srand is the expected score for aligning two
    random sequences

36
Profile alignment
  • Linear gap scores can be included in the SP
    score
  • Global alignment score

37
CLUSTALW progressive alignment
  • Construct a distance matrix of all N(N-1)/2 pair
    by pairwise dynamic programming alignment.
  • Construct a guide tree by a neighbor-joining
    clustering algorithm (Saitou Nei).
  • Progressively align at nodes in order of
    decreasing similarity, using sequence-sequence,
    sequence-profile and profile-profile alignment.

38
CLUSTALW properties
  • Sequences are weighted to compensate for biased
    representation.
  • The substitution matrix used to score an
    alignment is chosen based on the expected
    similarity of the sequences
  • Position-specific gap-open profile penalties are
    multiplied by a modifier that is a function of
    the residues observed at the position.

39
CLUSTALW properties
  • Gap-open penalties are also decreased if the
    position is spanned by a consecutive stretch of
    five or more hydrophilic residues.
  • Both gap-open and gap-extend penalties are
    increased if there are also no gaps occur nearby
    in the alignment.
  • In the progressive alignment stage, if the score
    of an alignment is low, we have to accumulate
    profile information

40
Iterative refinement methodsBarton-Stenberg
multiple alignment
  • Find two sequences with the highest pairwise
    similarity and align them using standard pairwise
    dynamic programming alignment.
  • Find the sequence that is most similar to a
    profile of the alignment of the first two and
    align it to the first two by profile-sequence
    alignment. Repeat until all sequences have been
    included in the multiply alignment.

41
Iterative refinement methodsBarton-Stenberg
multiple alignment
  • Remove sequence and realign it to a profile of
    the other aligned sequences by profile-sequence
    alignment. Repeat for sequences.
  • Repeat the previous realignment step a fixed
    number of times or until the alignment score
    converges.
Write a Comment
User Comments (0)
About PowerShow.com