Bioinformatics Algorithms and Data Structures - PowerPoint PPT Presentation

About This Presentation
Title:

Bioinformatics Algorithms and Data Structures

Description:

Detect homology between sequences and families of sequences ... Sequences are boiled down to distances. No secondary or tertiary features used ... – PowerPoint PPT presentation

Number of Views:297
Avg rating:3.0/5.0
Slides: 43
Provided by: john244
Learn more at: https://www.cse.sc.edu
Category:

less

Transcript and Presenter's Notes

Title: Bioinformatics Algorithms and Data Structures


1
Bioinformatics Algorithms and Data Structures
  • CLUSTAL W Algorithm
  • Lecturer Dr. Rose
  • Slides by Dr. Rose
  • April 3, 2003

2
Multiple Sequence Alignment
  • CLUSTAL is an algorithm for aligning multiple
    sequences.
  • Reasons for computing multiple alignments
  • Characterizing protein families
  • Detect homology between sequences and families of
    sequences
  • Predict secondary and tertiary structures of new
    sequences.
  • Needed for creating of phylogenetic trees.

3
Multiple Sequence Alignment
  • Recall DP used for 2 sequence alignment
  • Guarantees optimal alignment relative to the
    scoring table that is used.
  • DP is only practical for small numbers of short
    sequences.
  • Impractical for
  • large numbers of sequences
  • Very long sequences
  • i.e., more than 8 proteins of average length.

4
Progressive Algorithms
  • Progressive Approaches
  • Exploit idea that homologous sequences are
    related by evolution.
  • Multiple alignments can be built up from pairwise
    alignments.
  • The pairwise alignments follow branching in the
    guide tree.
  • The most closely related sequences are aligned
    first.
  • The more distant related sequences are gradually
    added.

5
Progressive Algorithms
  • Empirical observations
  • For simple cases
  • correctly align domains of known secondary and
    tertiary structures.
  • closely related sequences are less sensitive to
    parameter settings, i.e., gap penalties and
    weight matrix.
  • In all cases
  • gaps are preserved, i.e., once a gap always a
    gap.
  • progressive alignment gives an idea of the
    variability at each position before more distant
    sequences are added.

6
Progressive Algorithms
  • Empirical observations
  • For more complicated cases
  • Progressive approach is less reliable for highly
    divergent sequences (less than 25-30 identity).
  • gives a good starting point for further
    manual/automatic refinement.

7
Problems with Progressive Algorithms
  • Local minimum problem
  • Recall this is a greedy algorithm approach
  • Sequences are added greedily
  • Multiple alignments are built up from pairwise
    alignments.
  • The pairwise alignments follow branching in the
    initial guide tree. (more on this later)
  • No guarantee of a global optimum
  • Any misaligned regions made early on can not be
    corrected later on.

8
Problems with Progressive Algorithms
  • Sensitivity to alignment parameters
  • problematic also for iterative and stochastic
    algorithms.
  • Traditional parameters
  • weight table
  • cost of opening a gap
  • cost of extending a gap
  • Expectation is one set of parameters works well
    over
  • all sequences in the set
  • all parts of each sequence

9
Problems with Progressive Algorithms
  • Sensitivity to alignment parameters continued
  • A single weight matrix choice will generally work
    for closely related sequences.
  • weight matrices give highest weight to identities
  • Any weight matrix will work ok if identities
    dominate
  • For divergent sequences
  • Nonidentical residues are more significant
  • Scores to these residues are critical
  • Different weight matrices will be required for
  • different evolutionary distances
  • Different classes of proteins

10
Problems with Progressive Algorithms
  • Sensitivity to alignment parameters continued
  • A range of gap penalty values will generally work
    for closely related sequences.
  • For divergent sequences
  • The specific choice of gap penalty value becomes
    critical
  • For proteins gaps dont occur randomly.
  • Recall our discussion of conserved secondary
    features
  • Gaps occur between alpha helices and beta strands
    rather than within them

11
CLUSTAL W Contributions
  • Dynamically vary gap penalties according to
    position residue
  • Local gap opening penalty adjustment
  • relative to observed relative frequency of gaps
    next to each of the 20 amino acid.
  • reduced for loop or random coil regions (as
    indicated by short stretches of hydrophilic
    residues)
  • reduced for gaps found in early alignments
  • increased within 8 residues of existing gaps
    (observation gaps tend not to be closer than 8
    residues)

12
CLUSTAL W Contributions
  • Weight matrices are chosen dynamically
  • PAM series and BLOSUM series are main series of
    amino acid weight matrices in use.
  • Choice of weight matrix is by estimation of
    divergence of sequences being aligned at each
    step.
  • Different weight matrices are appropriate
    depending on similarity of sequences

13
CLUSTAL W Contributions
  • Different weight matrices are appropriate
    depending on similarity of sequences
  • For closely related sequences
  • identities predominate
  • Only frequent conservative substitutions are
    scored high
  • For evolutionary divergent sequences
  • Less weight should be given to identities
  • Weight matrix should be tuned to greater
    evolutionary distance

14
CLUSTAL W Contributions
  • Weighting of sequences
  • corrects for unequal sampling across the
    evolutionary distance in the data set.
  • Downweights similar sequences
  • Upweights divergent sequences
  • Weight are calculated from the branch lengths of
    the initial guide tree.

15
CLUSTAL W Contributions
  • Neighbor-Joining method used to calculate guide
    tree
  • Less sensitive to unequal evolutionary rates in
    different branches.
  • Significance branch lengths are used to derive
    sequence weights.
  • Accuracy of distance calculations for guide tree
  • Tree constructed from pairwise distance matrix
  • Fast approximate alignment
  • Full dynamic programming
  • User selectable

16
CLUSTAL W Algorithm
  • Basic method
  • Distance matrix is calculated
  • Distances are pairwise alignment scores
  • Gives divergence of each pair of sequences
  • Guide tree built from distance matrix
  • Progressive alignment according to guide tree
  • Branching order of tree specifies alignment order
  • Alignment progresses from leaves to root.

17
CLUSTAL W Algorithm
  • Distance matrix/pairwise alignments phase
  • Two choices fast approximation or DP
  • Fast approximation
  • Defn a k-tuple match is a run of identical
    residues, typically
  • 1 to 2 for proteins
  • 2 to 4 for nucleotide sequences
  • Scores are calculated as (k-tuple matches)
    fixed penalty per gap
  • Score is initially calculated as a percent
    identity score.
  • Distance 1.0 (score/100)

18
CLUSTAL W Algorithm
  • Distance matrix/pairwise alignments phase
  • Full DP alignment
  • Alignment uses
  • gap opening penalties
  • gap extension penalties
  • full amino acid weight matrix.
  • Scores are calculated as (identies)/(residues),
    gaps not included
  • Score is initially calculated as a percent
    identity score.
  • Distance 1.0 (score/100)

19
NJ Algorithm
  • Neighbor Joining to Calculate the Guide Tree
    Phase
  • does not require a uniform molecular clock
  • the raw data are provided as a distance matrix
  • the initial tree is a star tree
  • distance matrix is modified
  • distance between node pairs is adjusted on the
    basis of their average divergence from all other
    nodes.
  • the least-distant pair of nodes are linked.

20
NJ Algorithm
  • Neighbor Joining to Calculate the Guide Tree
    Phase
  • When two nodes are linked
  • Add their common ancestral node to the tree
  • delete the terminal nodes with their branches
  • the common ancestor is now a terminal node on a
    smaller tree
  • At each step, two terminal nodes are replaced by
    one new node
  • The process is complete when there are only two
    nodes separated by a single branch

21
NJ Algorithm
  • Advantages of Neighbor Joining
  • Fast.
  • Can be used on large datasets
  • Can support bootstrap analysis
  • Can handle lineages with largely different branch
    lengths (different molecular evolutionary rates)
  • Can be used with methods that use correction for
    multiple substitutions

22
NJ Algorithm
  • Disadvantages of Neighbor Joining
  • sequence information is reduced
  • Sequences are boiled down to distances
  • No secondary or tertiary features used
  • gives only one possible tree
  • strongly dependent on the model of evolution used

23
NJ Algorithm
  • NJ example from http//www.icp.ucl.ac.be/opperd/
    private/neighbor.html
  • Consider the following tree
  • Notice that the branches for D and B are longer.
  • This expresses the idea that they have a faster
    molecular clock than the other OTUs.

24
NJ Algorithm
  • The distance matrix for the tree is

Normally, we create the tree from the
distances. In this example, we use to tree to
derive the distances.
25
NJ Algorithm
  • We start with a star tree.
  • Notice that we have 6 operational taxonomic units
    (OTUs)
  • The start tree has a leaf for each OTU

26
NJ Algorithm
  • Step 1 Calculate the net divergence for each
    OTU.
  • The net divergence is the sum of distances from i
    to all other OTUs.

r(A) 5476830 r(B) 42 r(C) 32 r(D)
38 r(E) 34 r(F) 44
27
NJ Algorithm
  • Step 2 Calculate a new distance matrix based on
    average divergence
  • M(ij)d(ij) - r(i) r(j)/(N-2)
  • Example A,B
  • M(AB)d(AB) -(r(A) r(B)/(N-2) -13

Recall r(A) 30 r(B) 42
28
NJ Algorithm
  • Step 2 continued
  • M(ij)d(ij) - r(i) r(j)/(N-2)

Distance matrix
Average divergence matrix
29
NJ Algorithm
  • Step 3 choose two OTUs for which Mij is the
    smallest.
  • the possible choices are A,B and D,E
  • arbitrarily choose A and B
  • form a new node called U, the parent of A B.
  • calculate the branch length from U to A and B.
  • S(AU) d(AB) / 2 r(A)-r(B) / 2(N-2) 1
  • S(BU) d(AB) -S(AU) 4

30
NJ Algorithm
  • The tree after U is added.

31
NJ Algorithm
  • Step 4 define distances from U to other terminal
    nodes
  • d(CU) d(AC) d(BC) - d(AB) / 2 3
  • d(DU) d(AD) d(BD) - d(AB) / 2 6
  • d(EU) d(AE) d(BE) - d(AB) / 2 5
  • d(FU) d(AF) d(BF) - d(AB) / 2 7
  • Note no change in paired distances C,D,E,F

32
NJ Algorithm
  • Now N N-1 5
  • Repeat steps 1 through 4
  • Stop when N 2

33
CLUSTAL W Algorithm
  • The final result of the tree produced by NJ is an
    unrooted tree.
  • The branch lengths are proportional to the
    estimated divergence.
  • A mid-point method is used to place the root
  • The mid point is defined at the point where the
    means of the branch lengths on either side are
    equal.

34
CLUSTAL W Algorithm
  • Basic Progressive Alignment Phase
  • Use a series of pairwise alignments
  • The alignments follow the branching order of the
    guide tree
  • The alignments start from the leaves and progress
    towards the root
  • Full DP with a residue weight matrix is used
  • Gaps are preserved
  • Newly created gaps get full opening extension
    penalties

35
CLUSTAL W Algorithm
  • Basic Progressive Alignment Phase
  • Each step involved two existing alignments or
    sequences
  • The score at a given position is the average of
    the pairwise weight matrix scores. Example
  • aligning 2 alignments with 3 and 4 sequences,
    respectively
  • The score at a given position is the average of
    the 3X4 comparisons.
  • The weight matrix has only positive scores
  • Each gap versus a residue is scored a zero, the
    worst value
  • This is the average linkage cluster distance
    metric

36
CLUSTAL W Algorithm
  • Example
  • A B are aligned
  • C is aligned with the result of (1)
  • D E are aligned
  • The results of (2) and (3) are aligned
  • F is aligned with the result of (4)

37
CLUSTAL W Algorithm
  • Improvement to Progressive Alignment Phase
  • Sequence weighting
  • Calculated from the guide tree
  • Normalized so that largest weight is 1.0
  • Closely related sequences receive lower weights
  • They over-represent their common information
  • A lower weight seeks to reduce this influence
  • Divergent sequences receive higher weights
  • Sequence weight impacts alignment scores
  • each weight matrix value is multiplied by the
    weights of the two sequences.

38
CLUSTAL W Algorithm
  • Improvement to Progressive Alignment Phase
  • Two gap penalty types
  • Gap opening (GOP)
  • Gap extension (GEP)
  • Actual assessed penalty depends on
  • Weight matrix GOP is scaled by the average score
    of mismatched residues
  • Similarity of sequences identity is used to
  • increase GOP for similar sequences
  • decrease GOP for divergent sequences

39
CLUSTAL W Algorithm
  • Actual assessed penalty depends on continued
  • Length of sequences the logarithm of the length
    of the shorter sequence is used to increase GOP
    with sequence length
  • GOP (GOP log(min(N,M))) (ave residue
    mismatch score) ( identity scaling factor)
  • Difference in sequence lengths GEP is increased
    to inhibit many long gaps in shorter sequences.
  • GEP GEP (1.0 log(N/M))

40
CLUSTAL W Algorithm
  • Improvement to Progressive Alignment Phase
  • Position-specific gap penalties
  • Lowered GOP at existing gaps
  • if a position already has gaps, GOP is reduced
    relative to the number of sequences with a gap at
    that position
  • GOP GOP 0.3 ( sequences w/o gap)/(
    sequences)
  • Increased GOP near existing gaps
  • New gap within 8 residues of an exisiting gap
  • GOP GOP (2 ((8 distance from gap) 2) /
    8)

41
CLUSTAL W Algorithm
  • Improvement to Progressive Alignment Phase
  • Position-specific gap penalties continued
  • Reduced GOP in hydrophilic stretches
  • 5 or more consecutive hydrophilic residues is a
    stretch ?
  • Hydrophilic residues are D,E,G,K,N,Q,P,R S
  • GOP reduced by a third if there is no gap in a
    stretch
  • Residue specific penalty
  • GOP is modified if there is no gap and no
    hydrophilic stretch
  • There is an adjustment factor for each of the 20
    residues
  • For mixtures, the factor is the average of all
    contributing residues

42
  • The End
Write a Comment
User Comments (0)
About PowerShow.com