Title: Trees, Stars, and Multiple Biological Sequence Alignment
1Trees, Stars, and Multiple Biological Sequence
Alignment
- Jesse Wolfgang
- CSE 497
- February 19, 2004
2Importance?
- Molecular evolution (Dayhoff)
- RNA folding (Trifonov, Bolshoi)
- Gene regulation (Galas et al.)
- Protein structure-function relationships(Wu,
Kabat)
3Introduction
- Original sequence unknown
- Must consider all possible transformations
- Including insertions, deletions, and replacements
- Choose the most likely set of transformations
- With a given model of protein evolution
4Sequences and Alignments
5Alignments
- Ex sequences DQLF, DNVQ, QGL
6Lattices and Paths
7Paths
2n-1 O(2n)
2 dimensions
3 dimensions
3 possible paths
7 possible paths
8Paths
- Sequences DQLF, DNVQ, QGL
9Paths and Sequence Length
10Paths and Sequence Length
- Sequences ABCD, EFGH, IJK
11Projections
- Sequences DQLF, DNVQ, QGL
12Optimal Paths
13Calculating Optimal Paths
14Problems with This Algorithm
- Calculates a weighted sum of its projected
pairwise alignments - Called Sum-of-the-Pairs (SP)
- Other methods fit biological intuition more
closely
15Tree-Alignment
- Treat sequences as leaves of an evolutionary tree
- Reconstruct ancestral sequences which minimize
the cost of the tree - Must assign sequences to internal nodes
- Align the given and reconstructed sequences
- Star-alignment only one internal node
16Tree-Alignment
- Many different methods for calculating tree
alignments - Discuss version used by ClustalX
17Tree-Alignment in ClustalX
- Three main parts
- Perform pairwise alignment on all sequences to
calculate a distance matrix - Use distance matrix to calculate a guide tree
- Sequences are progressively aligned using the
branching order in the guide tree
http//bimas.dcrt.nih.gov/clustalw/clustalw.html
18Calculating Distance Matrix
- Use standard dynamic programming to find the best
alignment - Gap penalties for opening a gap and continuing a
gap (possibly different) - Divide number of matches by total number of
residues compared (excluding gaps) - Convert to distances by dividing by 100 and
subtracting from 1 - Gives one entry in the n by n matrix
19Calculating Distance Matrix
- Ex sequences ATCG, ATCC, AGGC, AGCC
20Calculating Distance Matrix
ATCG ATCT AGGC GCAA
ATCG -- -- -- --
ATCT .9925 -- -- --
AGGC .9975 .9975 -- --
GCAA 1 1 1 --
21Calculating a Guide Tree
- Using Nearest-Neighbor method to group sequences
- Results in an unrooted tree
- Branch lengths proportional to estimated
divergence - Mid-point method used to determine root
- Means of the branch lengths to each side of the
root are equal (or approximately equal)
22Calculating a Guide Tree
AGAA
GCAA
AGCC
AGGC
ATCG
ATCT
ATCG
23Calculating a Guide Tree
24Progressive Alignment
- Perform a series of pairwise alignments
- Slowly align larger and larger groups of
sequences - Follow the branching order of the tree
- From leaves to root
25Progressive Alignment
AGCC
ATCG
AGAA
26Alignment Costs
Traditional
A, A, A, C, C
A, A, A, C, C
A, A, A, C, C
27Alignment Inconsistencies
- Different definitions of multiple alignments can
yield different optimal alignments - Optimal tree-alignments minimize number of
mutations from theorized common ancestors - SP-alignments maximize number of positions where
aligned sequences agree - Sometimes makes more biological sense since
certain regions of proteins more likely to mutate
28Alignment Inconsistencies
- Ex cost of 1 for aligning two different
letters, cost of 2 for aligning a letter with a
null - Sequences ACC, ACC, TCT, ATCT
Input sequences Reconstructedsequences
29ClustalX Demo
- Multiple sequence alignment program
- For more information on ClustalX
- http//www.at.embnet.org/embnet/progs/clustal/clus
talx.htm