Title: Jacques.van.Heldenulb.ac.be
1Sequence analysisPart 4. Multiple sequence
alignments
- Introduction to Bioinformatics
2Multiple sequence alignment
3Dynamical programming - multiple alignment
- Dynamical programming can be extended to treat a
set of 3 sequences - build a 3-dimensional matrix
- the best score of each cell is calculated on the
basis of the preceding cells in the 3 directions,
and a scoring scheme (substitution matrix gap
cost) - Can be extended to n sequencs by using a
n-dimensional hyper-cube - Problem matrix size and execution time increase
exponentially with the number of sequences - 2 sequences L1 x L2
- 3 sequences L1 x L2 x L3
- 4 sequences L1 x L2 x L3 x L4
- n sequences L1 x L2 x ... x Ln
- Aligning n sequences with dynamical programming
requires O(Ln) operations, which becomes thus
very rapidly impractical. - The efficiency can be improved by only
considering a subspace of the n-dimensional
matrix. However, even with this kind of
algorithmic improvement, the number of sequences
that can be aligned is still restricted (8
sequences maximum).
4Progressive alignment
- Another approach to align multiple sequences is
to perform a progressive alignment. The algorithm
proceeds in several steps - Calculate a distance matrix, representing the
distance between each pair of sequences. - From this matrix, build a phylogenetic tree.
- Use this tree as guide to progressively align the
sequences. - This is a heuristics
- it is a practically tractable approach, but it
cannot guarantee to return the optimal solution
5Progressive alignment - distance matrix
- Perform a pairwise alignment between each pair of
sequences (dynamical programming or faster
heuristic algorithm). - From each pairwise alignment, calculate the
distance between the two sequences. - di,jsi,j/Lj,j
- dj,j distance between sequences i and j
- Lj,j length of the alignment
- sj,j number of substitutions
- Remarks
- Gaps are not taken into account in the distance
metric. - The matrix is symmetrical (di,j dj,i)
- Diagonal elements are null (di,i 0)
- For n sequences n(n1)/2 pairwise alignments.
Unaligned sequences
All pairwise distances
Distance matrix
6Progressive alignment - guide tree
- A phylogenetic tree is calculated from the matrix
distance - first regroup the two closest sequences (e.g. 1)
- next, regroup either
- the two closest sequences (e.g. 2)
- one sequence with a previous cluster (e.g. 4)
- one cluster with another cluster (e.g. 3)
- This tree will then be used as guide to determine
the order of incorporation of the sequences in
the multiple alignment.
Unaligned sequences
All pairwise distances
Distance matrix
Tree calculation
Guide tree
7Progressive alignment - multiple alignment
- Build a multiple alignment, by progressively
incorporating the sequences according to the
guide tree.
Unaligned sequences
Seq5 GATTGTAGTA
Seq5 GATTGTAGTA
All pairwise distances
1
1
3
3
Seq1 GATGGTAGTA
Seq1 GATGGTAGTA
2
2
Seq4
Seq4 GATTGTTC--GTA
4
4
Distance matrix
Seq2
Seq2 GATTGTTCGGGTA
Seq3
Seq3
Tree calculation
Seq5 GATTGTA---GTA
1
Seq5 GATTGTA-----GTA
1
Guide tree
3
3
Seq1 GATGGTA---GTA
Seq1 GATGGTA-----GTA
2
2
Seq4 GATTGTTC--GTA
Seq4 GATTGTTC----GTA
4
4
Seq2 GATTGTTCGGGTA
Seq2 GATTGTTCGG--GTA
Progressive alignment
Seq3
Seq3 GATGGTAGGCGTGTA
Multiple alignment
8Progressive alignment and NJ tree with clustalX
Phylogenetic inference by Neighbour Joining (!
Not the best mehod)
Multiple alignment
Unaligned Sequences (.fasta)
Aligned sequence (.aln)
All distances between sequence pairs On the basis
of pairwise alignments
Distances between each sequence pair WITHIN the
multiple alignment
Distance matrix (not exported)
Distance matrix (not exported)
Tree calculation
Tree calculation
Guide tree (.dnd)
Phylogenetic tree (.ph)
Progressive alignment
Multiple alignment (.aln)
9Global multiple alignment with clustalXHomoserine
-O-dehydrogenase
10Alignment of Zinc cluster proteins
- The alignment of yeast Zn(2)Cys(6) binuclear
cluster proteins is a difficult case. - The conserved region is restricted to the Zinc
cluster domain. - This domain is not contiguous, it contains
conserved and variable positions. - The alignment highlights 5 of the 6
characteristic cysteins.
11Local multiple alignment
12Progressive alignment - summary
- Processing time
- Building the tree proportional to n x n
- Aligning sequences linear with number of
sequences - Heuristic method
- cannot guarantee to return the optimal alignment.
- clustalX is a window-based environment for
clustalw, which provides additional
functionalities - Mark low scoring segments
- The alignment can be refined manually
- Realign selected sequences
- Realign selected positions
13Sequence motifs
14Profile matrices(position-specific scoring
matrices, PSSM)
- Starting from a multiple alignment, one can build
a matrix which reflects the preferred residues at
each position - Each column represents a position
- Each row represents a residue (20 rows for
proteins, 4 rows for DNA) - The cells indicate the frequency of each residue
at each position of the multiple alignment.
15(No Transcript)
16Weight matrix
17Scoring a sequence with a profile matrix
18Scoring a sequence with a profile matrix
19Scoring a sequence with a profile matrix
20PSI-BLAST
- PSI-BLAST stands for Position-Specific Iterated
BLAST (Altschul et al, 1997) - BLAST runs a first time in normal mode.
- Resulting sequences are aligned together
(Multiple sequence alignment) and a PSSM is
calculated. - This PSSM is used to scan the database for new
matches. - Steps 2-3 can be iterated several times.
- The PSSM increases the sensitivity of the search.
21References
- Substitution matrices
- PAM series
- Dayhoff, M. O., Schwartz, R. M. Orcutt, B.
(1978). A model of evolutionary change in
proteins. Atlas of Protein Sequence and Structure
5, 345--352. - BLOSUM substitution matrices
- Henikoff, S. Henikoff, J. G. (1992). Amino acid
substitution matrices from protein blocks. Proc
Natl Acad Sci U S A 89, 10915-9. - Gonnet matrices, built by an iterative procedure
- Gonnet, G. H., Cohen, M. A. Benner, S. A.
(1992). Exhaustive matching of the entire protein
sequence database. Science 256, 1443-5. 1. - Sequence alignment algorithms
- Needleman-Wunsch (pairwise, global)
- Needleman, S. B. Wunsch, C. D. (1970). A
general method applicable to the search for
similarities in the amino acid sequence of two
proteins. J Mol Biol 48, 443-53. - Smith-Waterman (pairwise, local)
- Smith, T. F. Waterman, M. S. (1981).
Identification of common molecular subsequences.
J Mol Biol 147, 195-7. - FastA (database searches, pairwise, local)
- W. R. Pearson and D. J. Lipman. Improved tools
for biological sequence comparison. Proc. Natl.
Acad. Sci. USA, 8524442448, 1988. - BLAST (database searches, pairwise, local)
- S. F. Altschul, W. Gish, W. Miller, E. W. Myers,
and D. J. Lipman. A basic local alignment search
tool. J. Mol. Biol., 215403410, 1990. - S. F. Altschul, T. L. Madden, A. A. Schaffer, J.
Zhang, Z. Zhang, W. Miller, and D. J. Lipman.
Gapped BLAST and PSI-BLAST a new generation of
protein database search programs Nucleic Acids
Res., 2533893402, 1997. - Clustal (multiple, global)
- Higgins, D. G. Sharp, P. M. (1988). CLUSTAL a
package for performing multiple sequence
alignment on a microcomputer. Gene 73, 237-44.