Title: BCB 444544 Introduction to Bioinformatics
1BCB 444/544 - Introduction to Bioinformatics
Lab (Oct 5) Background Phylogenetic
Methods Oct 5
2Background Phylogenetic Methods
Multiple Sequence Alignment
Building Phylogenetic Trees
3Multiple Sequence Alignment (MSA) Motivation
- Correspondence. Find out which parts do the same
thing - Similar genes are conserved across widely
divergent species, often performing similar
functions - Protein Structure prediction
- Use knowledge of structure of one or more members
of a protein MSA to predict structure of other
members - Structure is more conserved than sequence
- Create profiles for protein families
- Allow us to search for other members of the
family - Genome Assembly Automated reconstruction of
contig maps of genomic fragments such as ESTs - MSA is the starting point for phylogenetic
analysis
4Multiple Sequence Alignment
- VTISCTGSSSNIGAG?NHVKWYQQLPG
- VTISCTGTSSNIGS??ITVNWYQQLPG
- LRLSCSSSGFIFSS??YAMYWVRQAPG
- LSLTCTVSGTSFDD??YYSTWVRQPPG
- PEVTCVVVDVSHEDPQVKFNWYVDG??
- ATLVCLISDFYPGA??VTVAWKADS??
- ATLVCLISDFYPGA??VTVAWKADS??
- AALGCLVKDYFPEP??VTVSWNSG-??
- VSLTCLVKGFYPSD??IAVEWESNG-?
- Goal Bring the greatest number of similar
characters into the same column of the alignment - Similar to alignment of two sequences.
5Multiple Sequence Alignment Approaches
- Optimal Global Alignments -Dynamic programming
- Generalization of Needleman-Wunsch
- Find alignment that maximizes a score function
- Computationally expensive Time grows as product
of sequence lengths - Global Progressive Alignments - Match
closely-related sequences first - Global Iterative Alignments - Multiple
re-building attempts to find best alignment
6Dynamic Programming MSA General Case
- For k sequences of length n, dynamic programming
algorithm does (2k-1) nk operations - Example 6 sequences of length 100 require
6.4X1013 calculations - Space for table is nk
- Implementations (e.g., WashU MSA 2.1) use tricks
and only search subset of dynamic programming
table - Even this is expensive. E.g., Baylor CM Search
launcher limits MSA to 8 sequences of 800
characters and 10 minutes processing time
7What is a phylogeny?
www.rci.rutgers.edu/dvhowe/ invertzoo/lecture1_20
06slides.pdf
8Phylogenetic (evolutionary) trees
Describe evolutionary relationships between
species
or
Cannot be known with certainty!
Nevertheless, phylogenies can be useful
9Applications of Phylogenetic Analysis
- Inferring function
- Closely related sequences occupy neighboring
branches of tree - Tracking changes in rapidly evolving populations
(e.g., viruses) - Which genes are under selection?
10Methods
- Distance-based
- Parsimony
- Maximum likelihood
11Distance Matrices
a
b
c
d
12Least Squares
13Methods
- Distance-based
- Parsimony
- Maximum likelihood
14Parsimony
Goal Find the tree with least number of
evolutionary changes
a, b
f
c
d
e
d
15Methods
- Distance-based
- Parsimony
- Maximum likelihood
16 Markov models on trees
- Observed The species labeling the leaves
- Hidden The ancestral states
- Transition probabilities The mutation
probabilities - Assumptions
- Only mutations are allowed
- Sites are independent
- Evolution at each site occurs according to a
Markov process
17Models of evolution at a site
- Transition probability matrix M mij, i, j
?A, C, T, G where mij Prob(i ? j mutation
in 1 time unit) - Different branches of tree may have different
lengths
18The probability of an assignment
T
G
T
A
G
C
T
Probability mTG mGA mGG mTT mTC mTT
19Ancestral reconstruction most likely assignment
X
Y
Z
A
G
C
T
L maxX,Y,Z mXY mYA mYG mXZ mZC mZT