Title: Multiple Sequence Alignments and Phylogenetic Trees
1Multiple Sequence Alignments and Phylogenetic
Trees
2Two Questions
- When we used Blast to compare sequences 2
questions may have popped into our minds - How are these sequences actually related? Are
they homologous? - How are the organisms that these sequences came
from related? Do they have a close common
ancestor? - We will now learn how evolutionary biologists can
use sequence comparisons to help them show
evidence of evolutionary relatedness between
organisms.
3Taxonomy
- Naming and grouping organisms based on
similarities and differences - Carolus Lineaus 1707-1748 is the father of
Taxonomy - His taxonomy was based on traits of organisms
4(No Transcript)
5Phylogeny
- Inferring evolutionary relationships based on
similarities - molecular versus morphological traits
- Phenotype to infer genotype
- Willi Hennig, entomologist, 1950
- Phylogenetic tree
- Relying on traits has limitations
- Convergent evolution
- Eyes as a trait
- Phenotypes and bacteria
- Distantly related organisms
6Molecular Phylogeny
- GHF Nuttall
- 1902 molecular similarities between organisms
- Immune response
- Protein electrophoresis (1950s)
- Comparison of related proteins (size, charge
- Genome cross hybridization
- Protein Sequencing (1960s)
- Genomic information (1970s)
- Restriction maps
- Whole gene sequences
- Computational molecular phylogeny
- Application of computational algorithms, methods
and programs to molecular phylogenetic analyses - Sequence data
- amino acid or nucleic acid
- Carl Woese and 16S rRNA
7Assumptions Made
- The sequences are correct
- The sequence are homologous
- Each position is homologous
- The sampling of taxa or genes is sufficient to
resolve the problem of interest - Sequence variation is representative of the
broader group of interest - Sequence variation contains sufficient
phylogenetic signal (as opposed to noise) to
resolve the problem of interest - Each position in the sequence evolved
independently
8Phylogenetic Terms
- Nodes represent a distinct taxonomical unit
- Terminal node (data collected)
- Internal node (no datainferred common ancestor)
- Branches (lineage)branching order
- represent the evolutionary relationship between
organisms or nodes - Scaled tree When each branch of the tree is
proportional to the evolutionary distance
(substitution rates) between organisms in the
tree then it is scaled. - Usually use a phylogram to represent these types
of trees - Unscaled tree Relative kinship
- Usually use a cladogram to represent these types
of trees
9Note the nodes and branches.
10Phylogenetics
- Computer programs use Newick format for the tree.
- (((I,II),(III,IV)),V)
- Bifurcating nodes (especially shallow branches)
- More common
- Two lineages nodes A and B
- Multifurcating nodes
- More lineages from same ancestor?
- Two or more bifurcations, but order is unknown
11Rooted Trees Versus Unrooted Trees
- Rooted trees all organisms in the tree have a
common ancestor - a unique path leads through the common ancestor
to all others - Unrooted trees Only relationships between nodes
are shown - No direction is given by which evolution begins
and ends - To root a tree when building it you must assign
an outgroup (must know something about the
organisms you are comparinglook to the fossil
record or other evidence) - Not always easy to do
- A good outgroup for a bacterial tree is an
archaeal cell.
12Rooted versus Unrooted
I
II
I
II
III
III
13Numbers of Trees!!!
- The numbers of evolutionary possible paths that
can be taken from a dataset is staggering
depending on how many organisms you are comparing
(nnumber of organisms in tree) - NR(2n-3)!/2n-2(n-2)!
- Nu(2n-5)!/2n-3(n-3)!
- EX) if n5 NR105 and Nu15
- Most trees are inferred trees!
14Phylogenetics
- Gene trees versus species trees
- Reminder horizontal gene transfer events can
cause massive divergence between two genes found
in the same species (small divergence between the
species to make new strains, but major divergence
between 2 genes) - Be careful
- if a tree is constructed from a single gene it
doesnt always indicate species evolution - One of the only trees that are fairly well
accepted as species trees and involve comparison
of only a single gene is the ribosomal RNA trees - Some controversy here too!
15Which Sequence to Choose
- Different sequences change at different rates -
chose level of variation that is appropriate to
the group of organisms being studied. - Diverse group versus tight group (all mammals
versus primates only) - Proteins (or cDNAs) are constrained by natural
selection, while nucleic acid is not always - Some sequences are highly variable (rRNA spacer
regions, HLA genes), while others are highly
conserved (actin, Histones) - Different regions within a single gene can evolve
at different rates - Different functional constraints
16Phylogenetics
- Character sets
- Anatomical feature, color, timed response to a
stimulus, nucleic acid, amino acid - Character states
- DNA 4 possible states Protein 20 possible
states - Distance sets
- A measure of overall pairwise differences between
two character sets - Comparison of sequence data matches,
mismatches, gaps, matrix data - Simple distance calculation
- ratio of identities
- Dm/t (mmatching ttotal compared)
- Complex distance calculation
- Jukes and Cantor
- Kimura (transition/transversion)
- 12 parameter model
17Two Approaches to Molecular Phylogeny
- Distance Matrix Based method
- Multiple sequence alignment
- calculate distance in all possible pairs of
sequences using JC, Kimura or 12 parameter model - Cluster your organisms based on distance
- UPGMA or Neighbor Joining algorithm
- Optimality Methods
- Multiple sequence alignment
- Purely statistical approach to determining
relationships between organisms - Probability of every nucleotide or amino acid
substitution - No distance calculated
- Maximum Parsimony or Maximum likelihood
18Distance Based method
- MSA
- Calculate Distance from characters
- Clustering algorithm to build topology
- Branch lengths
19Multiple Sequence Alignment
hsb051bc CGTAACACGT ATGCAACCTA CCCAAAACAG
 hsb098bc CGTAACACGT ATGCAACCTA CCTTGTACAG
 hsb090bc CGTAACACGT ATGCAATCTG CCCGGAACTG
 hsb083bc CGTAACACGT ATGCAATCTA CCCGAAACAG
 hsb074bc AGTAATGCAT CG-GAACGTG TCCTCTTGTG
hsb104bc AGTAATGCAT CG-GAACGTG TCCTCTTGTG
hsb073bc AGTAATGCAT CG-GAACGTG TCCTCTTGTG
hsb065bc AGTAATACAT CG-GAACATG
TCCTGGAGTG (Consensus) mGTAAyrCrT mknsAAysTr
yCydvdwswG
- DNA aligned optimally by bringing the greatest
number of similar residues into register in same
column of alignment.
20Challenges
- Sequences are long!
- Placement of gaps, mismatches and matches
- Cannot use the same optimal methods that you
learned for pairwise alignments (i.e. Blast
alignments) - More rules must be applied to algorithm
- The less homology between sequences the more
difficult the task. - Obtaining a cumulative score for substitutions in
each column - Placement and scoring of gaps
21 Try it Yourself!
eek one cat ate the dog s two cat ate the dog
one rat ate the dog two rat ate the dog poo poo
two cat ate the dog eek eek one cat ate the dog
22Methods For Alignment
- Progressive
- Align most similar sequences first then build by
adding less similar sequences - EX) Clustal W algorithm
- Iterative
- Align groups of sequences that are similar
initially, then revise the alignment when groups
are placed together - EX) DIALIGN
- Local
- Align only the locally conserved patterns in all
sequences and let the rest of the sequences fall
where they may - Best when internal gaps are not expected
- EX)Asset (by NCBI)
- Statistical and probabilistic models of sequences
- All possible optimal pairwise alignments (species
A, B, C, D) - Use statistical approach to construct initial
tree (UPGMA) - Reconstruct alignments progressively in order of
relatedness according to tree - New UPGMA tree
23Distance
- Now we have the alignment what do we do next?
- Determine distance or substitution rate
- A single measurement of amount of evolutionary
change between two sequences - Pairwise distances must be calculated between
every organism included in a tree - Distance Algorithms (nucleic acids)
- A naïve algorithm would be
- ED differences/ total
- Kimura
- Jukes Cantor
- 12 parameter
- Indels are usually also taken into account in
these 2 algorithms (gap penalties) - Protein sequence distance (amino acids)
- BLOSUM and PAM are used to calculate distance
- Gap penalties calculated also
- Since distance determinations are pairwise all
distances are entered into a distance matrix
24Distance Matrix
001 CGTAACACGT ATGCAACCTA CCCAAAACAG Â 002
CGTAACACGT ATGCAACCTA CCTTGTACAG Â 003
CGTAACACGT ATGCAATCTG CCCGGAACTG Â 004
CGTAACACGT ATGCAATCTA CCCGAAACAG Â 005
AGTAATGCAT CG-GAACGTG TCCTCTTGTG
(Consensus)mGTAAyrCrT mknsAAysTr yCydvdwswG
25Clustering Algorithms
- Now what do you do with the distance matrix?
- Apply a clustering Algorithm to build the trees.
- UPGMA (unweighted-pair-group method with
arithmetric mean) - Clusters 2 species with smallest distance first
- The distance between this cluster together and
the other organisms are then calculated (new
matrix) - The cluster that has the smallest distance in
this new matrix becomes a new cluster.
26UPGMA
D and E closest Distance
C and A
27Branch Lengths
- Remember it is still unscaled if the branch
lengths dont reflect evolutionary distance. - How do we turn this into a scaled tree?
- If we can assume that evolution between all
species occurred at a constant rate then use the
matrix! - If not then it is more complicated
- Neighbor Joining algorithm instead of UPGMA
- The distance matrix is adjusted for differences
in the rate of evolution of each taxon (branch). - Disadvantage of UPGMA assumes constant
evolutionary rate across all lineages - Disadvantage of Neighbor Joining unrooted tree
only
28Optimality (character) Based Methods
- MSA
- Build Tree
- Branch lengths
29Parsimony
- Parsimony
- Allows the use of all known evolutionary
information in building a tree - In contrast, distance methods compress all of the
differences between pairs of sequences into a
single number - Parsimony involves evaluating all possible trees
and giving each a score based on the number of
evolutionary changes that are needed to explain
the observed data. - The best tree is the one that requires the fewest
base changes for all sequences to derive from a
common ancestor. - The most parsimonious tree is one that minimum
tree length needed to explain observed
distributions of all characters - Assumes one rate of evolution across all lineages
in a given tree
30Parsimony Example
- Consider four sequences ATCG, TTCG, ATCC, and
TCCG - Imagine a tree that branches at the first
position, grouping ATCG and ATCC on one branch,
TTCG and TCCG on the other branch. - Then each branch splits, for a total of 3
internal nodes on the tree (Tree 1)
31- Compare Tree 1 with one that first divides ATCC
on its own branch, then splits off ATCG, and
finally divides TTCG from TCCG (Tree 2). - Trees 1 and 2 both have three nodes, but when
all of the distances back to the root ( of nodes
crossed) are summed, the total is equal to 8 for
Tree 1 and 9 for Tree 2.
Tree 2
Tree 1
32Maximum Likelihood
- Directly comparable to maximum parsimony
- Statistical method to examine probably of all
possible substitutions in the alignment most
likely tree is one that involves the fewest
number of changes and most probable
substitutions. - Different from parsimony
- Different rates of evolutions in different
lineages - Evolution along different sites in different
lineages are statistically independent. - Great for distant related sequences
- Computationally taxing
33Scaled Tree
- We can also get a scaled character based tree if
we reflect the amount of evolutionary change in
the branch lengths by applying relative amount of
change occurring between each organism.
34(No Transcript)
35Is My Tree Correct?
- Not a good question.
- There are a number of different algorithms and
each can give a slightly different tree - No single algorithm is more powerful or more
accepted by scientific community than another!!!! - The trick is to use more than one algorithm and
build more than one tree and support your
hypothesis more than once?sound familiar? - Can any tree be scientifically proven to be
correct? - Are hypotheses proven or supported? Think about
it.
36Bootstrapping
- Allows for estimation of confidence levels
- Build trees 100X and determine how often groups
cluster together
C
98
B
A
D
Bootstrap value 98 This means that this
relationship between cluster A, B and C and D
occurred 98 times when this tree was constructed
100 different times using the same distance
values.
37MSA
- What can it be used for?
- Study evolution
- Determine common ancestry between organisms (to
build phylogenetic trees) - Determine how a protein evolved
- Find conserved regions in same gene of a number
of organisms - For primer searching
- Predict functional and structural locations
within cDNA - Assembling sequence fragments into a larger
sequence
38Assembly of Sequence Fragments
Sequence fragment 1 5-CGTAACACGTATGCAACCTA-3
Sequence fragment 2 5-ATGCAACCTACCCAAAACAG-3
align
CGTAACACGTATGCAACCTA
ATGCAACCTACCCAAAACAG
assemble
5-CGTAACACGTATGCAACCTACCCAAAACAG-3