Multiple Sequence Alignments and Phylogenetic Trees - PowerPoint PPT Presentation

1 / 38

About This Presentation

Title:

Multiple Sequence Alignments and Phylogenetic Trees

Description:

When we used Blast to compare sequences 2 questions may ... Jukes and Cantor. Kimura (transition/transversion) 12 parameter model ... Jukes Cantor. 12 parameter ... – PowerPoint PPT presentation

Number of Views:352

Avg rating:3.0/5.0

Slides: 39

Provided by: MICHELLE6

Category:

more less

Transcript and Presenter's Notes

Title: Multiple Sequence Alignments and Phylogenetic Trees

1
Multiple Sequence Alignments and Phylogenetic
Trees
2
Two Questions

When we used Blast to compare sequences 2
questions may have popped into our minds
How are these sequences actually related? Are
they homologous?
How are the organisms that these sequences came
from related? Do they have a close common
ancestor?
We will now learn how evolutionary biologists can
use sequence comparisons to help them show
evidence of evolutionary relatedness between
organisms.

3
Taxonomy

Naming and grouping organisms based on
similarities and differences
Carolus Lineaus 1707-1748 is the father of
Taxonomy
His taxonomy was based on traits of organisms

4
(No Transcript)
5
Phylogeny

Inferring evolutionary relationships based on
similarities
molecular versus morphological traits
Phenotype to infer genotype
Willi Hennig, entomologist, 1950
Phylogenetic tree
Relying on traits has limitations
Convergent evolution
Eyes as a trait
Phenotypes and bacteria
Distantly related organisms

6
Molecular Phylogeny

GHF Nuttall
1902 molecular similarities between organisms
Immune response
Protein electrophoresis (1950s)
Comparison of related proteins (size, charge
Genome cross hybridization
Protein Sequencing (1960s)
Genomic information (1970s)
Restriction maps
Whole gene sequences
Computational molecular phylogeny
Application of computational algorithms, methods
and programs to molecular phylogenetic analyses
Sequence data
amino acid or nucleic acid
Carl Woese and 16S rRNA

7
Assumptions Made

The sequences are correct
The sequence are homologous
Each position is homologous
The sampling of taxa or genes is sufficient to
resolve the problem of interest
Sequence variation is representative of the
broader group of interest
Sequence variation contains sufficient
phylogenetic signal (as opposed to noise) to
resolve the problem of interest
Each position in the sequence evolved
independently

8
Phylogenetic Terms

Nodes represent a distinct taxonomical unit
Terminal node (data collected)
Internal node (no datainferred common ancestor)
Branches (lineage)branching order
represent the evolutionary relationship between
organisms or nodes
Scaled tree When each branch of the tree is
proportional to the evolutionary distance
(substitution rates) between organisms in the
tree then it is scaled.
Usually use a phylogram to represent these types
of trees
Unscaled tree Relative kinship
Usually use a cladogram to represent these types
of trees

9
Note the nodes and branches.
10
Phylogenetics

Computer programs use Newick format for the tree.
(((I,II),(III,IV)),V)
Bifurcating nodes (especially shallow branches)
More common
Two lineages nodes A and B
Multifurcating nodes
More lineages from same ancestor?
Two or more bifurcations, but order is unknown

11
Rooted Trees Versus Unrooted Trees

Rooted trees all organisms in the tree have a
common ancestor
a unique path leads through the common ancestor
to all others
Unrooted trees Only relationships between nodes
are shown
No direction is given by which evolution begins
and ends
To root a tree when building it you must assign
an outgroup (must know something about the
organisms you are comparinglook to the fossil
record or other evidence)
Not always easy to do
A good outgroup for a bacterial tree is an
archaeal cell.

12
Rooted versus Unrooted
I
II
I
II
III
III
13
Numbers of Trees!!!

The numbers of evolutionary possible paths that
can be taken from a dataset is staggering
depending on how many organisms you are comparing
(nnumber of organisms in tree)
NR(2n-3)!/2n-2(n-2)!
Nu(2n-5)!/2n-3(n-3)!
EX) if n5 NR105 and Nu15
Most trees are inferred trees!

14
Phylogenetics

Gene trees versus species trees
Reminder horizontal gene transfer events can
cause massive divergence between two genes found
in the same species (small divergence between the
species to make new strains, but major divergence
between 2 genes)
Be careful
if a tree is constructed from a single gene it
doesnt always indicate species evolution
One of the only trees that are fairly well
accepted as species trees and involve comparison
of only a single gene is the ribosomal RNA trees
Some controversy here too!

15
Which Sequence to Choose

Different sequences change at different rates -
chose level of variation that is appropriate to
the group of organisms being studied.
Diverse group versus tight group (all mammals
versus primates only)
Proteins (or cDNAs) are constrained by natural
selection, while nucleic acid is not always
Some sequences are highly variable (rRNA spacer
regions, HLA genes), while others are highly
conserved (actin, Histones)
Different regions within a single gene can evolve
at different rates
Different functional constraints

16
Phylogenetics

Character sets
Anatomical feature, color, timed response to a
stimulus, nucleic acid, amino acid
Character states
DNA 4 possible states Protein 20 possible
states
Distance sets
A measure of overall pairwise differences between
two character sets
Comparison of sequence data matches,
mismatches, gaps, matrix data
Simple distance calculation
ratio of identities
Dm/t (mmatching ttotal compared)
Complex distance calculation
Jukes and Cantor
Kimura (transition/transversion)
12 parameter model

17
Two Approaches to Molecular Phylogeny

Distance Matrix Based method
Multiple sequence alignment
calculate distance in all possible pairs of
sequences using JC, Kimura or 12 parameter model
Cluster your organisms based on distance
UPGMA or Neighbor Joining algorithm
Optimality Methods
Multiple sequence alignment
Purely statistical approach to determining
relationships between organisms
Probability of every nucleotide or amino acid
substitution
No distance calculated
Maximum Parsimony or Maximum likelihood

18
Distance Based method

MSA
Calculate Distance from characters
Clustering algorithm to build topology
Branch lengths

19
Multiple Sequence Alignment
hsb051bc CGTAACACGT ATGCAACCTA CCCAAAACAG
hsb098bc CGTAACACGT ATGCAACCTA CCTTGTACAG
hsb090bc CGTAACACGT ATGCAATCTG CCCGGAACTG
hsb083bc CGTAACACGT ATGCAATCTA CCCGAAACAG
hsb074bc AGTAATGCAT CG-GAACGTG TCCTCTTGTG
hsb104bc AGTAATGCAT CG-GAACGTG TCCTCTTGTG
hsb073bc AGTAATGCAT CG-GAACGTG TCCTCTTGTG
hsb065bc AGTAATACAT CG-GAACATG
TCCTGGAGTG (Consensus) mGTAAyrCrT mknsAAysTr
yCydvdwswG

DNA aligned optimally by bringing the greatest
number of similar residues into register in same
column of alignment.

20
Challenges

Sequences are long!
Placement of gaps, mismatches and matches
Cannot use the same optimal methods that you
learned for pairwise alignments (i.e. Blast
alignments)
More rules must be applied to algorithm
The less homology between sequences the more
difficult the task.
Obtaining a cumulative score for substitutions in
each column
Placement and scoring of gaps

21

Try it Yourself!
eek one cat ate the dog s two cat ate the dog
one rat ate the dog two rat ate the dog poo poo
two cat ate the dog eek eek one cat ate the dog
22
Methods For Alignment

Progressive
Align most similar sequences first then build by
adding less similar sequences
EX) Clustal W algorithm
Iterative
Align groups of sequences that are similar
initially, then revise the alignment when groups
are placed together
EX) DIALIGN
Local
Align only the locally conserved patterns in all
sequences and let the rest of the sequences fall
where they may
Best when internal gaps are not expected
EX)Asset (by NCBI)
Statistical and probabilistic models of sequences
All possible optimal pairwise alignments (species
A, B, C, D)
Use statistical approach to construct initial
tree (UPGMA)
Reconstruct alignments progressively in order of
relatedness according to tree
New UPGMA tree

23
Distance

Now we have the alignment what do we do next?
Determine distance or substitution rate
A single measurement of amount of evolutionary
change between two sequences
Pairwise distances must be calculated between
every organism included in a tree
Distance Algorithms (nucleic acids)
A naïve algorithm would be
ED differences/ total
Kimura
Jukes Cantor
12 parameter
Indels are usually also taken into account in
these 2 algorithms (gap penalties)
Protein sequence distance (amino acids)
BLOSUM and PAM are used to calculate distance
Gap penalties calculated also
Since distance determinations are pairwise all
distances are entered into a distance matrix

24
Distance Matrix
001 CGTAACACGT ATGCAACCTA CCCAAAACAG 002
CGTAACACGT ATGCAACCTA CCTTGTACAG 003
CGTAACACGT ATGCAATCTG CCCGGAACTG 004
CGTAACACGT ATGCAATCTA CCCGAAACAG 005
AGTAATGCAT CG-GAACGTG TCCTCTTGTG
(Consensus)mGTAAyrCrT mknsAAysTr yCydvdwswG
25
Clustering Algorithms

Now what do you do with the distance matrix?
Apply a clustering Algorithm to build the trees.
UPGMA (unweighted-pair-group method with
arithmetric mean)
Clusters 2 species with smallest distance first
The distance between this cluster together and
the other organisms are then calculated (new
matrix)
The cluster that has the smallest distance in
this new matrix becomes a new cluster.

26
UPGMA
D and E closest Distance
C and A
27
Branch Lengths

Remember it is still unscaled if the branch
lengths dont reflect evolutionary distance.
How do we turn this into a scaled tree?
If we can assume that evolution between all
species occurred at a constant rate then use the
matrix!
If not then it is more complicated
Neighbor Joining algorithm instead of UPGMA
The distance matrix is adjusted for differences
in the rate of evolution of each taxon (branch).
Disadvantage of UPGMA assumes constant
evolutionary rate across all lineages
Disadvantage of Neighbor Joining unrooted tree
only

28
Optimality (character) Based Methods

MSA
Build Tree
Branch lengths

29
Parsimony

Parsimony
Allows the use of all known evolutionary
information in building a tree
In contrast, distance methods compress all of the
differences between pairs of sequences into a
single number
Parsimony involves evaluating all possible trees
and giving each a score based on the number of
evolutionary changes that are needed to explain
the observed data.
The best tree is the one that requires the fewest
base changes for all sequences to derive from a
common ancestor.
The most parsimonious tree is one that minimum
tree length needed to explain observed
distributions of all characters
Assumes one rate of evolution across all lineages
in a given tree

30
Parsimony Example

Consider four sequences ATCG, TTCG, ATCC, and
TCCG
Imagine a tree that branches at the first
position, grouping ATCG and ATCC on one branch,
TTCG and TCCG on the other branch.
Then each branch splits, for a total of 3
internal nodes on the tree (Tree 1)

Compare Tree 1 with one that first divides ATCC
on its own branch, then splits off ATCG, and
finally divides TTCG from TCCG (Tree 2).
Trees 1 and 2 both have three nodes, but when
all of the distances back to the root ( of nodes
crossed) are summed, the total is equal to 8 for
Tree 1 and 9 for Tree 2.

Tree 2
Tree 1
32
Maximum Likelihood

Directly comparable to maximum parsimony
Statistical method to examine probably of all
possible substitutions in the alignment most
likely tree is one that involves the fewest
number of changes and most probable
substitutions.
Different from parsimony
Different rates of evolutions in different
lineages
Evolution along different sites in different
lineages are statistically independent.
Great for distant related sequences
Computationally taxing

33
Scaled Tree

We can also get a scaled character based tree if
we reflect the amount of evolutionary change in
the branch lengths by applying relative amount of
change occurring between each organism.

34
(No Transcript)
35
Is My Tree Correct?

Not a good question.
There are a number of different algorithms and
each can give a slightly different tree
No single algorithm is more powerful or more
accepted by scientific community than another!!!!
The trick is to use more than one algorithm and
build more than one tree and support your
hypothesis more than once?sound familiar?
Can any tree be scientifically proven to be
correct?
Are hypotheses proven or supported? Think about
it.

36
Bootstrapping

Allows for estimation of confidence levels
Build trees 100X and determine how often groups
cluster together

C
98
B
A
D
Bootstrap value 98 This means that this
relationship between cluster A, B and C and D
occurred 98 times when this tree was constructed
100 different times using the same distance
values.
37
MSA

What can it be used for?
Study evolution
Determine common ancestry between organisms (to
build phylogenetic trees)
Determine how a protein evolved
Find conserved regions in same gene of a number
of organisms
For primer searching
Predict functional and structural locations
within cDNA
Assembling sequence fragments into a larger
sequence

38
Assembly of Sequence Fragments
Sequence fragment 1 5-CGTAACACGTATGCAACCTA-3
Sequence fragment 2 5-ATGCAACCTACCCAAAACAG-3
align
CGTAACACGTATGCAACCTA
ATGCAACCTACCCAAAACAG
assemble
5-CGTAACACGTATGCAACCTACCCAAAACAG-3

Write a Comment

User Comments (0)