Multiple Sequence Alignments and Phylogenetic Trees - PowerPoint PPT Presentation

1 / 38
About This Presentation
Title:

Multiple Sequence Alignments and Phylogenetic Trees

Description:

When we used Blast to compare sequences 2 questions may ... Jukes and Cantor. Kimura (transition/transversion) 12 parameter model ... Jukes Cantor. 12 parameter ... – PowerPoint PPT presentation

Number of Views:351
Avg rating:3.0/5.0
Slides: 39
Provided by: MICHELLE6
Category:

less

Transcript and Presenter's Notes

Title: Multiple Sequence Alignments and Phylogenetic Trees


1
Multiple Sequence Alignments and Phylogenetic
Trees
2
Two Questions
  • When we used Blast to compare sequences 2
    questions may have popped into our minds
  • How are these sequences actually related? Are
    they homologous?
  • How are the organisms that these sequences came
    from related? Do they have a close common
    ancestor?
  • We will now learn how evolutionary biologists can
    use sequence comparisons to help them show
    evidence of evolutionary relatedness between
    organisms.

3
Taxonomy
  • Naming and grouping organisms based on
    similarities and differences
  • Carolus Lineaus 1707-1748 is the father of
    Taxonomy
  • His taxonomy was based on traits of organisms

4
(No Transcript)
5
Phylogeny
  • Inferring evolutionary relationships based on
    similarities
  • molecular versus morphological traits
  • Phenotype to infer genotype
  • Willi Hennig, entomologist, 1950
  • Phylogenetic tree
  • Relying on traits has limitations
  • Convergent evolution
  • Eyes as a trait
  • Phenotypes and bacteria
  • Distantly related organisms

6
Molecular Phylogeny
  • GHF Nuttall
  • 1902 molecular similarities between organisms
  • Immune response
  • Protein electrophoresis (1950s)
  • Comparison of related proteins (size, charge
  • Genome cross hybridization
  • Protein Sequencing (1960s)
  • Genomic information (1970s)
  • Restriction maps
  • Whole gene sequences
  • Computational molecular phylogeny
  • Application of computational algorithms, methods
    and programs to molecular phylogenetic analyses
  • Sequence data
  • amino acid or nucleic acid
  • Carl Woese and 16S rRNA

7
Assumptions Made
  • The sequences are correct
  • The sequence are homologous
  • Each position is homologous
  • The sampling of taxa or genes is sufficient to
    resolve the problem of interest
  • Sequence variation is representative of the
    broader group of interest
  • Sequence variation contains sufficient
    phylogenetic signal (as opposed to noise) to
    resolve the problem of interest
  • Each position in the sequence evolved
    independently

8
Phylogenetic Terms
  • Nodes represent a distinct taxonomical unit
  • Terminal node (data collected)
  • Internal node (no datainferred common ancestor)
  • Branches (lineage)branching order
  • represent the evolutionary relationship between
    organisms or nodes
  • Scaled tree When each branch of the tree is
    proportional to the evolutionary distance
    (substitution rates) between organisms in the
    tree then it is scaled.
  • Usually use a phylogram to represent these types
    of trees
  • Unscaled tree Relative kinship
  • Usually use a cladogram to represent these types
    of trees

9
Note the nodes and branches.
10
Phylogenetics
  • Computer programs use Newick format for the tree.
  • (((I,II),(III,IV)),V)
  • Bifurcating nodes (especially shallow branches)
  • More common
  • Two lineages nodes A and B
  • Multifurcating nodes
  • More lineages from same ancestor?
  • Two or more bifurcations, but order is unknown

11
Rooted Trees Versus Unrooted Trees
  • Rooted trees all organisms in the tree have a
    common ancestor
  • a unique path leads through the common ancestor
    to all others
  • Unrooted trees Only relationships between nodes
    are shown
  • No direction is given by which evolution begins
    and ends
  • To root a tree when building it you must assign
    an outgroup (must know something about the
    organisms you are comparinglook to the fossil
    record or other evidence)
  • Not always easy to do
  • A good outgroup for a bacterial tree is an
    archaeal cell.

12
Rooted versus Unrooted
I
II
I
II
III
III
13
Numbers of Trees!!!
  • The numbers of evolutionary possible paths that
    can be taken from a dataset is staggering
    depending on how many organisms you are comparing
    (nnumber of organisms in tree)
  • NR(2n-3)!/2n-2(n-2)!
  • Nu(2n-5)!/2n-3(n-3)!
  • EX) if n5 NR105 and Nu15
  • Most trees are inferred trees!

14
Phylogenetics
  • Gene trees versus species trees
  • Reminder horizontal gene transfer events can
    cause massive divergence between two genes found
    in the same species (small divergence between the
    species to make new strains, but major divergence
    between 2 genes)
  • Be careful
  • if a tree is constructed from a single gene it
    doesnt always indicate species evolution
  • One of the only trees that are fairly well
    accepted as species trees and involve comparison
    of only a single gene is the ribosomal RNA trees
  • Some controversy here too!

15
Which Sequence to Choose
  • Different sequences change at different rates -
    chose level of variation that is appropriate to
    the group of organisms being studied.
  • Diverse group versus tight group (all mammals
    versus primates only)
  • Proteins (or cDNAs) are constrained by natural
    selection, while nucleic acid is not always
  • Some sequences are highly variable (rRNA spacer
    regions, HLA genes), while others are highly
    conserved (actin, Histones)
  • Different regions within a single gene can evolve
    at different rates
  • Different functional constraints

16
Phylogenetics
  • Character sets
  • Anatomical feature, color, timed response to a
    stimulus, nucleic acid, amino acid
  • Character states
  • DNA 4 possible states Protein 20 possible
    states
  • Distance sets
  • A measure of overall pairwise differences between
    two character sets
  • Comparison of sequence data matches,
    mismatches, gaps, matrix data
  • Simple distance calculation
  • ratio of identities
  • Dm/t (mmatching ttotal compared)
  • Complex distance calculation
  • Jukes and Cantor
  • Kimura (transition/transversion)
  • 12 parameter model

17
Two Approaches to Molecular Phylogeny
  • Distance Matrix Based method
  • Multiple sequence alignment
  • calculate distance in all possible pairs of
    sequences using JC, Kimura or 12 parameter model
  • Cluster your organisms based on distance
  • UPGMA or Neighbor Joining algorithm
  • Optimality Methods
  • Multiple sequence alignment
  • Purely statistical approach to determining
    relationships between organisms
  • Probability of every nucleotide or amino acid
    substitution
  • No distance calculated
  • Maximum Parsimony or Maximum likelihood

18
Distance Based method
  • MSA
  • Calculate Distance from characters
  • Clustering algorithm to build topology
  • Branch lengths

19
Multiple Sequence Alignment
hsb051bc CGTAACACGT ATGCAACCTA CCCAAAACAG
  hsb098bc CGTAACACGT ATGCAACCTA CCTTGTACAG
  hsb090bc CGTAACACGT ATGCAATCTG CCCGGAACTG
  hsb083bc CGTAACACGT ATGCAATCTA CCCGAAACAG
  hsb074bc AGTAATGCAT CG-GAACGTG TCCTCTTGTG
hsb104bc AGTAATGCAT CG-GAACGTG TCCTCTTGTG
hsb073bc AGTAATGCAT CG-GAACGTG TCCTCTTGTG
hsb065bc AGTAATACAT CG-GAACATG
TCCTGGAGTG (Consensus) mGTAAyrCrT mknsAAysTr
yCydvdwswG
  • DNA aligned optimally by bringing the greatest
    number of similar residues into register in same
    column of alignment.

20
Challenges
  • Sequences are long!
  • Placement of gaps, mismatches and matches
  • Cannot use the same optimal methods that you
    learned for pairwise alignments (i.e. Blast
    alignments)
  • More rules must be applied to algorithm
  • The less homology between sequences the more
    difficult the task.
  • Obtaining a cumulative score for substitutions in
    each column
  • Placement and scoring of gaps

21

Try it Yourself!
eek one cat ate the dog s two cat ate the dog
one rat ate the dog two rat ate the dog poo poo
two cat ate the dog eek eek one cat ate the dog
22
Methods For Alignment
  • Progressive
  • Align most similar sequences first then build by
    adding less similar sequences
  • EX) Clustal W algorithm
  • Iterative
  • Align groups of sequences that are similar
    initially, then revise the alignment when groups
    are placed together
  • EX) DIALIGN
  • Local
  • Align only the locally conserved patterns in all
    sequences and let the rest of the sequences fall
    where they may
  • Best when internal gaps are not expected
  • EX)Asset (by NCBI)
  • Statistical and probabilistic models of sequences
  • All possible optimal pairwise alignments (species
    A, B, C, D)
  • Use statistical approach to construct initial
    tree (UPGMA)
  • Reconstruct alignments progressively in order of
    relatedness according to tree
  • New UPGMA tree

23
Distance
  • Now we have the alignment what do we do next?
  • Determine distance or substitution rate
  • A single measurement of amount of evolutionary
    change between two sequences
  • Pairwise distances must be calculated between
    every organism included in a tree
  • Distance Algorithms (nucleic acids)
  • A naïve algorithm would be
  • ED differences/ total
  • Kimura
  • Jukes Cantor
  • 12 parameter
  • Indels are usually also taken into account in
    these 2 algorithms (gap penalties)
  • Protein sequence distance (amino acids)
  • BLOSUM and PAM are used to calculate distance
  • Gap penalties calculated also
  • Since distance determinations are pairwise all
    distances are entered into a distance matrix

24
Distance Matrix
001 CGTAACACGT ATGCAACCTA CCCAAAACAG   002
CGTAACACGT ATGCAACCTA CCTTGTACAG   003
CGTAACACGT ATGCAATCTG CCCGGAACTG   004
CGTAACACGT ATGCAATCTA CCCGAAACAG   005
AGTAATGCAT CG-GAACGTG TCCTCTTGTG
(Consensus)mGTAAyrCrT mknsAAysTr yCydvdwswG
25
Clustering Algorithms
  • Now what do you do with the distance matrix?
  • Apply a clustering Algorithm to build the trees.
  • UPGMA (unweighted-pair-group method with
    arithmetric mean)
  • Clusters 2 species with smallest distance first
  • The distance between this cluster together and
    the other organisms are then calculated (new
    matrix)
  • The cluster that has the smallest distance in
    this new matrix becomes a new cluster.

26
UPGMA
D and E closest Distance
C and A
27
Branch Lengths
  • Remember it is still unscaled if the branch
    lengths dont reflect evolutionary distance.
  • How do we turn this into a scaled tree?
  • If we can assume that evolution between all
    species occurred at a constant rate then use the
    matrix!
  • If not then it is more complicated
  • Neighbor Joining algorithm instead of UPGMA
  • The distance matrix is adjusted for differences
    in the rate of evolution of each taxon (branch).
  • Disadvantage of UPGMA assumes constant
    evolutionary rate across all lineages
  • Disadvantage of Neighbor Joining unrooted tree
    only

28
Optimality (character) Based Methods
  • MSA
  • Build Tree
  • Branch lengths

29
Parsimony
  • Parsimony
  • Allows the use of all known evolutionary
    information in building a tree
  • In contrast, distance methods compress all of the
    differences between pairs of sequences into a
    single number
  • Parsimony involves evaluating all possible trees
    and giving each a score based on the number of
    evolutionary changes that are needed to explain
    the observed data.
  • The best tree is the one that requires the fewest
    base changes for all sequences to derive from a
    common ancestor.
  • The most parsimonious tree is one that minimum
    tree length needed to explain observed
    distributions of all characters
  • Assumes one rate of evolution across all lineages
    in a given tree

30
Parsimony Example
  • Consider four sequences ATCG, TTCG, ATCC, and
    TCCG
  • Imagine a tree that branches at the first
    position, grouping ATCG and ATCC on one branch,
    TTCG and TCCG on the other branch.
  • Then each branch splits, for a total of 3
    internal nodes on the tree (Tree 1)

31
  • Compare Tree 1 with one that first divides ATCC
    on its own branch, then splits off ATCG, and
    finally divides TTCG from TCCG (Tree 2).
  • Trees 1 and 2 both have three nodes, but when
    all of the distances back to the root ( of nodes
    crossed) are summed, the total is equal to 8 for
    Tree 1 and 9 for Tree 2.

Tree 2
Tree 1
32
Maximum Likelihood
  • Directly comparable to maximum parsimony
  • Statistical method to examine probably of all
    possible substitutions in the alignment most
    likely tree is one that involves the fewest
    number of changes and most probable
    substitutions.
  • Different from parsimony
  • Different rates of evolutions in different
    lineages
  • Evolution along different sites in different
    lineages are statistically independent.
  • Great for distant related sequences
  • Computationally taxing

33
Scaled Tree
  • We can also get a scaled character based tree if
    we reflect the amount of evolutionary change in
    the branch lengths by applying relative amount of
    change occurring between each organism.

34
(No Transcript)
35
Is My Tree Correct?
  • Not a good question.
  • There are a number of different algorithms and
    each can give a slightly different tree
  • No single algorithm is more powerful or more
    accepted by scientific community than another!!!!
  • The trick is to use more than one algorithm and
    build more than one tree and support your
    hypothesis more than once?sound familiar?
  • Can any tree be scientifically proven to be
    correct?
  • Are hypotheses proven or supported? Think about
    it.

36
Bootstrapping
  • Allows for estimation of confidence levels
  • Build trees 100X and determine how often groups
    cluster together

C
98
B
A
D
Bootstrap value 98 This means that this
relationship between cluster A, B and C and D
occurred 98 times when this tree was constructed
100 different times using the same distance
values.
37
MSA
  • What can it be used for?
  • Study evolution
  • Determine common ancestry between organisms (to
    build phylogenetic trees)
  • Determine how a protein evolved
  • Find conserved regions in same gene of a number
    of organisms
  • For primer searching
  • Predict functional and structural locations
    within cDNA
  • Assembling sequence fragments into a larger
    sequence

38
Assembly of Sequence Fragments
Sequence fragment 1 5-CGTAACACGTATGCAACCTA-3
Sequence fragment 2 5-ATGCAACCTACCCAAAACAG-3
align
CGTAACACGTATGCAACCTA
ATGCAACCTACCCAAAACAG
assemble
5-CGTAACACGTATGCAACCTACCCAAAACAG-3
Write a Comment
User Comments (0)
About PowerShow.com