Stuart M' Brown - PowerPoint PPT Presentation

1 / 44
About This Presentation
Title:

Stuart M' Brown

Description:

Taxonomy is the art of classifying things into groups a quintessential human ... gene transfer (hybridization, vector mediated DNA movement, or direct uptake of DNA) ... – PowerPoint PPT presentation

Number of Views:32
Avg rating:3.0/5.0
Slides: 45
Provided by: stuart67
Category:
Tags: brown | stuart

less

Transcript and Presenter's Notes

Title: Stuart M' Brown


1
Molecular PhylogeneticsComputing Evolution
presented by
  • Stuart M. Brown
  • New York University School of Medicine

2
Evolution
  • The theory of evolution is the foundation upon
    which all of modern biology is built.
  • From anatomy to behavior to genomics, the
    scientific method requires an appreciation of
    changes in organisms over time.
  • By looking at gene sequences, one can see
    evolution in action.

3
Taxonomy
  • The study of the relationships between groups of
    organisms is called taxonomy, an ancient and
    venerable branch of classical biology.
  • Taxonomy is the art of classifying things into
    groups a quintessential human behavior
    established as a mainstream scientific field by
    Carolus Linnaeus (1707-1778).

4
(No Transcript)
5
Phylogenetics
  • Evolutionary theory states that groups of
    similar organisms are descended from a common
    ancestor.
  • Phylogenetic systematics (cladistics) is a method
    of taxonomic classification of organisms based on
    their evolutionary history.
  • Cladistics was developed by Willi Hennig, a
    German entomologist, in 1950.

6
DNA is a good tool for taxonomy
  • DNA sequences have many advantages over
    morphological taxonomic characters
  • Character states can be scored unambiguously
  • Large numbers of characters can be scored for
    each individual

7
  • New alleles are created by mutations in the DNA
    sequence

8
Definition
  • Homology related by descent
  • Homologous sequence positions

? ATTGCGC
ATTGCGC
ATTGCGC
?
ATTGCGC
? ATCCGC
AT-CCGC
C
9
Dynamic Programming
  • Dynamic Programming is a very general programming
    technique.
  • It is applicable when a large search space can be
    structured into a succession of stages, such
    that
  • the initial stage contains trivial solutions to
    sub-problems
  • each partial solution in a later stage can be
    calculated by recurring a fixed number of partial
    solutions in an earlier stage
  • the final stage contains the overall solution

10
(No Transcript)
11
Global vs. Local Alignments
  • Global alignment algorithms start at the
    beginning of two sequences and add gaps to each
    until the end of one is reached.
  • Local alignment algorithms finds the region (or
    regions) of highest similarity between two
    sequences and build the alignment outward from
    there.

12
(No Transcript)
13
Phylogenetics starts with Multiple Alignment
  • Can only align homologous sequences
  • Pairwise alignment can be exact
  • exhaustive computation dynamic programming
  • Multiple alignment is always approximate
    (heuristic) - the problem increases
    exponentially with the number of sequences to be
    aligned
  • Progressive pairwise method
  • Software of choice is CLUSTAL (emma in EMBOSS)

14
Clustal Format
  • CLUSTAL X (1.81) multiple sequence alignment
  • CAS1_BOVIN MKLLILTCLVAVALARPKHPIKHQGLPQ------
    --EVLNEN-
  • CAS1_SHEEP MKLLILTCLVAVALARPKHPIKHQGLSP------
    --EVLNEN-
  • CAS1_PIG MKLLIFICLAAVALARPKPPLRHQEHLQNEPDSR
    E--------
  • CAS1_HUMAN MRLLILTCLVAVALARPKLPLRYPERLQNPSESS
    E--------
  • CAS1_RABBIT MKLLILTCLVATALARHKFHLGHLKLTQEQPESS
    EQEILKERK
  • CAS1_MOUSE MKLLILTCLVAAAFAMPRLHSRNAVSSQTQ----
    --QQHSSSE
  • CAS1_RAT MKLLILTCLVAAALALPRAHRRNAVSSQTQ----
    ---------
  • .. .

15
A aat tcg ctt cta gga atc tgc cta atc ctg B
... ..a ..g ..a .t. ... ... t.. ... ..a C ...
..a ..c ..c ... ..t ... ... ... t.a D ... ..a
..a ..g ..g ..t ... t.t ..t t..
Each nucleotide difference is a character
16
Sequences Reflect Relationships
  • After working with sequences for a while, one
    develops an intuitive understanding that for a
    given gene, closely related organisms have
    similar sequences and more distantly related
    organisms have more dissimilar sequences. These
    differences can be quantified.
  • Given a set of gene sequences, it should be
    possible to reconstruct the evolutionary
    relationships among genes and among organisms.

17
(No Transcript)
18
Gap Penalties
  • Linear gap penalties Affine gap penalties
  • p (o l.e)
  • Gap opening /Gap extension
  • Penalized multiple nearby gaps
  • Protein specific penalties (on by default)
  • Increase the probability of gaps associated with
    certain residues
  • Increase the chances of gaps in loop regions (gt
    5 hydrophilic residues)

19
(No Transcript)
20
Multiple Alignment Tips
  • Align pairs of sequences using an optimal method
  • Progressive alignment programs such as ClustalX
    for multiple alignment
  • Choose representative sequences to align
    carefully
  • Choose sequences of comparable lengths
  • Progressive alignment programs may be combined
  • Review alignment by eye and edit
  • If you have a choice align amino acid sequences
    rather than nucleotides

21
Genome Data
  • Recent genome sequencing projects have
    contributed a huge store of data to publicly
    accessible databases
  • These data make it quite easy to conduct
    phylogentic experiments from any internet
    connected computer.

22
Caculating Trees from Sequences
  • First, collect a set of homologous sequences.
  • Make a multiple alignment (line them up)
  • Calculate differences (distance matrix)
  • Join pairs with the fewest differences
  • Then join groups
  • Repeat until all sequences are included in the
    tree
  • Draw a pretty looking tree.

23
What Sequences to Study?
  • Different sequences accumulate changes at
    different rates - chose level of variation that
    is appropriate to the group of organisms being
    studied.
  • Proteins (or protein coding DNAs) are constrained
    by natural selection - better for very distant
    relationships
  • Some sequences are highly variable (rRNA spacer
    regions, immunoglobulin genes), while others are
    highly conserved (actin, rRNA coding regions)
  • Different regions within a single gene can evolve
    at different rates (conserved vs. variable
    domains)

24
Genes vs. Species
  • Relationships calculated from sequence data
    represent the relationships between genes, this
    is not necessarily the same as relationships
    between species.
  • Your sequence data may not have the same
    phylogenetic history as the species from which
    they were isolated
  • Different genes evolve at different speeds, and
    there is always the possibility of horizontal
    gene transfer (hybridization, vector mediated DNA
    movement, or direct uptake of DNA).

25
Orthologs vs. Paralogs
  • When comparing gene sequences, it is important to
    distinguish between identical vs. merely similar
    genes in different organisms.
  • Orthologs are homologous genes in different
    species with analogous functions.
  • Paralogs are similar genes that are the result of
    a gene duplication.
  • A phylogeny that includes both orthologs and
    paralogs is likely to be incorrect.
  • Sometimes phylogenetic analysis is the best way
    to determine if a new gene is an ortholog or
    paralog to other known genes.

26
Ancestral gene
A
(globin)
Duplication
A
B
(myoglobin)
(hemoglobin)
Speciation
A2
B2
A1
B1
(mouse)
(human)
27
Distances Measurements
  • It is often useful to measure the genetic
    distance between two species, between two
    populations, or even between two individuals.
  • The entire concept of numerical taxonomy is based
    on computing phylogenies from a table of
    distances.
  • In the case of sequence data, pairwise distances
    must be calculated between all sequences that
    will be used to build the tree - thus creating a
    distance matrix.
  • Distance methods give a single measurement of the
    amount of evolutionary change between two
    sequences since divergence from a common
    ancestor.

28
Computing a Distance Matrix
Reading sequences... gtr1_human 548
total, 548 read gtr2_human 548 total,
548 read gtr3_human 548 total, 548
read gtr4_human 548 total, 548 read
gtr5_human 548 total, 548
read Computing distances using Kimura method...
1 x 2 48.61 1 x 3 45.50 1
x 4 65.74 1 x 5 107.70 2 x 3
61.53 2 x 4 74.57 2 x 5 113.82
3 x 4 68.93 3 x 5 104.43 4 x 5
110.86
Matrix 1 1 2
3 4 5 ________________________
____________________________________ ..
1 0.00 48.61 45.50 65.74
107.70 2 0.00
61.53 74.57 113.82 3
0.00 68.93 104.43
4 0.00
110.86 5
0.00
29
DNA Distances
  • Distances between pairs of DNA sequences are
    relatively simple to compute as the sum of all
    base pair differences between the two sequences.
  • this type of algorithm can only work for pairs of
    sequences that are similar enough to be aligned
  • Generally all base changes are considered equal
  • Insertion/deletions are generally given a larger
    weight than replacements (gap penalties).
  • It is also possible to correct for multiple
    substitutions at a single site, which is common
    in distant relationships and for rapidly evolving
    sites.

30
(No Transcript)
31
Amino Acid Distances
  • Distances between amino acid sequences are a bit
    more complicated to calculate.
  • Some amino acids can replace one another with
    relatively little effect on the structure and
    function of the final protein while other
    replacements can be functionally devastating.
  • From the standpoint of the genetic code, some
    amino acid changes can be made by a single DNA
    mutation while others require two or even three
    changes in the DNA sequence.
  • In practice, what has been done is to calculate
    tables of frequencies of all amino acid
    replacements within families of related protein
    sequences in the databanks i.e. PAM and BLOSSUM

32
The PAM 250 scoring matrix
A R N D C Q E G H I L K M F P S
T W Y V A 2 R -2 6 N 0
0 2 D 0 -1 2 4
C -2 -4 4 -5 4 Q 0 1
1 2 -5 4 E 0 -1 1 3 -5 2 4
G 1 -3 0 1 -3 -1 0 5
H -1 2 2 1 -3 3 1 -2 6 I -1
-2 -2 -2 -2 -2 -2 -3 -2 5 L -2 -3
-3 -4 -6 -2 -3 -4 -2 2 6 K -1 3
1 0 -5 1 0 -2 0 -2 -3 5 M -1
0 -2 -3 -5 -1 -2 -3 -2 2 4 0 6
F -4 -4 -4 -6 -4 -5 -5 -5 -2 1 2 -5 0 9
P 1 0 -1 -1 -3 0 -1 -1 0 -2 -3 -1 -2
-5 6 S 1 0 1 0 0 -1 0 1 -1
-1 -3 0 -2 -3 1 3 T 1 -1 0 0
-2 -1 0 0 -1 0 -2 0 -1 -2 0 1 3
W -6 2 -4 -7 -8 -5 -7 -7 -3 -5 -2 -3 -4 0 -6
-2 -5 17 Y -3 -4 -2 -4 0 -4 -4 -5
0 -1 -1 -4 -2 7 -5 -3 -3 0 10 V
0 -2 -2 -2 -2 -2 -2 -1 -2 4 2 -2 2 -1 -1 -1 0
-6 -2 4 Dayhoff, M, Schwartz, RM, Orcutt, BC
(1978) A model of evolutionary change in
proteins. in Atlas of Protein Sequence and
Structure, vol 5, sup. 3, pp 345-352. M. Dayhoff
ed., National Biomedical Research Foundation,
Silver Spring, MD.
33
Clustering Algorithms
  • Clustering algorithms use distances to calculate
    phylogenetic trees. These trees are based solely
    on the relative numbers of similarities and
    differences between a set of sequences.
  • Start with a matrix of pairwise distances
  • Cluster methods construct a tree by linking the
    least distant pairs of taxa, followed by
    successively more distant taxa.

34
UPGMA
  • The simplest of the distance methods is the UPGMA
    (Unweighted Pair Group Method using Arithmetic
    averages)
  • The PHYLIP programs DNADIST and PROTDIST
    calculate absolute pairwise distances between a
    group of sequences. Then the GCG program GROWTREE
    uses UPGMA to build a tree.
  • Many multiple alignment programs such as CLUSTAL
    use a variant of UPGMA to create a dendrogram of
    DNA sequences which is then used to guide the
    multiple alignment algorithm.

35
Neighbor Joining
  • The Neighbor Joining method is the most popular
    way to build trees from distance measurements
  • (Saitou and Nei 1987, Mol. Biol. Evol. 4406)
  • Neighbor Joining corrects the UPGMA method for
    its (frequently invalid) assumption that the same
    rate of evolution applies to each branch of a
    tree.
  • The distance matrix is adjusted for differences
    in the rate of evolution of each taxon (branch).
  • Neighbor Joining has given the best results in
    simulation studies and it is the most
    computationally efficient of the distance
    algorithms (N. Saitou and T. Imanishi, Mol.
    Biol. Evol. 6514 (1989)

36
(No Transcript)
37
(No Transcript)
38
Other Methods
  • Distance methods do not make use of all
    information in an alignment
  • differences between each pair of sequences are
    summarized as a single number
  • each character has information
  • Parsimony allows the use of all known
    evolutionary information in building a tree
  • Maximum Likelyhood is even more computationally
    intense (creates a tree for each character)
  • These methods are beyond the scope of this course

39
Computer Software for Phylogenetics
  • Due to the lack of consensus among evolutionary
    biologists about basic principles for
    phylogenetic analysis, it is not surprising that
    there is a wide array of computer software
    available for this purpose.
  • PHYLIP is a free package that includes 30
    programs that compute various phylogenetic
    algorithms on different kinds of data.
  • PAUP (Phylogenetic Analysis Using Parsimony)
  • CLUSTALX is a multiple alignment program that
    includes the ability to create tress based on
    Neighbor Joining.
  • MacClade is a well designed cladistics program
    that allows the user to explore possible trees
    for a data set.
  • EMBOSS emma multiple alignment, many dozens of
    phylogeny programs (includes entire PHYLIP
    package fprotdis, fdrawtree, etc)

40
Phylogenetics on the Web
  • There are several phylogenetics servers available
    on the Web
  • some of these will change or disappear in the
    near future
  • these programs can be very slow so keep your
    sample sets small
  • The Institut Pasteur, Paris has a PHYLIP server
    at
  • http//bioweb.pasteur.fr/seqanal/phylogeny/phyli
    p-uk.html
  • Louxin Zhang at the Natl. University of
    Singapore has a WebPhylip server
  • http//sdmc.krdl.org.sg8080/lxzhang/phylip/
  • The Belozersky Institute at Moscow State
    University has their own "GeneBee" phylogenetics
    server
  • http//www.genebee.msu.su/services/phtree_reduced.
    html
  • The Phylodendron website is a tree drawing
    program with a nice user interface and a lot of
    options, however, the output is limited to gifs
    at 72 dpi - not publication quality. http//iu
    bio.bio.indiana.edu/treeapp/treeprint-form.html

41
Other Web Resources
  • Joseph Felsenstein (author of PHYLIP) maintains a
    comprehensive list of Phylogeny programs at
  • http//evolution.genetics.washington.edu/phylip/
    software.html
  • Introduction to Phylogenetic Systematics,
  • Peter H. Weston Michael D. Crisp, Society of
    Australian Systematic Biologists
  • http//www.science.uts.edu.au/sasb/WestonCrisp.htm
    l
  • University of California, Berkeley Museum of
    Paleontology (UCMP)
  • http//www.ucmp.berkeley.edu/clad/clad4.html

42
Software Hazards
  • There are a variety of programs for Macs and PCs,
    but you can easily tie up your machine for many
    hours with even moderately sized data sets (i.e.
    fifty 300 bp sequences)
  • Moving sequences into different programs can be a
    major hassle due to incompatible file formats.
  • Just because a program can perform a given
    computation on a set of data does not mean that
    that is the appropriate algorithm for that type
    of data.

43
Conclusions
  • Given the huge variety of methods for computing
    phylogenies, how can the biologist determine what
    is the best method for analyzing a given data
    set?
  • Published papers that address phylogenetic issues
    generally make use of several different
    algorithms and data sets in order to support
    their conclusions.
  • Using several alternate methods can give an
    indication of the robustness of a given
    conclusion.
  • Bootstrapping methods can show if your results
    depend on just a small part of the data, or are
    generally supported
  • Analyze several different genes!!

44
Biodiversity and Conservation
  • Phylogenetics also overlaps significantly with
    other branches of evolutionary biology.
  • Check out the Tree of Life project for an
    introduction to phylogenetics and its
    relationship to biodiversity.
  • http//phylogeny.arizona.edu/tree/phylogeny.html
  • Measurements of DNA sequence differences are now
    being used to implement plans for the
    conservation of genetic resources.
Write a Comment
User Comments (0)
About PowerShow.com