Title: Stuart M' Brown
1Molecular PhylogeneticsComputing Evolution
presented by
- Stuart M. Brown
- New York University School of Medicine
2Topics
- A. Molecular Evolution
- B. Calculating Distances
- C. Clustering Algorithms
- D. Computer Software
Portions of this lecture have been inspired by
web pages created by Dr. Brian Golding,
Department of Biology, McMaster University,
Hamilton, Ontario, Canada, L8S 4K1
3Evolution
- The theory of evolution is the foundation upon
which all of modern biology is built.
- From anatomy to behavior to genomics, the
scientific method requires an appreciation of
changes in organisms over time. - It is impossible to evaluate relationships among
gene sequences without taking into consideration
the way these sequences have been modified over
time
4Relationships
- Similarity searches and multiple alignments
of sequences naturally lead to the question - How are these sequences related?
-
- and more generally
- How are the organisms from which these
sequences come related?
5Taxonomy
- The study of the relationships between groups of
organisms is called taxonomy, an ancient and
venerable branch of classical biology.
- Taxonomy is the art of classifying things into
groups a quintessential human behavior
established as a mainstream scientific field by
Carolus Linnaeus (1707-1778).
6(No Transcript)
7Phylogenetics
- Evolutionary theory states that groups of
similar organisms are descended from a common
ancestor. - Phylogenetic systematics (cladistics) is a method
of taxonomic classification based on their
evolutionary history.
- It was developed by Willi Hennig, a German
entomologist, in 1950.
8Molecular Evolution
- Phylogenetics often makes use of numerical data,
(numerical taxonomy) which can be scores for
various character states such as the size of a
visible structure or it can be DNA sequences. - Similarities and differences between organisms
can be coded as a set of characters, each with
two or more alternative character states. - In an alignment of DNA sequences, each position
is a separate character, with four possible
character states, the four nucleotides.
9DNA is a good tool for taxonomy
- DNA sequences have many advantages over
classical types of taxonomic characters - Character states can be scored unambiguously
- Large numbers of characters can be scored for
each individual - Information on both the extent and the nature of
divergence between sequences is available
(nucleotide substitutions, insertion/deletions,
or genome rearrangements)
10A aat tcg ctt cta gga atc tgc cta atc ctg B
... ..a ..g ..a .t. ... ... t.. ... ..a C ...
..a ..c ..c ... ..t ... ... ... t.a D ... ..a
..a ..g ..g ..t ... t.t ..t t..
Each nucleotide difference is a character
11Sequences Reflect Relationships
- After working with sequences for a while, one
develops an intuitive understanding that for a
given gene, closely related organisms have
similar sequences and more distantly related
organisms have more dissimilar sequences. These
differences can be quantified. - Given a set of gene sequences, it should be
possible to reconstruct the evolutionary
relationships among genes and among organisms.
12(No Transcript)
13Protein Evolution
- Protein sequences can be used to study more
distant evolutionary relationships - Related proteins have "conserved" substitutions
(amino acids with similar biochemical properties).
14What Sequences to Study?
- Different sequences accumulate changes at
different rates - chose level of variation that
is appropriate to the group of organisms being
studied. - Proteins (or protein coding DNAs) are constrained
by natural selection. - Some sequences are highly variable (rRNA spacer
regions, immunoglobulin genes), while others are
highly conserved (actin, rRNA coding regions) - Different regions within a single gene can evolve
at different rates (conserved vs. variable
domains)
15Orthologs vs. Paralogs
- When comparing gene sequences, it is important to
distinguish between identical vs. merely similar
genes in different organisms. - Orthologs are homologous genes in different
species with analogous functions. - Paralogs are similar genes that are the result of
a gene duplication. - A phylogeny that includes both orthologs and
paralogs is likely to be incorrect. - Sometimes phylogenetic analysis is the best way
to determine if a new gene is an ortholog or
paralog to other known genes.
16Biodiversity and Conservation
- Phylogenetics also overlaps significantly with
other branches of evolutionary biology. - Check out the Tree of Life project for an
introduction to phylogenetics and its
relationship to biodiversity. - http//phylogeny.arizona.edu/tree/phylogeny.html
- Measurements of DNA sequence differences are now
being used to implement plans for the
conservation of genetic resources.
17Genes vs. Species
- Relationships calculated from sequence data
represent the relationships between genes, this
is not necessarily the same as relationships
between species. - Your sequence data may not have the same
phylogenetic history as the species from which
they were isolated - Different genes evolve at different speeds, and
there is always the possibility of horizontal
gene transfer (hybridization, vector mediated DNA
movement, or direct uptake of DNA).
18Distances Measurements
- It is often useful to measure the genetic
distance between two species, between two
populations, or even between two individuals. - The entire concept of numerical taxonomy is based
on computing phylogenies from a table of
distances. - In the case of sequence data, pairwise distances
must be calculated between all sequences that
will be used to build the tree - thus creating a
distance matrix. - Distance methods give a single measurement of the
amount of evolutionary change between two
sequences since divergence from a common
ancestor.
19Computing a Distance Matrix
Reading sequences... gtr1_human 548
total, 548 read gtr2_human 548 total,
548 read gtr3_human 548 total, 548
read gtr4_human 548 total, 548 read
gtr5_human 548 total, 548
read Computing distances using Kimura method...
1 x 2 48.61 1 x 3 45.50 1
x 4 65.74 1 x 5 107.70 2 x 3
61.53 2 x 4 74.57 2 x 5 113.82
3 x 4 68.93 3 x 5 104.43 4 x 5
110.86
Matrix 1 1 2
3 4 5 ________________________
____________________________________ ..
1 0.00 48.61 45.50 65.74
107.70 2 0.00
61.53 74.57 113.82 3
0.00 68.93 104.43
4 0.00
110.86 5
0.00
20DNA Distances
- Distances between pairs of DNA sequences are
relatively simple to compute as the sum of all
base pair differences between the two sequences. - this type of algorithm can only work for pairs of
sequences that are similar enough to be aligned - Generally all base changes are considered equal
- Insertion/deletions are generally given a larger
weight than replacements (gap penalties). - It is also possible to correct for multiple
substitutions at a single site, which is common
in distant relationships and for rapidly evolving
sites.
21(No Transcript)
22Clustering Algorithms
- Clustering algorithms use distances to calculate
phylogenetic trees. These trees are based solely
on the relative numbers of similarities and
differences between a set of sequences. - Start with a list of related genes
- Make a multiple alignment
- Compute a matrix of pairwise distances
- Clustering methods construct a tree by linking
the least distant pairs, followed by successively
more distant pairs.
23Neighbor Joining
- The Neighbor Joining method is the most popular
way to build trees from distance measurements - (Saitou and Nei 1987, Mol. Biol. Evol. 4406)
- Neighbor Joining corrects the distance method
for its (frequently invalid) assumption that the
same rate of evolution applies to each branch of
a tree. - The distance matrix is adjusted for differences
in the rate of evolution of each taxon (branch). - Neighbor Joining has given the best results in
simulation studies and it is the most
computationally efficient of the distance
algorithms (N. Saitou and T. Imanishi, Mol.
Biol. Evol. 6514 (1989)
24Computer Software for Phylogenetics
- Due to the lack of consensus among evolutionary
biologists about basic principles for
phylogenetic analysis, it is not surprising that
there is a wide array of computer software
available for this purpose. - PHYLIP is a free package that includes 30
programs that compute various phylogenetic
algorithms on different kinds of data. - The GCG package (available at most research
institutions) contains a full set of programs for
phylogenetic analysis including simple
distance-based clustering and the complex
cladistic analysis program PAUP (Phylogenetic
Analysis Using Parsimony) - CLUSTALX is a multiple alignment program that
includes the ability to create tress based on
Neighbor Joining.
25Phylogenetics on the Web
- There are several phylogenetics servers available
on the Web - some of these will change or disappear in the
near future - these programs can be very slow so keep your
sample sets small - The Institut Pasteur, Paris has a PHYLIP server
at - http//bioweb.pasteur.fr/seqanal/phylogeny/phyli
p-uk.html - Louxin Zhang at the Natl. University of
Singapore has a WebPhylip server - http//sdmc.krdl.org.sg8080/lxzhang/phylip/
- The Belozersky Institute at Moscow State
University has their own "GeneBee" phylogenetics
server - http//www.genebee.msu.su/services/phtree_reduced.
html - The Phylodendron website is a tree drawing
program with a nice user interface and a lot of
options, however, the output is limited to gifs
at 72 dpi - not publication quality. http//iu
bio.bio.indiana.edu/treeapp/treeprint-form.html
26Other Web Resources
- Joseph Felsenstein (author of PHYLIP) maintains a
comprehensive list of Phylogeny programs at - http//evolution.genetics.washington.edu/phylip/
software.html - Introduction to Phylogenetic Systematics,
- Peter H. Weston Michael D. Crisp, Society of
Australian Systematic Biologists - http//www.science.uts.edu.au/sasb/WestonCrisp.htm
l - University of California, Berkeley Museum of
Paleontology (UCMP) - http//www.ucmp.berkeley.edu/clad/clad4.html
27Software Hazards
- There are a variety of programs for Macs and PCs,
but you can easily tie up your machine for many
hours with even moderately sized data sets (i.e.
ten 300 bp sequences) - Moving sequences into different programs can be a
major hassle due to incompatible file formats. - Just because a program can perform a given
computation on a set of data does not mean that
that is the appropriate algorithm for that type
of data.