Title: Stuart M' Brown
1Molecular PhylogeneticsComputing Evolution
presented by
- Stuart M. Brown
- New York University School of Medicine
2Evolution
- The theory of evolution is the foundation upon
which all of modern biology is built.
- From anatomy to behavior to genomics, the
scientific method requires an appreciation of
changes in organisms over time. - By looking at gene sequences, one can see
evolution in action.
3Taxonomy
- The study of the relationships between groups of
organisms is called taxonomy, an ancient and
venerable branch of classical biology.
- Taxonomy is the art of classifying things into
groups a quintessential human behavior
established as a mainstream scientific field by
Carolus Linnaeus (1707-1778).
4(No Transcript)
5Phylogenetics
- Evolutionary theory states that groups of
similar organisms are descended from a common
ancestor. - Phylogenetic systematics (cladistics) is a method
of taxonomic classification of organisms based on
their evolutionary history.
- Cladistics was developed by Willi Hennig, a
German entomologist, in 1950.
6DNA is a good tool for taxonomy
- DNA sequences have many advantages over
morphological taxonomic characters - Character states can be scored unambiguously
- Large numbers of characters can be scored for
each individual
7- New alleles are created by mutations in the DNA
sequence
8Definition
- Homology related by descent
- Homologous sequence positions
? ATTGCGC
ATTGCGC
ATTGCGC
?
ATTGCGC
? ATCCGC
AT-CCGC
C
9Dynamic Programming
- Dynamic Programming is a very general programming
technique. - It is applicable when a large search space can be
structured into a succession of stages, such
that - the initial stage contains trivial solutions to
sub-problems - each partial solution in a later stage can be
calculated by recurring a fixed number of partial
solutions in an earlier stage - the final stage contains the overall solution
10(No Transcript)
11Global vs. Local Alignments
- Global alignment algorithms start at the
beginning of two sequences and add gaps to each
until the end of one is reached. - Local alignment algorithms finds the region (or
regions) of highest similarity between two
sequences and build the alignment outward from
there.
12(No Transcript)
13Phylogenetics starts with Multiple Alignment
- Can only align homologous sequences
- Pairwise alignment can be exact
- exhaustive computation dynamic programming
- Multiple alignment is always approximate
(heuristic) - the problem increases
exponentially with the number of sequences to be
aligned - Progressive pairwise method
- Software of choice is CLUSTAL (emma in EMBOSS)
14Clustal Format
- CLUSTAL X (1.81) multiple sequence alignment
- CAS1_BOVIN MKLLILTCLVAVALARPKHPIKHQGLPQ------
--EVLNEN- - CAS1_SHEEP MKLLILTCLVAVALARPKHPIKHQGLSP------
--EVLNEN- - CAS1_PIG MKLLIFICLAAVALARPKPPLRHQEHLQNEPDSR
E-------- - CAS1_HUMAN MRLLILTCLVAVALARPKLPLRYPERLQNPSESS
E-------- - CAS1_RABBIT MKLLILTCLVATALARHKFHLGHLKLTQEQPESS
EQEILKERK - CAS1_MOUSE MKLLILTCLVAAAFAMPRLHSRNAVSSQTQ----
--QQHSSSE - CAS1_RAT MKLLILTCLVAAALALPRAHRRNAVSSQTQ----
--------- - .. .
-
15A aat tcg ctt cta gga atc tgc cta atc ctg B
... ..a ..g ..a .t. ... ... t.. ... ..a C ...
..a ..c ..c ... ..t ... ... ... t.a D ... ..a
..a ..g ..g ..t ... t.t ..t t..
Each nucleotide difference is a character
16Sequences Reflect Relationships
- After working with sequences for a while, one
develops an intuitive understanding that for a
given gene, closely related organisms have
similar sequences and more distantly related
organisms have more dissimilar sequences. These
differences can be quantified. - Given a set of gene sequences, it should be
possible to reconstruct the evolutionary
relationships among genes and among organisms.
17(No Transcript)
18Gap Penalties
- Linear gap penalties Affine gap penalties
- p (o l.e)
- Gap opening /Gap extension
- Penalized multiple nearby gaps
- Protein specific penalties (on by default)
- Increase the probability of gaps associated with
certain residues - Increase the chances of gaps in loop regions (gt
5 hydrophilic residues)
19(No Transcript)
20Multiple Alignment Tips
- Align pairs of sequences using an optimal method
- Progressive alignment programs such as ClustalX
for multiple alignment - Choose representative sequences to align
carefully - Choose sequences of comparable lengths
- Progressive alignment programs may be combined
- Review alignment by eye and edit
- If you have a choice align amino acid sequences
rather than nucleotides
21Genome Data
- Recent genome sequencing projects have
contributed a huge store of data to publicly
accessible databases - These data make it quite easy to conduct
phylogentic experiments from any internet
connected computer.
22Caculating Trees from Sequences
- First, collect a set of homologous sequences.
- Make a multiple alignment (line them up)
- Calculate differences (distance matrix)
- Join pairs with the fewest differences
- Then join groups
- Repeat until all sequences are included in the
tree - Draw a pretty looking tree.
23What Sequences to Study?
- Different sequences accumulate changes at
different rates - chose level of variation that
is appropriate to the group of organisms being
studied. - Proteins (or protein coding DNAs) are constrained
by natural selection - better for very distant
relationships - Some sequences are highly variable (rRNA spacer
regions, immunoglobulin genes), while others are
highly conserved (actin, rRNA coding regions) - Different regions within a single gene can evolve
at different rates (conserved vs. variable
domains)
24Genes vs. Species
- Relationships calculated from sequence data
represent the relationships between genes, this
is not necessarily the same as relationships
between species. - Your sequence data may not have the same
phylogenetic history as the species from which
they were isolated - Different genes evolve at different speeds, and
there is always the possibility of horizontal
gene transfer (hybridization, vector mediated DNA
movement, or direct uptake of DNA).
25Orthologs vs. Paralogs
- When comparing gene sequences, it is important to
distinguish between identical vs. merely similar
genes in different organisms. - Orthologs are homologous genes in different
species with analogous functions. - Paralogs are similar genes that are the result of
a gene duplication. - A phylogeny that includes both orthologs and
paralogs is likely to be incorrect. - Sometimes phylogenetic analysis is the best way
to determine if a new gene is an ortholog or
paralog to other known genes.
26Ancestral gene
A
(globin)
Duplication
A
B
(myoglobin)
(hemoglobin)
Speciation
A2
B2
A1
B1
(mouse)
(human)
27Distances Measurements
- It is often useful to measure the genetic
distance between two species, between two
populations, or even between two individuals. - The entire concept of numerical taxonomy is based
on computing phylogenies from a table of
distances. - In the case of sequence data, pairwise distances
must be calculated between all sequences that
will be used to build the tree - thus creating a
distance matrix. - Distance methods give a single measurement of the
amount of evolutionary change between two
sequences since divergence from a common
ancestor.
28Computing a Distance Matrix
Reading sequences... gtr1_human 548
total, 548 read gtr2_human 548 total,
548 read gtr3_human 548 total, 548
read gtr4_human 548 total, 548 read
gtr5_human 548 total, 548
read Computing distances using Kimura method...
1 x 2 48.61 1 x 3 45.50 1
x 4 65.74 1 x 5 107.70 2 x 3
61.53 2 x 4 74.57 2 x 5 113.82
3 x 4 68.93 3 x 5 104.43 4 x 5
110.86
Matrix 1 1 2
3 4 5 ________________________
____________________________________ ..
1 0.00 48.61 45.50 65.74
107.70 2 0.00
61.53 74.57 113.82 3
0.00 68.93 104.43
4 0.00
110.86 5
0.00
29DNA Distances
- Distances between pairs of DNA sequences are
relatively simple to compute as the sum of all
base pair differences between the two sequences. - this type of algorithm can only work for pairs of
sequences that are similar enough to be aligned - Generally all base changes are considered equal
- Insertion/deletions are generally given a larger
weight than replacements (gap penalties). - It is also possible to correct for multiple
substitutions at a single site, which is common
in distant relationships and for rapidly evolving
sites.
30(No Transcript)
31Amino Acid Distances
- Distances between amino acid sequences are a bit
more complicated to calculate. - Some amino acids can replace one another with
relatively little effect on the structure and
function of the final protein while other
replacements can be functionally devastating. - From the standpoint of the genetic code, some
amino acid changes can be made by a single DNA
mutation while others require two or even three
changes in the DNA sequence. - In practice, what has been done is to calculate
tables of frequencies of all amino acid
replacements within families of related protein
sequences in the databanks i.e. PAM and BLOSSUM
32The PAM 250 scoring matrix
A R N D C Q E G H I L K M F P S
T W Y V A 2 R -2 6 N 0
0 2 D 0 -1 2 4
C -2 -4 4 -5 4 Q 0 1
1 2 -5 4 E 0 -1 1 3 -5 2 4
G 1 -3 0 1 -3 -1 0 5
H -1 2 2 1 -3 3 1 -2 6 I -1
-2 -2 -2 -2 -2 -2 -3 -2 5 L -2 -3
-3 -4 -6 -2 -3 -4 -2 2 6 K -1 3
1 0 -5 1 0 -2 0 -2 -3 5 M -1
0 -2 -3 -5 -1 -2 -3 -2 2 4 0 6
F -4 -4 -4 -6 -4 -5 -5 -5 -2 1 2 -5 0 9
P 1 0 -1 -1 -3 0 -1 -1 0 -2 -3 -1 -2
-5 6 S 1 0 1 0 0 -1 0 1 -1
-1 -3 0 -2 -3 1 3 T 1 -1 0 0
-2 -1 0 0 -1 0 -2 0 -1 -2 0 1 3
W -6 2 -4 -7 -8 -5 -7 -7 -3 -5 -2 -3 -4 0 -6
-2 -5 17 Y -3 -4 -2 -4 0 -4 -4 -5
0 -1 -1 -4 -2 7 -5 -3 -3 0 10 V
0 -2 -2 -2 -2 -2 -2 -1 -2 4 2 -2 2 -1 -1 -1 0
-6 -2 4 Dayhoff, M, Schwartz, RM, Orcutt, BC
(1978) A model of evolutionary change in
proteins. in Atlas of Protein Sequence and
Structure, vol 5, sup. 3, pp 345-352. M. Dayhoff
ed., National Biomedical Research Foundation,
Silver Spring, MD.
33Clustering Algorithms
- Clustering algorithms use distances to calculate
phylogenetic trees. These trees are based solely
on the relative numbers of similarities and
differences between a set of sequences. - Start with a matrix of pairwise distances
- Cluster methods construct a tree by linking the
least distant pairs of taxa, followed by
successively more distant taxa.
34UPGMA
- The simplest of the distance methods is the UPGMA
(Unweighted Pair Group Method using Arithmetic
averages) -
- The PHYLIP programs DNADIST and PROTDIST
calculate absolute pairwise distances between a
group of sequences. Then the GCG program GROWTREE
uses UPGMA to build a tree. - Many multiple alignment programs such as CLUSTAL
use a variant of UPGMA to create a dendrogram of
DNA sequences which is then used to guide the
multiple alignment algorithm.
35Neighbor Joining
- The Neighbor Joining method is the most popular
way to build trees from distance measurements - (Saitou and Nei 1987, Mol. Biol. Evol. 4406)
- Neighbor Joining corrects the UPGMA method for
its (frequently invalid) assumption that the same
rate of evolution applies to each branch of a
tree. - The distance matrix is adjusted for differences
in the rate of evolution of each taxon (branch). - Neighbor Joining has given the best results in
simulation studies and it is the most
computationally efficient of the distance
algorithms (N. Saitou and T. Imanishi, Mol.
Biol. Evol. 6514 (1989)
36(No Transcript)
37(No Transcript)
38Other Methods
- Distance methods do not make use of all
information in an alignment - differences between each pair of sequences are
summarized as a single number - each character has information
- Parsimony allows the use of all known
evolutionary information in building a tree - Maximum Likelyhood is even more computationally
intense (creates a tree for each character) - These methods are beyond the scope of this course
39Computer Software for Phylogenetics
- Due to the lack of consensus among evolutionary
biologists about basic principles for
phylogenetic analysis, it is not surprising that
there is a wide array of computer software
available for this purpose. - PHYLIP is a free package that includes 30
programs that compute various phylogenetic
algorithms on different kinds of data. - PAUP (Phylogenetic Analysis Using Parsimony)
- CLUSTALX is a multiple alignment program that
includes the ability to create tress based on
Neighbor Joining. - MacClade is a well designed cladistics program
that allows the user to explore possible trees
for a data set. - EMBOSS emma multiple alignment, many dozens of
phylogeny programs (includes entire PHYLIP
package fprotdis, fdrawtree, etc)
40Phylogenetics on the Web
- There are several phylogenetics servers available
on the Web - some of these will change or disappear in the
near future - these programs can be very slow so keep your
sample sets small - The Institut Pasteur, Paris has a PHYLIP server
at - http//bioweb.pasteur.fr/seqanal/phylogeny/phyli
p-uk.html - Louxin Zhang at the Natl. University of
Singapore has a WebPhylip server - http//sdmc.krdl.org.sg8080/lxzhang/phylip/
- The Belozersky Institute at Moscow State
University has their own "GeneBee" phylogenetics
server - http//www.genebee.msu.su/services/phtree_reduced.
html - The Phylodendron website is a tree drawing
program with a nice user interface and a lot of
options, however, the output is limited to gifs
at 72 dpi - not publication quality. http//iu
bio.bio.indiana.edu/treeapp/treeprint-form.html
41Other Web Resources
- Joseph Felsenstein (author of PHYLIP) maintains a
comprehensive list of Phylogeny programs at - http//evolution.genetics.washington.edu/phylip/
software.html - Introduction to Phylogenetic Systematics,
- Peter H. Weston Michael D. Crisp, Society of
Australian Systematic Biologists - http//www.science.uts.edu.au/sasb/WestonCrisp.htm
l - University of California, Berkeley Museum of
Paleontology (UCMP) - http//www.ucmp.berkeley.edu/clad/clad4.html
42Software Hazards
- There are a variety of programs for Macs and PCs,
but you can easily tie up your machine for many
hours with even moderately sized data sets (i.e.
fifty 300 bp sequences) - Moving sequences into different programs can be a
major hassle due to incompatible file formats. - Just because a program can perform a given
computation on a set of data does not mean that
that is the appropriate algorithm for that type
of data.
43Conclusions
- Given the huge variety of methods for computing
phylogenies, how can the biologist determine what
is the best method for analyzing a given data
set? - Published papers that address phylogenetic issues
generally make use of several different
algorithms and data sets in order to support
their conclusions. - Using several alternate methods can give an
indication of the robustness of a given
conclusion. - Bootstrapping methods can show if your results
depend on just a small part of the data, or are
generally supported - Analyze several different genes!!
44Biodiversity and Conservation
- Phylogenetics also overlaps significantly with
other branches of evolutionary biology. - Check out the Tree of Life project for an
introduction to phylogenetics and its
relationship to biodiversity. - http//phylogeny.arizona.edu/tree/phylogeny.html
- Measurements of DNA sequence differences are now
being used to implement plans for the
conservation of genetic resources.