Motivation - PowerPoint PPT Presentation

1 / 115
About This Presentation
Title:

Motivation

Description:

Phenetic versus ... like network expressing phenetic relationships is called a ... approach, whereas the UPGMA method is a typical phenetic method. ... – PowerPoint PPT presentation

Number of Views:219
Avg rating:3.0/5.0
Slides: 116
Provided by: Ole105
Category:

less

Transcript and Presenter's Notes

Title: Motivation


1
(No Transcript)
2
Motivation Tree Basics Homoplasy Molecular Clock
Hypothesis Prediction Methods Character-based perf
ect phylogeny maximum parsimony Distance-based ult
rametric trees additive trees (eg.
Fitch-Margoliash, Nearest Neighbors) Unweighted
pair group method with arithmetic mean
(UPGMA) Maximum Likelihood
Evaluating trees and data
3
Evolution Recall that DNA encodes blue print
of life Living things pass DNA info to their
children Due to mutations, DNA is changed a
little bit After a long time, different
species would evolve Phylogenetics studies
genetic relationship between different species
4
Similarity searches and multiple alignments
of sequences naturally lead to the question
How are these sequences related? and more
generally How are the organisms from which
these sequences come related?
5
Phylogenetic systematics is a method of taxonomic
classification based on their evolutionary history
  • Willi Hennig, a German entomologist, 1950.

6
Phenetic versus cladistic analysis
Phenetics is the study of relationships among a
group of organisms on the basis of the degree of
similarity between them, be that similarity
molecular, phenotypic, or anatomical. A tree-like
network expressing phenetic relationships is
called a phenogram. Cladistics can be defined
as the study of the pathways of evolution. In
other words, cladists are interested in such
questions as how many branches there are among a
group of organisms which branch connects to
which other branch and what is the branching
sequence. A tree-like network that expresses such
ancestor-descendant relationships is called a
cladogram. Thus, a cladogram refers to the
topology of a rooted phylogenetic tree.
The maximum parsimony method is a typical
representative of the cladistic approach, whereas
the UPGMA method is a typical phenetic method.
7
Character-based approach
Trees constructed on the basis of gain or loss of
characters (or traits) NOT connected explicitly
to a measure of distance Best for small sets of
sequences with high similarity
  • Distance measures are not necessary
  • Traditionally, morphological features used
  • Has backbone
  • Has feathers

8
Has a certain amino acid at position i Whether a
certain gap is present in a multiple sequence
alignment Whether or not protein X regulates
protein Y
9
Character-based trees interpreted as
evolutionary trees
Root represents an ancestral object with none of
the present m characters 0 0 0 0 Each of
the characters changes from 0 to 1 exactly once
and never changes back Each character labels
one edge Evolutionary history by mutation event,
not time
10
Independent evolution of tails
11
Independent evolution of tails
12
Distance based approach
Cladistic Methods
  • Evolutionary relationships are documented by
    creating a
  • branching structure, termed a phylogeny or
    tree, that
  • illustrates the relationships between the
    sequences.
  • Cladistic methods construct a tree (cladogram) by
  • considering the various possible pathways
    of evolution
  • and choose from among these the best
    possible tree.
  • A phylogram is a tree with branches that are
    proportional
  • to evolutionary distances.

13
(No Transcript)
14
Types of data used in phylogenetic
inference Character-based methods Use the
aligned characters, such as DNA or protein
sequences, directly during tree inference.
Taxa Characters Species
A ATGGCTATTCTTATAGTACG Species
B ATCGCTAGTCTTATATTACA Species
C TTCACTAGACCTGTGGTCCA Species
D TTGACCAGACCTGTGGTCCG Species
E TTGACCAGTTCTCTAGTTCG Distance-based methods
Transform the sequence data into pairwise
distances (dissimilarities), and then use the
matrix during tree building. A
B C D E Species A ---- 0.20
0.50 0.45 0.40 Species B 0.23 ---- 0.40
0.55 0.50 Species C 0.87 0.59 ----
0.15 0.40 Species D 0.73 1.12 0.17 ----
0.25 Species E 0.59 0.89 0.61 0.31 ----
15
PHYLOGENETIC ANALYSIS
-evolution at a molecular level Linus
PaulingEmile Zuckerkandl, 1965 (mutation
rate) The branch of taxonomy that deals with
numerical data such as DNA sequence is known as
phylogenetics
Mutations Random (?) Accumulate (?) Ancestor
  • Genetic drift (identical genes in different
    species)
  • Gene duplication
  • Recombination
  • Exchange

16
Assumptions of Phylogenies
  • All sequences are homologous.
  • No duplicate sequences are present..
  • Back mutation/reversal
  • Optimal alignments
  • Reproductive isolation
  • Limited horizontal gene transfer

17
Purpose of phylogenetic predictions
  • Understand the lineage of different species
  • Organizing principle to sort species into a
    taxonomy
  • Understand how various functions evolved
  • Understand forces and constraints on evolution
  • Perform multiple sequence alignment

18
Homoplasy
  • Homoplasy is similarity that is not homologous
  • (not due to common ancestry)
  • Homology is the result of independent evolution
  • (convergence, parallelism, reversal)
  • Can provide misleading evidence of phylogenetic
  • relationships

19
Homoplasy
  • Homoplasy is similarity that is not homologous
  • (not due to common ancestry)
  • Homology is the result of independent evolution
  • (convergence, parallelism, reversal)
  • Can provide misleading evidence of phylogenetic
  • relationships

Significantly similar molecular sequences are
very unlikely to arise by chance - i.e. homoplasy
on the molecular level is very unlikely
horizontal transfer of sequences from one
organism to another ???????
20
Orthologs vs. Paralogs
  • When comparing gene sequences, it is important to
    distinguish between identical vs. merely similar
    genes in different organisms.
  • Orthologs are homologous genes in different
    species with analogous functions.
  • Paralogs are similar genes that are the result of
    a gene duplication.
  • A phylogeny that includes both orthologs and
    paralogs is likely to be incorrect.
  • Sometimes phylogenetic analysis is the best way
    to determine if a new gene is an ortholog or
    paralog to other known genes.

21
1. Alignment 2. Substitution model building 3.
Tree building 4. Tree evaluation
22
Progressive alignment
Closely related sequences distantly
related sequences
Independent (RNA?????)
GAPS?
23
  • Alignment parameter estimation
  • Placement of indels (insertion/deletion events)
  • in an alignment of length-variable sequences
  • depends on all parameters of evolutionary
    model and
  • should be consistent with those observed
    in a tree
  • inferred from the alignment
  • extreme way- to delete from analysis all
    sites that
  • include gaps (phylogenetic signals in this
    regions will be
  • lost)
  • another approach-incorporate gaps as
    characters
  • (additional state or independent of base
    substitution states)
  • Parameters should vary dynamically with
    evolutionary
  • divergence

24
PHYLOGENETIC ANALYSIS
  • Alignment
  • Attributes and options
  • Computer dependence
  • none partial complete
  • Phylogeny invocation
  • none a priori recursive
  • Alignment parameter estimation
  • a priori dynamic recursive
  • Alignment features
  • primary structure high order structures
  • Mathematical optimization
  • statistical nonstatistical

25
PHYLOGENETIC ANALYSIS
  • Alignment
  • Attributes and options
  • Computer dependence
  • none partial complete
  • Phylogeny invocation
  • none a priori recursive
  • Alignment parameter estimation
  • a priori dynamic recursive
  • Alignment features
  • primary structure high order structures
  • Mathematical optimization
  • statistical nonstatistical

CLUSTAL W
  • Partial computational dependence
  • Phylogeny criteria invoked a priori (guide
    tree)
  • Alignment parameter estimation a priori or
    dynamically (optional)
  • Alignment of primary structure (partial
    structural basis
  • in a case of hydrophilic AA)
  • 5. Mathematical optimization nonstatistical

26
  • Alignment of primary versus higher order
  • structures
  • Aligning according to secondary or higher order
    structures
  • are more reliable

TREE BUILDING PROGRAMS IN ALIGNMENT PACKAGES
ARE NOT RIGOROUS !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
27
Similarity vs. Evolutionary Relationship
Similarity and relationship are not the same
thing, even though evolutionary relationship is
inferred from certain types of similarity. Simila
r having likeness or resemblance (an
observation) Related genetically connected
(an historical fact) Two taxa can be most
similar without being most closely-related
C is more similar in sequence to A (d 3) than
to B (d 7), but C and B are most
closely related (that is, C and B shared a common
ancestor more recently than either did with A).
28
  • Substitution model building
  • Very important, since it will influence alignment
  • as well as tree building
  • For nucleotide sequences two models
  • Substitutions between particular bases
  • Substitutions among different sites in a sequence

A-G/G-A/C-T/T-C more frequent than
A-C/A-T/C-G/G-T
Rates of substitution takes form of a square
matrix 4- bases 20-AA 61-codons Fixed cost
matrices are used in weighted parsimony method
29
Character weight matrix and application in
phylogenetic analysis
G G A A C C T T
G G C C A A
T T
C-T
G-A
G-C
A-T
G-A
G-C
G
G
-if unweighted 3 steps -if weighted???????????
Reconstructions of evolution from 8 sequences
Distances matrices are much more complicated
30
Basic Considerations
  • Codon bias
  • Amino-acid codons have been degenerated with
    wobble in the third position.
  • Yeasts, protozoa, and animals have different
    codon preferences, which would result
  • in differences in DNA sequence that are related
    to codon bias and not to evolution.
  • Also, the protozoa use the codons TAA and TGA to
    encode glutamine, rather than
  • STOP, and in mitochondria the codon TGA encodes
    tryptophane, rather than STOP.

Relationships between genes are not necessarily
the same as the relationships between whole
organisms.
  • Phylogenies based on DNAs better than those based
    on proteins due to degeneracy of the genetic code
    and associated masking of mutations
  • DNA under less selective pressure than protein
  • DNA comparison is more sensitive to pick up
    divergence for closely related sequences.
  • But DNA sequence alignment is less reliable than
    protein sequence alignment

31
Distances Measurements
Distance score- the score between two sequences ,
representing the number of mismatched positions
in the alignments (number of positions that
should be changed)
  • It is often useful to measure the genetic
    distance between two species, between two
    populations, or even between two individuals.
  • The entire concept of numerical taxonomy is based
    on computing phylogenies from a table of
    distances.
  • In the case of sequence data, pairwise distances
    must be calculated between all sequences that
    will be used to build the tree - thus creating a
    distance matrix.
  • Distance methods give a single measurement of the
    amount of evolutionary change between two
    sequences since divergence from a common
    ancestor.

32
Finding Distance Between Two Species Consider
two species with these DNA fragments Species
i (A, C, G, C, T) Species j (C, C, A, C,
T) 2 mismatches, so can estimate distance to
be 2 Looks reasonable, as 2 mismatches can
be thought as 2 mutations However, this fails
to capture multiple mutations on the same
site In practice, need to apply some
corrective distance transformation
33
Conversion of Alignment Scores to Distances
  • Alignment scores are large for similar sequences.
  • Distance methods require that the distances
    between similar sequences are smaller than the
    distances between less similar sequences.
  • Large alignment scores need to be mapped to small
    distances and vice versa.

34
Computing a Distance Matrix
Reading sequences... gtr1_human 548
total, 548 read gtr2_human 548 total,
548 read gtr3_human 548 total, 548
read gtr4_human 548 total, 548 read
gtr5_human 548 total, 548
read Computing distances using Kimura method...
1 x 2 48.61 1 x 3 45.50 1
x 4 65.74 1 x 5 107.70 2 x 3
61.53 2 x 4 74.57 2 x 5 113.82
3 x 4 68.93 3 x 5 104.43 4 x 5
110.86
Matrix 1 1 2
3 4 5 ________________________
____________________________________ ..
1 0.00 48.61 45.50 65.74
107.70 2 0.00
61.53 74.57 113.82 3
0.00 68.93 104.43
4 0.00
110.86 5
0.00
35
DNA Distances
  • Distances between pairs of DNA sequences are
    relatively simple to compute as the sum of all
    base pair differences between the two sequences.
  • this type of algorithm can only work for pairs of
    sequences that are similar enough to be aligned
  • Generally all base changes are considered equal
  • Insertion/deletions are generally given a larger
    weight than replacements (gap penalties).
  • It is also possible to correct for multiple
    substitutions at a single site, which is common
    in distant relationships and for rapidly evolving
    sites.

36
Mutation rate?
37
Genetic distance An attempt to answer the
question of how much evolutionary change has
occurred between sequences
Jukes Cantor distance mutation occurs at a
constant rate and each nucleotide is equally
likely to mutate into any other nucleotide with
rate a.
38
Kimura two-parameter distance allows a
Different rate for transitions and
transversions.
39
Amino Acid Distances
  • Distances between amino acid sequences are a bit
    more complicated to calculate.
  • Some amino acids can replace one another with
    relatively little effect on the structure and
    function of the final protein while other
    replacements can be functionally devastating.
  • From the standpoint of the genetic code, some
    amino acid changes can be made by a single DNA
    mutation while others require two or even three
    changes in the DNA sequence.
  • In practice, what has been done is to calculate
    tables of frequencies of all amino acid
    replacements within families of related protein
    sequences in the databanks i.e. PAM and BLOSSUM

40
EVOLUTIONARY TIME?
41
Molecular clock hypothesis
proposed in 1968 by Motoo Kimura.
  • The controversial hypothesis of molecular clock
    (MC) is a consequence of the neutral theory of
    evolution.
  • It holds that in any given DNA /protein sequence,
    mutations accumulate at an approximately constant
    rate as long as the DNA sequence retains its
    original functions.
  • The difference between the sequences of a DNA
    segment (or protein) in two species would then be
    proportional to the time since the species
    diverged from a common ancestor (coalescence
    time).
  • This time may be measured in arbitrary units and
    then it can be calibrated in millions of years
    for any given gene if the fossil record of that
    species happens to be rich.

42
(No Transcript)
43
The rate of evolution k nTf0 where k rate
of nucleotide substitutions, nT the mutation
rate f0 the fraction of new alleles that are
selectively neutral.
Under a molecular clock, the rate at which two
populations diverge is 2mt where m
mutation rate and t time of last common
ancestor.
44
  • Neutral theory of Evolution most variation that
    is observed is of no interest to natural
    selection (fitness).
  • Most mutations are so nearly selectively neutral
    in their effects that their fate is determined
    largely through random genetic drift and other
    alleles are deleterious and removed by selection.
  • silent substitutions and substitutions in
    noncoding regions will occur more often because
    they are likely to be selectively neutral.
  • Replacement substitutions will occur less often
    because of selective pressure.

45
  • Rate of accepted mutations maybe different for
    different proteins (depending on their tolerance
    for mutations)
  • Different parts of a protein may evolve at
    different rates

46
Clustering Algorithms
Distances
Tree
  • Clustering algorithms use distances to calculate
    phylogenetic trees. These trees are based solely
    on the relative numbers of similarities and
    differences between a set of sequences.
  • Start with a matrix of pairwise distances
  • Cluster methods construct a tree by linking the
    least distant pairs of taxa, followed by
    successively more distant taxa.

47
TREE BUILDING
Species or genes tree
A tree is a 2-dimensional graph showing
evolutionary relationships among organisms, or
in our case, in certain genes from separate
organisms. We refer to these separate sources
of sequences as taxa (singular taxon), defined
as phylogenetically distinct units on the tree.
The tree is composed of nodes representing the
taxa and branches representing the relationships
among the taxa. The lengths of the branches are
often drawn proportional to the number of
sequence changes in the branch.
48
The sum of all branch length tree length The
tree is bifurcating or binary tree
49
The sum of all branch length tree length The
tree is bifurcating or binary tree (too
close-hard to resolve-several branches from the
node)
50
  • PROPERTIES OF TREES
  • a unique path leads from the root node to any
    other node
  • and the direction indicates evolutionary time.
  • the root is the common ancestor of all taxa
  • the root is defined by including a taxon which we
    are
  • reasonably sure branched off earlier than the
    other taxa
  • under study but should be related to the
    remaining taxa
  • if we do not have a taxa to define the root, we
    can predict
  • relationships by an UNROOTED TREE.

51
  • Rooted trees
  • Single common ancestor
  • Requires more information

Unrooted trees Insufficient information to tell
whether not not a given internal node is a common
ancestor of any 2 leaves
THERE ARE A LARGE POSSIBLE NUMBER OF TREES AND
ONLY ONE TREE IS THE CORRECT ONE. THE OBJECTIVE
OF THE ANALYSIS IS TO FIND THE CORRECT TREE.
TAXA OF ROOTED TREES OF UNROOTED
TREES 3 3
1 4
15
3 5
105
15 - 7
10,395 954
52
Phylogenetic Tree Construction
  • Processes
  • Topology construction
  • Length estimation
  • Methods
  • Distance methods
  • Maximum parsimony methods
  • Maximum likelihood methods

53
2 methods
54
Method 1
OUTGROUP
  • Outgroup seq should be closely related to rest of
    seqs, but there should also be significantly more
    difference between outgroup and rest of seqs
  • Outgroup that is too distant may lead to
    incorrect tree because of more random complex
    nature of diff between outgroup and rest of seqs
  • In choosing outgroup, one assumes that the
  • evolutionary history of the gene is same as rest
  • of seqs. If this assumption is incorrect (e.g.,
  • horizontal gene transfer has occurred), an
    incorrect analysis could result

55
Method 2
Use statistical tools will root trees
automatically (e.g. mid-point rooting)
This must involve assumptions BEWARE!
56
METRIC DISTANCES between any two or three taxa
(a, b, and c) have the following
properties Property 1 d (a, b)
0 Non-negativity Property 2 d (a, b) d (b,
a) Symmetry Property 3 d (a, b) 0 if and
only if a b Distinctness
and... Property 4 d (a, c) d (a, b) d (b,
c) Triangle inequality
57
ULTRAMETRIC DISTANCES must satisfy the previous
four conditions, plus Property 5 d (a, b)
maximum d (a, c), d (b, c)
This implies that the two largest distances are
equal, so that they define an isosceles triangle
Similarity Relationship if the distances are
ultrametric!
If distances are ultrametric, then the sequences
are evolving in a perfectly clock-like manner,
thus can be used in UPGMA trees and for the most
precise calculations of divergence dates.
58
Property 6 d (a, b) d (c, d) maximum d
(a, c) d (b, d), d (a, d) d (b, c)
ADDITIVE DISTANCES
59
METHODS OF PHYLOGENETIC ANALYSIS
(Phenetic-cladistic phenograms-cladograms)
Character-based methods maximum parsimony
method a multiple sequence alignment is
produced in order to predict which sequence
positions are likely to correspond. These
positions will appear in vertical columns in the
multiple sequence alignment. For each aligned
position, phylogenetic trees that require the
smallest number of evolutionary changes to
produce the observed sequence changes are
identified. This analysis is continued for every
position in the sequence alignment. Finally,
those trees which produce the smallest number of
changes overall for all sequence positions are
identified. maximum likelihood method like
the maximum parsimony method, the maximum
likelihood method depends upon first obtaining a
reliable multiple sequence alignment and then
examining the changes in each column in the
alignment. In this case, however, the likelihood
of a particular tree is calculated using an
expected model of change in the sequences
(Swofford and Olsen 1990). For example,
all nucleotides are assumed to be equally
frequent and the probability of change of any
nucleotide to any other nucleotide is assumed to
be the same in the Jukes-Cantor model. For each
possible tree, the likelihood of finding the
actual sequence changes at each column in the
aligned sequences is calculated. The
probabilities for each aligned position are then
multiplied to provide a likelihood for each tree.
The tree which provides the maximum likelihood
value is the most probable tree. Distance-based
methods all possible pairs of sequences are
aligned to determine which pairs are the most
similar or closely related. These alignments
provide a measure of the genetic distance between
the sequences. These distance measurements are
then used to predict the evolutionary
relationship.
Derive trees that optimize the distribution of
the data patterns for each character (not-fixed
distances)
Compute pairwise distances according to some
measures, then discard the actual data (fixed
distances)
60
(No Transcript)
61

Is there strong Seq similarity?
Obtain multiple Seq alignment
Maximum parsimony methods
Choose set of related seq
-
Is there clearly recognizable Seq similarity?

Distance methods
-
Maximum likelihood method
Analyze how well data support prediction
62
Character-based
  • maximum parsimony method
  • Find tree which minimizes number of changes
    needed to explain data
  • A multiple sequence alignment is produced in
    order to predict which sequence positions are
    likely to correspond.
  • These positions will appear in vertical columns
    in the multiple sequence alignment.
  • For each aligned position, phylogenetic trees
    that require the smallest number of evolutionary
    changes to produce the observed sequence changes
    are identified.
  • This analysis is continued for every position in
    the sequence alignment.
  • Finally, those trees which produce the smallest
    number of changes overall for all sequence
    positions are identified.

63
Character-based
A subset of all possible trees is examined. The
most parsimonious tree is the one that requires
the fewest evolutionary changes for all
sequences to derive from a common ancestor
(minimum evolution)
  • Consider four sequences ATCG, TTCG, ATCC, and
    TCCG
  • Imagine a tree that branches at the first
    position, grouping ATCG and ATCC on one branch,
    TTCG and TCCG on the other branch.
  • Then each branch splits, for a total of 3 nodes
    on the tree (Tree 1)
  • Compare Tree 1 with one that first divides ATCC
    on its own branch, then splits off ATCG, and
    finally divides TTCG from TCCG (Tree 2).
  • Trees 1 and 2 both have three nodes, but when
    all of the distances back to the root ( of nodes
    crossed) are summed, the total is equal to 8 for
    Tree 1 and 9 for Tree 2.

Tree 2
Tree 1
64
How do you search through all trees? Enumerate
all trees (too many) Can use techniques to
try to limit the search space (e.g., branch and
bound) or use heuristics (many
possibilities) E.g., nearest neighbor
interchange. Start with a tree and consider
neighboring trees. If any neighboring tree has
fewer changes, take it as current tree. Stop when
no improvements
65
Character-based



-informative sites
RULES
  • 4 taxa three unrooted trees
  • Some sites are informative, some not
  • Only informative sites need to be analyzed

COST of CHANGE??????
66
Character-based
Maximum parsimony - scoring
  • Step matrices
  • Consistency Index (CI)
  • CI min possible tree length
  • actual tree length
  • Codon position - variable weightage
  • Mutations leading to Amino acid changes scored
  • only

67
Implementation of step matrices   -       
Character-state trees describing possible
pathways are explicitly assigning a weight to a
particular sort of change cost -       
Parsimony methods will attempt to minimize the
summed cost of all changes -        Summed cost
number of steps in a character (step unit of
cost) -        One of key assumptions used in
parsimony analysis is assignment of relative
weights or costs to each type of change à
summarized in cost or step matrix          
Structure of step matrix dependent on the types
of rules you think characters are evolving under.
        In programs, you must choose (or
default chooses for you) a general step matrix
68
1.  Unordered chars           Change from any
state to any other counted as one step (Fitch
parsimony (Fitch, 1971) (nucleotide sequence
data) but may want TV as higher cost  
                                    2
                        1                      3
                                    0 2. 
Ordered chars   Number of steps from one state
to another diff between state numbers (Wagner
parsimony (Farris, 1970 )  draw where steps
lines in path  (ex morph chars on
continuum)            01234 3.  Irreversible
chars   Number of steps between states diff
between state numbers, where decreases in state
number do not occur (Camin-Sokal parsimony
(Camin and Sokal, 1965)         multiple gains
allowed, no losses                     
0à1à2à3à4
69
Unordered   Ordered   Irreversible
  Unordered         Ordered          
Irreversible    0 1 2 3            0 1 2
3            0 1 2 3 0 0 1 1 1         0 0 1 2
3         0 0 1 2 3 1 1 0 1 1         1 1 0 1
2         1 8 0 1 2 2 1 1 0 1         2 2 1 0
1         2 8 8 0 1 3 1 1 1 0         3 3 2 1
0         3 8 8 8 0  
        Can elaborate on step matrices for any
number of transformation or weighting schemes. 
Common one is weighting transversions more
heavily in molecular data    A  C  G  T  A - 
2  1  2 C 2  -  2  1 G 1  2  -  2 T 2  1 
2  -
70
Character-based
PAUP (phylogenic analysis using parsimony) -GCG
http//evolution.genetics.washington.edu/phylip/so
ftware.pars.htmlPAUP
(no web interface)
MACCLADE Macintosh program, contains many tools
for entering and editing data, producing trees
and having diagnostic feedback
71
Character-based
Maximum parsimony
Provides misleading information when rates of
sequence change in the different branches of
tree represented by the sequence data
Taxon 1
Taxon 4
g
g
predicted
Taxon 2
Taxon 1
a
g
real
a
g
a
a
Taxon 2
Taxon 3
Taxon 4
Taxon 3
If rates of change assumed to be equal..
Incorrect tree for the 1
72
Character-based
Maximum parsimony
Provides misleading information when rates of
sequence change in the different branches of
tree represented by the sequence data
Taxon 1
Taxon 4
g
g
Taxon 2
Taxon 1
a
g
a
g
a
a
Taxon 2
Taxon 3
Taxon 4
Taxon 3
Incorrect tree for the 1
  • Aproaches to solve the problem
  • To broke down long branches by presenting
    additional taxa
  • closely related to taxa in question
  • Lakes method (PAUP)
  • Only transversions are scored, (A,G) lt-gt (C,T)
  • Transversions are assumed to occur at constant
    rate
  • Also, independent of position

Taxa 2
Taxa 1
Taxa 2
Taxa 1
a
a
a
g
B
A
Other position
Evol. change or by chance?
c
c
c
c
Taxa 4
Taxa 3
Taxa 4
Taxa 3
73
Minimum evolution (ME) methods
  • Optimality criterion The tree(s) with the
    shortest sum of the branch lengths (or overall
    tree length) is chosen as the best tree.
  • Advantages
  • Can be used on indirectly-measured distances
    (immunological, hybridization).
  • Distances can be corrected for unseen events.
  • Usually faster than character-based methods.
  • Can be used for some rate analyses.
  • Has an objective function (as compared to
    clustering methods).
  • Disadvantages
  • Information lost when characters transformed to
    distances.
  • Slower than clustering methods.

74
Character-based
Maximum Likelihood (ML)
  • The term Maximum Likelihood does not refer to a
    single
  • statistical method, but rather to a general
    approach.
  • ML methods take what has been described as an
    "inside
  • out" approach. In their simplest form, they
    begin by listing
  • all possible models, and then calculating the
    probability that
  • each model would generate the data actually
    observed.
  • The model with the highest probability  of
    generating the
  • observed data is chosen as the best model.
  • Joe Felsenstein's application of ML to phylogeny
    is implemented in DNAML in the PHYLIP package,
    and in a modified version of DNAML called
    fastDNAml , written by Gary Olsen .
  • -explicit model of evolution, therefore more
    diverse sequences may be analyzed
  • -uses probability calculations to find a tree,
    similar to
  • parsimony method in that the analysis is
    performed
  • on each column of multiple alignment
  • all possible trees are considered
  • - trees with with the least numbers of changes
    are considered

75
The Maximum Likelihood approach resembles MP
method but presents additional opportunity to
evaluate trees variations in mutation
rates Jukes-Cantor and Kimura models
76
Sequence a A C G C G T T G G G Sequence b A C G C
G T T G G G Sequence c A C G C A A T G A A
Sequence d A C A C A G G G A A
Unrooted tree
C
A
(One of three)
D
B
T T A G
Rooted tree
a b c d
( one of five)
L3
L6
L4
L5
L-Likelihood values for the probability
Consider every possible base assignments Total
64- for three node positions
L1
L2
L0
Rooted tree with base assignments
T T A G
Transition 2x10-6 Transversion 10-6
a b c d
L3
L6
L4
L5
T
G
LL0xL1xL2xL3xL4xL5xL6 0.25x1x2x10-6x1x1x1x10-6
5x10-13
T
L1
L2
L0
Next tree and so on.. L (Tree) L (Tree1)
L (Tree2) ..
77
Maximum likelihood (ML) methods
Optimality criterion ML methods evaluate
phylogenetic hypotheses in terms of the
probability that a proposed model of the
evolutionary process and the proposed unrooted
tree would give rise to the observed data. The
tree found to have the highest ML value
is considered to be the preferred tree.
  • Advantages
  • Are inherently statistical and evolutionary
    model-based.
  • Usually the most consistent of the methods
    available.
  • Can be used for character (can infer the exact
    substitutions) and rate analysis.
  • Can be used to infer the sequences of the
    extinct (hypothetical) ancestors.
  • Can help account for branch-length effects in
    unbalanced trees.
  • Can be applied to nucleotide or amino acid
    sequences, and other types of data.
  • Disadvantages
  • Are not as simple and intuitive as many other
    methods.
  • Are computationally very intense (Iimits number
    of taxa and length of sequence).
  • Like parsimony, can be fooled by high levels of
    homoplasy.
  • Violations of the assumed model can lead to
    incorrect trees.

78
Distance-based methods
DISTANCES TREE
  • Tree is constructed using distances between
    species (number of mutations, time, other
    distance measures)
  • Neighbors sequence pairs with smallest number
    of changes
  • Trees are rooted, i.e. sequences share a common
    ancestor
  • First step is producing MSAs (ex CLUSTALW)
  • DNA
  • Distance matrix is created
  • Relatively simple for pairs of homologous
    sequences that can be aligned
  • without large insertions, deletions etc.
  • Proteins
  • Matrices, such as PAM are used
  • Multiple substitutions at one site is always a
    problem

79
  • Distance method applications
  • CLUSTALW
  • PAUP
  • PHYLIP

Methods of phylogenetic tree estimation
Outfile with a distance table
FITCH- Fitch-Margoliash method,
does not assume molecular clock KITSCH -
Fitch-Margoliash method, assume molecular
clock NEIGHBOR neighbor-joining (does not
assume molecular clock, unrooted tree) or
unweighted pair group methods (UPGMA)

DNADIST-distances among NA PROTDIST-distances
for AA seq
Distance matrices
Distance tables
outfile
infile
Distance score- the score between two sequences ,
representing the number of mismatched positions
in the alignments (number of positions that
should be changed)
Distance method will be successful if the
distances between the sequences can be made
additive on a predicted tree.
80
DISTANCE-BASED
Sequence A xxxxxxxxxxxxxxxxxxxxxx Sequence B
xxxxxxxxxxxxxxxxxxxxxx Sequence C
xxxxxxxxxxxxxxxxxxxxxx Sequence D
xxxxxxxxxxxxxxxxxxxxxx Distances the number of
steps required to change one sequence to
another nAB 3 nAC 7 nAD 8 nBC 6 nBD 7 nCD 3 Dist
ance table
Phylogenetic Tree
dABdCDltdACdBDdADdBC
Principle of additivity for this tree
Each change occurs once?????????
81
Additive Trees
  • Generalization of ultrametric trees
  • of mutations were assumed to be proportional to
    temporal distance of a node to ancestor
  • Also assumed, mutations took place at same rate
    in all branches
  • Additive trees model different rates of mutation
    along different branches

82
Fitch-Margoliash method
DISTANCE-BASED
  • Draw unrooted tree
  • Calculate the length of tree branches
    algebraically

c
From A to B ab22 (1) From A to C ac39
(2) From B to C bc41 (3) Subtract (3) from
(1) a-b-2 (4) Add (1) and (4) 2a20 a10 From
(1) and (2) b12 c29
83
Fitch-Margoliash method
DISTANCE BASED
  • Uses distance table
  • Calculates the length of tree branches
    algebraically
  • Draws unrooted tree

c
From A to B ab22 (1) From A to C ac39
(2) From B to C bc41 (3) Subtract (3) from
(1) a-b-2 (4) Add (1) and (4) 2a20 a10 From
(1) and (2) b12 c29
for n-sequences
  • Simple extension of 3-sequence method
  • Closest sequence pair is chosen
  • The rest of the sequences are agglomerated
  • Distance between the pair is computed
  • The matrix is recomputed with the sequence pair
    combined into single node
  • Process is repeated till the sequences are
    combined

84
Fitch-Margoliash method
DISTANCE -BASED
  • Advantages
  • tests more than one tree
  • still pretty fast
  • can use empirical substitution scoring methods
  • global optimization of tree by statistical
    criteria
  • Disadvantages
  • Requires longer execution time than Neighbor
    Joining, but still quite practical on most
    computers, for typical datasets.
  • does not consider intermediate ancestors, meaning
    that there is no requirement for an
    internally-consistent evolutionary model
  • misses homoplasies, especially over long
    distances long evolutionary distances will be
    underestimated.

85
The Neighbor-joining method
DISTANCE -BASED
Similar to Fitch-Margoliash Choice of which
sequences to pair is determined by a different
algorithm Pairs sequences based on the effect of
pairing on the sums of the branch lengths of the
tree
  • The distances between the sequences are used to
    calculate the sum of branch lengths
  • in a star-like tree
  • Decompose/modify the tree by combining pairs of
    sequences
  • The sum of the branch lengths of a new tree is
    calculated
  • A new distance table is made by combining A with
    B (composite sequence)

86
The Neighbor-joining method
  • Advantages
  • fastest tree building method
  • can use empirical substitution scoring methods
  • not influenced by variations in the rates of
    change along the branches of the tree
  • Disadvantages
  • tests only a single tree
  • does not consider intermediate ancestors, meaning
    that there is no requirement for an
    internally-consistent evolutionary model
  • misses homoplasies, especially over long
    distances long evolutionary distances will be
    underestimated.

87
DISTANCE -BASED
Unweighted pair group method with arithmetic mean
(UPGMA)
  • The rate of change along the branches of the tree
    is constant
  • Distances are approximately ultrametric
  • Simplest method
  • Can lead to wrong tree, if the rates of mutations
  • are not uniform

88
Distance Matrix
89
  • dAB is the smallest distance
  • Group A and B
  • Branch length dAB/2 (here we say evolution rate
    is constant..)
  • Recalculate distances from AB to other taxa as
    average
  • d(AB)C (dAC dBC)/2

.15/2
A
.15/2
B
90
  • new distance matrix
  • Find smallest distance and continue as before
  • Repeat until all taxa are on tree

dAB/2
A
dAB/2
B
d(AB)C/2
C
d(ABC)D/2
D
http//www.icp.ucl.ac.be/opperd/private/upgma.htm
l
91
Clustering methods (UPGMA N-J)
  • Optimality criterion NONE. The algorithm
    itself builds the tree.
  • Advantages
  • Can be used on indirectly-measured distances
    (immunological, hybridization).
  • Distances can be corrected for unseen events.
  • The fastest of the methods available (N-J is
    screamingly fast!).
  • Can therefore analyze very large datasets
    quickly (needed for HIV, etc.).
  • Can be used for some types of rate and date
    analysis.
  • Disadvantages
  • Similarity and relationship are not necessarily
    the same thing, so clustering by similarity does
    not necessarily give an evolutionary tree.
  • Cannot be used for character analysis!
  • Have no explicit optimization criteria, so one
    cannot even know if the program worked properly
    to find the correct tree for the method.

92
GrowTree When you run GrowTree, SeqWeb
seamlessly links together these programs (in the
order given) to perform the analysis.
1.PileUp 2.Distances 3.GrowTree
For alignment-a simplification of the
progressive alignment method of FengDoolittle,
1987 is used (clusters are created)
GrowTree reconstructs a phylogenetic tree
from a distance matrix such as the one
created by Distances. Two methods are
available for reconstructing the tree UPGMA
(unweighted pair group method using
arithmetic averages same rate of evolution)
and neighbor-joining.
NEXUS Trees from file hum_gtr.distances
begin trees utree Tree_1
((('Gtr1_Human'18.43,'Gtr3_Human'30.18)4.34,'Gt
r4_Human'24.87) 3.19,('Gtr2_Human'35.98,'Gtr5_H
uman'74.88)3.19)0.00 endblock
93
The NEXUS file is your actual ToL-MacClade data
file. It is the file you edit when working with
ToL-MacClade, and it is the file you write when
choosing Save from ToL-MacClade's File menu. The
function of the NEXUS file is to store the
information necessary to build your Tree .
ToL-MacClade uses a special format, the NEXUS
format, to store your list of taxa, your
phylogenetic tree, and the information you
entered in the various windows and boxes. The
NEXUS format has been created to allow for
compatibility between a number of different
programs for phylogenetic analysis. You will be
able to view and edit NEXUS files in ToL-MacClade
or in a word processor
GCG/SEQWEB
94
New Hampshire (Newick) Format
Human Mouse Drosophila Honey bee Fern Wheat Pine
(((Human, Mouse), (Dros, Bee)), (Fern, (Wheat,
Pine))
95
Q08832 P25123 P34903 P18505 o14764 P78334 Q99928 o
05591 P24046 p23415
96
UPGMA
Neighbor joining
Kimura distance
Uncorrected distance
97
PAUP (phylogenic analysis using parsimony) -GCG
-version 10 has an option to perform phylogenic
analysis using distance methods
PHYLIP (phylogenetic inference package)
http//evolution.genetics.washington.edu/phylip/ph
ylipweb.html
FITCH-estimates a PT assuming additivity of
branch lengths using Fitch-Margoliash method (no
molecular clock-rates of evolution along
branches can vary KITSCH- the same, but with
molecular clock NEIGHBOR- neighbor-joining method
with arithmetic mean (UPGMA)
98
maximum likelihood method
Computer intense and time consuming
PHYLIP (phylogenetic inference package)
http//bioweb.pasteur.fr/seqanal/phylogeny/phylip-
uk.html
DNAML - allows for variable frequences of 4
nucleotides, for unequal rates
of transitions and transversions
DNAMLK - the same, but molecular clock is
taking into account
99
  • There are several phylogenetics servers available
    on the Web
  • some of these will change or disappear in the
    near future
  • these programs can be very slow so keep your
    sample sets small
  • The Institut Pasteur, Paris has a PHYLIP server
    at
  • http//bioweb.pasteur.fr/seqanal/phylogeny/phylip
    -uk.html
  • The Belozersky Institute at Moscow State
    University has their own
  • "GeneBee" phylogenetics server
  • http//www.genebee.msu.su/services/phtree_reduced.
    html
  • The Phylodendron website is a tree drawing
    program with a nice user
  • interface and a lot of options,
    however, the output is limited to gifs at
  • 72 dpi - not publication quality.
  • http//iubio.bio.indiana.edu/treeapp/treeprint-for
    m.html
  • the most important factor quality of the input
    data
  • use each of the three methods and compare trees
  • different results depending on the order in
    which SEQ are
  • in input file (jumble option)

http//www.ncbi.nlm.nih.gov/entrez/query.fcgi?CMD
searchDBtaxonomy
http//phylogeny.arizona.edu/tree/phylogeny.html
http//evolution.genetics.washington.edu/phylip/so
ftware.htmlPlotting
100
  • Introduction to Phylogenetic Systematics,
  • Peter H. Weston Michael D. Crisp, Society of
    Australian Systematic Biologists
  • http//www.science.uts.edu.au/sasb/WestonCrisp.htm
    l
  • University of California, Berkeley Museum of
    Paleontology (UCMP)
  • http//www.ucmp.berkeley.edu/clad/clad4.html
  • Formats conversion
  • http//www.swbic.org/products/bioinfo/transform/tr
    ansform_help.php

101
Alignment in fasta
http//prodes.toulouse.inra.fr/multalin/multalin.h
tml
Alignment in a FASTA format
http//bioweb.pasteur.fr/seqanal/phylogeny/phylip-
uk.html
Distances from protein sequences (protdist).
http//bioweb.pasteur.fr/seqanal/interfaces/protdi
st-simple.html
Tree from the same file
Outfile Neighbor Neighbor joining
protpars
Alignment in fasta
http//prodes.toulouse.inra.fr/multalin/multalin.h
tml
READSEQ in PHYLIP format
http//bioweb.pasteur.fr/seqanal/interfaces/readse
q-simple.html
Parsimony
http//bioweb.pasteur.fr/seqanal/phylogeny/phylip-
uk.html
http//www.med.nyu.edu/rcr/nccu/phylogen-ex.txt
102
Evaluating Phylogenies
Non-random sequence order might introduce a bias
into the dataset.
We have no way of knowing whether the tree
inferred from the data is truly representative of
the evolutionary history of the gene family.
103
Evaluating Phylogenies
Non-random sequence order might introduce a bias
into the dataset.
We have no way of knowing whether the tree
inferred from the data is truly representative of
the evolutionary history of the gene family.
  • Jumbling sequence addition order
  • Most methods for phylogeny construction are
    sensitive to the order in which sequences are
    added to the tree. Consequently,  the simplest
    way to test a phylogeny is to repeat the analysis
    several times with different addition orders.
  • All PHYLIP programs, and most other phylogeny
    programs, have an option called JUMBLE, that uses
    a random number generator to choose which
    sequence to add at each step, rather than adding
    them in the order in which they appear in the
    file. The user is asked to supply a random number
    to use as a "seed" in generating a random number
    chain.
  • Therefore, even when doing only one run on a
    phylogeny, it is probably a good idea to jumble
    the order of sequences.

2. Bootstrap and Jacknife replicates
assumption - the statistical properties of a
sample should be similar to the statistical
properties of the population from which that
sample was drawn. The larger the sample, the
more representative it should be of the
population. Conversely, if the original sample
was large enough, it should also be possible to
take smaller samples from the larger sample, and
expect that the smaller samples would also retain
most of the statistical properties of the
original population.
104
  • Phylogenies if we create smaller alignments
    containing only some of the positions from the
    total alignment, and use these mini-alignments to
    construct a tree, we should still get the same
    tree each time.
  • If we get a different tree each time the data is
    sampled, then we are strongly confident that all
    the data is consistent with the tree.
  • If we get a different tree with each sample, then
    no tree is strongly supported by the data.
  • Jacknife resampling has the drawback that the
    subreplicates are of a smaller size than the
    original dataset, which may change the
    statistical properties of the samples. For that
    reason, Jacknife resampling has largely been
    replaced by bootstrap resampling.
  • Bootstrap resampling is sampling with
    replacement. In the case of a multiple sequence
    alignment,  sites are sampled at random until the
    dataset is equal in length to the original
    alignment.

105
(No Transcript)
106
Assessing Reliability Bootstrap
107
(No Transcript)
108
For bootstrap resampling of a sequence alignment,
it is best to create at least 100 bootstrapped
datasets, and redo the phylogeny for each one.
A consensus tree can then be built which
indicates, for each branch in the tree, how often
it occurs in the population of replicate samples.
Certain positions are biased in each replicate,
while others are underrepresented.  However, with
enough replicates, all sites will be weighted
equally.
109
Simulations have shown that "bootstrap values
greater than 70 correspond to a probability
greater than 95"
110
PROBLEM The disadvantage of bootstrap
resampling is that it drastically increases the
time required to construct a phylogeny.
only practical with distance methods where large
numbers of  sequences must be used
111
Are there Correct trees??
112
Are there Correct trees??
  • Despite all of these caveats, it is actually
    quite simple to use computer programs calculate
    phylogenetic trees for data sets.
  • Provided the data are clean, outgroups are
    correctly specified, appropriate algorithms are
    chosen, no assumptions are violated, etc., can
    the true, correct tree be found and proven to be
    scientifically valid?
  • Unfortunately, it is impossible to ever
    conclusively state what is the "true" tree for a
    group of sequences (or a group of organisms)
    taxonomy is constantly under revision as new data
    is gathered.

113
Some simple practical considerations
  • The most important factor is not the method but
    the quality
  • of input data
  • Use each of three methods and compare trees for
    consistency
  • (though, it does not mean that result is
  • statistically significant)
  • The choice of outgroup taxa can have so much
    influence on
  • analysis as a choice of ingroup taxa
  • Different answers can be obtained depending on
    the order in
  • which sequences are in input file (jumble option)
  • put problematic sequences at the end

114
Application of Phylogeny Understanding history
of life Understanding rapidly mutating viruses
(like HIV) Help to predict protein/RNA
structure Help to do multiple sequence
alignment Explaining and predicting gene
expression Explaining and predicting
ligands Help to design enhanced organisms
Help to design drug
115
gtClostridium_perfringens MKGIYSALLVSFDKDGNINEKGLRE
IIRHNIDVCKIDGLYVGGSTGENFMLSTDEKKRIFEIAMDEAKGQ VKLI
AQVGSVNLKEAVELAKFTTDLGYDAISAVTPFYYKFDFNEIKHYYETIIN
SVDNKLIIYSIPFLTG VNMSIEQFAELFENDKIIGVKFTAADFYLLERM
RKAFPDKLIFAGFDEMMLPATVLGVDGAIGSTFNVNG VRARQIFEAAQK
GDIETALEVQHVTNDLITDILNNGLYQTIKLILQEQGVDAGYCRQPMKEA
TEEMIAKA KEINKKYF gtMus_musculus MAFPKKKLRGLVAATITP
MTENGEINFPVIGQYVDYLVKEQGVKNIFVNGTTGEGLSLSVSERRQVAE
EW VNQGRNKLDQVVIHVGALNVKESQELAQHAAEIGADGIAVIAPFFFK
SQNKDALISFLREVAAAAPTLPF YYYHMPSMTGVKIRAEELLDGIQDKI
PTFQGLKFTDTDLLDFGQCVDQNHQRQFALLFGVDEQLLSALVM GATGA
VGSTYNYLGKKTNQMLEAFEQKDLASALSYQFRIQRFINYVIKLGFGVSQ
TKAIMTLVSGIPMGP PRLPLQKATQEFTAKAEAKLKSLDFLSSPSVKEG
KPLASA gtSinorhizobium_meliloti MKLEGIYSALLTPFSEDES
IDRQAIGALVDFQVRLGIDGVYVGGSSGEAMLQSLDERADYLSDVAAAAS
G RLTLIAHVGTIATRDALRLSQHAAKSGYQAISAIPPFYYDFSRPEVMA
HYRELADVSALPLIVYNFPART SGFTLPELVELLSHPNIIGIKHTSSDM
FQLERIRHAVPDAIVYNGYDEMCLAGFAMGAQGAIGTTYNFMG DLFVAL
RDCAAAGRIEEARRLQAMANRVIQVLIKVGVMPGSKALLGIMGLPGGPSR
RPFRKVEEADLAAL REAVAPVLAWRESTSRKSM gtBacillus_subti
lis MNFGNVSTAMITPFDNKGNVDFQKLSTLIDYLLKNGTDSLVVAGTT
GESPTLSTEEKIALFEYTVKEVNG RVPVIAGTGSNNTKDSIKLTKKAEE
AGVDAVMLVTPYYNKPSQEGMYQHFKAIAAETSLPVMLYNVPGRT VASL
APETTIRLAADIPNVVAIKEASGDLEAITKIIAETPEDFYVYSGDDALTL
PILSVGGRGVVSVASH IAGTDMQQMIKNYTNGQTANAALIHQKLLPIMK
ELFKAPNPAPVKTALQLRGLDVGSVRLPLVPLTEDER LSLSSTISEL gt
Escherichia_coli_O157 MATNLRGVMAALLTPFDQQQALDKASLR
RLVQFNIQQGIDGLYVGGSTGEAFVQSLSEREQVLEIVAEEA KGKIKLI
AHVGCVSTAESQQLAASAKRYGFDAVSAVTPFYYPFSFEEHCDHYRAIID
SADGLPMVVYNIP ALSGVKLTLDQINTLVTLPGVGALKQTSGDLYQMEQ
IRREHPDLVLYNGYDEIFASGLLAGADGGIGSTY NIMGWRYQGIVKALK
EGDIQTAQKLQTECNKVIDLLIKTGVFRGLKTVLHYMDVVSVPLCRKPFG
PVDEK YLPELKALAQQLMQERG gtPasteurella_multocida MKN
LKGIFSALLVSFNADGSINEKGLRQIVRYNIDKMKVDGLYVGGSTGENFM
LSTEEKKEIFRIAKDEA KDEIALIAQVGSVNLQEAIELGKYATELGYDS
LSAVTPFYYKFSFPEIKHYYDSIIEATGNYMIVYSIPF LTGVNIGVEQF
GELYKNPKVLGVKFTAGDFYLLERLKKAYPNHLIWAGFDEMMLPAASLGV
DGAIGSTFN VNGVRARQIFELTQAGKLKEALEIQHVTNDLIEGILANGL
YLTIKELLKLDGVEAGYCREPMTKELSPEK VAFAKELKAKYLS gtYers
inia_pestis MKKLTGLIAAPHTPFDEQGEVNYPVIDQIAEHLINDGV
KGVYVCGTTGEGIHCSVDERKKIAERWVNAAQ GKLSITLHTGALSIKDA
VDLSRHAETLDIFATSAIGPCFFKPGNLDDLIAYCQAIAAAAPSKGFYYY
HSG MSGVNLDMEQFLIKAESKIPNLSGIKFNNADLYEFQRCLRVSGGKF
DIPFGVDEHLPGGLAVGAIGAVGS TYNYAAPLFHKIIADFNAGDQVAVQ
RGMDHVIALIRVLVEFGGVAAGKAAMQLHGIDAGNPRLPLRALTK EQKQ
TVVNRMRDAITLQ gtE._coli MATNLRGVMAALLTPFDQQQALDKASL
RRLVQFNIQQGIDGLYVGGSTGEAFVQSLSEREQVLEIVAEEA KGKIKL
IAHVGCVSTAESQQLAASAKRYGFDAVSAVTPFYYPFSFEEHCDHYRAII
DSADGLPMVVYNIP ALSGVKLTLDQINTLVTLPGVGALKQTSGDLYQME
QIRREHPDLVLYNGYDEIFASGLLAGADGGIGSTY NIMGWRYQGIVKAL
KEGDIQTAQKLQTECNKVIDLLIKTGVFRGLKTVLHYMDVVSVPLCRKPF
GPVDEK YLPELKALAQQLMQERG gtVibrio_cholerae MKKLTGLI
AAPHTPFTKDNKVNFAAIDQIAELLIEQGVKGAYVCGTTGEGIHCSVEER
KAIAERWVKAVD GKLDVILHTGALSIVDTINLTEHAETLDIFATSAIGP
CFFKPGSVDDLVEYCAQVAAAAPSKGFYYYHSG MSGVNLDLEQFLIKGE
QRIPNLYGAKFNNADLYEYQRCVRVSNRKFDIPFGVDEFLPAGLAVGAVG
AVGS TYNYAAPLYLKIIEAFNHGKHDEVAALMDKVIAIIRVLVEYGGVA
AGKVAMQLHGIDAGDPRLPIRSLND KQKADVLAKMRDAGFLSI gtHomo
_sapiens MAFPKKKLQGLVAATITPMTENGEINFSVIGQYVDYLVKEQ
GVKNIFVNGTTGEGLSLSVSERRQVAEEW VTKGKDKLDQVIIHVGALSL
KESQELAQHAAEIGADGIAVIAPFFLKPWTKDILINFLKEVAAAAPALPF
YYYHIPALTGVKIRAEELLDGILDKIPTFQGLKFSDTDLLDFGQCVDQN
RQQQFAFLFGVDEQLLSALVM GATGAVGSTYNYLGKKTNQMLEAFEQKD
FSLALNYQFCIQRFINFVVKLGFGVSQTKAIMTLVSGIPMGP PRLPLQK
ASREFTDSAEAKLKSLDFLSFTDLKDGNLEAGS gtNeisseria_menin
gitidis MLQGSLVALITPMNQDGSIHYEQLRDLIDWHIENGTDGIVAV
GTTGESATLSVEEHTAVIEAVVKHVAKR VPVIAGTGANNTVEAIALSQA
AEKAGADYTLSVVPYYNKPSQEGMYRHFKAVAEAAAIPMILYNVPGRTV
VSMNNETILRLAEIPNIVGVKEASGNIGSNIELINRAPEGFVVLSGD
Write a Comment
User Comments (0)
About PowerShow.com