Title: EntoGene 606 Distance Methods I
1Ento/Gene 606Distance Methods I
(Another Place, by Chris Howells)
2Phenetics or Numerical Taxonomy
Joseph Camin (1970s)
Robert Sokal (1964)
3Phenetics
- Hypotheses of phylogenetic relationships are
unnecessary to the process of classification - In fact, they interfere with the process by
introducing uncertainty and subjective decisions
4Paul Ehrlich (1965)
- Into which group of taxonomists do you wish to be
classified in - Members of the old school, who would like to
see a pinch of phylogenetic specialization mixed
into their basic data (presumably for sentimental
reasons) - or
- Those who wish to look forward
5Phenetic Methods
- Group taxa based on overall similarity
- Do not distinguish between ancestral and derived
similarity - One common methodology
- Compute pairwise distances between taxa
- Cluster similar taxa together using clustering
algorithms
6Why use phenetic methods?
- Some types of data can only be compared using
distances - Immunological distances
- DNA annealing/melting points
- Distance methods are fast
- Analyze large data sets
7Measures of Distance (Dissimilarity)
- Continuous, quantitative characters
- DNA sequence data
- Allozymes, microsatellites, RFLP, AFLP, etc.
8Properties of an ideal distance measure1.
Symmetry
- Dij Dji
- Distance from i to j is same as distance computed
from j to I
9Properties of an ideal distance measure2.
Distinguishability
- If Dij gt 0, then i is not the same as j
- If Dij 0, then i is the same as j
10Properties of an ideal distance measure 3.
Satisfies Triangle Inequality
Why is this important?
113. Triangle Inequality
- Distance measures that satisfy triangle
inequality are called Metrics - Failure of triangle inequality implies that the
distance measure is generating negative distances - Negative distances are an artifact of the way in
which distance is computed
12Neis unbiased genetic distance
- Pairwise distances between O.T.U.s based on
allele frequencies
13Neis Genetic Distance
- Estimated accumulated number of allele
substitutions per locus - 0 all alleles in identical frequencies in both
populations - 1 alleles have been replaced at each locus
14Neis genetic distance(Hillis, 1984, Systematic
Zoology 33238-240)
- Assumes all loci evolving at equal rates
- Hillis (1984) published correction for this
assumption - Assumes all loci in genetic equilibrium
- Is not a metric distance in some cases!
15Other Measures of Genetic Distance
- Rogers
- Acts as a metric
- Microsatellites
- Stepwise mutation model
- Infinite alleles model
- RFLP data
16DNA distances
- Uncorrected P apparent substitutions
- total nucleotides
1 a c g t t c g a c g 2 a c a t t c g a c g 3
a c a c c c g a c g
17DNA distances
- Uncorrected P apparent substitutions
- total nucleotides
1 a c g t t c g a c g 2 a c a t t c g a c g 3
a c a c c c g a c g
P12 1/10 0.10 P13 3/10 0.30
18Saturation of gene sequences
A a g t c t c c a g g t g c a c g t c t t
B a g t c c c c a g g t g c a c g t c t t
19Saturation of gene sequences
A a g t c t c c a g g t g c a c g t c t t
B a g t c c c c a g g t g c a c g t c t t
C a g t c t c c a g g t g c a c g t c t t
20Saturation of gene sequences
A a g t c t c c a g g t g c a c g t c t t
B a g t c c c c a g g t g c a c g t c t t
C a g t c t c c a g g t g c a c g t c t t
D a g t c a c c a g g t g c a c g t c t t
21A
B
t gt c
C
D
22A
B
t gt c c gt t
C
D
23A
B
t gt c c gt t t gt a
C
D
24Page and Holmes (1998) Molecular Evolution
25Page and Holmes (1998) Molecular Evolution
26- (Galewski et al. BMC Evol. Biol 2006 680)
27Corrected DNA distances
- Jukes/Cantor - 3/4 ln (1- 4/3 D)
- Assumes equal rates of substitution between all
nucleotides - Assumes equal nucleotide frequencies
- One correction to overall distance
28Kimura 2-Parameter Distance(K2P)
- Two rates of substitutions, for transitions and
transversions
29Generalized Time Reversable Distance(GTR)
- Six different substitution rates
- Backward rates are same as forward rates
30Substitution rates
- All rates equal (essentially the Jukes-Cantor
model) - Transitions have different rate than
transversions - Two different kinds of transversions (this gives
us three rates) - Maximally, all types of substitutions are allowed
to have their own rate - But note that substitution rates are not
independent of base frequencies in the data!
31Base Frequences
- For example, insect mtDNA is A-T rich
- much more adenine and thymine
- much less cytosine and guanine
- What does this do to frequencies of base changes?
32Base Frequences
- For example, insect mtDNA is A-T rich
- much more adenine and thymine
- much less cytosine and guanine
- What does this do to frequencies of base changes?
33Base Frequences
- For example, insect mtDNA is A-T rich
- much more adenine and thymine
- much less cytosine and guanine
- What does this do to frequencies of base changes?
34Base Frequencies
- We can assume they are all equal
- We can count them in the data matrix
- But this will not include the base frequencies at
all the internal nodes of the tree (HTUs) once
it is constructed - We can build the tree, step by step, changing the
estimates of base frequencies as we go
35Substitution Rate Models
- GTR general time reversible
- JC Jukes Cantor
- Relax each assumption to move to the model below
36(No Transcript)
37(No Transcript)
38Matrix of Distances
39Clustering Methods
- Visualize as much information as possible in
matrix of distances or assocations in 2
dimensional scheme - Order O.T.U.s into connected groups
40Clustering Methods
B
C
A
E
G
D
F
- Divisive Methods
- Start with all O.T.U.s in one group
- Divide into two groups
- Continue until only individuals are left
41Clustering Methods
B
C
A
E
G
D
F
- Divisive Methods
- Start with all O.T.U.s in one group
- Divide into two groups
- Continue until only individuals are left
42Clustering Methods
B
C
A
E
G
D
F
- Divisive Methods
- Start with all O.T.U.s in one group
- Divide into two groups
- Continue until only individuals are left
G
F
C
A
B
E
D
43Clustering Methods
B
C
A
E
G
D
F
- Divisive Methods
- Start with all O.T.U.s in one group
- Divide into two groups
- Continue until only individuals are left
G
F
C
A
B
E
D
C
A
D
(etc.)
D
C
44Clustering Methods
B
C
A
E
G
D
F
- Agglomerative Methods
- Start with individual O.T.U.s
- Join two most similar O.T.U.s to form 1st
cluster - Find which O.T.U. is most similar to cluster
- Join to form second cluster
45Clustering Methods
B
C
A
E
G
D
F
- Agglomerative Methods
- Start with individual O.T.U.s
- Join two most similar O.T.U.s to form 1st
cluster - Find which O.T.U. is most similar to cluster
- Join to form second cluster
46Clustering Methods
B
A
E
G
F
- Agglomerative Methods
- Start with individual O.T.U.s
- Join two most similar O.T.U.s to form 1st
cluster - Find which O.T.U. is most similar to cluster, or
which two O.T.U.s are most similar to each other - Join to form second cluster
47Clustering Methods
B
F
- Agglomerative Methods
- Start with individual O.T.U.s
- Join two most similar O.T.U.s to form 1st
cluster - Find which O.T.U. is most similar to cluster, or
which two O.T.U.s are most similar to each other
?
A
G
E
48Clustering Methods
B
F
- Agglomerative Methods
- Start with individual O.T.U.s
- Join two most similar O.T.U.s to form 1st
cluster - Find which O.T.U. is most similar to cluster, or
which two O.T.U.s are most similar to each other - Join to form second cluster
G
A
E
49- Agglomerative Methods
- Start with individual O.T.U.s
- Join two most similar O.T.U.s to form 1st
cluster - Find which O.T.U. is most similar to cluster, or
which two O.T.U.s are most similar to each other - Join to form second cluster
- Continue until all O.T.U.s have been clustered
B
F
?
A
G
E
50- Agglomerative Methods
- Start with individual O.T.U.s
- Join two most similar O.T.U.s to form 1st
cluster - Find which O.T.U. is most similar to cluster, or
which two O.T.U.s are most similar to each other - Join to form second cluster
- Continue until all O.T.U.s have been clustered
A
C
D
F
E
51- Agglomerative Methods
- Start with individual O.T.U.s
- Join two most similar O.T.U.s to form 1st
cluster - Find which O.T.U. is most similar to cluster, or
which two O.T.U.s are most similar to each other - Join to form second cluster
- Continue until all O.T.U.s have been clustered
A
E
C
D
F
52- Hierarchical
- Members of lower level cluster also members of
higher level clusters - Non-Overlapping
- An O.T.U. is a member of only one cluster at a
particular level of similarity or distance
53Agglomerative Clustering Methods
- Start with distance or association matrix
- Each method has rule to join most similar
O.T.U.s to form clusters - Each method has rules to determine distance of
unclustered O.T.U.s to existing clusters - Each methods has rules to determine distance
between two clusters
54Single Linkage or Nearest Neighbor Method
- D a (b,c) min (Dab, Dac)
- D (a,b)(c,d) min (Dac, Dbc, Dad, Dbd)
B
A
D
C
55Single Linkage or Nearest Neighbor Method
- D a (b,c) min (Dab, Dac)
- D (a,b)(c,d) min (Dac, Dbc, Dad, Dbd)
B
B
E
A
D
D
C
F
C
56Complete Linkage or Furthest Neighbor Method
- D a (b,c) max (Dab, Dac)
- D (a,b)(c,d) max (Dac, Dbc, Dad, Dbd)
B
A
D
C
57Complete Linkage or Furthest Neighbor Method
- D a (b,c) max (Dab, Dac)
- D (a,b)(c,d) max (Dac, Dbc, Dad, Dbd)
B
B
E
A
D
D
C
F
C
58Averaging methods
- D a (b,c) average (Dab, Dac)
- D (a,b)(c,d) average (Dac, Dbc, Dad, Dbd)
B
A
D
C
Da(b,c,d) 1/3(DabDacDad)
59Averaging Methods
- D a (b,c) average (Dab, Dac)
- D (a,b)(c,d) average (Dac, Dbc, Dad, Dbd)
B
B
E
A
D
D
C
F
C
D(e,f)(b,c,d) 1/6(DebDecDedDfbDfcDfd)
60UPGMA clustering
- Unweighted Pair Group Method using arithmetic
Averages - Agglomerative method
- Computes distances between clusters and between
clusters and O.T.U. as averages
61E
F
A
C
D
B
G
E
F
A
C
D
B
G
62Dendrogram or Phenogram
- Graphical display of clustering algorithm
- Branch lengths of clusters in units of similarity
or distance