EntoGene 606 Distance Methods I - PowerPoint PPT Presentation

1 / 62
About This Presentation
Title:

EntoGene 606 Distance Methods I

Description:

Phenetic Methods. Group taxa based on overall similarity ... Why use phenetic methods? Some types of data can only be compared using distances ... – PowerPoint PPT presentation

Number of Views:48
Avg rating:3.0/5.0
Slides: 63
Provided by: hymenopt
Category:

less

Transcript and Presenter's Notes

Title: EntoGene 606 Distance Methods I


1
Ento/Gene 606Distance Methods I
(Another Place, by Chris Howells)
2
Phenetics or Numerical Taxonomy
Joseph Camin (1970s)
Robert Sokal (1964)
  • Peter Sneath
  • (1959)

3
Phenetics
  • Hypotheses of phylogenetic relationships are
    unnecessary to the process of classification
  • In fact, they interfere with the process by
    introducing uncertainty and subjective decisions

4
Paul Ehrlich (1965)
  • Into which group of taxonomists do you wish to be
    classified in
  • Members of the old school, who would like to
    see a pinch of phylogenetic specialization mixed
    into their basic data (presumably for sentimental
    reasons)
  • or
  • Those who wish to look forward

5
Phenetic Methods
  • Group taxa based on overall similarity
  • Do not distinguish between ancestral and derived
    similarity
  • One common methodology
  • Compute pairwise distances between taxa
  • Cluster similar taxa together using clustering
    algorithms

6
Why use phenetic methods?
  • Some types of data can only be compared using
    distances
  • Immunological distances
  • DNA annealing/melting points
  • Distance methods are fast
  • Analyze large data sets

7
Measures of Distance (Dissimilarity)
  • Continuous, quantitative characters
  • DNA sequence data
  • Allozymes, microsatellites, RFLP, AFLP, etc.

8
Properties of an ideal distance measure1.
Symmetry
  • Dij Dji
  • Distance from i to j is same as distance computed
    from j to I

9
Properties of an ideal distance measure2.
Distinguishability
  • If Dij gt 0, then i is not the same as j
  • If Dij 0, then i is the same as j

10
Properties of an ideal distance measure 3.
Satisfies Triangle Inequality
  • Dik lt Dij Djk

Why is this important?
11
3. Triangle Inequality
  • Distance measures that satisfy triangle
    inequality are called Metrics
  • Failure of triangle inequality implies that the
    distance measure is generating negative distances
  • Negative distances are an artifact of the way in
    which distance is computed

12
Neis unbiased genetic distance
  • Pairwise distances between O.T.U.s based on
    allele frequencies

13
Neis Genetic Distance
  • Estimated accumulated number of allele
    substitutions per locus
  • 0 all alleles in identical frequencies in both
    populations
  • 1 alleles have been replaced at each locus

14
Neis genetic distance(Hillis, 1984, Systematic
Zoology 33238-240)
  • Assumes all loci evolving at equal rates
  • Hillis (1984) published correction for this
    assumption
  • Assumes all loci in genetic equilibrium
  • Is not a metric distance in some cases!

15
Other Measures of Genetic Distance
  • Rogers
  • Acts as a metric
  • Microsatellites
  • Stepwise mutation model
  • Infinite alleles model
  • RFLP data

16
DNA distances
  • Uncorrected P apparent substitutions
  • total nucleotides

1 a c g t t c g a c g 2 a c a t t c g a c g 3
a c a c c c g a c g
17
DNA distances
  • Uncorrected P apparent substitutions
  • total nucleotides

1 a c g t t c g a c g 2 a c a t t c g a c g 3
a c a c c c g a c g
P12 1/10 0.10 P13 3/10 0.30
18
Saturation of gene sequences
A a g t c t c c a g g t g c a c g t c t t
B a g t c c c c a g g t g c a c g t c t t
19
Saturation of gene sequences
A a g t c t c c a g g t g c a c g t c t t
B a g t c c c c a g g t g c a c g t c t t
C a g t c t c c a g g t g c a c g t c t t
20
Saturation of gene sequences
A a g t c t c c a g g t g c a c g t c t t
B a g t c c c c a g g t g c a c g t c t t
C a g t c t c c a g g t g c a c g t c t t
D a g t c a c c a g g t g c a c g t c t t
21
A
B
t gt c
C
D
22
A
B
t gt c c gt t
C
D
23
A
B
t gt c c gt t t gt a
C
D
24
Page and Holmes (1998) Molecular Evolution
25
Page and Holmes (1998) Molecular Evolution
26
  • (Galewski et al. BMC Evol. Biol 2006 680)

27
Corrected DNA distances
  • Jukes/Cantor - 3/4 ln (1- 4/3 D)
  • Assumes equal rates of substitution between all
    nucleotides
  • Assumes equal nucleotide frequencies
  • One correction to overall distance

28
Kimura 2-Parameter Distance(K2P)
  • Two rates of substitutions, for transitions and
    transversions

29
Generalized Time Reversable Distance(GTR)
  • Six different substitution rates
  • Backward rates are same as forward rates

30
Substitution rates
  • All rates equal (essentially the Jukes-Cantor
    model)
  • Transitions have different rate than
    transversions
  • Two different kinds of transversions (this gives
    us three rates)
  • Maximally, all types of substitutions are allowed
    to have their own rate
  • But note that substitution rates are not
    independent of base frequencies in the data!

31
Base Frequences
  • For example, insect mtDNA is A-T rich
  • much more adenine and thymine
  • much less cytosine and guanine
  • What does this do to frequencies of base changes?

32
Base Frequences
  • For example, insect mtDNA is A-T rich
  • much more adenine and thymine
  • much less cytosine and guanine
  • What does this do to frequencies of base changes?

33
Base Frequences
  • For example, insect mtDNA is A-T rich
  • much more adenine and thymine
  • much less cytosine and guanine
  • What does this do to frequencies of base changes?

34
Base Frequencies
  • We can assume they are all equal
  • We can count them in the data matrix
  • But this will not include the base frequencies at
    all the internal nodes of the tree (HTUs) once
    it is constructed
  • We can build the tree, step by step, changing the
    estimates of base frequencies as we go

35
Substitution Rate Models
  • GTR general time reversible
  • JC Jukes Cantor
  • Relax each assumption to move to the model below

36
(No Transcript)
37
(No Transcript)
38
Matrix of Distances
39
Clustering Methods
  • Visualize as much information as possible in
    matrix of distances or assocations in 2
    dimensional scheme
  • Order O.T.U.s into connected groups

40
Clustering Methods
B
C
A
E
G
D
F
  • Divisive Methods
  • Start with all O.T.U.s in one group
  • Divide into two groups
  • Continue until only individuals are left

41
Clustering Methods
B
C
A
E
G
D
F
  • Divisive Methods
  • Start with all O.T.U.s in one group
  • Divide into two groups
  • Continue until only individuals are left

42
Clustering Methods
B
C
A
E
G
D
F
  • Divisive Methods
  • Start with all O.T.U.s in one group
  • Divide into two groups
  • Continue until only individuals are left

G
F
C
A
B
E
D
43
Clustering Methods
B
C
A
E
G
D
F
  • Divisive Methods
  • Start with all O.T.U.s in one group
  • Divide into two groups
  • Continue until only individuals are left

G
F
C
A
B
E
D
C
A
D
(etc.)
D
C
44
Clustering Methods
B
C
A
E
G
D
F
  • Agglomerative Methods
  • Start with individual O.T.U.s
  • Join two most similar O.T.U.s to form 1st
    cluster
  • Find which O.T.U. is most similar to cluster
  • Join to form second cluster

45
Clustering Methods
B
C
A
E
G
D
F
  • Agglomerative Methods
  • Start with individual O.T.U.s
  • Join two most similar O.T.U.s to form 1st
    cluster
  • Find which O.T.U. is most similar to cluster
  • Join to form second cluster

46
Clustering Methods
B
A
E
G
F
  • Agglomerative Methods
  • Start with individual O.T.U.s
  • Join two most similar O.T.U.s to form 1st
    cluster
  • Find which O.T.U. is most similar to cluster, or
    which two O.T.U.s are most similar to each other
  • Join to form second cluster

47
Clustering Methods
B
F
  • Agglomerative Methods
  • Start with individual O.T.U.s
  • Join two most similar O.T.U.s to form 1st
    cluster
  • Find which O.T.U. is most similar to cluster, or
    which two O.T.U.s are most similar to each other

?
A
G
E
48
Clustering Methods
B
F
  • Agglomerative Methods
  • Start with individual O.T.U.s
  • Join two most similar O.T.U.s to form 1st
    cluster
  • Find which O.T.U. is most similar to cluster, or
    which two O.T.U.s are most similar to each other
  • Join to form second cluster

G
A
E
49
  • Agglomerative Methods
  • Start with individual O.T.U.s
  • Join two most similar O.T.U.s to form 1st
    cluster
  • Find which O.T.U. is most similar to cluster, or
    which two O.T.U.s are most similar to each other
  • Join to form second cluster
  • Continue until all O.T.U.s have been clustered

B
F
?
A
G
E
50
  • Agglomerative Methods
  • Start with individual O.T.U.s
  • Join two most similar O.T.U.s to form 1st
    cluster
  • Find which O.T.U. is most similar to cluster, or
    which two O.T.U.s are most similar to each other
  • Join to form second cluster
  • Continue until all O.T.U.s have been clustered

A
C
D
F
E
51
  • Agglomerative Methods
  • Start with individual O.T.U.s
  • Join two most similar O.T.U.s to form 1st
    cluster
  • Find which O.T.U. is most similar to cluster, or
    which two O.T.U.s are most similar to each other
  • Join to form second cluster
  • Continue until all O.T.U.s have been clustered

A
E
C
D
F
52
  • Hierarchical
  • Members of lower level cluster also members of
    higher level clusters
  • Non-Overlapping
  • An O.T.U. is a member of only one cluster at a
    particular level of similarity or distance

53
Agglomerative Clustering Methods
  • Start with distance or association matrix
  • Each method has rule to join most similar
    O.T.U.s to form clusters
  • Each method has rules to determine distance of
    unclustered O.T.U.s to existing clusters
  • Each methods has rules to determine distance
    between two clusters

54
Single Linkage or Nearest Neighbor Method
  • D a (b,c) min (Dab, Dac)
  • D (a,b)(c,d) min (Dac, Dbc, Dad, Dbd)

B
A
D
C
55
Single Linkage or Nearest Neighbor Method
  • D a (b,c) min (Dab, Dac)
  • D (a,b)(c,d) min (Dac, Dbc, Dad, Dbd)

B
B
E
A
D
D
C
F
C
56
Complete Linkage or Furthest Neighbor Method
  • D a (b,c) max (Dab, Dac)
  • D (a,b)(c,d) max (Dac, Dbc, Dad, Dbd)

B
A
D
C
57
Complete Linkage or Furthest Neighbor Method
  • D a (b,c) max (Dab, Dac)
  • D (a,b)(c,d) max (Dac, Dbc, Dad, Dbd)

B
B
E
A
D
D
C
F
C
58
Averaging methods
  • D a (b,c) average (Dab, Dac)
  • D (a,b)(c,d) average (Dac, Dbc, Dad, Dbd)

B
A
D
C
Da(b,c,d) 1/3(DabDacDad)
59
Averaging Methods
  • D a (b,c) average (Dab, Dac)
  • D (a,b)(c,d) average (Dac, Dbc, Dad, Dbd)

B
B
E
A
D
D
C
F
C
D(e,f)(b,c,d) 1/6(DebDecDedDfbDfcDfd)
60
UPGMA clustering
  • Unweighted Pair Group Method using arithmetic
    Averages
  • Agglomerative method
  • Computes distances between clusters and between
    clusters and O.T.U. as averages

61
E
F
A
C
D
B
G
E
F
A
C
D
B
G
62
Dendrogram or Phenogram
  • Graphical display of clustering algorithm
  • Branch lengths of clusters in units of similarity
    or distance
Write a Comment
User Comments (0)
About PowerShow.com