Title: Introduction to
1Introduction to
Bioinformatics
2Introduction to Bioinformatics.
LECTURE 5 Variation within and between
species Chapter 5 Are Neanderthals among
us?
3Neandertal, Germany, 1856
Initial interpretations bear skull
pathological idiot Old Dutchman ...
4Introduction to BioinformaticsLECTURE 5 INTER-
AND INTRASPECIES VARIATION
5Introduction to BioinformaticsLECTURE 5 INTER-
AND INTRASPECIES VARIATION
6Introduction to BioinformaticsLECTURE 5 INTER-
AND INTRASPECIES VARIATION
7Introduction to BioinformaticsLECTURE 5 INTER-
AND INTRASPECIES VARIATION
- 5.1 Variation in DNA sequences
- Even closely related individuals differ in
genetic sequences - (point) mutations copy error at certain
location - Sexual reproduction diploid genome
8Introduction to Bioinformatics5.1 VARIATION IN
DNA SEQUENCES
Diploid chromosomes
9Introduction to Bioinformatics5.1 VARIATION IN
DNA SEQUENCES
Mitosis diploid reproduction
10Introduction to Bioinformatics5.1 VARIATION IN
DNA SEQUENCES
Meiosis diploid (double) ? haploid (single)
11Introduction to Bioinformatics5.1 VARIATION IN
DNA SEQUENCES
typing error rate very good typist 1 error /
1K typed letters all our diploid cells
constantly reproduce 7 billion letters typical
cell copying error rate is 1 error /1 Gbp
12Introduction to Bioinformatics5.1 VARIATION IN
DNA SEQUENCES
- GERM LINE
- Reverse time and follow your cells
- Now you count 1013 cells
- One generation ago you had 2 cells somewhere
in your parents body - Small T generations ago you had (2T multiple
ancestors) cells - Large T generations ago you counted (fertile
ancestors) cells - Congratulations you are 3.4 billion years old
!!! - Fast-forward time and follow your cells
- Only a few cells in your reproductive organs
have a chance to live on in the next generations - The rest (including you) will die
13Introduction to Bioinformatics5.1 VARIATION IN
DNA SEQUENCES
GERM LINE MUTATIONS This potentially immortal
lineage of (germ) cells is called the GERM
LINE All mutations that we have accumulated are
en route on the germ line
14Introduction to Bioinformatics5.1 VARIATION IN
DNA SEQUENCES
Polymorphism multiple possibilities for a
nucleotide allelle Single Nucleotide
Polymorphism SNP (snip) point mutation
example AAATAAA vs AAACAAA Humans SNP
1/1500 bases 0.067 STR Short Tandem
Repeats (microsatelites) example
CACACACACACACACACA Transition - transversion
15Introduction to Bioinformatics5.1 VARIATION IN
DNA SEQUENCES
Purines Pyrimidines
16Introduction to Bioinformatics5.1 VARIATION IN
DNA SEQUENCES
Transitions Transversions
17Introduction to BioinformaticsLECTURE 5 INTER-
AND INTRASPECIES VARIATION
- 5.2 Mitochondrial DNA
- mitochondriae are inherited only via the
maternal line!!! - Very suitable for comparing evolution, not
reshuffled
18Introduction to Bioinformatics 5.2 MITOCHONDRIAL
DNA
H.sapiens mitochondrion
19Introduction to Bioinformatics 5.2 MITOCHONDRIAL
DNA
EM photograph of H. Sapiens mtDNA
20Introduction to Bioinformatics 5.2 MITOCHONDRIAL
DNA
21Introduction to BioinformaticsLECTURE 5 INTER-
AND INTRASPECIES VARIATION
- 5.3 Variation between species
- genetic variation accounts for
morphological-physiological-behavioral variation - Genetic variation (c.q. distance) relates to
phylogenetic relation (relationship) - Necessity to measure distances between
sequences a metric
22Introduction to Bioinformatics5.3 VARIATION
BETWEEN SPECIES
Substitution rate Mutations originate in
single individuals Mutations can become fixed
in a population Mutation rate rate at which
new mutations arise Substitution rate rate at
which a species fixes new mutations For
neutral mutations
23Introduction to Bioinformatics5.3 VARIATION
BETWEEN SPECIES
Substitution rate and mutation rate For
neutral mutations ? 2Nµ1/(2N) µ ?
K/(2T)
24Introduction to BioinformaticsLECTURE 5 INTER-
AND INTRASPECIES VARIATION
5.4 Estimating genetic distance Substitutions
are independent (?) Substitutions are random
Multiple substitutions may occur Back-mutations
mutate a nucleotide back to an earlier value
25Introduction to Bioinformatics 5.4 ESTIMATING
GENETIC DISTANCE
Multiple substitutions and Back-mutations
conceal the real genetic distance
GACTGATCCACCTCTGATCCTTTGGAACTGATCGT
TTCTGATCCACCTCTGATCCTTTGGAACTGATCGT
TTCTGATCCACCTCTGATCCATCGGAACTGATCGT
GTCTGATCCACCTCTGATCCATTGGAACTGATCGT
observed 2 ( d) actual 4 ( K)
26Introduction to Bioinformatics 5.4 ESTIMATING
GENETIC DISTANCE
Saturation on average one substitution per
site Two random sequences of equal length will
match for approximately ¼ of their sites In
saturation therefore the proportional genetic
distance is ¼
27Introduction to Bioinformatics5.4 ESTIMATING
GENETIC DISTANCE
True genetic distance (proportion) K
Observed proportion of differences d Due to
back-mutations K d
28Introduction to Bioinformatics 5.4 ESTIMATING
GENETIC DISTANCE
SEQUENCE EVOLUTION is a Markov process a
sequence at generation ( time) t depends only
the sequence at generation t-1
29Introduction to Bioinformatics 5.4 ESTIMATING
GENETIC DISTANCE
The Jukes-Cantor model Correction for multiple
substitutions Substitution probability per site
per second is a Substitution means there are 3
possible replacements (e.g. C ?
A,G,T) Non-substitution means there is 1
possibility (e.g. C ? C)
30Introduction to Bioinformatics 5.4 THE
JUKES-CANTOR MODEL
Therefore, the one-step Markov process has the
following transition matrix MJC
A C G T A 1-a a/3 a/3 a/3 C a/3 1-a a/3 a/3 G a/3
a/3 1-a a/3 T a/3 a/3 a/3 1-a
31Introduction to Bioinformatics 5.4 THE
JUKES-CANTOR MODEL
After t generations the substitution probability
is M(t) MJCt Eigen-values and
eigen-vectors of M(t) ?1 1, (multiplicity
1) v1 1/4 (1 1 1 1)T ?2..4 1-4a/3,
(multiplicity 3) v2 1/4 (-1 -1 1 1)T v3
1/4 (-1 -1 -1 1)T v4 1/4 (1 -1 1 -1)T
32Introduction to Bioinformatics 5.4 THE
JUKES-CANTOR MODEL
Spectral decomposition of M(t) MJCt ?i
?itviviT Define M(t) as MJCt
Therefore, substitution probability s(t) per
site after t generations is s(t) ¼ - ¼ (1 -
4a/3)t
r(t) s(t) s(t) s(t) s(t) r(t) s(t)
s(t) s(t) s(t) r(t) s(t) s(t) s(t) s(t)
r(t)
33Introduction to Bioinformatics 5.4 THE
JUKES-CANTOR MODEL
substitution probability s(t) per site after t
generations s(t) ¼ - ¼ (1 -
4a/3)t observed genetic distance d after t
generations s(t) d ¼ - ¼ (1 -
4a/3)t For small a
34Introduction to Bioinformatics 5.4 THE
JUKES-CANTOR MODEL
For small a the observed genetic distance
is The actual genetic distance is (of
course) K at So This is the
Jukes-Cantor formula independent of a and t.
35Introduction to Bioinformatics 5.4 THE
JUKES-CANTOR MODEL
The Jukes-Cantor formula For small d using
ln(1x) x K d So actual distance
observed distance For saturation d ? ¾ K
?8 So if observed distance corresponds to random
sequence-distance then the actual distance
becomes indeterminate
36Jukes-Cantor
37Introduction to Bioinformatics 5.4 THE
JUKES-CANTOR MODEL
Variance in K If K f(d) then So Generati
on of a sequence of length n with substitution
rate d is a binomial process and therefore
with variance Var(d) d(1-d)/n Because of the
Jukes-Cantor formula
38Introduction to Bioinformatics 5.4 THE
JUKES-CANTOR MODEL
Variance in K Variance Var(d) d(1-d)/n
Jukes-Cantor So
39Var(K)
40Introduction to Bioinformatics 5.4 THE
JUKES-CANTOR MODEL
EXAMPLE 5.4 on page 90 Create artificial data
with n 1000 generate K mutations Count d
With Jukes-Cantor relation reconstruct estimate
K(d) Plot K(d) K
41Introduction to Bioinformatics 5.4 EXAMPLE 5.4
on page 90
42Introduction to Bioinformatics 5.4 EXAMPLE 5.4
on page 90
43Introduction to Bioinformatics 5.4 EXAMPLE 5.4
on page 90
44Introduction to Bioinformatics 5.4 EXAMPLE 5.4
on page 90 ( FIG 5.3)
45Introduction to Bioinformatics 5.4 ESTIMATING
GENETIC DISTANCE
The Kimura 2-parameter model Include
substitution bias in correction
factor Transition probability (G?A and T?C) per
site per second is a Transversion probability
(G?T, G?C, A?T, and A?C) per site per second is ß
46Introduction to Bioinformatics 5.4 THE KIMURA
2-PARAM MODEL
The one-step Markov process substitution matrix
now becomes MK2P
A C G T A 1-a-ß ß a ß C ß 1-a-ß ß a G a ß
1-a-ß ß T ß a ß 1-a-ß
47Introduction to Bioinformatics 5.4 THE KIMURA
2-PARAM MODEL
After t generations the substitution probability
is M(t) MK2Pt Determine of M(t)
eigen-values ?i and eigen-vectors vi
48Introduction to Bioinformatics 5.4 THE KIMURA
2-PARAM MODEL
Spectral decomposition of M(t) MK2Pt ?i
?itviviT Determine fraction of transitions per
site after t generations P(t) Determine
fraction of transitions per site after t
generations Q(t) Genetic distance K - ½
ln(1-2P-Q) ¼ ln(1 2Q) Fraction of
substitutions d P Q ? Jukes-Cantor
49Introduction to Bioinformatics 5.4 ESTIMATING
GENETIC DISTANCE
Other models for nucleotide evolution
Different types of transitions/transversions
Pairwise substitutions GTR ( General Time
Reversible) model Amino-acid substitutions
matrices
50Introduction to Bioinformatics 5.4 ESTIMATING
GENETIC DISTANCE
Other models for nucleotide evolution DEFICIT
all above models assume symmetric substitution
probs prob(A?T) prob(T?A) Now strong
evidence that this assumption is not
true Challenge incorporate this in a
self-consistent model
51Introduction to BioinformaticsLECTURE 5 INTER-
AND INTRASPECIES VARIATION
- 5.5 CASE STUDY Neanderthals
- mtDNA of 206 H. sapiens from different regions
- Fragments of mtDNA of 2 H. neanderthaliensis,
including the original 1856 specimen. - all 208 samples from GenBank
- A homologous sequence of 800 bp of the HVR
could be found in all 208 specimen.
52Introduction to Bioinformatics5.5 CASE STUDY
Neanderthals
Pairwise genetic difference corrected with
Jukes-Cantor formula d(i,j) is JC-corrected
genetic difference between pair (i,j) dT
d MDS (Multi Dimensional Scaling) translate
distance table d to a nD-map X, here 2D-map
53Introduction to Bioinformatics5.5 CASE STUDY
Neanderthals
distance map d(i,j)
54Introduction to Bioinformatics5.5 CASE STUDY
Neanderthals
MDS
55Introduction to Bioinformatics5.5 CASE STUDY
Neanderthals
phylogentic tree
56END of LECTURE 5
57Introduction to BioinformaticsLECTURE 5 INTER-
AND INTRASPECIES VARIATION
58(No Transcript)