Title: Comparative Genomics and Evolution
1- Comparative Genomics and Evolution
Pollard, K.S., et al., Forces Shaping the Fastest
Evolving Regions in the Human Genome. PLoS
Genetics 2(10), 2006.
McLean, C., and Bejerano, G., Dispensability of
Mammalian DNA. Genome Research 18, 1743-1751
(2008).
Image from McLean, C., and Bejerano, G.,
Dispensability of Mammalian DNA. Genome Research
18, 1743-1751 (2008).
Image source http//mbbnet.umn.edu
2Forces shaping the fastest evolving regions in
the human genome by Katherine S. Pollard et al.
3Image sources http//pro.corbis.com,
http//www.science.psu.edu
4- Humans have higher brainpower
- Examples creativity, problem solving, language
- What part of the genome is the cause?
Image source http//www.spaceflight.esa.int
5- Human and chimpanzee DNA is 98 similar
- The 2 difference is 29 million bases (mostly in
non-coding DNA)
Image source http//en.wikipedia.org
6- Human and rodent genomes are often compared to
identify conserved (presumably functional)
elements. - Humans and chimpanzees are compared to
understand what is uniquely human about our
genome.
Image source http//genome.ucsc.edu
7- Look at HARs in human genome
- HAR - human accelerated region. High rate of
nucleotide substitution in humans, low in other
vertebrates. - Fastest is HAR1 novel RNA gene expressed in
development of neocortex (language, conscious
thought).
8- 100 bp, mostly non-coding
- Function is likely to be gene regulation.
- Seem to have been under strong negative
selection up to common ancestor of chimp and
human. - Rapid positive selection then started in humans
only.
Image source http//www.shutterstock.com
9Branch lengths given in substitutions per base,
or in millions of years
Evolution of vertebrates
- Evolutionary tree based on the comparison of
conserved regions in whole-genome alignments
between species.
Image from Pollard, K.S., et al., Forces Shaping
the Fastest Evolving Regions of the Human Genome.
10- Find HARs by using LRT, the likelihood ratio
test. - In statistical hypothesis testing, the
likelihood ratio (?) is the ratio of the maximum
probability of a result under a null hypothesis
and alternative hypothesis. - The LRT decides between the two hypothesis
based on the value of the likelihood ratio.
11- Two models were used for genomic LRT.
- Model 1 human substitution rate is held
proportional to the other substitution rates in
the evolutionary tree. - Model 2 human substitution rate can be
accelerated relative to the rates in the rest of
the tree.
12 . . .
Human
. . .
Another vertebrate
.
.
.
.
.
.
.
.
.
All the conserved alignments
13Model 1
. . .
Human
. . .
Another vertebrate
.
.
.
.
.
.
.
.
.
Determine 1st set of rates
Determine 2nd set of rates
Determine 3rd set of rates
Scale all by the same amount
14Model 2
. . .
Human
. . .
Another vertebrate
.
.
.
.
.
.
.
.
.
Scale all by the same amount
Scale the human rates separately
15Identify regions conserved between human and
other vertebrates (34,498 of them)
16Identify regions conserved between human and
other vertebrates (34,498 of them)
For all regions, fit model 1 and determine the
proportional rates that maximize the likelihood
of the tree
Obtain P1
(max probability 1)
17Identify regions conserved between human and
other vertebrates (34,498 of them)
For all regions, fit model 1 and determine the
proportional rates that maximize the likelihood
of the tree
Obtain P1
(max probability 1)
Loop over all conserved regions. For each region,
do
18Identify regions conserved between human and
other vertebrates (34,498 of them)
For all regions, fit model 1 and determine the
proportional rates that maximize the likelihood
of the tree
Obtain P1
(max probability 1)
Loop over all conserved regions. For each region,
do
Calculate LRT for the region as ? log(P2 / P1)
Fit model 2 to the region in human, find
acceleration for that region that maximizes the
likelihood of the tree
Obtain P2
(max probability 2)
19- Big LRT value indicates an HAR. How big is big?
- Do 1 million simulations of the 34,498 conserved
alignments. - To create each simulation, use the model 1
proportional rates. - Repeat the LRT calculation for each simulation.
- Then for each region, find proportion of
simulated LRTs that are bigger than its original
LRT. - That proportion is a p-value that tells if the
region is an HAR.
20- Note on methods vertebrates that were used in
selecting the conserved regions (chimp, macaque,
mouse, rat, rabbit) were omitted from any LRT
analysis. - This ensured that the LRT test is independent of
the method used to select the conserved regions.
21- Result 202 HARs were found in the human genome.
Image source http//www.3dscience.com
22- Results for Conserved Elements
- 80.4 of the 34,498 conserved regions are
non-coding. - 45.4 of non-coding regions are intronic, 31
are intergenic, - Non-coding regions are enriched for
transcription factors, DNA-binding proteins,
regulators of nucleic acid metabolism
23- 202 HARs have p lt 0.1, 49 of them have p lt 0.05
- HAR1 through HAR5 have p lt 4.5e-4, very
accelerated - Most HARs are non-coding
- 66.3 are intergenic, 31.7 are intronic, only
1.5 are coding - Results support the hypothesis (King and Wilson)
that most chimp-human differences are regulatory.
24- Results Confirming Accelerated Selection in HARs
Negative selection
Positive selection
- Are the HARs just due to relaxation of negative
selection? - No. Compare to neutral rate for 4D sites to see.
Image source http//cs273a.stanford.edu
Bejerano Aut 08/09
25Genome-wide neutral rate for 4D sites in human
and chimp in chromosome end bands
Genome-wide neutral rate for 4D sites in human
and chimp
The chimp rates in all five elements fall well
below the human rates, which exceed the
background rates by as much as an order of
magnitude. H, human C, chimp.
Image from K.S. Pollard et al., Forces Shaping
the Fastest Evolving Regions of the Human Genome.
26- Results W ? S Bias in HARs
AT ? GC substitution bias in HARs
HAR1 HAR5
AT ? GC
HAR6 HAR49
GC ? AT
HAR50 HAR202
Rest of 34000 conserved elements
- Dramatic AT ? GC bias was observed in HARs.
Image from Pollard, K.S., et al., Forces Shaping
the Fastest Evolving Regions of the Human Genome.
27- Results W ? S Bias in HARs
- Top 49 HARs are 2.7 times as likely to be
located near final chromosomal bands as the other
conserved elements - Interestingly, HAR1 and HAR5 are also in end
regions in other mammals, but are not accelerated.
Image source http//www.intelihealth.com
28- Results W ? S Bias in HARs
- HARs tend to be located in regions of high
recombination in humans. - All of this evidence points to biased gene
conversion (BGC) as the driving force behind HARs.
29- Paired chromosomes can exchange homologous
pieces - Typically occurs during meiosis
30Meiosis
diploid germ cell
paternal chromosome A
maternal chromosome A
31Meiosis
diploid germ cell
paternal chromosome A
maternal chromosome A
DNA replication
centromere
sister chromatids
32Meiosis
diploid germ cell
paternal chromosome A
maternal chromosome A
DNA replication
centromere
sister chromatids
Recombination
33Meiosis
diploid germ cell
paternal chromosome A
maternal chromosome A
DNA replication
centromere
sister chromatids
Recombination
Segregation
34Meiosis
diploid germ cell
paternal chromosome A
maternal chromosome A
DNA replication
centromere
sister chromatids
Recombination
Segregation
haploid gametes
35Recombination hotspot
Recombination
36duplex 1
duplex 2
Formation of Holliday Junction intermediate
Horizontal resolution with gene conversion
Vertical resolution with crossover
Mismatch repair
or
Image source http//www.sanger.ac.uk
37- Genetic Recombination
- Chromosomal Crossover
Homologous chromosomes
Recombinant chromatids
- Chromosomal crossover results in exchange of DNA
pieces
Image source http//www.emc.maricopa.edu
38- Genetic Recombination
- Gene Conversion
Mismatch repair causes DNA to revert back to its
original form
Recombinant chromatids
- Gene conversion results in nonreciprocal
transfer of DNA
Image source http//www.emc.maricopa.edu
39- Genetic Recombination
- Gene Conversion
haploid gametes
- The result is a nonstandard ratio of alleles,
such as 31 - This causes homogenization of a species gene
pool
Image source http//www.emc.maricopa.edu
40A - T is a weak pairing
G - C is a strong pairing
- DNA repair machinery likes to replace weak
pairings with strong pairings during gene
conversion.
Image source http//commons.wikimedia.org
41Biased Gene Conversion
Recombinant chromatids
A T replaced by G C during mismatch repair
- Biased gene conversion results in G C
enrichment of a species gene pool (in addition
to causing homogenization)
42- HARs and Recombination Hotspots
- HARs tend to be located near recombination
hotspots in humans
43- Mysterious
- Extremely different between chimps and humans
(change rapidly during evolution) - Not caused by the local DNA sequence (it is the
same in human and chimp)
44Recombination hotspots
?
45- Recombination-caused BGC (often seen negatively)
played a big role in the development of our
species.
46HAR
HAR
Isochore
- Isochore DNA region (100 kb) with high gene
concentration - Isochores are stabilized by many strong (GC)
pairings
47- Theory (Bernardi et al.) that weakly deleterious
changes drive isochore to a critical point of
destabilization - At critical point, GC content cannot decrease
otherwise isochore becomes unstable - AT ? GC substitution in the isochore suddenly
gains selective advantage and sweeps through the
population
48- Isochore selective sweep theory vs. the BGC
theory. - Isochore sweep has a different DNA signature
than BGC
Isochore selective sweep
GC
GC
GC
GC
GC
GC
GC
100 kb
Biased gene conversion
GC
GC
GC
GC
GC
GC
GC
100 bases
49- Evidence so far favors the BGC explanation for
HARs - However, the results are not yet conclusive
50Dispensability of Mammalian DNA by Gill
Bejerano and Cory McLean
51- Are mammalian CNEs dispensable?
- CNE conserved non-exonic element
- Examples cis-regulatory DNA, ultraconserved DNA
?
Image source http//apps.co.marion.or.us
52- Cis-regulatory DNA elements
promoter or inhibitor
Image source http//cnx.org
53- Cis-regulatory DNA elements
Image source http//cnx.org
54- 200 bp and up, many seem to be regulatory
- 100 identity with no insertions or deletions
between orthologous regions of the human, rat,
and mouse genomes. - Nearly all of these segments are also conserved
in the chicken and dog genomes, with an average
of 95 and 99 identity, respectively. Many are
also significantly conserved in fish. - (quotes from Ultraconserved elements in the
human genome by Bejerano et al.)
55- Are mammalian CNEs dispensable?
- About 20 of gene knockout experiments,
including cis-regulatory and ultraconserved
knockouts, produce no phenotype measurable in lab
settings.
Image source http//www.sciencedaily.com
56- Are mammalian CNEs dispensable?
Do CNEs have functional redundancy?
OR
Are CNEs indispensable, but in a way that cannot
be observed in the lab?
- Approach look at CNEs lost in rodents due to
evolution
57- Finding CNEs lost by rodents
Computational Pipeline
Identify conserved mammalian sequences
Pick out the ones absent in rodents
Remove artifacts due to assembly, alignment,
structural RNA migration
58Image from McLean, C., and Bejerano, G.,
Dispensability of Mammalian DNA.
59Image from McLean, C., and Bejerano, G.,
Dispensability of Mammalian DNA.
60To avoid assembly artifacts
Use UCSC chains and nets
Ignore multi-level nets
Image from McLean, C., and Bejerano, G.,
Dispensability of Mammalian DNA.
61Identify lost DNA
Validate quality of results
Image from McLean, C., and Bejerano, G.,
Dispensability of Mammalian DNA.
62- Identifying DNA lost by rodents
Different bases between primates and dog
primates
primates
A
dog
dog
G
rodents
Look at the aligned orthologous sequences in
primates (human, macaque), dog, and rodents
(mouse, rat).
63- Identifying DNA lost by rodents
100 bp window
primates
primates
A
dog
dog
G
rodents
Compute primate-dog id (percentage of identical
alignment columns)
64- Identifying DNA lost by rodents
primates
primates
A
dog
dog
G
rodents
Compute primate-dog id
65- Identifying DNA lost by rodents
primates
primates
A
dog
dog
G
rodents
!
Compute primate-dog id
Deletion in rodents
66- Identifying DNA lost by rodents
primates
primates
A
dog
dog
G
rodents
Ultraconserved-like element between primates-dog
67- Identifying DNA lost by rodents
primates
primates
A
dog
dog
G
rodents
!
Ultraconserved-like element that was lost in
rodents
68- Results for non-exonic ultras
- 1,691,090 bp of ultraconserved-like sequences
were found - 1147 bp of these sequences were lost in rodents
- Thus only 0.086 of ultras is lost in rodents
- In comparison, ¼ of neutrally-evolving DNA
(50id 65id) is lost in rodents - Thus ultraconserved-like sequences are 300 times
more indispensable than neutrally-evolving DNA
69- Expected uniform rate of lost neutrally-evolving
DNA - Observed that less conserved sequences are more
retained
Image from McLean, C., and Bejerano, G.,
Dispensability of Mammalian DNA.
70- Phenomenon due to poorly conserved sequences
being adjacent to exons, and thus shielded from
being lost - Larger deletions are biased away from gene
structures
Image from McLean, C., and Bejerano, G.,
Dispensability of Mammalian DNA.
71- Separating DNA under selection from neutral DNA
- Moving away from 100id, there is a mixing of
DNA under purifying selection and neutrally
evolving DNA
Image from McLean, C., and Bejerano, G.,
Dispensability of Mammalian DNA.
72- Separating DNA under selection from neutral DNA
- To distinguish neutral DNA from conserved DNA in
the mix, use longer evolutionary tree branch
lengths
Image from McLean, C., and Bejerano, G.,
Dispensability of Mammalian DNA.
73- Separating DNA under selection from neutral DNA
- Example human-dog-horse alignment has longer
cumulative branch length than human-macaque-dog
Image from McLean, C., and Bejerano, G.,
Dispensability of Mammalian DNA.
74- Separating DNA under selection from neutral DNA
- Example human-dog-horse alignment has longer
cumulative branch length than human-macaque-dog
Image from McLean, C., and Bejerano, G.,
Dispensability of Mammalian DNA.
75- Separating DNA under selection from neutral DNA
- Thus human-dog-horse alignment has lower id for
neutral DNA than human-macaque-dog - This shifts the neutral DNA curve shifts to the
right
Image from McLean, C., and Bejerano, G.,
Dispensability of Mammalian DNA.
76- Results for DNA under purifying selection
Image from McLean, C., and Bejerano, G.,
Dispensability of Mammalian DNA.
77- Results for DNA under purifying selection
- 80id to 100id identified as DNA under
purifying selection - As is visible from the figure, practically none
of this DNA is lost in the primates (only 0.154
of bases are lost)
Image from McLean, C., and Bejerano, G.,
Dispensability of Mammalian DNA.
78- Results for DNA under purifying selection
- The previous results were for CNEs
- Those results compare to the numbers for lost
coding DNA - Fraction of lost CNEs 0 at 100id, 0.00122 at
80id - Fraction of lost exons 0 at 100id, 0.0000861
at 80id
Image from McLean, C., and Bejerano, G.,
Dispensability of Mammalian DNA.
79- Results for DNA under purifying selection
- Thus CNEs under purifying selection are
indispensable, similarly to coding elements.
80- CNE dispensability ranking
Deepest in vertebrate tree, so corresponds to the
most indispensable CNEs
In primates
In rodents
Region of high conservation (CNEs)
- Left plot explanation (right plot is similar)
take the h-m-d alignments, find their
conservation id in each of the shown species.
Then for each of those species, plot the fraction
of DNA lost in rodents vs the id.
Image from McLean, C., and Bejerano, G.,
Dispensability of Mammalian DNA.
81- CNE dispensability ranking
Image from McLean, C., and Bejerano, G.,
Dispensability of Mammalian DNA.
82- Many mammalian CNE knockouts produce no
observable phenotype in the lab, suggesting great
functional redundancy. - However, evolutionary analysis shows that the
CNEs, and particularly ultraconserved regions,
are indispensable.
- Seems like the phenotype in knockouts is subtle,
but very important.
Image source http//apps.co.marion.or.us