Comparative genomics of 24 mammals - PowerPoint PPT Presentation

About This Presentation
Title:

Comparative genomics of 24 mammals

Description:

Comparative genomics of 24 mammals Manolis Kellis MIT Broad Institute of MIT and Harvard MIT Computer Science & Artificial Intelligence Laboratory – PowerPoint PPT presentation

Number of Views:148
Avg rating:3.0/5.0
Slides: 43
Provided by: mitEdu69
Learn more at: https://compbio.mit.edu
Category:

less

Transcript and Presenter's Notes

Title: Comparative genomics of 24 mammals


1
Comparative genomics of 24 mammals
  • Manolis Kellis
  • MIT

Broad Institute of MIT and Harvard
MIT Computer Science Artificial Intelligence
Laboratory
2
Sequencing the mammalian phylogeny
  • Species Center Covg
  • H1 Human Done Full
  • H2 Chimp Done Full
  • H3 Rhesus Done Full
  • H4 Mouse Done Full
  • H5 Rat Done Full
  • H6 Dog Done Full
  • H7 Cow Done Full
  • 1 Elephant Broad 1.94x
  • 2 Armadillo Broad 1.98x
  • 3 Tenrec Broad 1.90x
  • 4 Rabbit Broad 1.95x
  • 5 Guinea Pig Broad 1.92x
  • 6 Hedgehog Broad 1.86x
  • 7 Shrew Broad 1.92x
  • 8 Microbat Broad 1.84x
  • 9 Tree Shrew Broad 1.89x
  • 10 Squirrel Broad 1.90x
  • 11 Bushbaby Broad 1.87x

Kerstin Lindblad-Toh, Sante Gnerre, Federica
DiPalma Broad, Baylor, WashU, Arachne, UCSC
3
Comparative genomics of mammalian species
  • Goal 1 Discover regions of increased selection
  • Detect functional elements by their increased
    conservation
  • More genomes detect smaller elements, subtle
    selection
  • Goal 2 Discover different classes of functional
    elements
  • Patterns of change distinguish different types of
    functional elements
  • Specific function ? Selective pressures ?
    Patterns of mutation/inse/del
  • Develop evolutionary signatures characteristic of
    each function

4
Protein-coding genes
  • Mike Lin

5
Evolutionary signatures for protein-coding genes
  • Same conservation levels, distinct patterns of
    divergence
  • Gaps are multiples of three (preserve amino acid
    translation)
  • Mutations are largely 3-periodic (silent codon
    substitutions)
  • Specific triplets exchanged more frequently
    (conservative substs.)
  • Conservation boundaries are sharp (pinpoint
    individual splicing signals)

6
Protein-coding evolution vs nucleotide
conservation
Protein-coding exons
Highly conserved non-coding elements
  • Evolutionary signatures specific to each function
  • Distinguish protein-coding from non-coding
    conservation
  • Genome-wide run (CSF only) 81 sens., 91
    precision
  • Incorporating additional signatures RFC,
    single-species

7
Many new genes confirmed by chromatin domains
Missed exon
Example MM14qC3
  • Several hundred new exons, many in clusters

8
Genome-wide curation / experimental follow-up
G
PI Tim Hubbard, Sanger Center. HAVANA curators,
experimental validation.
  • Novel candidate genes and exons
  • Experimental cDNA sequencing and validation
  • Curation of gene structures integrating evidence
  • Revising existing annotations
  • Identify dubious genes with non-protein-like
    evolution
  • Refine boundaries and exon sets of existing genes
  • Curation evaluate evidence supporting that
    annotation
  • Unusual gene structures
  • Evolutionary evidence in absence of primary
    signals
  • Reveal new and unusual biological mechanisms

9
Unusual protein-coding events
  • Mike Lin

10
When primary sequence signals are ignored
  • Typical gene (MEF2A). Evolutionary signal stops
    at the stop codon.
  • Unusual gene (GPX2). Protein-coding signal
    continues past the stop.
  • GPX2 is a known selenoprotein! Additional
    candidates found.

11
Translational read-through in neuronal proteins
Novel candidate OPRL1 neurotransmitter
Continued protein-coding conservation
Protein-coding conservation
No more conservation
Stop codon read through
  • New mechanism of post-transcriptional control.
  • Conserved in both mammals (5 candidates) and
    flies (150 candidates)
  • Strongly enriched for neurotransmitters and
    brain-expressed proteins
  • Read-through stop codon (surrounding) shows
    increased conservation
  • Many questions remain
  • Role of editing? Cryptic splice sites? RNA
    secondary structure?

Lin et al, Genome Research 2007
12
Measuring excess constraint within protein-coding
exons
Typical protein-coding exon (Numerous mutations,
at each column)
Excess-conservation exon conserved above and
beyond the call of duty ? Likely to have
additional functions, overlapping selective
pressures
13
Searching for excess-constraint coding sequence
  • (1) Build a model for expected substitution counts

Syn.subs. correlate w/ degeneracy CpG
Distribution for each ancestral codon
(2) Score windows for depletion in syn. subst.
  • Z-score P(obs. subst expected for each codon)

(3) Top candidate exons with excess constraint
  • PCPB2 derived from ancestral transposon
  • Hox B5 gene start 52 AA before 1 syn.subst
  • C6orf111 predicted ORF on chr. 6
  • EIF4G2 overlaps spliced EvoFold prediction

14
Examples Top candidate exons showing increased
selection
  • HoxB5 52 amino acids before the first
    synonymous substitution
  • Overlaps highly conserved RNA secondary structure
  • C6orf11 Predicted ORF, protein-coding,
    extremely conserved
  • EIF4G2 Several consecutive exons, conserved RNA
    struct.

15
microRNA genes
  • Alex Stark
  • Pouya Kheradpour

16
Evolutionary signatures for microRNA genes
  • Conservation
  • profile

Combine with 10 other features ? 4,500-fold
enrichment
17
Novel miRNAs validated by sequencing reads
Stark et al, Genome Research (GR) 2007. Ruby et
al GR 2007
  • In fly genome 101 hairpins above 0.95 cutoff
  • 60 of 74 (81) known Rfam miRNAs rediscovered
  • 24 novel expression-validated by 454Solexa
    (Bartel/Hannon)
  • 17 additional candidates show diverse evidence
    of function
  • In mammals combine experimental evolutionary
    info
  • Rely on reads for discovery, use evolutionary
    signal to study function

18
Surprise 1 microRNA microRNA function
Drosophila Hox
  • Both hairpin arms of a microRNA can be functional
  • High scores, abundant processing, conserved
    targets
  • Hox miRNAs miR-10 and miR-iab-4 as master Hox
    regulators

Stark et al, Genome Research 2007
19
Surprise 2 microRNA-anti-sense function
anti- sense
sense
Stark et al, GenesDevelopment 2007
  • A single miRNA locus transcribed from both
    strands
  • The two transcripts show distinct expression
    domains (mutually exclusive)
  • Both processed to mature miRNAs mir-iab-4,
    miR-iab-4AS (anti-sense)

20
miR-iab-4AS leads to homeotic transformations
?wing w/bristles
Sensory bristles
haltere
haltere
?wing
WT
Note C,D,E same magnification
?wing
sense
Antisense
  • Mis-expression of mir-iab-4S AS alteres?wings
    homeotic transform.
  • Stronger phenotype for AS miRNA
  • Sense/anti-sense pairs as general building blocks
    for miRNA regulation
  • 10 sense/anti-sense miRNAs in mouse

Stark et al, GenesDevelopment 2007
21
Function of miRNA arms and anti-sense miRNAs
  • Denser Hox miRNA targeting network

22
Measuring selection
  • Michele Clamp
  • Manuel Garber
  • Xiaohui Xie

23
Detecting Purifying Selection (?)
?
Neutral sequence
Constrained sequence
  • Estimating intensity of constraint (?)
  • Probabilistic evolutionary model
  • Maximum Likelihood (ML) estimation of ?
  • sitewise (evaluate every k-long window)
  • windows-based (increased power)
  • Reports ?, and its log odds score (LODS).
  • Theoretical p-value (LODS distributes ?2 with df
    1)

Manuel Garber, Michele Clamp, Xiaohui Xie
24
Detecting other constraint signatures (p)
? 0 0 0.8 0.5 0.6 3.2 0
0
  • Repeated C?G transversion
  • Has happened at least 4 times.
  • Very unlikely given neutral model.

?
  • Goal Identify sites with unlikely substitution
    pattern.
  • Approach Probabilistic method to detect a
  • stationary distribution that is different from
    background.
  • Solution Implement ML estimator (?) of this
    vector
  • Provides a Position Weight Matrix for any given
    k-mer in the genome.
  • Scores every base in the genome (LODS).

Manuel Garber, Michele Clamp, Xiaohui Xie
25
Estimation of genome-wide constraint
Pilot Encode Regions (1)
9.4 conserved 5.7 above FDR cutoff
10.5 conserved 6 above FDR cutoff
Genome-wide
Across entire genome 5 under selection. Same as
for Human-Mouse. Whats different?
Manuel Garber, Michele Clamp, Xiaohui Xie
26
More mammals We can actually tell which 5 it is!
Constraint calculated over a 50mer
21 mammals
4 mammals
5 FDR
gt40 FDR
Constraint calculated over a 12mer
21 mammals
4 mammals
5 FDR
gt40 FDR
Michele Clamp
27
Individual conserved elements match known TF sites
Example TNNC1 (Troponin C)
?5
Constraint score
Promoter alignment
Known TF binding sites
?5
TATA
SP-1
CEF-2
CEF1
Binding site resolution, even without known motif
model
Michele Clamp
28
Binding sites for known regulators
  • Pouya Kheradpour
  • Alex Stark

29
Computing Branch Length Score (BLS)
  • Allows for
  • Mutations permitted by motif degeneracy
  • Misalignment/movement of motifs within window (up
    to hundreds of nucleotides)
  • Missing motif in dense species tree

30
Branch Length Score ? Confidence
  • Use motif-specific shuffled control motifs
    determine the expected number of instances at
    each BLS by chance alone (or due to non-motif
    conservation)
  • Compute Confidence Score as fraction of instances
    over noise at a given BLS(1 false discovery
    rate)
  • Many species are needed to confidently predict
    instances

31
Performance on vertebrate Transfac motifs
Median number of instances (at fixed confidence)
  • Most motifs have confident instances into 90
    confidence with 18 mammals
  • Substantial increase in the number of instances
    compared to only human, mouse rat and dog.

32
Intersection with CTCF ChIP-Seq regions
ChIP data from Barski, et al., Cell (2007)
  • ChIP-Seq and ChIP-Chip technologies allow for
    identifying binding sites of a motif
    experimentally
  • Conserved CTCF motif instances highly enriched in
    ChIP-Seq sites
  • High enrichment does not require low sensitivity
  • Many motif instances are verified

33
Enrichment also found for other factors
Barski, et al., Cell (2007)
34
Enrichment increases in conserved bound regions
  • ChIP bound regions may not be conserved
  • For CTCF we also have binding data in mouse
  • Enrichment in intersection is dramatically higher

Human Barski, et al., Cell (2007) Mouse
Bernstein, unpublished
35
Enrichment increases in conserved bound regions
  • ChIP bound regions may not be conserved
  • For CTCF we also have binding data in mouse
  • Enrichment in intersection is dramatically higher
  • Trend persists for other factors where we have
    multi-species ChIP data

36
Motif discovery
  • Pouya Kheradpour
  • Alex Stark

37
Using confidence for motif discovery
  • Use motif-specific shuffled control motifs
    determine the expected number of instances at
    each BLS by chance alone (or due to non-motif
    conservation)
  • Compute Confidence Score as fraction of instances
    over noise at a given BLS(1 false discovery
    rate)

38
Motif discovery pipeline
  • Enumerate motif seeds
  • Six non-degenerate characters with variable size
    gap in the middle
  • Score seed motifs
  • Use a conservation ratio corrected for
    composition and small counts to rank seed motifs
  • Expand seed motifs
  • Use expanded nucleotide IUPAC alphabet to fill
    unspecified bases around seed using hill climbing
  • Cluster to remove redundancy
  • Using sequence similarity

39
Motif discovery in enhancer regions
Heinzman et al, Bing Rens lab
  • Collaboration with Ren, White, Posakony labs
  • Predict novel enhancer / promoter / insulator
    elements
  • Identify motifs associated with these regions
  • Validate predicted regions for in vivo function
  • Initial results in human genome
  • Motif combinations predictive of enhancer regions
    (5X)

40
Motif discovery in 3UTRs
  • Perform motif discovery by ranking 7-mers in
    3UTRs by the highest confidence they reach with
    100 instances.

41
Summary
  • Measuring increased selection
  • Scaling of branch lengths ?
  • Non-random stationary distribution p
  • Increased resolution individual binding sites
  • Protein-coding genes
  • Distinct evolutionary signatures
  • Novel genes, revised genes
  • Unusual structures read-through, increased
    selection
  • microRNAs
  • Function of miRNA/miRNA and sense/anti-sense
    pairs
  • Dense miRNA targeting network for Hox cluster
  • Regulatory motifs
  • Measure increased selection, derive confidence
    score
  • High sensitivity / high specificity for known
    motifs
  • Use enumeration/confidence metric for motif
    discovery

42
Acknowledgements
MIT Computer Science and AI Lab
Broad Institute of MIT and Harvard
Pouya Kheradpour
Kerstin Lindblad-Toh
Michele Clamp
Manuel Garber
Mike Lin
Xiaohui Xie
Alex Stark
Matt Rasmussen
Sante Gnerre, David Jaffe Issao Fujiwara Federica
Di Palma Arachne Assembly Team Broad Sequencing
Platform Eric Lander
Sequencing Baylor, WashU, Agencourt. Funding
NHGRI miRNAs Julius Brennecke, Graham Ruby, Greg
Hannon, David Bartel iab-4AS Natascha Bushati,
Steve Cohen, Julius, Greg Hannon
Write a Comment
User Comments (0)
About PowerShow.com