Title: Comparative genomics and Target discovery
1Comparative genomics and Target discovery
- Maarten Sollewijn Gelpke
- MDI, Organon
2- What is comparative genomics?
- What can we learn from comparative genomics?
- What is target discovery?
- What are the implications of comparative genomics
to target discovery? - What issues in target discovery can be addressed
by comparative genomics?
3Overview
- Introduction to genomes and sequencing.
- Comparative genomics aspects.
- Phylogenomics concepts.
- Examples of comparative genomics.
4Sequence availability
- Availability of gene and protein sequences has
increased enormously in during the last 2
decades. - Current capacity of the main sequencing centers
is gt3Gb per month per centre. - This will increase again dramatically with the
development of new superfast sequencing
techniques.
Currently gt 100Gbases
5Genomes sequenced
A.thaliana
First bacterial genomes sequenced H.influenzae
and M.genitalium
The yeast genome
Human draft
- Human finished
- Rat
- Chicken
E.coli K12
Full sequence of chr. 22
1999
Xenopus Zebrafish
C.elegans
Chimpanzee
D.melanogaster Genome Chr. 21
6Genome sequencing
Evolutionary relationship between metazoans
(multicellular animals) that have been sequenced
or are due for sequencing.
7Genome sequencing
- BAC fingerprinting ? shotgun approach
- Accurate but laborious!
- Shotgun sequencing (WGS)
Bac
Clone
Bac
Clone
Whole Genome 30Mb 3Gb
100
-
200 kb
100
-
200 kb
Assembly
Finishing
8Genome sequencing
- Current state of sequenced organisms
- gt316 Prokaryotes
- gt27 Archae
- gt280 Eukaryotes (complete, in assembly or in
progress) - gt1600 Viruses and gt 500 mitochondria/chloroplasts
- Some ongoing genome sequencing projects
- Poplar, gibbon, platypus, Drosophila species,
variety of pathogenic fungi and bacteria, etc. - Meta-genomic projects on environmental samples
(soil, deep-sea, waste sites)
9Future of genome sequencing?
- New complete genomes.
- New low-redundancy genomes.
- New (low-redundancy) genome areas.
- Meta-genomics. Sequencing of microbial
communities. - Sequencing of extinct species.
- 40000 year old Cave bear 26k, 21 genes.
- 45000 year old Neanderthaler 75k ? diverged from
human lineage 315000 years ago
10Comparative genomics
- Discover what lies hidden in genomic sequence by
comparing sequence information. - Main areas
- Whole genome alignment
- Gene prediction
- Regulatory element prediction
- Phylogenomics
- Pharmacogenetics
- Affected by evolutionary aspects
- Mutational forces (introduce random mutations)
- Selection pressures
- Ratio of non-synonymous to synonymous
substitutions - Mutation rates lower or higher than neutral
11Comparing sequences, methods.
- Pairwise comparison of sequences (alignments)
- proteins or genes
- variety of local alignment tools like BLAST,
Smith-Waterman etc. - multiple sequence comparisons (ClustalW, Muscle
etc.) - results may be dependent on alignment settings
12Comparing sequences, methods.
- Whole genome comparisons
- Large stretches of sequence
- Divergence up to 450Mya (fugu-human) with
sufficient similarity remaining. - BLAT, BLASTZ, Phusion/BlastN
- Seeding strategy ? alignment extension ? gapped
alignments
13Whole genome comparison
- Conservation of synteny!
- Cross-reference of any genetic traits (diseases!)
from one organism (eg mouse) to genes in the
syntenic regions in the other organism (eg
human). - Genome expansion and contraction
- Genome duplications, segmental duplications
important mechanism for generating new genes. - (GC) content, CpG islands
- Reflect different mutational or DNA repair
processes? - Repeats
- Transposable elements are a main force in
reshaping genomes. TEs (or remainders thereof)
can be used to measure evolutionary forces acting
on the genome. - Neutral mutation rate.
14Gene prediction
- Comparing sequences has contributed enormously to
the accuracy of gene prediction. - Evidence based method.
- Use cDNAs, ESTs and proteins from various
organisms. - Apply gene feature rules.
15Gene prediction
- De novo methods.
- Alignment of genomic sequences
- Splicing rules and other gene features
- De novo gene prediction by comparing sequences
attempts to model a negative selection of
mutations. Areas with less mutations are
conserved because the mutations where detrimental
for the organism. - Prediction of similar proteins in both genomes.
Newly predicted protein in mouse and human,
similar to the disease related gene dystrophin.
16Regulatory element prediction
- The complexity of higher eukaryotes and their
relatively low number of genes can be explained
partially through the importance of
transcriptional regulation. - Identification of REs will have an extensive
impact in understanding gene expression patterns
(expression intensity, tissue specificity),
relations within expression patterns and
inferring biological systems or networks.
17Regulatory element prediction
- No formal models for regulatory motifs
- Attempt to find conserved regions or motifs based
on the global alignment of similar sequences of
different organisms (phylogenetic footprinting). - Which species to compare? Evolutionary distance?
- What regions around gene models to investigate?
5 and 3 flanking regions, introns? - Take expression patterns into account?
- How does evolution affect REs?
18Phylogenomics
- Comparison of genes and gene products across a
number of species (whole genomes), characterizing
homologues and gain insights in the evolutionary
process itself. - Pharmacophylogenomics is the use of phylogenomics
in aid of drug discovery, through improved target
selection and validation.
19Orthology and paralogy
Phylogenetic tree of gene X
- Orthologs genes in different species that arose
from a single gene in the most recent common
ancestor, by speciation. - Paralogs genes in the same species that arose
from a single gene in a ancestral species, by a
process of gene duplication.
20Target orthology
- Species differences frequently affect progression
of targets and compounds. Orthology maps in
combination with expression studies may explain
these differences. - Establishing orthology
- Reciprocal highest scoring Blast hit.
- Conservation of synteny.
- Gene loss or rate of evolution issues.
- Orthology does not guarantee common function
(functional shift). - Extensive sequence divergence
- High non-synonymous over synonymous nucleotide
substitution ratios. - Comparison of regulatory regions?
21Target paralogy
- Key insights in large pharmacologically relevant
families (NRs, GPCRs) can be gained from paralogy
analysis. - Paralogy is inter-related with several other gene
to function occurrences that can seriously affect
the suitability of genes as drug targets
Schematic representation of various mappings of
genes to functions.
22- Pleiotropy
- Suggested to precede paralogy
- Relaxed substrate or ligand specificity
- Multiple protein domains
- Tissue or cellular localization
- Redundancy
- Total or partial redundancy of function
- Directly linked to paralogy
- Robustness against gene knock-outs (target
validation) - PPAR-d / PPAR-a in skeletal muscle PXR / FXR in
bile acid signaling dopamine transporters /
serotonin transporters in adjacent neurons.
23- Heteromery
- Formation of heteromers between paralogs
- Known examples in major classes of drug targets
- GPCRs GABAß receptors
- NRs formation of heterodimers with retinoid X
rexeptors (RXR) - Ion channels
- Crosstalk
- Combination of pleiotropy and redundancy
- May be regulated in time and space (expression
and localization) - Action of cytokines (interleukins) on immune cell
types.
24- Alternative transcription
- Intermediate between paralogy and pleiotropy.
paralogy in place - Increases effective size of the genome (estimated
gt30 of human genes show alternative
transcription!)
25Effects on drug discovery
- Functional shifts, pleiotropy and redundancy
potentially have good or bad news for drug
discovery. - Functional shifts
- Misleading or unavailable animal model
- Animal toxicity irrelevant for humans
- Pleiotropy
- Unintended drug effects
- Opportunities for multiple indications
- Redundancy
- Disease resistant to treatment (multi-functionalit
y) - Highly selective treatment for complex diseases.
26Pharmacogenetics
- Within species comparative genomics
- Single Nucleotide Polymorphisms SNPs
- Current focus in coding regions, expected to
expand to sites of transcription regulation. - Determine the site of a SNP and the allele
frequencies from ethnic or multi-ethnic panels of
individuals (eg 100) - Pharmacogenetics (PGx) relate SNP information to
efficacy and safety issues during the drug
development process. - Efficacy PGx Select/predict drug responders,
increase confidence in a certain drug in
development. - Safety PGx Identification of individuals with
adverse effects to a drug
27Examples
- New genes and REs from yeast genomes.
- Multi species comparisons from targeted genomic
regions. - Comparative genomics at the vertebrate extremes.
- Pharmacogenetics in drug efficacy
28Comparison of yeast species to identify genes and
regulatory elements. (Kellis et al, Nature 2003)
- Saccharomyces cerevisiae and 3 related species
- 7x coverage WGS of each species
- Assembly of draft genome sequence
- S.cerevisiae genome aligned to others using ORFs
as seeds - Most ORFs have 11 matches. Considerable
conserved synteny. - Most genomic rearrangements clustered in
telomeric regions. - Local gene family expansion/contraction, creating
phenotypic diversity over evolutionary time. - Balance between conservation and divergence
allows for accurate gene identification and
recognition of REs as well!
29Identification of genes
- Original S.cerevisiae genome (1996) 6275 ORFs
- Re-analysis and other evidence (2002) 6062 ORFs
- This study validates all ORFs using a reading
frame conservation score (very sensitive). - 5538 ORFs, 20 unresolved, 504 rejected ORFs!
- In addition to gene recognition, also largely
improved gene structure definitions (start, stop,
intron).
30Identification of regulatory elements
- REs are difficult to identify
- Short (6-15bp), sequence variation, few known
rules - De novo discovery of REs directly from genomic
sequence. - Develop a motif conservation score system based
on known motifs - 78 motifs discovered, overlapping with 36 of 55
known motifs - Putative annotation of motifs using adjacent
genes. (GO) - 25 of 42 new motifs show high category annotation
correlation - Discovery of combinatorial control of Res
- Applications to human genome?
- Increase number of species in comparison to
enrich the low signal to noise ratio in humans.
31Multi species comparisons from targeted genomic
regions. (Thomas et al, Nature 2003)
- Comparing targeted regions areas in multiple
evolutionary diverse vertebrates (less probable
for conservation to occur by chance) - ENCODE project
- 44 genomic regions (14 manually selected of
which some disease related, 30 random) of diverse
gene density and non-exonic conservation - primates, bat, alligator, elephant, cat, emu,
leopard, salmon etc. - Initial analysis 1.8 Mb on chromosome 7
containing 10 genes, including CFTR, from 12
species. - Detection of 1000 multi-species conserved
sequences of which gt60 would not be detected by
a 2 species comparison.
32Comparative genomics at the vertebrate extremes
(Bofelli et al, Nature 2004)
- What can be learned from comparisons of genomes
that are distant or closely related in evolution? - Distant comparisons reveal the most constrained
sequence elements. - Most of the conserved human-fish non-coding
sequences are found near genes with roles in
embryonic development. - Mutations can have an important role in human
disease
33- Human-Fugu conservation of non-coding sequence in
the DACH gene area (development of brain, limbs,
sensory organs). - Validation of identified enhancer regions by
driving expression of a reporter in mouse embryos.
34Comparative genomics at the vertebrate extremes
- Intraspecies sequence comparisons allow
identification of species specific sequences - Phylogenetic shadowing
- Requires high rate of polymorphism
- Comparison among primates show human specific
sequences - Analysis of regulatory sequence of ApoA (involved
in human heart disease)
A. Mutation rate analysis of Ciona intestinalis
5 region of the forkhead gene. B. Validation
of identified potential regulatory elements in
Ciona larvae.
35Pharmacogenetics in drug efficacy
Efficacy PGx for an obesity drug. Compare
genotypes 1-1, 1-2 and 2-2