Title: Comparative Genomics
1Comparative Genomics Annotation
The Foundation of Comparative Genomics Non-Compara
tive Annotation Three methodological tasks of CG
Annotation Protein Gene Finding RNA
Structure Prediction Signal
Finding Challenges Empirical Investigations
Genes Signals Functional Stories
Positive Selection Open Questions
2Hidden Markov Models in Bioinformatics
- Definition
- Three Key Algorithms
- Summing over Unknown States
- Most Probable Unknown States
- Marginalizing Unknown States
- Key Bioinformatic Applications
- Pedigree Analysis
- Profile HMM Alignment
- Fast/Slowly Evolving States
- Statistical Alignment
3Further Examples
Isochore Churchill,1989,92
Lp(C)Lp(G)0.1, Lp(A)Lp(T)0.4,
Lr(C)Lr(G)0.4, Lr(A)Lr(T)0.1
Likelihood Recursions
Likelihood Initialisations
Simple Eukaryotic
Gene Finding Burge and Karlin, 1996
Simple Prokaryotic
4Further Examples
Secondary Structure Elements Goldman, 1996
a ? L
a .909 .0005 .091
? .005 .881 .184
L .062 .086 .852
.325 .212 .462
HMM for SSEs
Adding Evolution
SSE Prediction
Profile HMM Alignment Krogh et al.,1994
5Grammars Finite Set of Rules for Generating
Strings
6Simple String Generators Terminals (capital)
--- Non-Terminals (small) i. Start with S
S --gt aT bS T
--gt aS bT ? One sentence odd of as S-gt
aT -gt aaS gt aabS -gt aabaT -gt aaba ii. ?S--gt
aSa bSb aa bb One sentence (even length
palindromes) S--gt aSa --gt abSba --gt abaaba
7Stochastic Grammars
The grammars above classify all string as
belonging to the language or not.
All variables has a finite set of substitution
rules. Assigning probabilities to the use of
each rule will assign probabilities to the
strings in the language.
If there is a 1-1 derivation (creation) of a
string, the probability of a string can be
obtained as the product probability of the
applied rules.
i. Start with S. S --gt (0.3)aT (0.7)bS
T --gt (0.2)aS (0.4)bT (0.2)?
0.2
0.7
0.3
0.3
S -gt aT -gt aaS gt aabS -gt aabaT -gt aaba
0.2
ii. ?S--gt (0.3)aSa (0.5)bSb (0.1)aa (0.1)bb
0.1
0.3
0.5
S -gt aSa -gt abSba -gt abaaba
8Finding Regulatory Signals in Genomes
The Computational Problem Non-homologous/homo
logous sequences Known/unknown signal
1 common signal/complex signals/additional
information Combinations
Regulatory signals know from molecular biology
Different Kinds of Signals
Promotors Enhancers
Splicing Signals
a-globins in humans
9Weight Matrices Sequence Logos
Wasserman and Sandelin (2004) Applied
Bioinformatics for the Identification of
Regulatory Elements Nature Review Genetics
5.4.276
10Motifs in Biological Sequences 1990 Lawrence
Reilly An Expectation Maximisation (EM)
Algorithm for the identification and
Characterization of Common Sites in Unaligned
Biopolymer Sequences Proteins 7.41-51. 1992
Cardon and Stormo Expectation Maximisation
Algorithm for Identifying Protein-binding sites
with variable lengths from Unaligned DNA
Fragments L.Mol.Biol. 223.159-170 1993 Lawrence
Liu Detecting subtle sequence signals a Gibbs
sampling strategy for multiple alignment Science
262, 208-214.
Q(q1,A,,qw,T) probability of different bases
in the window
A(a1,..,aK) positions of the windows
q0(qA,..,qT) background frequencies of
nucleotides.
Priors A has uniform prior Qj
has Dirichlet(N0a) prior a base frequency in
genome. N0 is pseudocounts
11The Gibbs Sampler
For i1,..,d Draw xi(t1) from conditional
distribution p(.x-i(t)) and leave remaining
components unchanged, i.e. x-i (t1) x-i
(t)
12The Gibbs sampler
Gibbs iteration
From Lawrence, C. et al.(1993) Detecting Subtle
Sequence Signals A Gibbs Sampler approach to
Multiple Alignment. Science 262.208-
13The Gibbs sampler example
From Lawrence, C. et al.(1993) Detecting Subtle
Sequence Signals A Gibbs Sampler approach to
Multiple Alignment. Science 262.208-
14Natural Extensions to Basic Model I
Modified from Liu
15Natural Extensions to Basic Model II
16Combining Signals and other Data
Modified from Liu
17MEME- Multiple EM for Motif Elicitation
Motif nucleotide distribution Mp,q, where p -
position, q-nucleotide. Background
distribution Bq, l is probability that a Zi,j
1
Find M,B, l, Z that maximize Pr (X, Z M, B,
l) Expectation Maximization to find a local
maximum Iteration t Expectation-step Z(t)
E (Z X, (M, B, l) (t) )
Maximization-step Find (M, B, l) (t1) that
maximizesPr (X, Z(t) (M, B, l) (t1))
Bailey, T. L. and C. Elkan (1994). "Fitting a
mixture model by expectation maximization to
discover motifs in biopolymers." Proc Int Conf
Intell Syst Mol Biol 2 28-36.
18Phylogenetic Footprinting (homologous detection)
Blanchette and Tompa (2003) FootPrinter a
program designed for phylogenetic footprinting
NAR 31.13.3840-
19(No Transcript)
20Statistical Alignment and Footprinting.
Solution Cartesian Product of HMMs
21Structure does not stem from an evolutionary
model
- The equilibrium annotation
- does not follow a Markov Chain
- Each alignment in from the Alignment HMM
- is annotated by the Structure HMM
- No ideal way of simulating
using the HMM at the alignment will give other
distributions on the leaves
using the HMM at the root will give other
distributions on the leaves
22(Homologous Non-homologous) detection
Wang and Stormo (2003) Combining phylogenetic
data with co-regulated genes to identify
regulatory motifs Bioinformatics 19.18.2369-80
23Regulatory Signals in Humans
Transcription in Eukaryotes is done by RNA
Polymerase II. 1850 DNA-binding proteins in the
human genome.
- Transcription Start Site - TSS
- Core Promoter - within 100 bp of TSS
- Proximal Promoter Elements - 1kb TSS
- Locus Control Region - LCR
- Insulator
- Silencer
- Enhancer
Sourece Transcriptional Regulatory Elements in
the Human GenomeGlenn A. Maston, Sara K. Evans,
Michael R. GreenAnnual Review of Genomics and
Human Genetics. Volume 7, Sep 2006
24Sourece Transcriptional Regulatory Elements in
the Human GenomeGlenn A. Maston, Sara K. Evans,
Michael R. GreenAnnual Review of Genomics and
Human Genetics. Volume 7, Sep 2006
25a-globins
Multispecies Conserved Sequences - MCSs Analyzed
238kb in 22 species Found 24 MCSs Programs use
GUMBY - VISTA - MULTIPIPMAKER MULTILAGAN -
CLUSTALW - DIALIGN TRANSFAC 6.0 - TRES -
Experimental Knowledge of the region
Hypersensitive sites (DHSs) DNA
Methylation Region lies in CG rich, gene rich
region close to the telomeres. It is not easy
to align CG-islands.
26Promoters in a-globins
- 94.273-114.273 vista illus.
- 5 MCSs
- Divergence relative to human
- Promoters MCSs - 11
- Regulatory MCSs - 4
- Intronics MCSs - 2
- Exonic MCSs - 4
- Unknown - 3
Sourece Hughes et al.(2005) Annotation of
cis-regulatory elements by identification,
subclassification, and functional assessment of
multispecies conserved sequences PNAS 2005 102
9830-9835
27Regulatory Protein-DNA Complexes
28Challenges
Open Problems