Title: Biological sequence analysis
1Biological sequence analysis
- Terry Speed
- Wald Lecture II,
- August 8, 2001
2The objects of our study
- DNA, RNA and proteins macromolecules which
are unbranched polymers built up from smaller
units. - DNA units are the nucleotide residues A, C, G
and T - RNA units are the nucleotide residues A, C,
G and U - Proteins units are the amino acid residues A,
C, D, E, F, G, H, I, K, L, M, N, P, Q, R, S, T,
V, W and Y. - To a considerable extent, the chemical
properties of DNA, RNA and protein molecules are
encoded in the linear sequence of these basic
units their primary structure.
3The use of statistics to study linear sequences
of biomolecular units
- Can be descriptive, predictive or everything
else in between..almost business as usual. - Stochastic mechanisms should never be taken
literally, but nevertheless can be amazingly
useful. - Care is always needed a model or method can
break down at any time without notice. - Biological confirmation of predictions is
almost always necessary.
4The statistics of biological sequences can
be global or local
- Base composition of genomes
- E. coli 25 A, 25 C, 25 G, 25 T
- P. falciparum 82AT
- Translation initiation
- ATG is the near universal motif indicating the
- start of translation in DNA coding sequence.
5From certainty to statistical models a brief
case study
1 ZNF Cys-Cys-His-His zinc finger DNA binding
domain
6Cys-Cys-His-His zinc finger DNA binding domain
- Its characteristic motif has regular
expression - C-x(2,4)-C-x(3)-LIVMFYWC-x(8)-H-x(3,5)-H
-
-
- 1ZNF XYKCGLCERSFVEKSALSRHQRVHKNX
- Regular expressions can be limiting, and
position-specific distributions came to represent
the variability in motifs.
7Cys-Cys-His-His profile sequence logo form
A sequence logo is a scaled position-specific
a.a.distribution. Scaling is by a measure of a
positions information content.
8Representation of motifs the next steps
- Missing from the position-specific
distribution - representation of motifs are good ways of
dealing with - Length distributions for insertions/deletions
- Cross-position association of amino acids
- Hidden Markov models help with the first.
The second remains a hard unsolved problem.
9Hidden Markov models
- Processes (St,Ot), t1,, where St is the
hidden - state and Ot the observation at time t, such
that -
- pr(St St-1,Ot-1,St-2 ,Ot-2 ) pr(St
St-1) - pr(Ot St-1,Ot-1,St-2 ,Ot-2 ) pr(Ot
St, St-1) - The basics of HMMs were laid bare in a series
of beautiful papers by L E Baum and colleagues
around 1970, and their formulation has been used
almost unchanged to this day.
10Hidden Markov modelsextensions
- Many variants are now used. For example, the
distribution of O may not depend on previous S
but on previous O values, -
- pr(Ot St , St-1 , Ot-1 ,.. ) pr(Ot
St ), or - pr(Ot St , St-1 , Ot-1 ,.. ) pr(Ot
St , St-1 ,Ot-1) . -
- Most importantly for us, the times of S and O
may be decoupled, permitting the Observation
corresponding to State time t to be a string
whose length and composition depends on St (and
possibly St-1 and part or all of the previous
Observations). This is called a hidden
semi-Markov or generalized hidden Markov model. -
11A simple HMM (Churchill, 1989)
O.O1
O.99
O.9
O.1
hidden states
.
.
.
.
.
.
observations
.
.
.
.
.
.
12The algorithms
- As the name suggests, with an HMM the series
O (O1,O2 ,O3 ,., OT) is observed, while
the states S (S1 ,S2 ,S3 ,., ST) are not. -
- There are elegant algorithms for calculating
pr(O?), arg max? pr(O?) in certain special
cases, and arg maxS pr(SO,?). - Here ? are the parameters of the model, e.g.
transition and observation probabilities.
13Some early applications of HMMs
finance, but we never saw them
speech recognition modelling
ion channels
In the mid-late 1980s HMMs entered genetics
and molecular biology, and they are now firmly
entrenched.
-
- Some current applications of HMMs to biology
- mapping chromosomes
- aligning biological sequences
- predicting sequence structure
- inferring evolutionary
relationships - finding genes in DNA sequence
14HMMs representing coiled-coil domains
2TMA Tropomyosin
15Coiled-coil domains, schematically
dimeric parallel helices, heptad repeats,
knobs-into-holes
16(hydrophobic)
17Designing the HMM, I
18Designingthe HMM, 2
19Designing the HMM, 3
20HMM decoding
WGP ARQLNES VKDINKM LER HP BBB CCCCCCC CCCCCCC
CCC BB 000 abcdefg abcdefg abc 00 00c defgabc
defgabc def g0
Sequence Labels Path1 Path2
- VITERBI decoding of all possible state-paths, we
determine the maximum probability one given the
amino acid sequence O - POSTERIOR decoding at each position, we
determine the state with the highest probability
given O. - Issue how to measure the strength of a potential
CC domain, and how should this depend on the
length of a domain?
21CC-PROBABILITY PROFILE
Fusion protein of simian parainfluenza virus 5
22Assessing performance terms
TP true positive a predicted fragment that
overlaps the annotated fragment (aa in the
annotated region) FP false positive a
predicted fragment does not overlap any
annotated fragment (aa not in the
annotated region) LS learning
set of sequences NTS negative test set
sequences with no CCD used for
estimating FP PTS positive test set used for
estimating TP Much care/effort required to create
these sets
23Assessing performance study design
Study the variability of performance under
variation of the sequences used for
determining the model parameters. Compare
methods using the same set of
aa-frequencies / emission probabilities. Use
the same set of domains for learning and testing
instead of testing on different protein
families. Choose a number of FP-rates and
calculate the corresponding TP-rates (ROC curve).
24 PTS subdivided 150 times at random
(stratified) into 2/3 for learning and the rest
for testing.
LEARNING PHASE gt PARAMETERS
TESTING ON NTS gt THRESHOLDS
150 X
TESTING ON PTS gt TP-VALUES
ANALYSIS OF RESULTS
25Assessing performance summaries
- TP-rate at given FP-rate per family / per
length-class - TP- and FP-rates for aas per family / per
length-class - Accuracy at the borders and of length prediction
26Finding genes in DNA sequence
This is one of the most challenging and
interesting problems in computational biology at
the moment. With so many genomes being sequenced
so rapidly, it remains important to begin by
identifying genes computationally.
27What is a (protein-coding) gene?
CCTGAGCCAACTATTGATGAA
CCUGAGCCAACUAUUGAUGAA
PEPTIDE
28What is a gene, ctd?
- In general the transcribed sequence is longer
than the translated portion parts called introns
(intervening sequence) are removed, leaving exons
(expressed sequence), and yet other regions
remain untranslated. The translated sequence
comes in triples called codons, beginning and
ending with a unique start (ATG) and one of three
stop (TAA, TAG, TGA) codons. - There are also characteristic intron-exon
boundaries called splice donor and acceptor
sites, and a variety of other motifs promoters,
transcription start sites, polyA sites,branching
sites, and so on. - All of the foregoing have statistical
characterizations.
29In more detail (color state)
30Some facts about human genes
- Comprise about 3 of the genome
- Average gene length 8,000 bp
- Average of 5-6 exons/gene
- Average exon length 200 bp
- Average intron length 2,000 bp
- 8 genes have a single exon
- Some exons can be as small as 1 or 3 bp.
- HUMFMR1S is not atypical 17 exons 40-60 bp long,
comprising 3 of a 67,000 bp gene
31The idea behind a GHMM genefinder
- States represent standard gene features
intergenic region, exon, intron, perhaps more
(promotor, 5UTR, 3UTR, Poly-A,..). - Observations embody state-dependent base
composition, dependence, and signal features. -
- In a GHMM, duration must be included as well.
-
- Finally, reading frames and both strands must be
dealt with.
32Half a model for a genefinder
33Splice sites can be included in the exons
34Beyond position-specific distributions
- The bases in splice sites exhibit dependence, and
not simply of the nearest neighbor kind. - High-order (non-stationary) Markov models would
be one option, but the number of parameters in
relation to the amount of data rules them out.
The class of variable length Markov models
(VLMMs) deriving from early research by Rissanen
prove to be valuable in this context. However,
there is likely to be room for more research
here.
35GENSCAN (Burge Karlin)
36Remark
- In general the problem of identifying
(annotating) human genes is considerably harder
than ß-globin might suggest. - The human factor VIII gene (whose mutations
cause hemophilia A) is spread over 186,000 bp.
It consists of 26 exons ranging in size from 69
to 3,106 bp, and its 25 introns range in size
from 207 to 32,400 bp. The complete gene is thus
9 kb of exon and 177 kb of intron. -
- The biggest human gene yet is for dystrophin.
It has gt 30 exons and is spread over 2.4
million bp.
37Challenges in the analysis of sequence data
- Understanding the biology well enough to
begin. - Designing HMM architecture, e.g. in Marcoil
for coiled-coils. - Modelling the parts, e.g. VLMMs for splice
sites. - Coding software engineering, is the hardest
and most important task of all making it all
work. - Obtaining good data sets for use in careful
evaluation and comparison with competing
algorithms designing the studies. - Opportunities for methodological research.
38Topics not mentioned include
- Molecular evolution, including phylogenetic
inference (building trees from aligned sequence
data) - Sequence alignment (pairwise, multiple),
including use of Gibbs sampler - Stochastic context-free grammar models and the
analysis of RNA sequence data.
39Acknowledgements
- Mauro Delorenzi (WEHI)
- Simon Cawley (Affymetrix)
- Tony Wirth (CS, Princeton)
- Lior Pachter (Math, UCB)
- Marina Alexandersson (Stat, UCB)
40References
- Biological Sequence Analysis
- R Durbin, S Eddy, A Krogh and G Mitchison
- Cambridge University Press, 1998.
- Bioinformatics The machine learning approach
- P Baldi and S Brunak
- The MIT Press, 1998
- Post-Genome Informatics
- M Kanehisa
- Oxford University Press, 2000