Title: Statistical modeling and classification in Biological Sequence Space
1Statistical modeling and classification in
Biological Sequence Space
- April 26, 04 9.520
- Gene Yeo
- Poggio, Burge _at_MIT
2Framework/Issues
- Build models around known biology
- In the process, extend knowledge about known
biology - Predict new examples
- Validate predictions by
- prediction accuracy
- experimental validation
- higher-level traits of predictions
- conservation in other genomes
3Biological sequences
- DNA, RNA and proteins macromolecules built up
from smaller units. - DNA units are the nucleotide residues A, C, G
and T - RNA units are the nucleotide residues A, C,
G and U - Proteins units are the amino acid residues A,
C, D, E, F, G, H, I, K, L, M, N, P, Q, R, S, T,
V, W and Y. - To a considerable extent, the chemical
properties of DNA, RNA and protein molecules are
encoded in the linear sequence of these basic
units their primary structure.
4- Statistical models can be descriptive and/or
predictive. - Given known biological signal-gt describe the
signal with statistical modeling find unknown
examples of the same signal - Gene-finding (protein-coding genes)
- Noncoding RNA genes
- Protein domains
- Warning although successful, models are not to
be taken literally. - Most important biological confirmation of
predictions is almost always necessary.
5Sequences are full of signals!
ACGTAGCTAGCATGCATGCATGACTACGATCGACTACGATCAACGATGCA
TGCATCGACTACGATCAGCTACGATCAGCATCGACTAGCATCGATCAGCA
TCGATCAGCATCGACTAGCTACGACTAGCGCTAC
How do we model/describe these motifs ?
6Different models
RNA gene (Covariation,SCFG,NN,SVM)
Protein structure (a variety of methods)
Complexity
Protein gene(HMM,NN)
Splice site motif (WMM, MM, SVM, NN)
DNA RNA
Protein
7Modeling dependencies in biological sequence
motifs
Object Model
Assumptions
Weight Matrix Model (WMM)
Independence (easy)
Hidden Markov Model (HMM)
Local dependence (medium)
Non-local Pairwise Dependence (hard)
Stochastic Context-Free Grammar (SCFG)
8A case study in computational biology modeling
signals in genes
With so many genomes being sequenced, it
remains important to be able to identify genes
and the signals within and around genes
computationally.
9What is a (protein-coding) gene?
CCTGAGCCAACTATTGATGAA
CCUGAGCCAACUAUUGAUGAA
PEPTIDE
10What is a gene, ctd?
- In general the transcribed sequence is longer
than the translated portion parts called introns
(intervening sequence) are removed, leaving exons
(expressed sequence), and yet other regions
remain untranslated. The translated sequence
comes in triples called codons, beginning and
ending with a unique start (ATG) and one of three
stop (TAA, TAG, TGA) codons. - There are also characteristic intron-exon
boundaries called splice donor and acceptor
sites, and a variety of other motifs promoters,
transcription start sites, polyA sites,branching
sites, and so on. - All of the foregoing have statistical
characterizations.
11(No Transcript)
12Some facts about human genes
- Comprise about 3 of the genome
- Average gene length 8,000 bp
- Average of 5-6 exons/gene
- Average exon length 200 bp
- Average intron length 2,000 bp
- 8 genes have a single exon
13The idea behind a HMM genefinder
- States represent standard gene features
intergenic region, exon, intron, perhaps more
(promotor, 5UTR, 3UTR, Poly-A,..). - Observations embody state-dependent statistics,
such as base composition, dependence, and signal
features.
14GENSCAN (Burge Karlin)
15a simple genefinder
16Splice sites can be an important signal
17Regular expressions can be limiting
C A
A G
AGGT AGT
5 splice junction in eukaryotes
T C
T C
N AGC
3 splice junction
11
Most protein binding sites are characterized by
some degree of sequence specificity, but
seeking a consensus sequence is often an
inadequate way to recognize sites.
Position-specific distributions came to represent
the variability in motif composition.
18Position-specific scoring matrix (PSSM)
6
5
4
3
2
1
-1
-2
-3
Pos
0.1
0.1
0.7
0.4
0.0
0.0
0.1
0.6
0.3
A
0.2
0.1
0.1
0.1
0.0
0.0
0.0
0.1
0.4
C
0.2
0.8
0.1
0.4
0.0
1.0
0.8
0.2
0.2
G
0.5
0.0
0.1
0.1
1.0
0.0
0.1
0.1
0.1
T
19Ok, so we got the genes
- molecular biology (transcription, splicing)
- signals are modeled as states (HMM) or
separately, i.e.PSSMs
- Heres another catch, there isnt just one
version of each gene. - But sometimes several
20Eg. alternative splicing - CD44
Human chromosome 11p
Zhu et al Science (2003)
21Alternative splicing
- is a major determinant of protein diversity
(Lander 2001, Zavolan 2003) - 30-50 of human diseases involve alt. splicing
22Defining constitutive and alternative exons
Constitutive exon Skipped exon 3 alternative
exon 5 alternative exon Intron
retention Mutually exclusive exons
23Conserved alternative, skipped exon - FXR1
Fragile X Related Gene, FXR1
24Another example of genes containing CSE DMWD
Myotonic Dystrophy-containing WD Repeat, DMWD
25Predicting new alternatively spliced exons
- The problem is ill-posed
- High-dimensional space
- Not overfit data
- Simple feature selection
- Unbalanced data set sizes
- Labels are more flexible
26Eg. of experimentally validated
27Biological sequence space challenges
- Models that represent as much of the biology as
possible. - Biologically motivated features are important
- Validating attributes
- Conservation of events are key in computational
biology - Higher-level consistency with known biology
- Experimental validation of predictions are
essential
28Framework/Issues
- Build models around known biology
- In the process, extend knowledge about known
biology - Predict new examples
- Validate predictions by
- prediction accuracy
- experimental validation
- higher-level traits of predictions
- conservation in other genomes
29Modeling higher order interactions Yeast Phe tRNA
If time permits
Secondary Structure
Tertiary Structure
30The Hammerhead Ribozyme
Secondary structure
Tertiary structure
31One example on how to model and predict RNA 2o
Structure
Covariation (using comparative genomics)
Seq1 A C G A A A G U Seq2 U A G U A A U
A Seq3 A G G U G A C U Seq4 C G G C A A U
G Seq5 G U G G G A A C
32Mutual information statistic for pair of columns
in a multiple alignment
fraction of seqs w/ nt. x in col. i, nt. y
in col. j
fraction of seqs w/ nt. x in col. i
sum over x, y A, C, G, U
33Inferring 2o Structure from Covariation
34Stochastic Context-Free Grammars (SCFGs)
A generalized model which is capable of
handling non-local dependencies between words in
a language (or bases in an RNA)
Ref Durbin et al. Biological Sequence
Analysis 1998
35An SCFG Model of RNA 2o Structure
Production Rules P ?? ?aWb (pair)
L ?? ?aW (left bulge/loop) R ?? ?Wa
(right bulge/loop) B ?? SS
(bifurcation) S ?? W (start) E ?? ?
(end)
36last page
- some of the slides were obtained from various
places - available online slides on the web (primarily
from lectures by terry speed). - slides from chris burge, dirk holste