Statistical modeling and classification in Biological Sequence Space - PowerPoint PPT Presentation

About This Presentation

Title:

Statistical modeling and classification in Biological Sequence Space

Description:

Given known biological signal- describe the signal with statistical modeling ... Modeling dependencies in biological sequence motifs ... – PowerPoint PPT presentation

Number of Views:79

Avg rating:3.0/5.0

Slides: 33

Provided by: gene91

Learn more at: http://web.mit.edu

Category:

more less

Transcript and Presenter's Notes

Title: Statistical modeling and classification in Biological Sequence Space

1
Statistical modeling and classification in
Biological Sequence Space

April 26, 04 9.520
Gene Yeo
Poggio, Burge _at_MIT

2
Framework/Issues

Build models around known biology
In the process, extend knowledge about known
biology
Predict new examples
Validate predictions by
prediction accuracy
experimental validation
higher-level traits of predictions
conservation in other genomes

3
Biological sequences

DNA, RNA and proteins macromolecules built up
from smaller units.
DNA units are the nucleotide residues A, C, G
and T
RNA units are the nucleotide residues A, C,
G and U
Proteins units are the amino acid residues A,
C, D, E, F, G, H, I, K, L, M, N, P, Q, R, S, T,
V, W and Y.
To a considerable extent, the chemical
properties of DNA, RNA and protein molecules are
encoded in the linear sequence of these basic
units their primary structure.

Statistical models can be descriptive and/or
predictive.
Given known biological signal-gt describe the
signal with statistical modeling find unknown
examples of the same signal
Gene-finding (protein-coding genes)
Noncoding RNA genes
Protein domains
Warning although successful, models are not to
be taken literally.
Most important biological confirmation of
predictions is almost always necessary.

5
Sequences are full of signals!
ACGTAGCTAGCATGCATGCATGACTACGATCGACTACGATCAACGATGCA
TGCATCGACTACGATCAGCTACGATCAGCATCGACTAGCATCGATCAGCA
TCGATCAGCATCGACTAGCTACGACTAGCGCTAC
How do we model/describe these motifs ?
6
Different models
RNA gene (Covariation,SCFG,NN,SVM)
Protein structure (a variety of methods)
Complexity
Protein gene(HMM,NN)
Splice site motif (WMM, MM, SVM, NN)
DNA RNA
Protein
7
Modeling dependencies in biological sequence
motifs
Object Model
Assumptions
Weight Matrix Model (WMM)
Independence (easy)
Hidden Markov Model (HMM)
Local dependence (medium)
Non-local Pairwise Dependence (hard)
Stochastic Context-Free Grammar (SCFG)
8
A case study in computational biology modeling
signals in genes
With so many genomes being sequenced, it
remains important to be able to identify genes
and the signals within and around genes
computationally.
9
What is a (protein-coding) gene?
CCTGAGCCAACTATTGATGAA
CCUGAGCCAACUAUUGAUGAA
PEPTIDE
10
What is a gene, ctd?

In general the transcribed sequence is longer
than the translated portion parts called introns
(intervening sequence) are removed, leaving exons
(expressed sequence), and yet other regions
remain untranslated. The translated sequence
comes in triples called codons, beginning and
ending with a unique start (ATG) and one of three
stop (TAA, TAG, TGA) codons.
There are also characteristic intron-exon
boundaries called splice donor and acceptor
sites, and a variety of other motifs promoters,
transcription start sites, polyA sites,branching
sites, and so on.
All of the foregoing have statistical
characterizations.

11
(No Transcript)
12
Some facts about human genes

Comprise about 3 of the genome
Average gene length 8,000 bp
Average of 5-6 exons/gene
Average exon length 200 bp
Average intron length 2,000 bp
8 genes have a single exon

13
The idea behind a HMM genefinder

States represent standard gene features
intergenic region, exon, intron, perhaps more
(promotor, 5UTR, 3UTR, Poly-A,..).
Observations embody state-dependent statistics,
such as base composition, dependence, and signal
features.

14
GENSCAN (Burge Karlin)
15
a simple genefinder
16
Splice sites can be an important signal
17
Regular expressions can be limiting
C A
A G
AGGT AGT
5 splice junction in eukaryotes
T C
T C
N AGC
3 splice junction
11
Most protein binding sites are characterized by
some degree of sequence specificity, but
seeking a consensus sequence is often an
inadequate way to recognize sites.
Position-specific distributions came to represent
the variability in motif composition.
18
Position-specific scoring matrix (PSSM)
6
5
4
3
2
1
-1
-2
-3
Pos
0.1
0.1
0.7
0.4
0.0
0.0
0.1
0.6
0.3
A
0.2
0.1
0.1
0.1
0.0
0.0
0.0
0.1
0.4
C
0.2
0.8
0.1
0.4
0.0
1.0
0.8
0.2
0.2
G
0.5
0.0
0.1
0.1
1.0
0.0
0.1
0.1
0.1
T
19
Ok, so we got the genes

molecular biology (transcription, splicing)
signals are modeled as states (HMM) or
separately, i.e.PSSMs

Heres another catch, there isnt just one
version of each gene.
But sometimes several

20
Eg. alternative splicing - CD44
Human chromosome 11p
Zhu et al Science (2003)
21
Alternative splicing

is a major determinant of protein diversity
(Lander 2001, Zavolan 2003)
30-50 of human diseases involve alt. splicing

22
Defining constitutive and alternative exons
Constitutive exon Skipped exon 3 alternative
exon 5 alternative exon Intron
retention Mutually exclusive exons
23
Conserved alternative, skipped exon - FXR1
Fragile X Related Gene, FXR1
24
Another example of genes containing CSE DMWD
Myotonic Dystrophy-containing WD Repeat, DMWD
25
Predicting new alternatively spliced exons

The problem is ill-posed
High-dimensional space
Not overfit data
Simple feature selection
Unbalanced data set sizes
Labels are more flexible

26
Eg. of experimentally validated
27
Biological sequence space challenges

Models that represent as much of the biology as
possible.
Biologically motivated features are important
Validating attributes
Conservation of events are key in computational
biology
Higher-level consistency with known biology
Experimental validation of predictions are
essential

28
Framework/Issues

Build models around known biology
In the process, extend knowledge about known
biology
Predict new examples
Validate predictions by
prediction accuracy
experimental validation
higher-level traits of predictions
conservation in other genomes

29
Modeling higher order interactions Yeast Phe tRNA
If time permits
Secondary Structure
Tertiary Structure
30
The Hammerhead Ribozyme
Secondary structure
Tertiary structure
31
One example on how to model and predict RNA 2o
Structure
Covariation (using comparative genomics)
Seq1 A C G A A A G U Seq2 U A G U A A U
A Seq3 A G G U G A C U Seq4 C G G C A A U
G Seq5 G U G G G A A C
32
Mutual information statistic for pair of columns
in a multiple alignment
fraction of seqs w/ nt. x in col. i, nt. y
in col. j
fraction of seqs w/ nt. x in col. i
sum over x, y A, C, G, U
33
Inferring 2o Structure from Covariation
34
Stochastic Context-Free Grammars (SCFGs)
A generalized model which is capable of
handling non-local dependencies between words in
a language (or bases in an RNA)
Ref Durbin et al. Biological Sequence
Analysis 1998
35
An SCFG Model of RNA 2o Structure
Production Rules P ?? ?aWb (pair)
L ?? ?aW (left bulge/loop) R ?? ?Wa
(right bulge/loop) B ?? SS
(bifurcation) S ?? W (start) E ?? ?
(end)
36
last page