Biological sequence analysis - PowerPoint PPT Presentation

1 / 40
About This Presentation
Title:

Biological sequence analysis

Description:

Biological sequence analysis Terry Speed Wald Lecture II, August 8, 2001 – PowerPoint PPT presentation

Number of Views:119
Avg rating:3.0/5.0
Slides: 41
Provided by: WEH75
Category:

less

Transcript and Presenter's Notes

Title: Biological sequence analysis


1
Biological sequence analysis
  • Terry Speed
  • Wald Lecture II,
  • August 8, 2001

2
The objects of our study
  • DNA, RNA and proteins macromolecules which
    are unbranched polymers built up from smaller
    units.
  • DNA units are the nucleotide residues A, C, G
    and T
  • RNA units are the nucleotide residues A, C,
    G and U
  • Proteins units are the amino acid residues A,
    C, D, E, F, G, H, I, K, L, M, N, P, Q, R, S, T,
    V, W and Y.
  • To a considerable extent, the chemical
    properties of DNA, RNA and protein molecules are
    encoded in the linear sequence of these basic
    units their primary structure.

3
The use of statistics to study linear sequences
of biomolecular units
  • Can be descriptive, predictive or everything
    else in between..almost business as usual.
  • Stochastic mechanisms should never be taken
    literally, but nevertheless can be amazingly
    useful.
  • Care is always needed a model or method can
    break down at any time without notice.
  • Biological confirmation of predictions is
    almost always necessary.

4
The statistics of biological sequences can
be global or local
  • Base composition of genomes
  • E. coli 25 A, 25 C, 25 G, 25 T
  • P. falciparum 82AT
  • Translation initiation
  • ATG is the near universal motif indicating the
  • start of translation in DNA coding sequence.

5
From certainty to statistical models a brief
case study
1 ZNF Cys-Cys-His-His zinc finger DNA binding
domain
6
Cys-Cys-His-His zinc finger DNA binding domain
  • Its characteristic motif has regular
    expression
  • C-x(2,4)-C-x(3)-LIVMFYWC-x(8)-H-x(3,5)-H
  • 1ZNF XYKCGLCERSFVEKSALSRHQRVHKNX
  • Regular expressions can be limiting, and
    position-specific distributions came to represent
    the variability in motifs.

7
Cys-Cys-His-His profile sequence logo form
A sequence logo is a scaled position-specific
a.a.distribution. Scaling is by a measure of a
positions information content.
8
Representation of motifs the next steps
  • Missing from the position-specific
    distribution
  • representation of motifs are good ways of
    dealing with
  • Length distributions for insertions/deletions
  • Cross-position association of amino acids
  • Hidden Markov models help with the first.
    The second remains a hard unsolved problem.

9
Hidden Markov models
  • Processes (St,Ot), t1,, where St is the
    hidden
  • state and Ot the observation at time t, such
    that
  • pr(St St-1,Ot-1,St-2 ,Ot-2 ) pr(St
    St-1)
  • pr(Ot St-1,Ot-1,St-2 ,Ot-2 ) pr(Ot
    St, St-1)
  • The basics of HMMs were laid bare in a series
    of beautiful papers by L E Baum and colleagues
    around 1970, and their formulation has been used
    almost unchanged to this day.

10
Hidden Markov modelsextensions
  • Many variants are now used. For example, the
    distribution of O may not depend on previous S
    but on previous O values,
  • pr(Ot St , St-1 , Ot-1 ,.. ) pr(Ot
    St ), or
  • pr(Ot St , St-1 , Ot-1 ,.. ) pr(Ot
    St , St-1 ,Ot-1) .
  • Most importantly for us, the times of S and O
    may be decoupled, permitting the Observation
    corresponding to State time t to be a string
    whose length and composition depends on St (and
    possibly St-1 and part or all of the previous
    Observations). This is called a hidden
    semi-Markov or generalized hidden Markov model.

11
A simple HMM (Churchill, 1989)
O.O1
O.99
O.9
O.1
hidden states
.

.

.
.

.

.
observations
.

.

.
.

.

.
12
The algorithms
  • As the name suggests, with an HMM the series
    O (O1,O2 ,O3 ,., OT) is observed, while
    the states S (S1 ,S2 ,S3 ,., ST) are not.
  • There are elegant algorithms for calculating
    pr(O?), arg max? pr(O?) in certain special
    cases, and arg maxS pr(SO,?).
  • Here ? are the parameters of the model, e.g.
    transition and observation probabilities.

13

Some early applications of HMMs
finance, but we never saw them
speech recognition modelling
ion channels
In the mid-late 1980s HMMs entered genetics
and molecular biology, and they are now firmly
entrenched.
  • Some current applications of HMMs to biology
  • mapping chromosomes
  • aligning biological sequences
  • predicting sequence structure
  • inferring evolutionary
    relationships
  • finding genes in DNA sequence

14
HMMs representing coiled-coil domains
2TMA Tropomyosin
15
Coiled-coil domains, schematically
dimeric parallel helices, heptad repeats,
knobs-into-holes
16
(hydrophobic)
17
Designing the HMM, I
18
Designingthe HMM, 2
19
Designing the HMM, 3
20
HMM decoding
WGP ARQLNES VKDINKM LER HP BBB CCCCCCC CCCCCCC
CCC BB 000 abcdefg abcdefg abc 00 00c defgabc
defgabc def g0
Sequence Labels Path1 Path2
  • VITERBI decoding of all possible state-paths, we
    determine the maximum probability one given the
    amino acid sequence O
  • POSTERIOR decoding at each position, we
    determine the state with the highest probability
    given O.
  • Issue how to measure the strength of a potential
    CC domain, and how should this depend on the
    length of a domain?

21
CC-PROBABILITY PROFILE
Fusion protein of simian parainfluenza virus 5
22
Assessing performance terms
TP true positive a predicted fragment that
overlaps the annotated fragment (aa in the
annotated region) FP false positive a
predicted fragment does not overlap any
annotated fragment (aa not in the
annotated region) LS learning
set of sequences NTS negative test set
sequences with no CCD used for
estimating FP PTS positive test set used for
estimating TP Much care/effort required to create
these sets
23
Assessing performance study design
Study the variability of performance under
variation of the sequences used for
determining the model parameters. Compare
methods using the same set of
aa-frequencies / emission probabilities. Use
the same set of domains for learning and testing
instead of testing on different protein
families. Choose a number of FP-rates and
calculate the corresponding TP-rates (ROC curve).
24
PTS subdivided 150 times at random
(stratified) into 2/3 for learning and the rest
for testing.
LEARNING PHASE gt PARAMETERS
TESTING ON NTS gt THRESHOLDS
150 X
TESTING ON PTS gt TP-VALUES
ANALYSIS OF RESULTS
25
Assessing performance summaries
  • TP-rate at given FP-rate per family / per
    length-class
  • TP- and FP-rates for aas per family / per
    length-class
  • Accuracy at the borders and of length prediction

26
Finding genes in DNA sequence
This is one of the most challenging and
interesting problems in computational biology at
the moment. With so many genomes being sequenced
so rapidly, it remains important to begin by
identifying genes computationally.
27
What is a (protein-coding) gene?
CCTGAGCCAACTATTGATGAA
CCUGAGCCAACUAUUGAUGAA
PEPTIDE
28
What is a gene, ctd?
  • In general the transcribed sequence is longer
    than the translated portion parts called introns
    (intervening sequence) are removed, leaving exons
    (expressed sequence), and yet other regions
    remain untranslated. The translated sequence
    comes in triples called codons, beginning and
    ending with a unique start (ATG) and one of three
    stop (TAA, TAG, TGA) codons.
  • There are also characteristic intron-exon
    boundaries called splice donor and acceptor
    sites, and a variety of other motifs promoters,
    transcription start sites, polyA sites,branching
    sites, and so on.
  • All of the foregoing have statistical
    characterizations.

29
In more detail (color state)
30
Some facts about human genes
  • Comprise about 3 of the genome
  • Average gene length 8,000 bp
  • Average of 5-6 exons/gene
  • Average exon length 200 bp
  • Average intron length 2,000 bp
  • 8 genes have a single exon
  • Some exons can be as small as 1 or 3 bp.
  • HUMFMR1S is not atypical 17 exons 40-60 bp long,
    comprising 3 of a 67,000 bp gene

31
The idea behind a GHMM genefinder
  • States represent standard gene features
    intergenic region, exon, intron, perhaps more
    (promotor, 5UTR, 3UTR, Poly-A,..).
  • Observations embody state-dependent base
    composition, dependence, and signal features.
  • In a GHMM, duration must be included as well.
  • Finally, reading frames and both strands must be
    dealt with.

32
Half a model for a genefinder
33
Splice sites can be included in the exons
34
Beyond position-specific distributions
  • The bases in splice sites exhibit dependence, and
    not simply of the nearest neighbor kind.
  • High-order (non-stationary) Markov models would
    be one option, but the number of parameters in
    relation to the amount of data rules them out.
    The class of variable length Markov models
    (VLMMs) deriving from early research by Rissanen
    prove to be valuable in this context. However,
    there is likely to be room for more research
    here.

35
GENSCAN (Burge Karlin)
36
Remark
  • In general the problem of identifying
    (annotating) human genes is considerably harder
    than ß-globin might suggest.
  • The human factor VIII gene (whose mutations
    cause hemophilia A) is spread over 186,000 bp.
    It consists of 26 exons ranging in size from 69
    to 3,106 bp, and its 25 introns range in size
    from 207 to 32,400 bp. The complete gene is thus
    9 kb of exon and 177 kb of intron.
  • The biggest human gene yet is for dystrophin.
    It has gt 30 exons and is spread over 2.4
    million bp.

37
Challenges in the analysis of sequence data
  • Understanding the biology well enough to
    begin.
  • Designing HMM architecture, e.g. in Marcoil
    for coiled-coils.
  • Modelling the parts, e.g. VLMMs for splice
    sites.
  • Coding software engineering, is the hardest
    and most important task of all making it all
    work.
  • Obtaining good data sets for use in careful
    evaluation and comparison with competing
    algorithms designing the studies.
  • Opportunities for methodological research.

38
Topics not mentioned include
  • Molecular evolution, including phylogenetic
    inference (building trees from aligned sequence
    data)
  • Sequence alignment (pairwise, multiple),
    including use of Gibbs sampler
  • Stochastic context-free grammar models and the
    analysis of RNA sequence data.

39
Acknowledgements
  • Mauro Delorenzi (WEHI)
  • Simon Cawley (Affymetrix)
  • Tony Wirth (CS, Princeton)
  • Lior Pachter (Math, UCB)
  • Marina Alexandersson (Stat, UCB)

40
References
  • Biological Sequence Analysis
  • R Durbin, S Eddy, A Krogh and G Mitchison
  • Cambridge University Press, 1998.
  • Bioinformatics The machine learning approach
  • P Baldi and S Brunak
  • The MIT Press, 1998
  • Post-Genome Informatics
  • M Kanehisa
  • Oxford University Press, 2000
Write a Comment
User Comments (0)
About PowerShow.com