Title: Algorithms for Finding Genes
1Algorithms for Finding Genes
- Rhys Price Jones
- Anne R. Haake
2Theres something about Genes
- The human genome consists of about 3 billion base
pairs - Some of that can be transcribed and translated to
produce proteins - Most of it does not
- What is it that makes some sequences of DNA
encode protein?
3The Genetic Code
- Pick a spot in your sequence
- Group nucleotides from that point on in triplets.
- Each triplet encodes for an amino acid, or a
stop, or a start - TGA, TAG, TAA code for stop
- But there are exceptions!
4Complications
- In the 1960s it was thought that a gene was a
linear structure of nucleotides directly linked
in triplets to amino acids in the resulting
protein - This view did not last long with the discovery in
the late 1960s of genes within genes and
overlapping genes - In 1977, split genes were discovered
5Introns and Exons
- Most eukaryotic genes are interrupted by
non-coding sections and broken into pieces called
exons. The interrupting sequences are called
introns - Some researchers have used a biological approach
and searched for splicing sites at intron-exon
junctions - Catalogs of splice sites were created in the
1980s - But unreliable
6Can you tell if its a gene?
- What is it about some sections of DNA that make
them encode for proteins? - Compare
- TTCTTCTCCAAGAGCAGGGCTTAATTCTATGCTTCCAGGCGAAAGACTGC
ATGGCTAACAAAGCAACGCCTAACACATTCCTAAGCAATTGGCTTGCACC
- GTCTTCCCGAGGGTGTTTCTCCAATGGAAAGAGGCGTCGCTGGGCACCCG
CCGGGAACGGCCGGGTGACCACCCGGTCATTGTGAACGGAAGTTTCGAGA
- Is there anything that distinguishes the second
from the first? - Well return to this issue
7Before we embark on the biological problem
- We will look at an easier analogy
- Sometimes called the Crooked Casino
- Basic idea
- sequence of values generated by one or more
processes - from the values, try to deduce which process
produced which parts of the sequence
8A useful analogy
- The loaded die
- I rolled a die 80 times and got236366265161134414
16663242646331422242145353556233136631321454662631
116546622542 - Some of the time it was a fair die
- Some of the time the die produced 6s half the
time and 1-5 randomly the rest of the time
9Confession
- (append (fair 20) (loaded 10) (fair 20) (loaded
10) (fair 20)) - (define fair (lambda (n) (if (zero? n) '()
(cons (1 (random 6)) (fair (1- n)))))) - (define loaded (lambda (n) (if (zero? n)
'() (cons (if (lt (random 2) 1) 6 (1 (random
5))) (loaded (1- n)))))) - 23636626516113441416663242646331422242145353556233
136631321454662631116546622542
10There are no computer algorithms to recognize
genes reliably
- Statistical approaches
- Similarity-based approaches
- Others
11Similarity Methods
- Searching sequences for regions that have the
look and feel of a gene (Dan Gusfield) - An early step is the investigation of ORFs
- (in Eukaryotes) look for possible splicing sites
and try to assemble exons - Combine sequence comparison and database search,
seeking similar genes
12Statistical Approaches
- Look for features that appear frequently in genes
but not elsewhere - As for the loaded die sequence above, that is not
100 reliable - But then, what is? This is Biology and its had
billions of years to figure out how to outfox us! - If you want to learn about nature, to appreciate
nature, it is necessary to understand the
language that she speaks in. She offers her
information only in one form we are not so
unhumble as to demand that she change before we
pay any attention. (R. Feynman)
13Relative Frequency of Stop Codons
- 3 of the possible 64 nucleotide triplets are stop
codons - We would thus expect stop codons to form
approximately 5 of any stretch of random DNA,
giving an average distance between stop codons of
about 20 codons - Very long stretches of triplets containing no
stop codons (long ORFs) might, therefore,
indicate gene possibilities
14Rather like looking for regions where my die was
fair
- By seeking a shortage of 6s
- Similarly, there is (organism-specific) codon
bias of various kinds, and this can be exploited
by gene-finders - People and programs have looked for such bias
features as - regularity in codon frequencies
- regularity in bicodon frequencies
- periodicity
- homogeneity vs. complexity
15CpG Islands
- CG dinucleotides are often written CpG to avoid
confusion with the base pair C-G - In the human genome CpG is a rarer bird than
would be expected in purely random sequences
(there are chemical reasons for this involving
methylation) - In the start regions of many genes, however, the
methylation process is suppressed, and CpG
dinucleotides appear more frequently than
elsewhere
16Rather like looking for stretches where my die
was loaded
- By seeking a plethora of 6s
- But now well look for high incidences of CpG
17How can we detect shortage or plethora of CpGs?
- Markov Models
- Assume that the next state depends only on the
previous state - If the states are x1 x2 ... xn
- We have a matrix of probabilities pij
- pij is the probability that state xj will follow
xi
18Transition matrices
- CpG islands A C G T A 18 27 43
12C 17 37 27 19G 16 34 37 13T 8 35
38 18 - elsewhere A C G T A 30 20 28 21C
32 30 8 30G 25 25 30 20T 18 24 29
29 (adapted from Durbin et al, 1998)
19How do we get those numbers?
- empirically
- lots of hard wet and dry lab work
- We can think of these numbers as parameters of
our model - later well pay more attention to model parameters
20For our loaded die example we have
- 0 means non-6
- loaded 0 60 50 50 6 50 50
- unloaded 0 60 83 17 6 83 17
21Discrimination
- Given a sequence x we can calculate a log odds
score that it was generated by one of two models. - S(x) log (P(x model A) / P(x model B))
sum of log((prob xi-1 to xi in A) / (prob xi-1 to
xi in B)) - Demonstration (define beta (lambda (x y)
(cond ((and ( x 6) (lt y 6)) (log (/ .83
.5))) ((and ( x 6) ( y 6)) (log (/ .17
.5))) ((and (lt x 6) (lt y 6)) (log (/ .83
.5))) ((and (lt x 6) ( y 6)) (log (/ .17
.5))))))(define s (lambda (l) (cond
((null? (cdr l)) 0) (else ( (beta (car l)
(cadr l)) (s (cdr l)))))))
22Similar Analysis for CpG Islands
- Visit the code
- (show mary)
- (show gene)
- (scorecpg mary)
- (scorecpg gene)
- Sliding Window
- (slidewindow 20 mary)
- (show (classify mary))
23Hidden Markov Models
- The Markov model we have just seen could be used
for locating CpG islands - For example you could calculate the log odds
score of a moving window of, say 50 nucleotides - Hidden Markov Models try to combine the two
models in a single model - For each state, say x, in the Markov model, we
imagine two states xA and xB for the two models A
and B - Both new states emit the same character
- But we dont know whether it was CA or CB that
emitted the nucleotide C
24HMMs
- Hidden Markov Models are so prevalent in
bioinformatics, well just refer to them as HMMs - Given
- a sequence of emitted signals -- the observed
sequence - and a matrix of transition probabilities and
emission probabilities - For a path of hidden states, calculate the
conditional probability of the sequence given the
path - Then try to find an optimal path for the sequence
such that the conditional probability is maximized
25HMM combines Markov Models
C
A
27
17
16
34
37
27
G
43
8
35
18
37
38
13
19
12
T
18
T-
29
21
30
20
29
G-
30
18
24
28
8
30
25
25
C-
A-
32
20
30
26Must be able to move between models
- Figure out these transition probabilities
C
A
G
T
T-
G-
C-
A-
27Add a begin and an end state
- and figure even more transition probabilities
C
A
G
B
T
E
T-
G-
C-
A-
28Basic Idea
- Each state emits a nucleotide (except B and E)
- A and A- both emit A
- C and C- both emit C
- You know the sequence of emissions
- You dont know the exact sequences of states
- Was it A C C G- that produced ACCG
- Or was it A- C- C G-
- You want to find the most likely sequence
29Parameters of the HMM
- How do we know the transition probabilities
between hidden states? - empirical, hard lab work
- as above for the CpG island transition
probabilities - imperfect data will lead to imperfect model
- Is there an alternative?
30Machine Learning
- Grew out of other disciplines, including
statistical model fitting - Tries to automate the process as much as possible
- use very flexible models
- lots of parameters
- let the program figure them out
- Based on ...
- known data and properties
- training sets
- Induction and inference
31Bayesian Modeling
- Understand the hypotheses (model)
- Assign prior probabilities to the hypotheses
- Use Bayes theorem (and the rest of probability
calculus) to evaluate posterior probabilities
for the hypotheses in light of actual data
32Neural Networks
- Very flexible
- Inputs
- Thresholds
- Weights
- Output
- Variations
- additional layers
- loops
33Training Neural Networks
- Need set of inputs and desired outputs
- Run the network with a given input
- Compare to desired output
- Adjust weights to improve
- Repeat...
34Training HMMs
- Similar idea
- Begin with guesses for the parameters of the HMM
- Run it on your training set, and see how well it
predicts whatever you designed it for - Use the results to adjust the parameters
- Repeat...
35Hope that
- When you provide brand new inputs
- The trained network will produce useful outputs
36Some gene finder programs
- GRAIL neural net discriminant
- Genie HMM
- GENSCAN Semi Markov Model
- GeneParser neural net
- FGENEH etc (Baylor) Classification by Linear
Discrimination Analysis (LDA) a well-known
statistical technique - GeneID rule-based