Algorithms for Finding Genes - PowerPoint PPT Presentation

About This Presentation

Title:

Algorithms for Finding Genes

Description:

Algorithms for Finding Genes Rhys Price Jones Anne R. Haake There s something about Genes The human genome consists of about 3 billion base pairs Some of that can ... – PowerPoint PPT presentation

Number of Views:88

Avg rating:3.0/5.0

Slides: 37

Provided by: AnneH72

Learn more at: https://www.cs.rit.edu

Category:

more less

Transcript and Presenter's Notes

Title: Algorithms for Finding Genes

1
Algorithms for Finding Genes

Rhys Price Jones
Anne R. Haake

2
Theres something about Genes

The human genome consists of about 3 billion base
pairs
Some of that can be transcribed and translated to
produce proteins
Most of it does not
What is it that makes some sequences of DNA
encode protein?

3
The Genetic Code

Pick a spot in your sequence
Group nucleotides from that point on in triplets.
Each triplet encodes for an amino acid, or a
stop, or a start
TGA, TAG, TAA code for stop
But there are exceptions!

4
Complications

In the 1960s it was thought that a gene was a
linear structure of nucleotides directly linked
in triplets to amino acids in the resulting
protein
This view did not last long with the discovery in
the late 1960s of genes within genes and
overlapping genes
In 1977, split genes were discovered

5
Introns and Exons

Most eukaryotic genes are interrupted by
non-coding sections and broken into pieces called
exons. The interrupting sequences are called
introns
Some researchers have used a biological approach
and searched for splicing sites at intron-exon
junctions
Catalogs of splice sites were created in the
1980s
But unreliable

6
Can you tell if its a gene?

What is it about some sections of DNA that make
them encode for proteins?
Compare
TTCTTCTCCAAGAGCAGGGCTTAATTCTATGCTTCCAGGCGAAAGACTGC
ATGGCTAACAAAGCAACGCCTAACACATTCCTAAGCAATTGGCTTGCACC
GTCTTCCCGAGGGTGTTTCTCCAATGGAAAGAGGCGTCGCTGGGCACCCG
CCGGGAACGGCCGGGTGACCACCCGGTCATTGTGAACGGAAGTTTCGAGA
Is there anything that distinguishes the second
from the first?
Well return to this issue

7
Before we embark on the biological problem

We will look at an easier analogy
Sometimes called the Crooked Casino
Basic idea
sequence of values generated by one or more
processes
from the values, try to deduce which process
produced which parts of the sequence

8
A useful analogy

The loaded die
I rolled a die 80 times and got236366265161134414
16663242646331422242145353556233136631321454662631
116546622542
Some of the time it was a fair die
Some of the time the die produced 6s half the
time and 1-5 randomly the rest of the time

9
Confession

(append (fair 20) (loaded 10) (fair 20) (loaded
10) (fair 20))
(define fair (lambda (n) (if (zero? n) '()
(cons (1 (random 6)) (fair (1- n))))))
(define loaded (lambda (n) (if (zero? n)
'() (cons (if (lt (random 2) 1) 6 (1 (random
5))) (loaded (1- n))))))
23636626516113441416663242646331422242145353556233
136631321454662631116546622542

10
There are no computer algorithms to recognize
genes reliably

Statistical approaches
Similarity-based approaches
Others

11
Similarity Methods

Searching sequences for regions that have the
look and feel of a gene (Dan Gusfield)
An early step is the investigation of ORFs
(in Eukaryotes) look for possible splicing sites
and try to assemble exons
Combine sequence comparison and database search,
seeking similar genes

12
Statistical Approaches

Look for features that appear frequently in genes
but not elsewhere
As for the loaded die sequence above, that is not
100 reliable
But then, what is? This is Biology and its had
billions of years to figure out how to outfox us!
If you want to learn about nature, to appreciate
nature, it is necessary to understand the
language that she speaks in. She offers her
information only in one form we are not so
unhumble as to demand that she change before we
pay any attention. (R. Feynman)

13
Relative Frequency of Stop Codons

3 of the possible 64 nucleotide triplets are stop
codons
We would thus expect stop codons to form
approximately 5 of any stretch of random DNA,
giving an average distance between stop codons of
about 20 codons
Very long stretches of triplets containing no
stop codons (long ORFs) might, therefore,
indicate gene possibilities

14
Rather like looking for regions where my die was
fair

By seeking a shortage of 6s
Similarly, there is (organism-specific) codon
bias of various kinds, and this can be exploited
by gene-finders
People and programs have looked for such bias
features as
regularity in codon frequencies
regularity in bicodon frequencies
periodicity
homogeneity vs. complexity

15
CpG Islands

CG dinucleotides are often written CpG to avoid
confusion with the base pair C-G
In the human genome CpG is a rarer bird than
would be expected in purely random sequences
(there are chemical reasons for this involving
methylation)
In the start regions of many genes, however, the
methylation process is suppressed, and CpG
dinucleotides appear more frequently than
elsewhere

16
Rather like looking for stretches where my die
was loaded

By seeking a plethora of 6s
But now well look for high incidences of CpG

17
How can we detect shortage or plethora of CpGs?

Markov Models
Assume that the next state depends only on the
previous state
If the states are x1 x2 ... xn
We have a matrix of probabilities pij
pij is the probability that state xj will follow
xi

18
Transition matrices

CpG islands A C G T A 18 27 43
12C 17 37 27 19G 16 34 37 13T 8 35
38 18
elsewhere A C G T A 30 20 28 21C
32 30 8 30G 25 25 30 20T 18 24 29
29 (adapted from Durbin et al, 1998)

19
How do we get those numbers?

empirically
lots of hard wet and dry lab work
We can think of these numbers as parameters of
our model
later well pay more attention to model parameters

20
For our loaded die example we have

0 means non-6
loaded 0 60 50 50 6 50 50
unloaded 0 60 83 17 6 83 17

21
Discrimination

Given a sequence x we can calculate a log odds
score that it was generated by one of two models.
S(x) log (P(x model A) / P(x model B))
sum of log((prob xi-1 to xi in A) / (prob xi-1 to
xi in B))
Demonstration (define beta (lambda (x y)
(cond ((and ( x 6) (lt y 6)) (log (/ .83
.5))) ((and ( x 6) ( y 6)) (log (/ .17
.5))) ((and (lt x 6) (lt y 6)) (log (/ .83
.5))) ((and (lt x 6) ( y 6)) (log (/ .17
.5))))))(define s (lambda (l) (cond
((null? (cdr l)) 0) (else ( (beta (car l)
(cadr l)) (s (cdr l)))))))

22
Similar Analysis for CpG Islands

Visit the code
(show mary)
(show gene)
(scorecpg mary)
(scorecpg gene)
Sliding Window
(slidewindow 20 mary)
(show (classify mary))

23
Hidden Markov Models

The Markov model we have just seen could be used
for locating CpG islands
For example you could calculate the log odds
score of a moving window of, say 50 nucleotides
Hidden Markov Models try to combine the two
models in a single model
For each state, say x, in the Markov model, we
imagine two states xA and xB for the two models A
and B
Both new states emit the same character
But we dont know whether it was CA or CB that
emitted the nucleotide C

24
HMMs

Hidden Markov Models are so prevalent in
bioinformatics, well just refer to them as HMMs
Given
a sequence of emitted signals -- the observed
sequence
and a matrix of transition probabilities and
emission probabilities
For a path of hidden states, calculate the
conditional probability of the sequence given the
path
Then try to find an optimal path for the sequence
such that the conditional probability is maximized

25
HMM combines Markov Models
C
A
27
17
16
34
37
27
G
43
8
35
18
37
38
13
19
12
T
18
T-
29
21
30
20
29
G-
30
18
24
28
8
30
25
25
C-
A-
32
20
30
26
Must be able to move between models

Figure out these transition probabilities

C
A
G
T
T-
G-
C-
A-
27
Add a begin and an end state

and figure even more transition probabilities

C
A
G
B
T
E
T-
G-
C-
A-
28
Basic Idea

Each state emits a nucleotide (except B and E)
A and A- both emit A
C and C- both emit C
You know the sequence of emissions
You dont know the exact sequences of states
Was it A C C G- that produced ACCG
Or was it A- C- C G-
You want to find the most likely sequence

29
Parameters of the HMM

How do we know the transition probabilities
between hidden states?
empirical, hard lab work
as above for the CpG island transition
probabilities
imperfect data will lead to imperfect model
Is there an alternative?

30
Machine Learning

Grew out of other disciplines, including
statistical model fitting
Tries to automate the process as much as possible
use very flexible models
lots of parameters
let the program figure them out
Based on ...
known data and properties
training sets
Induction and inference

31
Bayesian Modeling

Understand the hypotheses (model)
Assign prior probabilities to the hypotheses
Use Bayes theorem (and the rest of probability
calculus) to evaluate posterior probabilities
for the hypotheses in light of actual data

32
Neural Networks

Very flexible
Inputs
Thresholds
Weights
Output
Variations
additional layers
loops

33
Training Neural Networks

Need set of inputs and desired outputs
Run the network with a given input
Compare to desired output
Adjust weights to improve
Repeat...

34
Training HMMs

Similar idea
Begin with guesses for the parameters of the HMM
Run it on your training set, and see how well it
predicts whatever you designed it for
Use the results to adjust the parameters
Repeat...

35
Hope that

When you provide brand new inputs
The trained network will produce useful outputs

36
Some gene finder programs

GRAIL neural net discriminant
Genie HMM
GENSCAN Semi Markov Model
GeneParser neural net
FGENEH etc (Baylor) Classification by Linear
Discrimination Analysis (LDA) a well-known
statistical technique
GeneID rule-based

Write a Comment

User Comments (0)