Algorithms for Finding Genes - PowerPoint PPT Presentation

1 / 36
About This Presentation
Title:

Algorithms for Finding Genes

Description:

Algorithms for Finding Genes Rhys Price Jones Anne R. Haake There s something about Genes The human genome consists of about 3 billion base pairs Some of that can ... – PowerPoint PPT presentation

Number of Views:86
Avg rating:3.0/5.0
Slides: 37
Provided by: AnneH72
Learn more at: http://www.cs.rit.edu
Category:

less

Transcript and Presenter's Notes

Title: Algorithms for Finding Genes


1
Algorithms for Finding Genes
  • Rhys Price Jones
  • Anne R. Haake

2
Theres something about Genes
  • The human genome consists of about 3 billion base
    pairs
  • Some of that can be transcribed and translated to
    produce proteins
  • Most of it does not
  • What is it that makes some sequences of DNA
    encode protein?

3
The Genetic Code
  • Pick a spot in your sequence
  • Group nucleotides from that point on in triplets.
  • Each triplet encodes for an amino acid, or a
    stop, or a start
  • TGA, TAG, TAA code for stop
  • But there are exceptions!

4
Complications
  • In the 1960s it was thought that a gene was a
    linear structure of nucleotides directly linked
    in triplets to amino acids in the resulting
    protein
  • This view did not last long with the discovery in
    the late 1960s of genes within genes and
    overlapping genes
  • In 1977, split genes were discovered

5
Introns and Exons
  • Most eukaryotic genes are interrupted by
    non-coding sections and broken into pieces called
    exons. The interrupting sequences are called
    introns
  • Some researchers have used a biological approach
    and searched for splicing sites at intron-exon
    junctions
  • Catalogs of splice sites were created in the
    1980s
  • But unreliable

6
Can you tell if its a gene?
  • What is it about some sections of DNA that make
    them encode for proteins?
  • Compare
  • TTCTTCTCCAAGAGCAGGGCTTAATTCTATGCTTCCAGGCGAAAGACTGC
    ATGGCTAACAAAGCAACGCCTAACACATTCCTAAGCAATTGGCTTGCACC
  • GTCTTCCCGAGGGTGTTTCTCCAATGGAAAGAGGCGTCGCTGGGCACCCG
    CCGGGAACGGCCGGGTGACCACCCGGTCATTGTGAACGGAAGTTTCGAGA
  • Is there anything that distinguishes the second
    from the first?
  • Well return to this issue

7
Before we embark on the biological problem
  • We will look at an easier analogy
  • Sometimes called the Crooked Casino
  • Basic idea
  • sequence of values generated by one or more
    processes
  • from the values, try to deduce which process
    produced which parts of the sequence

8
A useful analogy
  • The loaded die
  • I rolled a die 80 times and got236366265161134414
    16663242646331422242145353556233136631321454662631
    116546622542
  • Some of the time it was a fair die
  • Some of the time the die produced 6s half the
    time and 1-5 randomly the rest of the time

9
Confession
  • (append (fair 20) (loaded 10) (fair 20) (loaded
    10) (fair 20))
  • (define fair (lambda (n) (if (zero? n) '()
    (cons (1 (random 6)) (fair (1- n))))))
  • (define loaded (lambda (n) (if (zero? n)
    '() (cons (if (lt (random 2) 1) 6 (1 (random
    5))) (loaded (1- n))))))
  • 23636626516113441416663242646331422242145353556233
    136631321454662631116546622542

10
There are no computer algorithms to recognize
genes reliably
  • Statistical approaches
  • Similarity-based approaches
  • Others

11
Similarity Methods
  • Searching sequences for regions that have the
    look and feel of a gene (Dan Gusfield)
  • An early step is the investigation of ORFs
  • (in Eukaryotes) look for possible splicing sites
    and try to assemble exons
  • Combine sequence comparison and database search,
    seeking similar genes

12
Statistical Approaches
  • Look for features that appear frequently in genes
    but not elsewhere
  • As for the loaded die sequence above, that is not
    100 reliable
  • But then, what is? This is Biology and its had
    billions of years to figure out how to outfox us!
  • If you want to learn about nature, to appreciate
    nature, it is necessary to understand the
    language that she speaks in. She offers her
    information only in one form we are not so
    unhumble as to demand that she change before we
    pay any attention. (R. Feynman)

13
Relative Frequency of Stop Codons
  • 3 of the possible 64 nucleotide triplets are stop
    codons
  • We would thus expect stop codons to form
    approximately 5 of any stretch of random DNA,
    giving an average distance between stop codons of
    about 20 codons
  • Very long stretches of triplets containing no
    stop codons (long ORFs) might, therefore,
    indicate gene possibilities

14
Rather like looking for regions where my die was
fair
  • By seeking a shortage of 6s
  • Similarly, there is (organism-specific) codon
    bias of various kinds, and this can be exploited
    by gene-finders
  • People and programs have looked for such bias
    features as
  • regularity in codon frequencies
  • regularity in bicodon frequencies
  • periodicity
  • homogeneity vs. complexity

15
CpG Islands
  • CG dinucleotides are often written CpG to avoid
    confusion with the base pair C-G
  • In the human genome CpG is a rarer bird than
    would be expected in purely random sequences
    (there are chemical reasons for this involving
    methylation)
  • In the start regions of many genes, however, the
    methylation process is suppressed, and CpG
    dinucleotides appear more frequently than
    elsewhere

16
Rather like looking for stretches where my die
was loaded
  • By seeking a plethora of 6s
  • But now well look for high incidences of CpG

17
How can we detect shortage or plethora of CpGs?
  • Markov Models
  • Assume that the next state depends only on the
    previous state
  • If the states are x1 x2 ... xn
  • We have a matrix of probabilities pij
  • pij is the probability that state xj will follow
    xi

18
Transition matrices
  • CpG islands A C G T A 18 27 43
    12C 17 37 27 19G 16 34 37 13T 8 35
    38 18
  • elsewhere A C G T A 30 20 28 21C
    32 30 8 30G 25 25 30 20T 18 24 29
    29 (adapted from Durbin et al, 1998)

19
How do we get those numbers?
  • empirically
  • lots of hard wet and dry lab work
  • We can think of these numbers as parameters of
    our model
  • later well pay more attention to model parameters

20
For our loaded die example we have
  • 0 means non-6
  • loaded 0 60 50 50 6 50 50
  • unloaded 0 60 83 17 6 83 17

21
Discrimination
  • Given a sequence x we can calculate a log odds
    score that it was generated by one of two models.
  • S(x) log (P(x model A) / P(x model B))
    sum of log((prob xi-1 to xi in A) / (prob xi-1 to
    xi in B))
  • Demonstration (define beta (lambda (x y)
    (cond ((and ( x 6) (lt y 6)) (log (/ .83
    .5))) ((and ( x 6) ( y 6)) (log (/ .17
    .5))) ((and (lt x 6) (lt y 6)) (log (/ .83
    .5))) ((and (lt x 6) ( y 6)) (log (/ .17
    .5))))))(define s (lambda (l) (cond
    ((null? (cdr l)) 0) (else ( (beta (car l)
    (cadr l)) (s (cdr l)))))))

22
Similar Analysis for CpG Islands
  • Visit the code
  • (show mary)
  • (show gene)
  • (scorecpg mary)
  • (scorecpg gene)
  • Sliding Window
  • (slidewindow 20 mary)
  • (show (classify mary))

23
Hidden Markov Models
  • The Markov model we have just seen could be used
    for locating CpG islands
  • For example you could calculate the log odds
    score of a moving window of, say 50 nucleotides
  • Hidden Markov Models try to combine the two
    models in a single model
  • For each state, say x, in the Markov model, we
    imagine two states xA and xB for the two models A
    and B
  • Both new states emit the same character
  • But we dont know whether it was CA or CB that
    emitted the nucleotide C

24
HMMs
  • Hidden Markov Models are so prevalent in
    bioinformatics, well just refer to them as HMMs
  • Given
  • a sequence of emitted signals -- the observed
    sequence
  • and a matrix of transition probabilities and
    emission probabilities
  • For a path of hidden states, calculate the
    conditional probability of the sequence given the
    path
  • Then try to find an optimal path for the sequence
    such that the conditional probability is maximized

25
HMM combines Markov Models
C
A
27
17
16
34
37
27
G
43
8
35
18
37
38
13
19
12
T
18
T-
29
21
30
20
29
G-
30
18
24
28
8
30
25
25
C-
A-
32
20
30
26
Must be able to move between models
  • Figure out these transition probabilities

C
A
G
T
T-
G-
C-
A-
27
Add a begin and an end state
  • and figure even more transition probabilities

C
A
G
B
T
E
T-
G-
C-
A-
28
Basic Idea
  • Each state emits a nucleotide (except B and E)
  • A and A- both emit A
  • C and C- both emit C
  • You know the sequence of emissions
  • You dont know the exact sequences of states
  • Was it A C C G- that produced ACCG
  • Or was it A- C- C G-
  • You want to find the most likely sequence

29
Parameters of the HMM
  • How do we know the transition probabilities
    between hidden states?
  • empirical, hard lab work
  • as above for the CpG island transition
    probabilities
  • imperfect data will lead to imperfect model
  • Is there an alternative?

30
Machine Learning
  • Grew out of other disciplines, including
    statistical model fitting
  • Tries to automate the process as much as possible
  • use very flexible models
  • lots of parameters
  • let the program figure them out
  • Based on ...
  • known data and properties
  • training sets
  • Induction and inference

31
Bayesian Modeling
  • Understand the hypotheses (model)
  • Assign prior probabilities to the hypotheses
  • Use Bayes theorem (and the rest of probability
    calculus) to evaluate posterior probabilities
    for the hypotheses in light of actual data

32
Neural Networks
  • Very flexible
  • Inputs
  • Thresholds
  • Weights
  • Output
  • Variations
  • additional layers
  • loops

33
Training Neural Networks
  • Need set of inputs and desired outputs
  • Run the network with a given input
  • Compare to desired output
  • Adjust weights to improve
  • Repeat...

34
Training HMMs
  • Similar idea
  • Begin with guesses for the parameters of the HMM
  • Run it on your training set, and see how well it
    predicts whatever you designed it for
  • Use the results to adjust the parameters
  • Repeat...

35
Hope that
  • When you provide brand new inputs
  • The trained network will produce useful outputs

36
Some gene finder programs
  • GRAIL neural net discriminant
  • Genie HMM
  • GENSCAN Semi Markov Model
  • GeneParser neural net
  • FGENEH etc (Baylor) Classification by Linear
    Discrimination Analysis (LDA) a well-known
    statistical technique
  • GeneID rule-based
Write a Comment
User Comments (0)
About PowerShow.com