Bioinformatics: Applications - PowerPoint PPT Presentation

1 / 53
About This Presentation
Title:

Bioinformatics: Applications

Description:

TATA box. ATGACAGATTACAGATTACAGATTACAGGATAG. Frame 1. Frame 2. Frame 3 ... RNA polymerase promoter site (-10, -30 site or TATA box) ... – PowerPoint PPT presentation

Number of Views:119
Avg rating:3.0/5.0
Slides: 54
Provided by: jonath76
Category:

less

Transcript and Presenter's Notes

Title: Bioinformatics: Applications


1
Bioinformatics Applications
  • ZOO 4903
  • Fall 2006, MW 1030-1145
  • Sutton Hall, Room 312
  • Basics of gene finding - Prokaryotes

2
First
  • Short discussion feedback on Exam 1
  • Class report instructions to be handed out

3
Lecture overview
  • What weve talked about so far
  • DNA is the blueprint for all organisms
  • Overview
  • RNA, not DNA, is the marker of cellular activity
    and changes
  • Gene finding in prokaryotes

4
DNA guides the transcription of RNA in the nucleus
5
Gene number generally increases with phylogenetic
complexity
6
Genes genome complexity
  • There is almost no correlation between the amount
    of DNA in a species and its evolutionary
    complexity (C-value paradox).
  • There is a correlation between the amount of
    non-protein coding regions and complexity.

7
Gene finding approaches
  • Rule-based (e.g, start stop codons)
  • Content-based (e.g., codon bias, promoter sites)
  • Similarity-based (e.g., orthologs)
  • Extrinsic-based (e.g., known proteins, ESTs)
  • Pattern-based (e.g., machine-learning)

8
Prokaryotes
  • Advantages
  • Simple gene structure
  • Small genomes (0.5 to 10 million bp)
  • No introns
  • Genes are called Open Reading Frames (ORFs)
  • High coding density (gt90)
  • Disadvantages
  • Some genes overlap (nested)
  • Some genes are quite short (lt60 bp)

9
Gene structure comparisons
10
Prokaryotic gene structure
ORF (open reading frame)
TATA box
Stop codon
Start codon
ATGACAGATTACAGATTACAGATTACAGGATAG
Frame 1
Frame 2
Frame 3
11
Prokaryotes stack multiple genes together for
expression (operons)
Promoter
Gene1
Gene2
Gene N
Terminator
Transcription
RNA Polymerase
mRNA 5
3
1
2
N
N
N
C
N
C
C
1
2
3
Polypeptides
12
Prokaryotic genomes
E. coli
13
Simple rule-based gene finding in prokaryotes,
based on ORFs
  • Look for putative start codon (ATG)
  • Staying in same frame, scan in groups of three
    until a stop codon is found
  • If of codons gt50, assume its a gene
  • If of codons lt50, go back to last start codon,
    increment by 1 start again
  • At end of chromosome, repeat process for reverse
    complement

14
Example ORF
15
ORF Finding Tools
  • NCBI - http//www.ncbi.nlm.nih.gov/gorf/gorf.html
  • Diogenes - http//www.cbc.umn.edu/diogenes/diogene
    s.html
  • Frameplot - http//www.nih.go.jp/jun/cgi-bin/fram
    eplot.pl

16
Problems with rule-based approaches
  • Advantages
  • Simple and fairly sensitive (gt50)
  • Disadvantages
  • Prokaryotic genes are not always so simple to
    find
  • ATG is not the only possible start site (e.g.
    CTG, TTG class I alternates)
  • Small genes tend to be overlooked and long ones
    over-predicted
  • Solution? Use additional information to increase
    confidence in predictions

17
Gene finding approaches
  • Rule-based (e.g, start stop codons)
  • Content-based (e.g., codon bias, promoter sites)
  • Similarity-based (e.g., orthologs)
  • Extrinsic-based (e.g., known proteins, ESTs)
  • Pattern-based (e.g., machine-learning)

18
Key prokaryotic gene features
  • RNA polymerase promoter site (-10, -30 site or
    TATA box)
  • Shine-Dalgarno sequence (10, Ribosome Binding
    Site) to initiate protein translation
  • Codon biases
  • High GC content
  • Stem-loop (rho-independent) terminators

19
Promoter structure in prokaryotes (E. coli)
  • Transcription starts at offset 0.
  • Pribnow Box (-10)
  • Gilbert Box (-30)
  • Ribosomal Binding Site (10)

20
RNAP binds a region of DNA from -40 to 20
The sequence of the non-template strand is shown
-10 region
TTGACA16-19 bp... TATAAT -35 spacer
-10
21
Example lexA Gene
  • Three potential binding sides for the lexA
    product to the promoter region
  • Promoter sites (-10, -35) for interaction with
    the RNA polymerase
  • Ribosomal binding site on the mRNA product
    complementary to ribosomal RNA
  • open reading frame devoid of introns.

22
Codon Bias
  • The genetic code is degenerate
  • Equivalent triplet codons code for the same amino
    acid
  • Codon usage varies
  • Organism to organism (fortunately)
  • Gene to gene (unfortunately)
  • Can be calculated (http//www.kazusa.or.jp/codon/)
  • Biological basis
  • Avoidance of codons similar to stop
  • Preference for codons that correspond to abundant
    tRNAs within the organism

23
Codon Adaptation Index example
Counts per 1000 codons
24
Terminator Stem-loops
25
Content-based recognition
  • Advantages
  • Increases accuracy over rule-based
  • Disadvantages
  • Features are degenerate
  • Features are not always present

26
Dealing with degenerate signals
  • Use a profile-based method, sometimes called a
    position specific scoring matrix (PSSM) built
    from multiple sequence alignments

A PSSM
27
Building a feature profile/PSSM
A T T T A G T A T C G T T C T G T A A C A T T T T
G T A G C A A G C T G T A A C C A T T - G T A C A
Multiple Alignment
A 3 2 0 0 1 0 0 5 2 1 C 1 0 0 2 0 0 0 0 1 4 G 1 0
1 0 0 5 0 0 1 0 T 0 3 4 3 3 0 5 0 1 0 - 0 0 0 0 1
0 0 0 0 0
Table of Occurrences
28
Building a feature profile/PSSM
A 3 2 0 0 1 0 0 5 2 1 C 1 0 0 2 0 0 0 0 1 4 G 1 0
1 0 0 5 0 0 1 0 T 0 3 4 3 3 0 5 0 1 0 - 0 0 0 0 1
0 0 0 0 0
Table of Occurrences
A .6 .4 0 0 .2 0 0 1 .4 .2 C .2 0 0 .4 0
0 0 0 .2 .8 G .2 0 .2 0 0 1 0 0 .2 0 T
0 .6 .8 .6 .6 0 1 0 .2 0 - 0 0 0 0 .2 0
0 0 0 0
PSSM with no pseudovalues
29
Why pseudovalues?
How well a sequence fits a profile is often
calculated by multiplying the probabilities
together. Consider the following case when a new
sequence (blue) is compared to a profile.
A T T T T G T A C C A .9 .1 0 0 .2
0 0 1 .9 .2 C .1 0 0 .2 0 0 0 0 0 .8 G
0 0 .2 0 0 1 0 0 0 0 T 0 .9 .8 .8 .8 0
1 0 .1 0 - 0 0 0 0 0 0 0 0 0 0
30
Building a feature profile/PSSM
A .6 .4 0 0 .2 0 0 1 .4 .2 C .2 0 0 .4 0
0 0 0 .2 .8 G .2 0 .2 0 0 1 0 0 .2 0 T
0 .6 .8 .6 .6 0 1 0 .2 0 - 0 0 0 0 .2 0
0 0 0 0
PSSM with no pseudovalues
A .58 .38 .09 .09 .24 .04 .09 .79 .38 .24 C .17
.06 .06 .33 .05 .06 .05 .05 .18 .61 G .17 .06 .19
.06 .05 .75 .05 .05 .18 .04 T .05 .51 .65 .51 .65
.09 .79 .09 .24 .09 - .05 .02 .04 .05 .02 .05 .02
.02 .02 .02
PSSM with Pseudovalues
31
Gene finding approaches
  • Rule-based (e.g, start stop codons)
  • Content-based (e.g., codon bias, promoter sites)
  • Similarity-based (e.g., orthologs)
  • Pattern-based (e.g., machine-learning)
  • Extrinsic-based (e.g., known proteins, ESTs)

32
Similarity-based gene finding
  • Take all known genes from a related genome and
    compare them to the query genome via BLAST
  • Advantages
  • Predictions are made based upon confirmed genes

33
Similarity-based gene finding
  • Take all known genes from a related genome and
    compare them to the query genome via BLAST
  • Disadvantages
  • Orthologs/paralogs sometimes lose function and
    become pseudogenes
  • Not all genes will always be known in the
    comparison genome (big circularity problem)
  • The best species for comparison isnt always
    obvious
  • Summary Similarity comparisons are good
    supporting evidence for prediction validity

34
Gene finding approaches
  • Rule-based (e.g, start stop codons)
  • Content-based (e.g., codon bias, promoter sites)
  • Similarity-based (e.g., orthologs)
  • Extrinsic-based (e.g., known proteins, ESTs)
  • Pattern-based (e.g., machine-learning)

35
Why not just use extrinsic evidence?
  • Proteins and ESTs (mRNAs) are expressed under
    specific circumstances
  • ESTs are often too short to determine complete
    gene structure
  • In eukaryotes, many are only expressed sometime
    during development
  • In prokaryotes, some are only expressed when
    certain conditions are met (e.g. environmental)

36
Gene finding approaches
  • Rule-based (e.g, start stop codons)
  • Content-based (e.g., codon bias, promoter sites)
  • Similarity-based (e.g., orthologs)
  • Extrinsic-based (e.g., known proteins, ESTs)
  • Pattern-based (e.g., machine-learning)

37
Markov Models
  • Begin with a set of states
  • The transition from any state to any other state,
    including itself, is probabilistic
  • The odds of moving from one state to another
    depend only upon the current state
  • Can be created from multiple sequence alignment
    (e.g., for feature recognition)

38
A Markov Model of DNA mutations
State Transition Matrix
39
Nth order Markov Models
  • What is the probability of observing GTCACT in a
    region?
  • 1st order
  • P(GTCACT) P(G?T)P(T?C)P(C?A)P(A?C)P(C?T)
  • 2nd order
  • P(GTCACT) P(GT?CA)P(CA?CT)
  • 3rd order
  • P(GTCACT) P(GTC?ACT)
  • Etc

40
Transition probabilities are compared to another
model
  • CTAGCGACGGCTCAGCGGTGCTACGCGC
  • Gene sequence
  • GTATGCGCGATCGATCGCGACCGATCGT
  • Random
  • TACACTATAGTACGACTATCAATACTCA
  • Intragenic sequence

41
Markov Models in gene prediction
  • Judge how likely a given sequence of bases
    belongs to one class of DNA vs another
  • Codon vs intragenic 3rd order MM
  • Intron vs exon (eukaryotes) 3rd order MM
  • Binding site vs. other nth order MM

42
Question
  • Q MMs work well when we know what kind of
    comparison to make, but what if we dont know
    anything about the sequence were analyzing (e.g.
    when a state transition occurs)?

43
Question
  • Q MMs work well when we know what kind of
    comparison to make, but what if we dont know
    anything about the sequence were analyzing (e.g.
    when a state transition occurs)?
  • A We have to look for a model that best fits our
    observations. We assume that the real states are
    hidden from our view.

44
Markov Chains
Rain
Sunny
Cloudy
State transition matrix The probability of the
weather given the previous day's weather.
States Three states - sunny, cloudy, rainy.
Initial Distribution Defining the probability
of the system being in each of the states at time
0.
45
Hidden Markov Models
Hidden states The true states of a system that
may be described by a Markov process (e.g., the
weather). Observable states The states of the
process that are visible' to an observer (e.g.,
damp grass).
46
Components of an HMM
Grass
Output matrix containing the probability of
observing a particular observable state given
that the hidden model is in a particular hidden
state. Initial Distribution contains the
probability of the (hidden) model being in a
particular hidden state at time t 1. State
transition matrix holding the probability of a
hidden state given the previous hidden state.
47
Applied to gene finding (color state)
48
Some HMMs can get complex
RBS site
promoter site
49
Markov Model caveat
  • Only works when each base pair is not linked to
    any other in the sequence. For example

GACCCTC
G
C
C
C
C
G
A
T
A
T
A
T
C
C
C
C
C
C
C
C
C
OK non-functional
OK
50
Prokaryotic gene prediction software using the
methods discussed
  • GLIMMER
  • http//cbcb.umd.edu/software/glimmer/
  • Uses interpolated markov models (IMMs)
  • Requires training of sample genes
  • Takes about 1 minute/genome
  • GeneMark
  • http//opal.biology.gatech.edu/GeneMark/gmhmm2_pro
    k.cgi
  • Available as a web server
  • Uses Hidden Markov Models

51
Glimmer Performance
52
Bottom Line...
  • Gene finding in prokaryotes is pretty much a
    solved problem
  • Accuracy of the best methods approaches 99
  • Gene predictions should always be compared
    against extrinsic evidence (protein, ESTs) and
    similarity to other genes (BLAST) to ensure
    accuracy and to catch possible sequencing errors

53
For next time
  • Read Mount, Chapter 9
Write a Comment
User Comments (0)
About PowerShow.com