Title: Bioinformatics: Applications
1Bioinformatics Applications
- ZOO 4903
- Fall 2006, MW 1030-1145
- Sutton Hall, Room 312
- Basics of gene finding - Prokaryotes
2First
- Short discussion feedback on Exam 1
- Class report instructions to be handed out
3Lecture overview
- What weve talked about so far
- DNA is the blueprint for all organisms
- Overview
- RNA, not DNA, is the marker of cellular activity
and changes - Gene finding in prokaryotes
4DNA guides the transcription of RNA in the nucleus
5Gene number generally increases with phylogenetic
complexity
6Genes genome complexity
- There is almost no correlation between the amount
of DNA in a species and its evolutionary
complexity (C-value paradox). - There is a correlation between the amount of
non-protein coding regions and complexity.
7Gene finding approaches
- Rule-based (e.g, start stop codons)
- Content-based (e.g., codon bias, promoter sites)
- Similarity-based (e.g., orthologs)
- Extrinsic-based (e.g., known proteins, ESTs)
- Pattern-based (e.g., machine-learning)
8Prokaryotes
- Advantages
- Simple gene structure
- Small genomes (0.5 to 10 million bp)
- No introns
- Genes are called Open Reading Frames (ORFs)
- High coding density (gt90)
- Disadvantages
- Some genes overlap (nested)
- Some genes are quite short (lt60 bp)
9Gene structure comparisons
10Prokaryotic gene structure
ORF (open reading frame)
TATA box
Stop codon
Start codon
ATGACAGATTACAGATTACAGATTACAGGATAG
Frame 1
Frame 2
Frame 3
11Prokaryotes stack multiple genes together for
expression (operons)
Promoter
Gene1
Gene2
Gene N
Terminator
Transcription
RNA Polymerase
mRNA 5
3
1
2
N
N
N
C
N
C
C
1
2
3
Polypeptides
12Prokaryotic genomes
E. coli
13Simple rule-based gene finding in prokaryotes,
based on ORFs
- Look for putative start codon (ATG)
- Staying in same frame, scan in groups of three
until a stop codon is found - If of codons gt50, assume its a gene
- If of codons lt50, go back to last start codon,
increment by 1 start again - At end of chromosome, repeat process for reverse
complement
14Example ORF
15ORF Finding Tools
- NCBI - http//www.ncbi.nlm.nih.gov/gorf/gorf.html
- Diogenes - http//www.cbc.umn.edu/diogenes/diogene
s.html - Frameplot - http//www.nih.go.jp/jun/cgi-bin/fram
eplot.pl
16Problems with rule-based approaches
- Advantages
- Simple and fairly sensitive (gt50)
- Disadvantages
- Prokaryotic genes are not always so simple to
find - ATG is not the only possible start site (e.g.
CTG, TTG class I alternates) - Small genes tend to be overlooked and long ones
over-predicted - Solution? Use additional information to increase
confidence in predictions
17Gene finding approaches
- Rule-based (e.g, start stop codons)
- Content-based (e.g., codon bias, promoter sites)
- Similarity-based (e.g., orthologs)
- Extrinsic-based (e.g., known proteins, ESTs)
- Pattern-based (e.g., machine-learning)
18Key prokaryotic gene features
- RNA polymerase promoter site (-10, -30 site or
TATA box) - Shine-Dalgarno sequence (10, Ribosome Binding
Site) to initiate protein translation - Codon biases
- High GC content
- Stem-loop (rho-independent) terminators
19Promoter structure in prokaryotes (E. coli)
- Transcription starts at offset 0.
- Pribnow Box (-10)
- Gilbert Box (-30)
- Ribosomal Binding Site (10)
20RNAP binds a region of DNA from -40 to 20
The sequence of the non-template strand is shown
-10 region
TTGACA16-19 bp... TATAAT -35 spacer
-10
21Example lexA Gene
- Three potential binding sides for the lexA
product to the promoter region - Promoter sites (-10, -35) for interaction with
the RNA polymerase - Ribosomal binding site on the mRNA product
complementary to ribosomal RNA - open reading frame devoid of introns.
22Codon Bias
- The genetic code is degenerate
- Equivalent triplet codons code for the same amino
acid - Codon usage varies
- Organism to organism (fortunately)
- Gene to gene (unfortunately)
- Can be calculated (http//www.kazusa.or.jp/codon/)
- Biological basis
- Avoidance of codons similar to stop
- Preference for codons that correspond to abundant
tRNAs within the organism
23Codon Adaptation Index example
Counts per 1000 codons
24Terminator Stem-loops
25Content-based recognition
- Advantages
- Increases accuracy over rule-based
- Disadvantages
- Features are degenerate
- Features are not always present
26Dealing with degenerate signals
- Use a profile-based method, sometimes called a
position specific scoring matrix (PSSM) built
from multiple sequence alignments
A PSSM
27Building a feature profile/PSSM
A T T T A G T A T C G T T C T G T A A C A T T T T
G T A G C A A G C T G T A A C C A T T - G T A C A
Multiple Alignment
A 3 2 0 0 1 0 0 5 2 1 C 1 0 0 2 0 0 0 0 1 4 G 1 0
1 0 0 5 0 0 1 0 T 0 3 4 3 3 0 5 0 1 0 - 0 0 0 0 1
0 0 0 0 0
Table of Occurrences
28Building a feature profile/PSSM
A 3 2 0 0 1 0 0 5 2 1 C 1 0 0 2 0 0 0 0 1 4 G 1 0
1 0 0 5 0 0 1 0 T 0 3 4 3 3 0 5 0 1 0 - 0 0 0 0 1
0 0 0 0 0
Table of Occurrences
A .6 .4 0 0 .2 0 0 1 .4 .2 C .2 0 0 .4 0
0 0 0 .2 .8 G .2 0 .2 0 0 1 0 0 .2 0 T
0 .6 .8 .6 .6 0 1 0 .2 0 - 0 0 0 0 .2 0
0 0 0 0
PSSM with no pseudovalues
29Why pseudovalues?
How well a sequence fits a profile is often
calculated by multiplying the probabilities
together. Consider the following case when a new
sequence (blue) is compared to a profile.
A T T T T G T A C C A .9 .1 0 0 .2
0 0 1 .9 .2 C .1 0 0 .2 0 0 0 0 0 .8 G
0 0 .2 0 0 1 0 0 0 0 T 0 .9 .8 .8 .8 0
1 0 .1 0 - 0 0 0 0 0 0 0 0 0 0
30Building a feature profile/PSSM
A .6 .4 0 0 .2 0 0 1 .4 .2 C .2 0 0 .4 0
0 0 0 .2 .8 G .2 0 .2 0 0 1 0 0 .2 0 T
0 .6 .8 .6 .6 0 1 0 .2 0 - 0 0 0 0 .2 0
0 0 0 0
PSSM with no pseudovalues
A .58 .38 .09 .09 .24 .04 .09 .79 .38 .24 C .17
.06 .06 .33 .05 .06 .05 .05 .18 .61 G .17 .06 .19
.06 .05 .75 .05 .05 .18 .04 T .05 .51 .65 .51 .65
.09 .79 .09 .24 .09 - .05 .02 .04 .05 .02 .05 .02
.02 .02 .02
PSSM with Pseudovalues
31Gene finding approaches
- Rule-based (e.g, start stop codons)
- Content-based (e.g., codon bias, promoter sites)
- Similarity-based (e.g., orthologs)
- Pattern-based (e.g., machine-learning)
- Extrinsic-based (e.g., known proteins, ESTs)
32Similarity-based gene finding
- Take all known genes from a related genome and
compare them to the query genome via BLAST - Advantages
- Predictions are made based upon confirmed genes
33Similarity-based gene finding
- Take all known genes from a related genome and
compare them to the query genome via BLAST - Disadvantages
- Orthologs/paralogs sometimes lose function and
become pseudogenes - Not all genes will always be known in the
comparison genome (big circularity problem) - The best species for comparison isnt always
obvious - Summary Similarity comparisons are good
supporting evidence for prediction validity
34Gene finding approaches
- Rule-based (e.g, start stop codons)
- Content-based (e.g., codon bias, promoter sites)
- Similarity-based (e.g., orthologs)
- Extrinsic-based (e.g., known proteins, ESTs)
- Pattern-based (e.g., machine-learning)
35Why not just use extrinsic evidence?
- Proteins and ESTs (mRNAs) are expressed under
specific circumstances - ESTs are often too short to determine complete
gene structure - In eukaryotes, many are only expressed sometime
during development - In prokaryotes, some are only expressed when
certain conditions are met (e.g. environmental)
36Gene finding approaches
- Rule-based (e.g, start stop codons)
- Content-based (e.g., codon bias, promoter sites)
- Similarity-based (e.g., orthologs)
- Extrinsic-based (e.g., known proteins, ESTs)
- Pattern-based (e.g., machine-learning)
37Markov Models
- Begin with a set of states
- The transition from any state to any other state,
including itself, is probabilistic - The odds of moving from one state to another
depend only upon the current state - Can be created from multiple sequence alignment
(e.g., for feature recognition)
38A Markov Model of DNA mutations
State Transition Matrix
39Nth order Markov Models
- What is the probability of observing GTCACT in a
region? - 1st order
- P(GTCACT) P(G?T)P(T?C)P(C?A)P(A?C)P(C?T)
- 2nd order
- P(GTCACT) P(GT?CA)P(CA?CT)
- 3rd order
- P(GTCACT) P(GTC?ACT)
- Etc
40Transition probabilities are compared to another
model
- CTAGCGACGGCTCAGCGGTGCTACGCGC
- Gene sequence
- GTATGCGCGATCGATCGCGACCGATCGT
- Random
- TACACTATAGTACGACTATCAATACTCA
- Intragenic sequence
41Markov Models in gene prediction
- Judge how likely a given sequence of bases
belongs to one class of DNA vs another - Codon vs intragenic 3rd order MM
- Intron vs exon (eukaryotes) 3rd order MM
- Binding site vs. other nth order MM
42Question
- Q MMs work well when we know what kind of
comparison to make, but what if we dont know
anything about the sequence were analyzing (e.g.
when a state transition occurs)?
43Question
- Q MMs work well when we know what kind of
comparison to make, but what if we dont know
anything about the sequence were analyzing (e.g.
when a state transition occurs)? - A We have to look for a model that best fits our
observations. We assume that the real states are
hidden from our view.
44Markov Chains
Rain
Sunny
Cloudy
State transition matrix The probability of the
weather given the previous day's weather.
States Three states - sunny, cloudy, rainy.
Initial Distribution Defining the probability
of the system being in each of the states at time
0.
45Hidden Markov Models
Hidden states The true states of a system that
may be described by a Markov process (e.g., the
weather). Observable states The states of the
process that are visible' to an observer (e.g.,
damp grass).
46Components of an HMM
Grass
Output matrix containing the probability of
observing a particular observable state given
that the hidden model is in a particular hidden
state. Initial Distribution contains the
probability of the (hidden) model being in a
particular hidden state at time t 1. State
transition matrix holding the probability of a
hidden state given the previous hidden state.
47Applied to gene finding (color state)
48Some HMMs can get complex
RBS site
promoter site
49Markov Model caveat
- Only works when each base pair is not linked to
any other in the sequence. For example
GACCCTC
G
C
C
C
C
G
A
T
A
T
A
T
C
C
C
C
C
C
C
C
C
OK non-functional
OK
50Prokaryotic gene prediction software using the
methods discussed
- GLIMMER
- http//cbcb.umd.edu/software/glimmer/
- Uses interpolated markov models (IMMs)
- Requires training of sample genes
- Takes about 1 minute/genome
- GeneMark
- http//opal.biology.gatech.edu/GeneMark/gmhmm2_pro
k.cgi - Available as a web server
- Uses Hidden Markov Models
51Glimmer Performance
52Bottom Line...
- Gene finding in prokaryotes is pretty much a
solved problem - Accuracy of the best methods approaches 99
- Gene predictions should always be compared
against extrinsic evidence (protein, ESTs) and
similarity to other genes (BLAST) to ensure
accuracy and to catch possible sequencing errors
53For next time