Bioinformatics: Applications - PowerPoint PPT Presentation

1 / 53

About This Presentation

Title:

Bioinformatics: Applications

Description:

TATA box. ATGACAGATTACAGATTACAGATTACAGGATAG. Frame 1. Frame 2. Frame 3 ... RNA polymerase promoter site (-10, -30 site or TATA box) ... – PowerPoint PPT presentation

Number of Views:119

Avg rating:3.0/5.0

Slides: 54

Provided by: jonath76

Category:

more less

Transcript and Presenter's Notes

Title: Bioinformatics: Applications

1
Bioinformatics Applications

ZOO 4903
Fall 2006, MW 1030-1145
Sutton Hall, Room 312
Basics of gene finding - Prokaryotes

2
First

Short discussion feedback on Exam 1
Class report instructions to be handed out

3
Lecture overview

What weve talked about so far
DNA is the blueprint for all organisms
Overview
RNA, not DNA, is the marker of cellular activity
and changes
Gene finding in prokaryotes

4
DNA guides the transcription of RNA in the nucleus
5
Gene number generally increases with phylogenetic
complexity
6
Genes genome complexity

There is almost no correlation between the amount
of DNA in a species and its evolutionary
complexity (C-value paradox).
There is a correlation between the amount of
non-protein coding regions and complexity.

7
Gene finding approaches

Rule-based (e.g, start stop codons)
Content-based (e.g., codon bias, promoter sites)
Similarity-based (e.g., orthologs)
Extrinsic-based (e.g., known proteins, ESTs)
Pattern-based (e.g., machine-learning)

8
Prokaryotes

Advantages
Simple gene structure
Small genomes (0.5 to 10 million bp)
No introns
Genes are called Open Reading Frames (ORFs)
High coding density (gt90)
Disadvantages
Some genes overlap (nested)
Some genes are quite short (lt60 bp)

9
Gene structure comparisons
10
Prokaryotic gene structure
ORF (open reading frame)
TATA box
Stop codon
Start codon
ATGACAGATTACAGATTACAGATTACAGGATAG
Frame 1
Frame 2
Frame 3
11
Prokaryotes stack multiple genes together for
expression (operons)
Promoter
Gene1
Gene2
Gene N
Terminator
Transcription
RNA Polymerase
mRNA 5
3
1
2
N
N
N
C
N
C
C
1
2
3
Polypeptides
12
Prokaryotic genomes
E. coli
13
Simple rule-based gene finding in prokaryotes,
based on ORFs

Look for putative start codon (ATG)
Staying in same frame, scan in groups of three
until a stop codon is found
If of codons gt50, assume its a gene
If of codons lt50, go back to last start codon,
increment by 1 start again
At end of chromosome, repeat process for reverse
complement

14
Example ORF
15
ORF Finding Tools

NCBI - http//www.ncbi.nlm.nih.gov/gorf/gorf.html
Diogenes - http//www.cbc.umn.edu/diogenes/diogene
s.html
Frameplot - http//www.nih.go.jp/jun/cgi-bin/fram
eplot.pl

16
Problems with rule-based approaches

Advantages
Simple and fairly sensitive (gt50)
Disadvantages
Prokaryotic genes are not always so simple to
find
ATG is not the only possible start site (e.g.
CTG, TTG class I alternates)
Small genes tend to be overlooked and long ones
over-predicted
Solution? Use additional information to increase
confidence in predictions

17
Gene finding approaches

Rule-based (e.g, start stop codons)
Content-based (e.g., codon bias, promoter sites)
Similarity-based (e.g., orthologs)
Extrinsic-based (e.g., known proteins, ESTs)
Pattern-based (e.g., machine-learning)

18
Key prokaryotic gene features

RNA polymerase promoter site (-10, -30 site or
TATA box)
Shine-Dalgarno sequence (10, Ribosome Binding
Site) to initiate protein translation
Codon biases
High GC content
Stem-loop (rho-independent) terminators

19
Promoter structure in prokaryotes (E. coli)

Transcription starts at offset 0.
Pribnow Box (-10)
Gilbert Box (-30)
Ribosomal Binding Site (10)

20
RNAP binds a region of DNA from -40 to 20
The sequence of the non-template strand is shown
-10 region
TTGACA16-19 bp... TATAAT -35 spacer
-10
21
Example lexA Gene

Three potential binding sides for the lexA
product to the promoter region
Promoter sites (-10, -35) for interaction with
the RNA polymerase
Ribosomal binding site on the mRNA product
complementary to ribosomal RNA
open reading frame devoid of introns.

22
Codon Bias

The genetic code is degenerate
Equivalent triplet codons code for the same amino
acid
Codon usage varies
Organism to organism (fortunately)
Gene to gene (unfortunately)
Can be calculated (http//www.kazusa.or.jp/codon/)
Biological basis
Avoidance of codons similar to stop
Preference for codons that correspond to abundant
tRNAs within the organism

23
Codon Adaptation Index example
Counts per 1000 codons
24
Terminator Stem-loops
25
Content-based recognition

Advantages
Increases accuracy over rule-based
Disadvantages
Features are degenerate
Features are not always present

26
Dealing with degenerate signals

Use a profile-based method, sometimes called a
position specific scoring matrix (PSSM) built
from multiple sequence alignments

A PSSM
27
Building a feature profile/PSSM
A T T T A G T A T C G T T C T G T A A C A T T T T
G T A G C A A G C T G T A A C C A T T - G T A C A
Multiple Alignment
A 3 2 0 0 1 0 0 5 2 1 C 1 0 0 2 0 0 0 0 1 4 G 1 0
1 0 0 5 0 0 1 0 T 0 3 4 3 3 0 5 0 1 0 - 0 0 0 0 1
0 0 0 0 0
Table of Occurrences
28
Building a feature profile/PSSM
A 3 2 0 0 1 0 0 5 2 1 C 1 0 0 2 0 0 0 0 1 4 G 1 0
1 0 0 5 0 0 1 0 T 0 3 4 3 3 0 5 0 1 0 - 0 0 0 0 1
0 0 0 0 0
Table of Occurrences
A .6 .4 0 0 .2 0 0 1 .4 .2 C .2 0 0 .4 0
0 0 0 .2 .8 G .2 0 .2 0 0 1 0 0 .2 0 T
0 .6 .8 .6 .6 0 1 0 .2 0 - 0 0 0 0 .2 0
0 0 0 0
PSSM with no pseudovalues
29
Why pseudovalues?
How well a sequence fits a profile is often
calculated by multiplying the probabilities
together. Consider the following case when a new
sequence (blue) is compared to a profile.
A T T T T G T A C C A .9 .1 0 0 .2
0 0 1 .9 .2 C .1 0 0 .2 0 0 0 0 0 .8 G
0 0 .2 0 0 1 0 0 0 0 T 0 .9 .8 .8 .8 0
1 0 .1 0 - 0 0 0 0 0 0 0 0 0 0
30
Building a feature profile/PSSM
A .6 .4 0 0 .2 0 0 1 .4 .2 C .2 0 0 .4 0
0 0 0 .2 .8 G .2 0 .2 0 0 1 0 0 .2 0 T
0 .6 .8 .6 .6 0 1 0 .2 0 - 0 0 0 0 .2 0
0 0 0 0
PSSM with no pseudovalues
A .58 .38 .09 .09 .24 .04 .09 .79 .38 .24 C .17
.06 .06 .33 .05 .06 .05 .05 .18 .61 G .17 .06 .19
.06 .05 .75 .05 .05 .18 .04 T .05 .51 .65 .51 .65
.09 .79 .09 .24 .09 - .05 .02 .04 .05 .02 .05 .02
.02 .02 .02
PSSM with Pseudovalues
31
Gene finding approaches

Rule-based (e.g, start stop codons)
Content-based (e.g., codon bias, promoter sites)
Similarity-based (e.g., orthologs)
Pattern-based (e.g., machine-learning)
Extrinsic-based (e.g., known proteins, ESTs)

32
Similarity-based gene finding

Take all known genes from a related genome and
compare them to the query genome via BLAST
Advantages
Predictions are made based upon confirmed genes

33
Similarity-based gene finding

Take all known genes from a related genome and
compare them to the query genome via BLAST
Disadvantages
Orthologs/paralogs sometimes lose function and
become pseudogenes
Not all genes will always be known in the
comparison genome (big circularity problem)
The best species for comparison isnt always
obvious
Summary Similarity comparisons are good
supporting evidence for prediction validity

34
Gene finding approaches

Rule-based (e.g, start stop codons)
Content-based (e.g., codon bias, promoter sites)
Similarity-based (e.g., orthologs)
Extrinsic-based (e.g., known proteins, ESTs)
Pattern-based (e.g., machine-learning)

35
Why not just use extrinsic evidence?

Proteins and ESTs (mRNAs) are expressed under
specific circumstances
ESTs are often too short to determine complete
gene structure
In eukaryotes, many are only expressed sometime
during development
In prokaryotes, some are only expressed when
certain conditions are met (e.g. environmental)

36
Gene finding approaches

Rule-based (e.g, start stop codons)
Content-based (e.g., codon bias, promoter sites)
Similarity-based (e.g., orthologs)
Extrinsic-based (e.g., known proteins, ESTs)
Pattern-based (e.g., machine-learning)

37
Markov Models

Begin with a set of states
The transition from any state to any other state,
including itself, is probabilistic
The odds of moving from one state to another
depend only upon the current state
Can be created from multiple sequence alignment
(e.g., for feature recognition)

38
A Markov Model of DNA mutations
State Transition Matrix
39
Nth order Markov Models

What is the probability of observing GTCACT in a
region?
1st order
P(GTCACT) P(G?T)P(T?C)P(C?A)P(A?C)P(C?T)
2nd order
P(GTCACT) P(GT?CA)P(CA?CT)
3rd order
P(GTCACT) P(GTC?ACT)
Etc

40
Transition probabilities are compared to another
model

CTAGCGACGGCTCAGCGGTGCTACGCGC
Gene sequence
GTATGCGCGATCGATCGCGACCGATCGT
Random
TACACTATAGTACGACTATCAATACTCA
Intragenic sequence

41
Markov Models in gene prediction

Judge how likely a given sequence of bases
belongs to one class of DNA vs another
Codon vs intragenic 3rd order MM
Intron vs exon (eukaryotes) 3rd order MM
Binding site vs. other nth order MM

42
Question

Q MMs work well when we know what kind of
comparison to make, but what if we dont know
anything about the sequence were analyzing (e.g.
when a state transition occurs)?

43
Question

Q MMs work well when we know what kind of
comparison to make, but what if we dont know
anything about the sequence were analyzing (e.g.
when a state transition occurs)?
A We have to look for a model that best fits our
observations. We assume that the real states are
hidden from our view.

44
Markov Chains
Rain
Sunny
Cloudy
State transition matrix The probability of the
weather given the previous day's weather.
States Three states - sunny, cloudy, rainy.
Initial Distribution Defining the probability
of the system being in each of the states at time
0.
45
Hidden Markov Models
Hidden states The true states of a system that
may be described by a Markov process (e.g., the
weather). Observable states The states of the
process that are visible' to an observer (e.g.,
damp grass).
46
Components of an HMM
Grass
Output matrix containing the probability of
observing a particular observable state given
that the hidden model is in a particular hidden
state. Initial Distribution contains the
probability of the (hidden) model being in a
particular hidden state at time t 1. State
transition matrix holding the probability of a
hidden state given the previous hidden state.
47
Applied to gene finding (color state)
48
Some HMMs can get complex
RBS site
promoter site
49
Markov Model caveat

Only works when each base pair is not linked to
any other in the sequence. For example

GACCCTC
G
C
C
C
C
G
A
T
A
T
A
T
C
C
C
C
C
C
C
C
C
OK non-functional
OK
50
Prokaryotic gene prediction software using the
methods discussed

GLIMMER
http//cbcb.umd.edu/software/glimmer/
Uses interpolated markov models (IMMs)
Requires training of sample genes
Takes about 1 minute/genome
GeneMark
http//opal.biology.gatech.edu/GeneMark/gmhmm2_pro
k.cgi
Available as a web server
Uses Hidden Markov Models

51
Glimmer Performance
52
Bottom Line...

Gene finding in prokaryotes is pretty much a
solved problem
Accuracy of the best methods approaches 99
Gene predictions should always be compared
against extrinsic evidence (protein, ESTs) and
similarity to other genes (BLAST) to ensure
accuracy and to catch possible sequencing errors

53
For next time