Title: BCB 444544
1 BCB 444/544
- Lecture 28
- Gene Prediction - finish it
- Promoter Prediction
- 28_Oct29
2 Required Reading (before lecture)
- Mon Oct 29 - Lecture 28
- Promoter Regulatory Element Prediction
- Chp 9 - pp 113 - 126
- Wed Oct 30 - Lecture 29
- Phylogenetics Basics
- Chp 10 - pp 127 - 141
- Thurs Oct 31 - Lab 9
- Gene Regulatory Element Prediction
- Fri Oct 30 - Lecture 29
- Phylogenetic Tree Construction Methods
Programs - Chp 11 - pp 142 - 169
-
3 Assignments Announcements
- Mon Oct 29 - HW5 - will be posted today
- HW5 Hands-on exercises with
phylogenetics - and tree-building software
- Due Mon Nov 5 (not Fri Nov 1 as previously
posted)
4BCB 544 "Team" Projects
- Last week of classes will be devoted to Projects
- Written reports due
- Mon Dec 3 (no class that day)
- Oral presentations (20-30') will be
- Wed-Fri Dec 5,6,7
- 1 or 2 teams will present during each class
period - See Guidelines for Projects posted online
5BCB 544 Only New Homework Assignment
- 544 Extra2
-
- Due vPART 1 - ASAP
- PART 2 - meeting prior to 5 PM Fri Nov 2
- Part 1 - Brief outline of Project, email to Drena
Michael - after response/approval, then
- Part 2 - More detailed outline of project
- Read a few papers and summarize status of
problem - Schedule meeting with Drena Michael to
discuss ideas -
6 Seminars this Week
- BCB List of URLs for Seminars related to
Bioinformatics - http//www.bcb.iastate.edu/seminars/index.html
- Nov 1 Thurs - BBMB Seminar 410 in 1414 MBB
- Todd Yeates UCLA TBA -something cool about
structure and evolution? - Nov 2 Fri - BCB Faculty Seminar 210 in 102 ScI
- Bob Jernigan BBMB, ISU
- Control of Protein Motions by Structure
7Chp 8 - Gene Prediction
- SECTION III GENE AND PROMOTER PREDICTION
- Xiong Chp 8 Gene Prediction
- Categories of Gene Prediction Programs
- Gene Prediction in Prokaryotes
- Gene Prediction in Eukaryotes
8Computational Gene Prediction Approaches
- Ab initio methods
- Search by signal find DNA sequences involved in
gene expression - Search by content Test statistical properties
distinguishing coding from non-coding DNA - Similarity-based methods
- Database search exploit similarity to proteins,
ESTs, cDNAs - Comparative genomics exploit aligned genomes
- Do other organisms have similar sequence?
- Hybrid methods - best
9Computational Gene Prediction Algorithms
This is a new slide
- Neural Networks (NNs) (more on these later)
- e.g., GRAIL
- Linear discriminant analysis (LDA) (see text)
- e.g., FGENES, MZEF
- Markov Models (MMs) Hidden Markov Models (HMMs)
- e.g., GeneSeqer - uses MMs
- GENSCAN - uses 5th order HMMs - (see
text) - HMMgene - uses conditional maximum
likelihood (see text)
10Signals Search
This is a new slide
- Approach Build models (PSSMs, profiles, HMMs,
) and search against DNA. Detected instances
provide evidence for genes
11Content Search
This is a new slide
- Observation Encoding a protein affects
statistical properties of DNA sequence - Nucleotide.amino acid distribution
- GC content (CpG islands, exon/intron)
- Uneven usage of synonymous codons (codon bias)
- Hexamer frequency - most discriminative of these
for identifying coding potential - Method Evaluate these differences (coding
statistics) to differentiate between coding and
non-coding regions
12Human Codon Usage
This is a new slide
13Predicting Genes based on Codon Usage
Differences
This is a new slide
- Algorithm
- Process sliding window
- Use codon frequencies to compute probability of
coding versus non-coding - Plot log-likelihood ratio
14Similarity-Based Methods Database Search
This is a new slide
- In different genomes Translate DNA into all 6
reading frames and search against proteins
(TBLASTX,BLASTX, etc.) - Within same genome Search with EST/cDNA database
- (EST2genome, BLAT, etc.).
- Problems
- Will not find new or RNA genes (non-coding
genes). - Limits of similarity are hard to define
- Small exons might be overlooked
15Similarity-Based Methods Comparative Genomics
This is a new slide
- Idea Functional regions are more conserved than
non-functional ones high similarity in alignment
indicates gene - Advantages
- May find uncharacterized or RNA genes
- Problems
- Finding suitable evolutionary distance
- Finding limits of high similarity (functional
regions)
16This is a new slide
Human-Mouse Homology
- Comparison of 1196 orthologous genes
- Sequence identity between genes in human vs mouse
- Exons 84.6
- Protein 85.4
- Introns 35
- 5 UTRs 67
- 3 UTRs 69
17Thanks to Volker Brendel, ISU for the following
Figs Slides
- Slightly modified from
- BSSI Genome Informatics Module
- http//www.bioinformatics.iastate.edu/BBSI/course_
desc_2005.htmlmoduleB - V Brendel vbrendel_at_iastate.edu
Brendel et al (2004) Bioinformatics 20 1157
18Spliced Alignment Algorithm
GeneSeqer - Brendel et al.- ISU
http//deepc2.psi.iastate.edu/cgi-bin/gs.cgi
Brendel et al (2004) Bioinformatics 20
1157 http//bioinformatics.oxfordjournals.org/cgi/
content/abstract/20/7/1157
- Perform pairwise alignment with large gaps in one
sequence (due to introns) - Align genomic DNA with cDNA, ESTs, protein
sequences - Score semi-conserved sequences at splice
junctions - Using Bayesian probability model 1st order MM
- Score coding constraints in translated exons
- Using Bayesian model
Brendel 2005
19Splice Site Detection
Do DNA sequences surrounding splice "consensus"
sequences contribute to splicing signal?
YES
i ith position in sequence I avg
information content over all positions gt20 nt
from splice site ?I avg sample standard
deviation of I
Brendel 2005
20Information Content vs Position
Which sequences are exons which are introns?
How can you tell?
Brendel 2005
21Markov Model for Spliced Alignment
Brendel 2005
22This is a new slide
Evaluation of Splice Site Prediction
TP positive instance correctly predicted as
positive FP negative instance incorrectly
predicted as positive TN negative instance
correctly predicted as negative FN positive
instance incorrectly predicted as negative
Right!
Fig 5.11 Baxevanis Ouellette 2005
23Evaluation of Predictions
Predicted Positives
True Positives
False Positives
Coverage
Recall
Do not memorize this!
24Evaluation of Predictions - in English
Coverage
IMPORTANT Sensitivity alone does not tell us
much about performance because a 100 sensitivity
can be achieved trivially by labeling all test
cases positive!
In English? Sensitivity is the fraction of all
positive instances having a true positive
prediction.
Recall
IMPORTANT in medical jargon, Specificity is
sometimes defined differently (what we define
here as "Specificity" is sometimes referred to as
"Positive predictive value")
In English? Specificity is the fraction of all
predicted positives that are, in fact, true
positives.
25This slide has been changed
Best Measures for Comparison?
- ROC curves (Receiver Operating Characteristic
(?!!) - http//en.wikipedia.org/wiki/Roc_curve
-
- Correlation Coefficient
- Matthews correlation coefficient (MCC)
- MCC 1 for a perfect prediction
- 0 for a completely random assignment
- -1 for a "perfectly incorrect" prediction
In signal detection theory, a receiver operating
characteristic (ROC), or ROC curve is a plot of
sensitivity vs (1 - specificity) for a binary
classifier system as its discrimination threshold
is varied. The ROC can also be represented
equivalently by plotting fraction of true
positives (TPR true positive rate) vs fraction
of false positives (FPR false positive rate)
Do not memorize this!
26GeneSeqer Input http//deepc2.psi.iastate.edu/cg
i-bin/gs.cgi
Brendel 2005
27GeneSeqer Output
Brendel 2005
28GeneSeqer Gene Evidence Summary
Brendel 2005
29Gene Prediction - Problems Status?
- Common errors?
- False positive intergenic regions
- 2 annotated genes actually correspond to a single
gene - False negative intergenic region
- One annotated gene structure actually contains 2
genes - False negative gene prediction
- Missing gene (no annotation)
- Other
- Partially incorrect gene annotation
- Missing annotation of alternative transcripts
- Current status?
- For ab initio prediction in eukaryotes HMMs have
better overall performance for detecting
intron/exon boundaries - Limitation? Training data predictions are
organism specific - Combined ab initio/homology based predictions
Improved accurracy - Limitation? Availability of identifiable
sequence homologs in databases
30Recommended Gene Prediction Software
- Ab initio
- GENSCAN http//genes.mit.edu/GENSCAN.html
- GeneMark.hmm http//exon.gatech.edu/GeneMark/
- others GRAIL, FGENES, MZEF, HMMgene
- Similarity-based
- BLAST, GenomeScan, EST2Genome, Twinscan
- Combined
- GeneSeqer, http//deepc2.psi.iastate.edu/cgi-bin/g
s.cgi - ROSETTA
- Consensus because results depend on organisms
specific task, Always use more than one
program! - Two servers hat report consensus predictions
- GeneComber
- DIGIT
31Other Gene Prediction Resources at ISU
http//www.bioinformatics.iastate.edu/bioinformati
cs2go/
32Other Gene Prediction Resources GaTech, MIT,
Stanford, etc.
Lists of Gene Prediction Software http//www.bioi
nformaticsonline.org/links/ch_09_t_1.html http//
cmgm.stanford.edu/classes/genefind/
- Current Protocols in Bioinformatics (BCB/ISU owns
a copy - currently in my lab!) - Chapter 4 Finding Genes
- 4.1 An Overview of Gene Identification
Approaches, Strategies, and Considerations - 4.2 Using MZEF To Find Internal Coding Exons
- 4.3 Using GENEID to Identify Genes
- 4.4 Using GlimmerM to Find Genes in Eukaryotic
Genomes - 4.5 Prokaryotic Gene Prediction Using GeneMark
and GeneMark.hmm - 4.6 Eukaryotic Gene Prediction Using GeneMark.hmm
- 4.7 Application of FirstEF to Find Promoters and
First Exons in the Human Genome - 4.8 Using TWINSCAN to Predict Gene Structures in
Genomic DNA Sequences - 4.9 GrailEXP and Genome Analysis Pipeline for
Genome Annotation - 4.10 Using RepeatMasker to Identify Repetitive
Elements in Genomic Sequences
33Chp 9 - Promoter Regulatory Element Prediction
- SECTION III GENE AND PROMOTER PREDICTION
- Xiong Chp 9 Promoter Regulatory Element
Prediction - Promoter Regulatory Elements in Prokaryotes
- Promoter Regulatory Elements in Eukaryotes
- Prediction Algorithms
34Eukaryotes vs Prokaryotes Genomes
- Eukaryotic genomes
- Are packaged in chromatin sequestered in a
nucleus - Are larger and have multiple linear chromosomes
- Contain mostly non-protein coding DNA (98-99)
- Prokarytic genomes
- DNA is associated with a nucleoid, but no nucleus
- Much larger, usually single, circular chromosome
- Contain mostly protein encoding DNA
35Eukaryotes vs Prokryotes Gene Structure
36Eukaryotes vs Prokaryotes Genes
- Eukaryotic genes
- Are larger and more complex than in prokaryotes
- Contain introns that are spliced out to
generate mature mRNAs - Often undergo alternative splicing, giving rise
to multiple RNAs - Are transcribed by 3 different RNA polymerases
- (instead of 1, as in prokaryotes)
-
- In biology, statements such as this include an
implicit usually or often
37Eukaryotes vs Prokaryotes Levels of Gene
Regulation
- Primary level of control?
- Prokaryotes Transcription initiation
- Eukaryotes Transcription is also very
important, but - Expression is regulated at multiple levels
- many of which are post-transcriptional
- RNA processing, transport, stability
- Translation initiation
- Protein processing, transport, stability
- Post-translational modification (PTM)
- Subcellular localization
- Recent important discoveries small regulatory
RNAs (miRNA, siRNA) are abundant and play very
important roles in controlling gene expression in
eukaryotes, often at post-transcriptional levels
38Eukaryotes vs Prokaryotes Regulatory Elements
- Prokaryotes
- Promoters operators (for operons) - cis-acting
DNA signals - Activators repressors - trans-acting proteins
- (we won't discuss these)
- Eukaryotes
- Promoters enhancers (for single genes) -
cis-acting - Transcription factors - trans-acting
- Important difference?
- What the RNA polymerase actually binds
39Prokaryotic Promoters
- RNA polymerase complex recognizes promoter
sequences located very close to and on 5 side
(upstream) of tansription initiation site -
- Prokaryotic RNA polymerase complex binds directly
to promoter, by virtue of its sigma subunit - no
requirement for transcription factors binding
first - Prokaryotic promoter sequences are highly
conserved - -10 region
- -35 region
40Eukaryotic Promoters
- Eukaryotic RNA polymerase complexes do not bind
directly to promoter sequences - Transcription factors must bind first and serve
as landmarks recognized by RNA polymerase
complexes - Eukaryotic promoter sequences are less highly
conserved, but many promoters (for RNA
polymerase II) contain - -30 region "TATA" box
- -100 region "CCAAT" box
41Eukaryotic Promoters vs Enhancers
- Both promoters enhancers are binding sites for
transcription factors (TFs) - Promoters
- essential for initiation of transcription
- located relatively close to start site (usually
lt200 bp upstream, but can be located within
gene, rather than upstream!) - Enhancers
- needed for regulated transcription (differential
expression in specific cell types, developmental
stages, in response to environment, etc.) - can be very far from start site (sometimes gt 100
kb)
42Eukaryotic genes are transcribed by 3 different
RNA polymerases (Location of promoter regions,
TFBSs TFs differ, too)
Brown Fig 9.18
BIOS Scientific Publishers Ltd, 1999
43Prokaryotic Genes Operons
- Genes with related functions are often clustered
within operons (e.g., lac operon) - Operons genes with related functions that are
transcribed and regulated as a single unit one
promoter controls expression of several proteins - mRNAs produced from operons are polycistronic -
a single mRNA encodes several proteins i.e.,
there are multiple ORFs, each with its own AUG
(START) STOP codons, linked within one mRNA
molecule
44Promoter of lac operon in E. coli (Transcribed
by prokaryotic RNA polymerase)
Brown Fig 9.17
BIOS Scientific Publishers Ltd, 1999
45Eukaryotic genes
- Genes with related functions are occasionally,
but not usually clustered instead, they share
common regulatory regions (promoters, enhancers,
etc.) - Chromatin structure must also be active for
transcription to occur
46Eukaryotic genes have large complex
regulatory regions
- Cis-acting regulatory elements include
- Promoters, enhancers, silencers
- Trans-acting regulatory factors include
- Transcription factors (TFs), chromatin
- remodeling complexes, small RNAs
-
Brown Fig 9.17
BIOS Scientific Publishers Ltd, 1999
47Eukaryotic Promoters DNA sequences required for
initiation, usually lt200 bp from start site
Eukaryotic RNA polymerases bind by recognizing a
complex of TFs bound at promotor
First, TFs must bind short motifs (TFBSs) within
promoters then RNA polymerase can bind and
initiate transcription of RNA
250 bp
Pre-mRNA
48Eukaryotic promoters enhancer regions often
contain many different TFBS motifs
Fig 9.13 Mount 2004
49Simplified View of Promoters in Eukaryotes
Fig 5.12 Baxevanis Ouellette 2005
50Eukaryotic Activators vs Repressors
Regions far from the promoter can act as
"enhancers" or "repressors" of transcription by
serving as binding sites for activator or
repressor proteins (TFs)
Activator proteins (TFs) bind to enhancers
interact with RNAP to stimulate transcription
Repressors block the action of activators
51Eukaryotic Transcription Factors (TFs)
- Transcription factors proteins that interact
with the RNA polymerase complex to activate or
repress transcription - TFs often contain both
- a trans-activating domain
- a DNA binding domain or motif
-
- TFs recognize and bind specific short DNA
sequence motifs called transcription factor
binding sites (TFBSs) - Databases for TFs TFBSs include
- TRANSFAC, http//www.generegulation.com/cgibin/pub
/databases/transfac - JASPAR
-
Here motif amino acid sequence in protein
Here motif nucleotide sequence in DNA
52Zinc Finger Proteins - Transcription Factors
- Common in eukaryotic proteins
- 1 of mammalian genes encode zinc-finger
proteins (ZFPs) - In C. elegans, there are gt 500 !
- Can be used as highly specific DNA binding
modules - Potentially valuable tools for directed genome
modification (esp. in plants) human gene
therapy - one clinical trial will begin soon!
- Did you go to Dave Segal's seminar?
- Your TAs Pete Jeff work on designing better
ZFPs!
Brown Fig 9.12
BIOS Scientific Publishers Ltd, 1999
53Promoter Prediction Algorithms Software
Xiong -
54Eukaryotes vs Prokaryotes Promoter Prediction
Promoter prediction is much easier in
prokaryotes Why? Highly conserved Simpler
gene structures More sequenced genomes!
(for comparative approaches) Methods?
Previously mostly HMM-based Now
similarity-based comparative methods because
so many genomes available Xiong textbook 1)
"Manual method" rules of Wang et al (see
text) 2) BPROM - uses linear discriminant
function
55Eukaryotes vs Prokaryotes Promoter Prediction
Promoter prediction is much easier in
prokaryotes Why? Highly conserved Simpler
gene structures More sequenced genomes!
(for comparative approaches) Methods?
Previously mostly HMM-based Now
similarity-based comparative methods because
so many genomes available Xiong textbook 1)
"Manual method" rules of Wang et al (see
text) 2) BPROM - uses linear discriminant
function
56Predicting Promoters in Eukaryotes
- Closely related to gene prediction!
- Obtain genomic sequence
- Use sequence-similarity based comparison
- (BLAST, MSA) to find related genes
- But "regulatory" regions are much less
well- conserved than coding regions - Locate ORFs
- Identify Transcription Start Site (TSS)
- (if possible!)
- Use Promoter Prediction Programs
- Analyze motifs, etc. in DNA sequence (TRANSFAC,
JASPAR)
57Predicting promoters Steps Strategies
- Identify TSS --if possible?
- One of biggest problems is determining exact
TSS! - Not very many full-length cDNAs!
- Good starting point? (human vertebrate genes)
- Use FirstEF
- found within UCSC Genome Browser
- or submit to FirstEF web server
-
Fig 5.10 Baxevanis Ouellette 2005
58Automated Promoter Prediction Strategies
- Pattern-driven algorithms (ab initio)
- Sequence-driven algorithms (homology based)
- Combined "evidence-based"
- BEST RESULTS? Combined, sequential
591) Pattern-driven Algorithms
- Success depends on availability of collections of
annotated transcription factor binding sites
(TFBSs) - Tend to produce very large numbers of false
positives (FPs) - Why?
- Binding sites for specific TFs are often variable
- Binding sites are short (typically 6-10 bp)
- Interactions between TFs ( other proteins)
influence both affinity specificity of TF
binding - One binding site often recognized by multiple TFs
- Biology is complex gene activation is often
specific to organism/cell/stage/environmental
condition promoter and enhancer elements must
mediate this
60Ways to Reduce FPs in ab initio Prediction
- Take sequence context/biology into account
- Eukaryotes clusters of TFBSs are common
- Prokaryotes knowledge of ? (sigma) factors helps
- Probability of "real" binding site higher if
annotated transcription start site (TSS) is
nearby - But What about enhancers? (no TSS nearby!)
- only a small fraction of TSSs have been
experimentally determinined - Do the wet lab experiments!
- But Promoter-bashing can be tedious
612) Sequence-driven Algorithms
- Assumption Common functionality can be deduced
from sequence conservation (Homology) - Alignments of co-regulated genes should highlight
elements involved in regulation - Careful How determine co-regulation?
- Orthologous genes from difference species
- Genes experimentally shown to be co-regulated
(using microarrays??) - Comparative promoter prediction
- Phylogenetic footprinting
- Expression Profiling
62Phylogenetic Footprinting
- Based on increasing availability of whole genome
DNA sequences from many different species - Selection of organisms for comparison is
important - not too close, not too far good human vs
mouse - To reduce FPs, must extract non-coding sequences
and then align them prediction depends on good
alignment - use MSA algorithms (e.g., CLUSTAL)
- more sensitive methods
- Gibbs sampling
- Expectation Maximization (EM) methods
- Examples of programs
- Consite, rVISTA, PromH(W), Bayes aligner,
Footprinter
63Expression Profiling
- Based on increasing availability of whole genome
mRNA expression data, esp., microarray data - High-throughput simultaneous monitoring of
expression levels of thousands of genes - Assumptions (sometimes valid, sometimes NOT)
- Co-expression implies co-regulation
- Co-regulated genes share common regulatory
elements - Drawbacks
- Signals are short weak!
- Requires Gibbs sampling or EM e.g.,
MEME, AlignACE, Melina - Prediction depends on determining which genes are
co-expressed - usually by clustering -
which an be error prone - Examples of programs
- INCLUSive - combined microarray analysis motif
detection - PhyloCon - combined phylo footprinting
expression profiling)
64Problems with Sequence-driven Algorithms
- Need sets of co-regulated genes
- For comparative (phylogenetic) methods
- Must choose appropriate species
- Different genomes evolve at different rates
- Classical alignment methods have trouble with
- translocations or inversions than change
order of functional elements - If background conservation of entire region is
high, comparison is useless - Not enough data (but Prokaryotes gtgtgt Eukaryotes)
- Complexity many regulatory elements are not
conserved across species!
65TRANSFAC Matrix Entry for TATA box
- Fields
- Accession ID
- Brief description
- TFs associated with this entry
- Weight matrix
- Number of sites used to build
- Other info
Fig 5.13 Baxevanis Ouellette 2005
66Global Alignment of Human Mouse Obese Gene
Promoters (200 bp upstream from TSS)
Fig 5.14 Baxevanis Ouellette 2005
67Annotated Lists of Promoter Databases Promoter
Prediction Software
- URLs from Mount textbook
- Table 9.12 http//www.bioinformaticsonline.org/li
nks/ch_09_t_2.html - Table in Wasserman Sandelin Nat Rev Genet
article http//proxy.lib.iastate.edu2103/nrg/jour
nal/v5/n4/full/nrg1315_fs.htm - URLs from Baxevanis Ouellette textbook
- http//www.wiley.com/legacy/products/subject/life
/bioinformatics/ch05.htmlinks - More lists
- http//www.softberry.com/berry.phtml?topicindexg
roupprogramssubgrouppromoter - http//bioinformatics.ubc.ca/resources/links_direc
tory/?subcategory_id104 - http//www3.oup.co.uk/nar/database/subcat/1/4/
68Check out Optional Review Try Associated
Tutorial
- Wasserman WW Sandelin A (2004) Applied
bioinformatics for identification of regulatory
elements. Nat Rev Genet 5276-287 - http//proxy.lib.iastate.edu2103/nrg/journal/v5/
n4/full/nrg1315_fs.html
Check this out http//www.phylofoot.org/NRG_test
cases/
Bottom line this is a very "hot" area - new
software for computational prediction of gene
regulatory elements published every day!