Title: Lecture 6: Gene Prediction
1Lecture 6 Gene Prediction
- Chapter 6
- Eukaryotic gene and intron splicing prediction
- Genome and gene organization
2Eukaryote Gene Prediction
- Its like finding a 2gram needle in a 6,000kg
haystack.
3Finding Gene-Rich Regions in whole genome
- Isochores (high density versus low density)
- CpG islands
- Codon biased regions
- Mask the junk (mask the repeats)
4Isochores
- Genome organization
- Nucleotides?genes?isochores?chromosomes
- Just another level of organization to help
organize our information! - Isochore
- 1 mb
- Homogenous bp composition (GC content uniform in
individual isochores) - Human example
- 5 different classes of isochores
- L1, L2, H1, H2, H3
- Llow density (lower GC and lower gene content)
- Hhigh density (higher GC and higher gene
content)
Where would you Start to sequence A genome?
5CpG Islands
- Randomness of most dinucleotide pairs
- CpG found at 20 of the normal random frequency
overall - Some regions are more dense with CpG than others
- CpG Islands
- Higher density of CpG in a given area as compared
to overall percentage of CpG - CpG islands occur at specific sites
- 45,000 islands in human genome
- Found typically in housekeeping genes and at many
regulatory binding sites or promoter regions - 1-2kb at 5 ends of many human genes
- About 4,500 islands in human genes
- Never found in introns or in junk DNA or in gene
free regions
6Codon Bias
- Every organisms seems to prefer to use some
codons over others within genes - Yeast AGA for arginine 48 of time
- Codon bias occurs in exons, but not in introns
- This can be used as more evidence for the
presence of an intron
7Codon Usage in a Properly Spliced Gene
8Locating Genes and Coding Regions in Eukaryote
Gene-Rich Regions
- ORF locating versus Coding Region Location
- Note Easy!
- Large genomes
- About 1 of human DNA encodes functional genes.
- Alternatively, 85 of bacteria DNA encodes
functional genes - Genes are interspersed among long stretches of
non-coding DNA. - More than one chromosome and more than one copy
of a chromosome - Repeats, pseudo-genes, and introns confound
matters - Will take at least 15 years to find all genes in
a given genome once genome is sequenced.
9Eukaryote Prediction
- A variety of features must be found and the
features must be found in specific locations - Any one feature (ie finding a promoter region)
cannot be used by itself to predict a functional
gene - Prediction algorithms have an accuracy of about
50 for Euks - Focus for finding genes
- Promoter elements
- Regulatory binding sites
- Focus for finding coding regions
- Splice Junctions
- Codon bias
- All DNA motiffs observed in 5 to 3 direction on
coding/sense strand
10Finding Genes Promoter Elements
- Binding of transcriptional factors to promoter
- Assist in RNA polymerase II binding
- TATA box
- TATAWAW, where WA or T
- At 25 of trc start site
- Inr sequence (initiator sequence)
- YYCARR, where YC or T and RG or A
- At 1 from TATA box
- Subtle differences in these basal trc factors
make them unreliable by themselves - Need more evidence
11Finding Genes Regulatory Protein Binding Sites
- Found upstream from trc start site
- Regulatory proteins bind to DNA and promote or
prevent trc - CAAT box found 80 from actual gene
- Enhancers
- EX) GGGCGG site ? binding site for Sp1 protein
- -500 to 500 to trc start site
- Bend DNA?change DNA orientation to initiate trc
12Finding Coding Regions Within Genes Splice
Junctions
- Junctions between intron and 2 flanking exons
- GT-AG rule (usually)
- First 2 nucleotides of intron in mRNA is GU
(coding DNA GT) - Last 2 nucleotides of intron is AG
- Introns must be at least 60 bp, but there are no
limits on length - No limits on distribution
- Other splice junctions exist and differ for
different genes - Alternative splicing complicates things here.
- 20 of all human genes!
13Finding Coding Regions Within Genes Splice
Junction
Intron
G100 T100 A62 A68 G64T6312-C/T N C68 A100
G100
Exon
Exon
3 Splice Junction (5-AG-3)
5 Splice Junction (5-GT-3)
- Subscript refers to percentage of times you see
that particular base. - Dots represent length of intron (can be any
number of nucleotides greater than about 60). - Consensus GTAAGT.YYYYYYYYYYYYNCAG
14Finding Coding Regions Within Genes Alternative
Splicing
Mouse Troponin T Cardiac muscle or skeletal
muscle
Intron
Intron
Tnni3 Cardiac
Intron
Intron
Tnnt1 Skeletal
15Finding Coding Regions Within Genes Coding Bias
- Also used for predicting Coding regions
16Gene Prediction SoftwareGreat for Eukaryotes
- Genescan and HMM Gene
- Used to predict exon locations and repeated
elements - Splices exons and translates (when more than one
present) so you can do a BLASTP - http//genes.mit.edu/GENSCAN.html
- http//www.cbs.dtu.dk/services/HMMgene/
- Used primarily for Human/vertebrate genomic
sequences - Not good for invertebrate sequences
- Practice Together