Title: Module 5: Gene prediction
1Module 5 Gene prediction
sequence
2What is a gene?
- Prokaryotic genes
- Eukaryotic genes
3Prokaryotic gene
- Small genomes, high gene density
- Haemophilus influenza genome 85 genic
- Operons
- One transcript, many genes
- No introns.
- One gene, one protein
- Open reading frames
- One ORF per gene
- ORFs begin with start,
- end with stop codon
4Eukaryotic Gene
- Much lower gene density.
- Undergo several post transcriptional
modifications. - 5 CAP
- Poly A tail
- Splicing
5Sequencing genomes
The Hybrid Approach
6Supporting evidence
mRNA
7Collating the evidence
DNA databases (EMBL/Genbank/DDBJ)
Protein databases (Swall)
TrEMBL (automatic translation of CDS from DNA
dbs)
Swissprot (curated data)
mRNA (cDNA)
Genomic (finished, draft)
dbEST (ESTs)
8Genome Browsers
Ensembl www.ensembl.org EBI and Sanger
collaboration Gene build, predict novel
genes UCSC genome.ucsc.edu University of Santa
Cruz Annotate other gene builds NCBIwww.ncbi.n
lm.nih.gov/mapview/ NCBI map viewer Gene build,
predicts novel genes
9Genes
Known genes as catalogued by the reference
sequence project Ensembl known genes (red
genes) NCBI known genes Novel genes (1) based
on similarity to known genes, or cDNAs these
need not have 100 matching supporting
evidence Ensembl novel genes (black) NCBI Loc
genes
10Genes
Novel genes (2) based on the presence of
ESTs resource of alternative splicing EST genes
in Ensembl (purple) Database of transcribed
sequences (DOTs) Acembly Ab initio gene
prediction Single organsism Genscan Comparative
information Twinscan Pseudogenes - matches a
known gene but with a a disrupted ORF - a
minefield!
11Gene - www.ncbi.nlm.nih.gov80-entrez/
query.fgi?dbgene
12Refseq - http//www.ncbi.nlm.nih.gov/RefSeq/
13Genes in Ensembl
14Genes in Ensembl
15Suporting evidence in Ensembl
DNA
Protein
16Genes in UCSC
Put in view of UCSC
17Genes in UCSC
18Gene prediction programs
- Ab initio gene prediction
- First ones predicted single exons, e.g. GRAIL
(Uberbacher, 91) or MZEF (Zhang, 97) - Later, predict entire genes e.g. Genscan (Burge
97) and Fgenesh (Solovyev, 95) - Predict individual exons based on codon usage and
sequence signals (start, stop, splice sites)
followed by assembly of putative exons into genes - Genscan predicts 90 of coding nucleotides, and
70 of coding exons (Guigo, 00) - Can not use gene prediction methods alone to
accurately identify every gene in a genome
19Gene prediction programs
- Sn Sensitivity TP/(TPFN)
- How many exons were found out of total present?
- Sp Specificity TP/(TPFP)
- How many predicted exons were correct out of
total exons predicted?
20Twinscan
Gene structure prediction model Extends
probability model of GENSCAN Exploits homology
between two related genomes Notable improvement
on GENSCAN
21Twinscan
22Twinscan - genes.cs.wustl.edu/
23Other sources of gene prediction
- ORF detectors
- NCBI http//www.ncbi.nih.gov/gorf/gorf.html
- Promoter predictors
- CSHL http//rulai.cshl.org/software/index1.htm
- BDGP fruitfly.org/seq_tools/promoter.html
- ICG TATA-Box predictor
- PolyA signal predictors
- CSHL argon.cshl.org/tabaska/polyadq_form.html
- Splice site predictors
- BDGP http//www.fruitfly.org/seq_tools/splice.htm
l - Start-/stop-codon identifiers
- DNALC Translator/ORF-Finder
- BCM Searchlauncher