Title: Sophie Brachat, Applied Microbiology, Biozentrum der Universitt Basel
1Gene prediction and genome annotation
Bioinformatics I
- Sophie Brachat, Applied Microbiology, Biozentrum
der Universität Basel
2Sequenced genomes Prokaryotic genomes
- 97 Complete microbial genomes (November 2002)
(http//www.ncbi.nlm.nih.gov/PMGifs/Genomes/micr.h
tml) - 230 microbial genomes in progress
- About 2 microbial genomes are completed every
month.
3Sequenced genomes Eukaryotic genomes
- 8 complete eukaryotic genomes
- Fungi
- Saccharomyces cerevisiae (yeast) (1996)
- Schizosaccharomyces pombe (fission yeast) (2002)
- Drosophila melanogaster (Fly) (1997)
- C. elegans (worm) (1998)
- Homo sapiens (2000) (draft!)
- Plant genomes
- Arabidopsis thaliana (2000)
- Medicago truncatula (barrel medic) (2002) (not
public!) - Oryza sativa (rice) (2002) (not public!)
- 15 sequencing projects in progress (Ashbya
gossypii, Candida albicans, Neurospora crassa,
Aspergillus fumigatus, Magnaporthe grisea, Mus
musculus, Rattus norvegicus...) - And many more, being sequenced by
pharmaceutical/biotech companies and not publicly
available.
4Sequenced eukaryotic genomes where to find the
information?
- Bakers yeast
- Saccharomyces cerevisiae 13 Mbp
http//genome-www.stanford.edu/Saccharomyces - Nematode worm
- Caenorhabditis elegans 97 Mbp
http//www.sanger.ac.uk/Projects/C_elegans/ - Fruit fly
- Drosophila melanogaster 137 Mbp
http//www.fruitfly.org/ - Mustard plant
- Arabidopsis thaliana 119 Mbp
http//arabidopsis.org/info/agi.html - Human
- Homo sapiens 3,200 Mbp
http//www.nature.com/genomics/human - http//www.sciencemag.org/content/vol291/issue55
07/index.shtml
5And what do we do with a genome sequence?
- We have to translate the sequence into a
language human beings can understand Genome
annotation.
6We have the human genome sequence
- So, what is the problem?
- Well...
- We dont know how many genes there are!
- We dont know where they are!
- We dont know what they do!
7Definitions of Annotation
- Interpreting raw sequence data into useful
biological information - Information attached to genomic coordinates with
start and end point, can occur at different
levels - Addition of as much reliable and up-to-date
information as possible to describe a sequence - Identification, structural description,
characterization of putative protein products and
other features in primary genomic sequence
8Genome annotation
- Two main levels
- Structural annotation Nucleotide-Protein level
annotation Finding genes and other biologically
relevant sites thus building up a model of genome
as objects with specific locations - Functional annotation Objects are used in
database searches (and experiments) aim is
attributing biologically relevant information to
whole sequence and individual objects
Large-scale genome analysis projects
- Rate-limiting step is annotation
9Part I Structural annotation gene prediction
- This step consist in identifying the coding
genes in the DNA sequence. Properties of coding
genes that can be used for to detect them on a
genomic sequence are numerous.
10Gene prediction Methods
- Gene Prediction can be based upon
- Coding statistics
- Gene structure/Statistical approaches
- Comparison/homology
11Gene prediction Methods
- Gene Prediction can be based upon
- Coding statistics
- Gene structure/Statistical approaches
- Comparison/homology
12Gene prediction Coding statistics
- Coding regions of the sequence have different
properties than non-coding regions non random
properties of coding regions. - GC content
- Codon bias (CODON FREQUENCY).
- Third base composition (every third base in a
coding region tends to be the same one much more
often than by chance alone) (TESTCODE).
13Gene prediction Codon bias
- Synonymous codons depict the same Amino-acids
(degenerative genetic code) - For each species, the use of one of the codon
for a similar AA will be vary based on the
relative abundance of the corresponding tRNA.
Codon bias. - This is true only for Coding regions. In non
coding regions the appearance of a codon will
appear randomly.
Example graphical output of the codonpreference
program of GCG
14Gene prediction Methods
- Gene Prediction can be based upon
- Coding statistics
- Gene structure/Statistical approaches
- Comparison/homology
15Gene structure in Prokaryots
Transcribed region
start codon
stop codon
Coding region
5
3
RBS
Untranslated regions
Promoter
Transcription stop side
Transcription start side
16Gene structure in Eukaryots
Transcribed region
exons
start codon
stop codon
introns
5
3
GT AG
donor and acceptor sites
Promoter
Transcription stop site
Untranslated regions
Transcription start site
17Gene prediction Finding ORFs
- The coding region of all protein-coding genes
starts with a START codon and ends with a STOP
codon. So called ORFs (Open Reading Frames) can
be searched in the genome sequence. Valid only
for prokaryots or lower eukaryots (few or no
introns).
18Gene prediction Features that can be searched for
- Prokaryots
- ORFs
- RBS (Ribosome Binding Site) (Shine Dalgarno)
(RBS finder). - Promoters (Promoter regions of genes often have
a particular DNA structure/sequence
TTGACAT(..)17TATAAT - Program used for most of the complete microbial
genomes Glimmer (97-98 genes predicted
accurately) -
- Eukaryots
- Poly-Adenylation signal
- Splicing sites (consensus for splice sites)
- CpG islands
- Promoters, transcriptional regulators binding
sites
19Eukaryots the problem
- Consensus are neither strong nor unique
Solution Use a combination of all prediction
criteria
- All parameters are studied in parallel.
- Programs are trained to evaluate the prediction
capacity for each of the parameters and learn
to recognize genes - Based on probability HMM
- Based on artificial networks Neural Networks
- Programs need to be trained on your favorite
organism!!
20Hidden Markov Models (HMM) for gene prediction
- What is an HMM?
- An HMM describes the probability of transition
between the hidden states of a model.
ATGCGTGCAGTCACCAGCAGTCAGTCG
Genomic sequence
21Hidden Markov Models (HMM) for gene prediction
- What is an HMM?
- An HMM describes the probability of transition
between the hidden states of a model.
Exon
Genomic sequence
ATGCGTGCAGTCACCAGCAGTCAGTCG
HIDDEN STATES
Introns
22Hidden Markov Models (HMM) for gene prediction
P 0.5
Introns
Exon
P 0.8
Genomic sequence
ATGCGTGCAGTCACCAGCAGTCAGTCG
The probability that one base pair is in one
particular state depends on the state of the
previous base pair. The transition probabilty to
another state depends on the appearance of a
transition signal (splice site) and/or the
average number of bp in a certain hidden state
(size of Exon/Introns).
23Hidden Markov Models (HMM) for gene prediction
- Basic probabilistic model of gene structure.
E
EF
EI
3
5
I
SE
Signals
B Begin sequence S Start translation A
acceptor site D Donor site T Stop
translation F End sequence
24Neural Networks for gene prediction (1)
- What are Neural Networks?
- Neural Network is a computer program that given
a training set of data that preserve certain
pattern learn to recognize given pattern. - The name derives from the fact that originally
they ware intended to imitate human brain. - Like a brain cells, neural networks consists of
central decision making unit connected to other
units with the same topology.
25Neural Networks for gene prediction (2)
- Artficial neurons the nodes of the network
26Neural Networks for gene prediction (3)
- Weighting factor A neuron receives many
simultaneous inputs. Each input has its own
relative weight (w) - Summation function Processing in the usual
artificial neuron consists of computing weighted
sum. - Transfer function the result of the summating
function is transferred via transfer function.
Transfer function usually compares the weighted
sum against some threshold value and may transfer
no signal is the value is below the threshold.
27Neural Networks for gene prediction GRAIL II
- Neural Network of gene structure.
28Gene prediction Statistical methods programs
- Grail II
- Genscan
- GeneMark
- Veil
- GeneParser
- FGENES
Any HMM or Neural network method need to be
trained on your model organism!!! Do not trust
the results of a single program but rather look
at the proposed gene structure from different
programs.
29How do you train learning programs?
Whole genome sequence
Known genes
Unknown genes
1/2
1/2
Good prediction
Training set (sequence and annotation)
Verification set (sequence)
Program
Program with adapted parameters (weigth
functions, probabilities)
30Gene prediction Methods
- Gene Prediction can be based upon
- Coding statistics
- Gene structure/Statistical approaches
- Comparison/homology
31EST alignment to predict Intron/Exon boundaries
- EST Expressed Sequence Tag. cDNA is produced
from mRNA and sequenced.
- Very powerful
- If several ESTs are available, allows the
identification of alternative splicing products - Programs EST-GENOME Genseqer
- BUT
- EST sequences are usually very poor quality
(sequence errors) - EST sequences are often contaminated
- Presence of an EST sequence depends on
expression (level, tissus...)
32Gene prediction sequence conservation
- Between organisms, protein sequence conservation
can be conserved (homology). Homology will be
detectable only in the coding regions. - Database search programs such as Blast ot tFasta
can be used to search the DNA sequence against a
protein database. The DNA sequence is translated
in all six-frame and searched individually
against the database.
Homologous protein name
Expect value
Coordinate of the hit on the DNA sequence
gtYMR272C GENE YMR272C CHR. XIIIC REV FROM
209623 TO 210777 Length 384 Score
485 bits (1248), Expect e-137 Identities
232/383 (60), Positives 274/383 (70), Gaps
4/383 (1) Frame 3 Query 3708
SKMVSKTLPLYSKATLQKHTDRTSCWVSVGNRKIYDVSQFLDEHPGGDQY
ILDYAGKDIT 3887 S SKTL LSK TQH
CWV NRKIYDVFL EHPGGD ILDYAGKDIT Sbjct 2
STNTSKTLELFSKKTVQEHNTANDCWVTYQNRKIYDVTRFLSEHPGGDE
SILDYAGKDIT 61 Query 3888 AVLKDKLIHEHTEAAYEILDES
YLVGYLATEEEEIKLLTNEKHVMEVTPE----NLDTTT 4055
KD HEHAYEIL YLGYLATEE LLTN H
EV DTT Sbjct 62 EIMKDSDVHEHSDSAYEILED
EYLIGYLATDEEAARLLTNKNHKVEVQLSADGTEFDSTT
121 Query 4056 FVKELPAEEVLSVATDFGTDYTKHHFLDLNKPL
LMQVLRGNFTRDFYIDQIHRPRHYGKG 4235
FVKELPAEE LSATD DY KH FLDLNPLLMQLR F
DFYDQIHRPRHYGKG Sbjct 122 FVKELPAEEKLSIATDYSND
YKKHKFLDLNRPLLMQILRSDFKKDFYVDQIHRPRHYGKG 181
DNA frame where the hit was found
Here must be a gene!!!
33Comparative genomics approach to annotation
- Ashbya/Yeast as an example of synteny.
34Gene prediction in higher eukaryotsTake home
message
- The problem INTRONS the detection of the
numerous introns in higher eukaryotic genes is
difficult - It does not help to search for ORFs
- There are often many introns per gene
- The intron splicing sites do not always have a
strict consensus. - The existence of alternative splicing makes the
things even more difficult. - The potential solutions
- Based the gene prediction on homology (ESTs or
related species). - Exon/intron prediction programs (HMM or neural
networks based) are trained on known gene
sequences to recognize intron/exon boundaries.
They can be used to search new sequences. - None of the method is good alone.
- Very often a combination of all this methods is
used to increase the accuracy but still gene
prediction in higher eukaryots is a challenge.
35Genome annotation and submission tools
- Oakridge Genome Annotation Channel
(http//compbio.ornl.gov/channel/) - ENSEMBL (http//ensembl.ebi.ac.uk)
- Artemis (http//www.sanger.ac.uk/Software/Artemis)
Sequence viewer and annotation tool - GeneQuiz (http//www.sander.ebi.ac.uk/genequiz/)
System for automated annotation of sequences, web
access required - Genome Annotation Assessment Project (GASP1)
(http//www.fruitfly.org/GASP1) - Sequin submission tool ftp//ftp.ebi.ac.uk/pub/so
ftware/sequin/
36(No Transcript)
37SEQUIN Submission System
- Multi-platform (Mac/PC/Unix) stand-alone software
tool - Allows direct submissions to EMBL, GenBank and
DDBJ - Available from EBI ftp//ftp.ebi.ac.uk/pub/softwa
re/sequin/ - Free
38Artemis
- Multi-platform (Mac/PC/Unix) stand-alone software
tool - Nice visualization of the annotation.
- Easy extraction of the data.
- Available from Sanger Center http//www.sanger.ac
.uk/Software/Artemis/ - Free
39Finding tRNA genes using tRNAscan-SE
- Availability
- Web search http//www.genetics.wustl.edu/eddy/tR
NAscan-SE/ - UNIX source code also available at that address.
- Prediction is based on
- Identification of RNA pol III intragenic
promoters - Secondary structure prediction
gtAGCHR1_3 agchr1_3.seq Continuation (3 of 7) of
agchr1 from base 200001 GCTACTCCGGGCCCAAATGAAGGAAG
AAGTTGAAAAGGTGTTCAGGAGACATGGCGGTAT CGAGAACAATGAACC
ACCCATTATTTTCCCCAAAGCTCCATTCTACTCGTCTCAAAATGT GTAT
GAGGTATTGGATAGAGGGGGTTCTGTGTTGCAGCTGCAATATGATTTAAC
GTACCC TATGGCGCGCTATCTTTCTAAGAACCCTCATTGCATATCAAAA
CAGTACAGAATGCAGTC AGTATACCGCCCAGCAGAACAGCAGCATGGCA
GCGTTGAACCACGAAGATTCGGAGAAAT AGATTTTGATATTATATCTGG
ATCATCTGCGGATTCAGCTTTATACGACGCTGAAAGTAT TAAAATCATT
GATGAACTGATATCAGTGTTTCCTGTCTTCGAAAAGACTAATACTTTGAT
TATTGTGAATCACTCAGATATTATGGAAAGTATCTTCAACCTTTGTTCT
ATTGATAAAGC CCAACGTTCCCTCGTATCTCAGATGCTGTCTCAGGTTG
GCTTTGCCAAGTCGTTTAAAGA TGTCAAAACCGAGCTGAAGGCCCAGTT
AAATATATCTTCTACCTCCTTGAACGATTTGGA GATGTTCGATTTCAAG
GTGGATTTTGACAATGCAAAAAAGAGGCTCAACAAACTGATGAT CGATA
GTCCGCACCTAACCAAGGTTGAGGAATCGCTTTCATATATATTCAAAGTG
TTGAA CTTCCTGAAGCCTCTTGGTGTAACACGAAATGTGGTGGTATCCC
CGTTAAGCAATTATAA CAGTGCCTTCTACAAGGGCGGCATCATGTTCCA
GGCCATATACGATAGCGGCCGTGTAAA AAGTTTGTTGGCAGCTGGTGGA
CGTTACGATAATTTGATTTCTTACATTGCAAGGCCATC
Sequence tRNA Bounds tRNA
Anti Intron Bounds Cove Name tRNA
Begin End Type Codon Begin End
Score -------- ------ ---- ------
---- ----- ----- ---- ------ AGCHR1_3
1 84548 84659 Leu CAA
84586 84615 54.86 AGCHR1_3 2
105389 105459 Gly GCC 0 0
62.63 AGCHR1_3 3 83748 83656 Phe
GAA 83711 83692 68.65 AGCHR1_3
4 53864 53792 Val CAC 0 0
76.92
40After gene prediction and structural annotation...
MGWCDSLAIVTSI...
- Endless strings of four-lettered DNA can be
translated to twenty-lettered proteins but other
as yet unknown translations will be necessary to
convert this alphabetical soup to biology - S. Fields
...Functional annotation
41Part II Functional annotation is the
description of
- Function(s) of the protein
- Post-translational modification(s)
- Domains and sites
- Secondary structure
- Quaternary structure
- Similarities to other proteins
- Diseases associated with deficiencies in the
protein - Sequence conflicts, variants, etc.
42Functional annotation sources
- Publications that report experimental data
- Protein sequence analysis
- Search for characteristic domains (patterns in
protein sequences found in all protein carrying
the same function DNA binding domain, kinase
domain, transmembrane domain) - Comparison with other, related sequenced
organisms - Homology to protein of known function
- Experimental data (see functional genomics
lecture) - Expression studies
- Biochemical studies
- 3D structure determination
- Loss of function phenotype
43From sequence to function
44Example of annotation pipeline
NB look out for multi-domain proteins, put into
genome context
Supplement with manual curation and use evidence
tags
45Example of Interpro search domain search
In all these proteins, the Interpro domain
IPR002198 (Short-chain dehydrogenase/reductase
SDR ) was identified. The short-chain
dehydrogenases/reductases family (SDR) 1 is a
very large family of enzymes, most of which are
known to be NAD- or NADP-dependent
oxidoreductases.
46Limits of annotation
- Databases are biased in sequence and AA
composition and search is dependent on size - If no homology found- limited amount of
information can be inferred - Incorrect functional annotation can be propagated
very fast. If a functional annotation is wrong,
then all the proteins with homology to that
protein discovered afterwards will have a wrong
functional annotation. - No answers to tissue-specificity, binding of
ligands, relationship between genotype and
phenotype
47IMPORTANT TO NOTE
- DONT COMPLETELY TRUST COMPUTER RESULTS
- CHECK LITERATURE
- CONFIRM WITH WETLAB WORK