Sophie Brachat, Applied Microbiology, Biozentrum der Universitt Basel - PowerPoint PPT Presentation

1 / 47

About This Presentation

Title:

Sophie Brachat, Applied Microbiology, Biozentrum der Universitt Basel

Description:

97 Complete microbial genomes (November 2002) (http://www.ncbi.nlm.nih.gov ... Nematode worm. Caenorhabditis elegans 97 Mbp http://www.sanger.ac.uk/Projects/C_elegans ... – PowerPoint PPT presentation

Number of Views:94

Avg rating:3.0/5.0

Slides: 48

Provided by: sophieb3

Category:

more less

Transcript and Presenter's Notes

Title: Sophie Brachat, Applied Microbiology, Biozentrum der Universitt Basel

1
Gene prediction and genome annotation
Bioinformatics I

Sophie Brachat, Applied Microbiology, Biozentrum
der Universität Basel

2
Sequenced genomes Prokaryotic genomes

97 Complete microbial genomes (November 2002)
(http//www.ncbi.nlm.nih.gov/PMGifs/Genomes/micr.h
tml)
230 microbial genomes in progress
About 2 microbial genomes are completed every
month.

3
Sequenced genomes Eukaryotic genomes

8 complete eukaryotic genomes
Fungi
Saccharomyces cerevisiae (yeast) (1996)
Schizosaccharomyces pombe (fission yeast) (2002)
Drosophila melanogaster (Fly) (1997)
C. elegans (worm) (1998)
Homo sapiens (2000) (draft!)
Plant genomes
Arabidopsis thaliana (2000)
Medicago truncatula (barrel medic) (2002) (not
public!)
Oryza sativa (rice) (2002) (not public!)
15 sequencing projects in progress (Ashbya
gossypii, Candida albicans, Neurospora crassa,
Aspergillus fumigatus, Magnaporthe grisea, Mus
musculus, Rattus norvegicus...)
And many more, being sequenced by
pharmaceutical/biotech companies and not publicly
available.

4
Sequenced eukaryotic genomes where to find the
information?

Bakers yeast
Saccharomyces cerevisiae 13 Mbp
http//genome-www.stanford.edu/Saccharomyces
Nematode worm
Caenorhabditis elegans 97 Mbp
http//www.sanger.ac.uk/Projects/C_elegans/
Fruit fly
Drosophila melanogaster 137 Mbp
http//www.fruitfly.org/
Mustard plant
Arabidopsis thaliana 119 Mbp
http//arabidopsis.org/info/agi.html
Human
Homo sapiens 3,200 Mbp
http//www.nature.com/genomics/human
http//www.sciencemag.org/content/vol291/issue55
07/index.shtml

5
And what do we do with a genome sequence?

We have to translate the sequence into a
language human beings can understand Genome
annotation.

6
We have the human genome sequence

So, what is the problem?
Well...
We dont know how many genes there are!
We dont know where they are!
We dont know what they do!

7
Definitions of Annotation

Interpreting raw sequence data into useful
biological information
Information attached to genomic coordinates with
start and end point, can occur at different
levels
Addition of as much reliable and up-to-date
information as possible to describe a sequence
Identification, structural description,
characterization of putative protein products and
other features in primary genomic sequence

8
Genome annotation

Two main levels
Structural annotation Nucleotide-Protein level
annotation Finding genes and other biologically
relevant sites thus building up a model of genome
as objects with specific locations
Functional annotation Objects are used in
database searches (and experiments) aim is
attributing biologically relevant information to
whole sequence and individual objects

Large-scale genome analysis projects

Rate-limiting step is annotation

9
Part I Structural annotation gene prediction

This step consist in identifying the coding
genes in the DNA sequence. Properties of coding
genes that can be used for to detect them on a
genomic sequence are numerous.

10
Gene prediction Methods

Gene Prediction can be based upon
Coding statistics
Gene structure/Statistical approaches
Comparison/homology

11
Gene prediction Methods

Gene Prediction can be based upon
Coding statistics
Gene structure/Statistical approaches
Comparison/homology

12
Gene prediction Coding statistics

Coding regions of the sequence have different
properties than non-coding regions non random
properties of coding regions.
GC content
Codon bias (CODON FREQUENCY).
Third base composition (every third base in a
coding region tends to be the same one much more
often than by chance alone) (TESTCODE).

13
Gene prediction Codon bias

Synonymous codons depict the same Amino-acids
(degenerative genetic code)
For each species, the use of one of the codon
for a similar AA will be vary based on the
relative abundance of the corresponding tRNA.
Codon bias.
This is true only for Coding regions. In non
coding regions the appearance of a codon will
appear randomly.

Example graphical output of the codonpreference
program of GCG
14
Gene prediction Methods

Gene Prediction can be based upon
Coding statistics
Gene structure/Statistical approaches
Comparison/homology

15
Gene structure in Prokaryots
Transcribed region
start codon
stop codon
Coding region
5
3
RBS
Untranslated regions
Promoter
Transcription stop side
Transcription start side
16
Gene structure in Eukaryots
Transcribed region
exons
start codon
stop codon
introns
5
3
GT AG
donor and acceptor sites
Promoter
Transcription stop site
Untranslated regions
Transcription start site
17
Gene prediction Finding ORFs

The coding region of all protein-coding genes
starts with a START codon and ends with a STOP
codon. So called ORFs (Open Reading Frames) can
be searched in the genome sequence. Valid only
for prokaryots or lower eukaryots (few or no
introns).

18
Gene prediction Features that can be searched for

Prokaryots
ORFs
RBS (Ribosome Binding Site) (Shine Dalgarno)
(RBS finder).
Promoters (Promoter regions of genes often have
a particular DNA structure/sequence
TTGACAT(..)17TATAAT
Program used for most of the complete microbial
genomes Glimmer (97-98 genes predicted
accurately)
Eukaryots
Poly-Adenylation signal
Splicing sites (consensus for splice sites)
CpG islands
Promoters, transcriptional regulators binding
sites

19
Eukaryots the problem

Consensus are neither strong nor unique

Solution Use a combination of all prediction
criteria

All parameters are studied in parallel.
Programs are trained to evaluate the prediction
capacity for each of the parameters and learn
to recognize genes
Based on probability HMM
Based on artificial networks Neural Networks
Programs need to be trained on your favorite
organism!!

20
Hidden Markov Models (HMM) for gene prediction

What is an HMM?
An HMM describes the probability of transition
between the hidden states of a model.

ATGCGTGCAGTCACCAGCAGTCAGTCG
Genomic sequence
21
Hidden Markov Models (HMM) for gene prediction

What is an HMM?
An HMM describes the probability of transition
between the hidden states of a model.

Exon
Genomic sequence
ATGCGTGCAGTCACCAGCAGTCAGTCG
HIDDEN STATES
Introns
22
Hidden Markov Models (HMM) for gene prediction
P 0.5
Introns
Exon
P 0.8
Genomic sequence
ATGCGTGCAGTCACCAGCAGTCAGTCG
The probability that one base pair is in one
particular state depends on the state of the
previous base pair. The transition probabilty to
another state depends on the appearance of a
transition signal (splice site) and/or the
average number of bp in a certain hidden state
(size of Exon/Introns).
23
Hidden Markov Models (HMM) for gene prediction

Basic probabilistic model of gene structure.

E
EF
EI
3
5
I
SE
Signals
B Begin sequence S Start translation A
acceptor site D Donor site T Stop
translation F End sequence
24
Neural Networks for gene prediction (1)

What are Neural Networks?
Neural Network is a computer program that given
a training set of data that preserve certain
pattern learn to recognize given pattern.
The name derives from the fact that originally
they ware intended to imitate human brain.
Like a brain cells, neural networks consists of
central decision making unit connected to other
units with the same topology.

25
Neural Networks for gene prediction (2)

Artficial neurons the nodes of the network

26
Neural Networks for gene prediction (3)

Weighting factor A neuron receives many
simultaneous inputs. Each input has its own
relative weight (w)
Summation function Processing in the usual
artificial neuron consists of computing weighted
sum.
Transfer function the result of the summating
function is transferred via transfer function.
Transfer function usually compares the weighted
sum against some threshold value and may transfer
no signal is the value is below the threshold.

27
Neural Networks for gene prediction GRAIL II

Neural Network of gene structure.

28
Gene prediction Statistical methods programs

Grail II
Genscan
GeneMark
Veil
GeneParser
FGENES

Any HMM or Neural network method need to be
trained on your model organism!!! Do not trust
the results of a single program but rather look
at the proposed gene structure from different
programs.
29
How do you train learning programs?
Whole genome sequence
Known genes
Unknown genes
1/2
1/2
Good prediction
Training set (sequence and annotation)
Verification set (sequence)
Program
Program with adapted parameters (weigth
functions, probabilities)
30
Gene prediction Methods

Gene Prediction can be based upon
Coding statistics
Gene structure/Statistical approaches
Comparison/homology

31
EST alignment to predict Intron/Exon boundaries

EST Expressed Sequence Tag. cDNA is produced
from mRNA and sequenced.

Very powerful
If several ESTs are available, allows the
identification of alternative splicing products
Programs EST-GENOME Genseqer
BUT
EST sequences are usually very poor quality
(sequence errors)
EST sequences are often contaminated
Presence of an EST sequence depends on
expression (level, tissus...)

32
Gene prediction sequence conservation

Between organisms, protein sequence conservation
can be conserved (homology). Homology will be
detectable only in the coding regions.
Database search programs such as Blast ot tFasta
can be used to search the DNA sequence against a
protein database. The DNA sequence is translated
in all six-frame and searched individually
against the database.

Homologous protein name
Expect value
Coordinate of the hit on the DNA sequence
gtYMR272C GENE YMR272C CHR. XIIIC REV FROM
209623 TO 210777 Length 384 Score
485 bits (1248), Expect e-137 Identities
232/383 (60), Positives 274/383 (70), Gaps
4/383 (1) Frame 3 Query 3708
SKMVSKTLPLYSKATLQKHTDRTSCWVSVGNRKIYDVSQFLDEHPGGDQY
ILDYAGKDIT 3887 S SKTL LSK TQH
CWV NRKIYDVFL EHPGGD ILDYAGKDIT Sbjct 2
STNTSKTLELFSKKTVQEHNTANDCWVTYQNRKIYDVTRFLSEHPGGDE
SILDYAGKDIT 61 Query 3888 AVLKDKLIHEHTEAAYEILDES
YLVGYLATEEEEIKLLTNEKHVMEVTPE----NLDTTT 4055
KD HEHAYEIL YLGYLATEE LLTN H
EV DTT Sbjct 62 EIMKDSDVHEHSDSAYEILED
EYLIGYLATDEEAARLLTNKNHKVEVQLSADGTEFDSTT
121 Query 4056 FVKELPAEEVLSVATDFGTDYTKHHFLDLNKPL
LMQVLRGNFTRDFYIDQIHRPRHYGKG 4235
FVKELPAEE LSATD DY KH FLDLNPLLMQLR F
DFYDQIHRPRHYGKG Sbjct 122 FVKELPAEEKLSIATDYSND
YKKHKFLDLNRPLLMQILRSDFKKDFYVDQIHRPRHYGKG 181
DNA frame where the hit was found
Here must be a gene!!!
33
Comparative genomics approach to annotation

Ashbya/Yeast as an example of synteny.

34
Gene prediction in higher eukaryotsTake home
message

The problem INTRONS the detection of the
numerous introns in higher eukaryotic genes is
difficult
It does not help to search for ORFs
There are often many introns per gene
The intron splicing sites do not always have a
strict consensus.
The existence of alternative splicing makes the
things even more difficult.
The potential solutions
Based the gene prediction on homology (ESTs or
related species).
Exon/intron prediction programs (HMM or neural
networks based) are trained on known gene
sequences to recognize intron/exon boundaries.
They can be used to search new sequences.
None of the method is good alone.
Very often a combination of all this methods is
used to increase the accuracy but still gene
prediction in higher eukaryots is a challenge.

35
Genome annotation and submission tools

Oakridge Genome Annotation Channel
(http//compbio.ornl.gov/channel/)
ENSEMBL (http//ensembl.ebi.ac.uk)
Artemis (http//www.sanger.ac.uk/Software/Artemis)
Sequence viewer and annotation tool
GeneQuiz (http//www.sander.ebi.ac.uk/genequiz/)
System for automated annotation of sequences, web
access required
Genome Annotation Assessment Project (GASP1)
(http//www.fruitfly.org/GASP1)
Sequin submission tool ftp//ftp.ebi.ac.uk/pub/so
ftware/sequin/

36
(No Transcript)
37
SEQUIN Submission System

Multi-platform (Mac/PC/Unix) stand-alone software
tool
Allows direct submissions to EMBL, GenBank and
DDBJ
Available from EBI ftp//ftp.ebi.ac.uk/pub/softwa
re/sequin/
Free

38
Artemis

Multi-platform (Mac/PC/Unix) stand-alone software
tool
Nice visualization of the annotation.
Easy extraction of the data.
Available from Sanger Center http//www.sanger.ac
.uk/Software/Artemis/
Free

39
Finding tRNA genes using tRNAscan-SE

Availability
Web search http//www.genetics.wustl.edu/eddy/tR
NAscan-SE/
UNIX source code also available at that address.
Prediction is based on
Identification of RNA pol III intragenic
promoters
Secondary structure prediction

gtAGCHR1_3 agchr1_3.seq Continuation (3 of 7) of
agchr1 from base 200001 GCTACTCCGGGCCCAAATGAAGGAAG
AAGTTGAAAAGGTGTTCAGGAGACATGGCGGTAT CGAGAACAATGAACC
ACCCATTATTTTCCCCAAAGCTCCATTCTACTCGTCTCAAAATGT GTAT
GAGGTATTGGATAGAGGGGGTTCTGTGTTGCAGCTGCAATATGATTTAAC
GTACCC TATGGCGCGCTATCTTTCTAAGAACCCTCATTGCATATCAAAA
CAGTACAGAATGCAGTC AGTATACCGCCCAGCAGAACAGCAGCATGGCA
GCGTTGAACCACGAAGATTCGGAGAAAT AGATTTTGATATTATATCTGG
ATCATCTGCGGATTCAGCTTTATACGACGCTGAAAGTAT TAAAATCATT
GATGAACTGATATCAGTGTTTCCTGTCTTCGAAAAGACTAATACTTTGAT
TATTGTGAATCACTCAGATATTATGGAAAGTATCTTCAACCTTTGTTCT
ATTGATAAAGC CCAACGTTCCCTCGTATCTCAGATGCTGTCTCAGGTTG
GCTTTGCCAAGTCGTTTAAAGA TGTCAAAACCGAGCTGAAGGCCCAGTT
AAATATATCTTCTACCTCCTTGAACGATTTGGA GATGTTCGATTTCAAG
GTGGATTTTGACAATGCAAAAAAGAGGCTCAACAAACTGATGAT CGATA
GTCCGCACCTAACCAAGGTTGAGGAATCGCTTTCATATATATTCAAAGTG
TTGAA CTTCCTGAAGCCTCTTGGTGTAACACGAAATGTGGTGGTATCCC
CGTTAAGCAATTATAA CAGTGCCTTCTACAAGGGCGGCATCATGTTCCA
GGCCATATACGATAGCGGCCGTGTAAA AAGTTTGTTGGCAGCTGGTGGA
CGTTACGATAATTTGATTTCTTACATTGCAAGGCCATC
Sequence tRNA Bounds tRNA
Anti Intron Bounds Cove Name tRNA
Begin End Type Codon Begin End
Score -------- ------ ---- ------
---- ----- ----- ---- ------ AGCHR1_3
1 84548 84659 Leu CAA
84586 84615 54.86 AGCHR1_3 2
105389 105459 Gly GCC 0 0
62.63 AGCHR1_3 3 83748 83656 Phe
GAA 83711 83692 68.65 AGCHR1_3
4 53864 53792 Val CAC 0 0
76.92
40
After gene prediction and structural annotation...
MGWCDSLAIVTSI...

Endless strings of four-lettered DNA can be
translated to twenty-lettered proteins but other
as yet unknown translations will be necessary to
convert this alphabetical soup to biology
S. Fields

...Functional annotation
41
Part II Functional annotation is the
description of

Function(s) of the protein
Post-translational modification(s)
Domains and sites
Secondary structure
Quaternary structure
Similarities to other proteins
Diseases associated with deficiencies in the
protein
Sequence conflicts, variants, etc.

42
Functional annotation sources

Publications that report experimental data
Protein sequence analysis
Search for characteristic domains (patterns in
protein sequences found in all protein carrying
the same function DNA binding domain, kinase
domain, transmembrane domain)
Comparison with other, related sequenced
organisms
Homology to protein of known function
Experimental data (see functional genomics
lecture)
Expression studies
Biochemical studies
3D structure determination
Loss of function phenotype

43
From sequence to function
44
Example of annotation pipeline
NB look out for multi-domain proteins, put into
genome context
Supplement with manual curation and use evidence
tags
45
Example of Interpro search domain search
In all these proteins, the Interpro domain
IPR002198 (Short-chain dehydrogenase/reductase
SDR ) was identified. The short-chain
dehydrogenases/reductases family (SDR) 1 is a
very large family of enzymes, most of which are
known to be NAD- or NADP-dependent
oxidoreductases.
46
Limits of annotation

Databases are biased in sequence and AA
composition and search is dependent on size
If no homology found- limited amount of
information can be inferred
Incorrect functional annotation can be propagated
very fast. If a functional annotation is wrong,
then all the proteins with homology to that
protein discovered afterwards will have a wrong
functional annotation.
No answers to tissue-specificity, binding of
ligands, relationship between genotype and
phenotype

47
IMPORTANT TO NOTE