Sophie Brachat, Applied Microbiology, Biozentrum der Universitt Basel - PowerPoint PPT Presentation

1 / 47
About This Presentation
Title:

Sophie Brachat, Applied Microbiology, Biozentrum der Universitt Basel

Description:

97 Complete microbial genomes (November 2002) (http://www.ncbi.nlm.nih.gov ... Nematode worm. Caenorhabditis elegans 97 Mbp http://www.sanger.ac.uk/Projects/C_elegans ... – PowerPoint PPT presentation

Number of Views:94
Avg rating:3.0/5.0
Slides: 48
Provided by: sophieb3
Category:

less

Transcript and Presenter's Notes

Title: Sophie Brachat, Applied Microbiology, Biozentrum der Universitt Basel


1
Gene prediction and genome annotation
Bioinformatics I
  • Sophie Brachat, Applied Microbiology, Biozentrum
    der Universität Basel

2
Sequenced genomes Prokaryotic genomes
  • 97 Complete microbial genomes (November 2002)
    (http//www.ncbi.nlm.nih.gov/PMGifs/Genomes/micr.h
    tml)
  • 230 microbial genomes in progress
  • About 2 microbial genomes are completed every
    month.

3
Sequenced genomes Eukaryotic genomes
  • 8 complete eukaryotic genomes
  • Fungi
  • Saccharomyces cerevisiae (yeast) (1996)
  • Schizosaccharomyces pombe (fission yeast) (2002)
  • Drosophila melanogaster (Fly) (1997)
  • C. elegans (worm) (1998)
  • Homo sapiens (2000) (draft!)
  • Plant genomes
  • Arabidopsis thaliana (2000)
  • Medicago truncatula (barrel medic) (2002) (not
    public!)
  • Oryza sativa (rice) (2002) (not public!)
  • 15 sequencing projects in progress (Ashbya
    gossypii, Candida albicans, Neurospora crassa,
    Aspergillus fumigatus, Magnaporthe grisea, Mus
    musculus, Rattus norvegicus...)
  • And many more, being sequenced by
    pharmaceutical/biotech companies and not publicly
    available.

4
Sequenced eukaryotic genomes where to find the
information?
  • Bakers yeast
  • Saccharomyces cerevisiae 13 Mbp
    http//genome-www.stanford.edu/Saccharomyces
  • Nematode worm
  • Caenorhabditis elegans 97 Mbp
    http//www.sanger.ac.uk/Projects/C_elegans/
  • Fruit fly
  • Drosophila melanogaster 137 Mbp
    http//www.fruitfly.org/
  • Mustard plant
  • Arabidopsis thaliana 119 Mbp
    http//arabidopsis.org/info/agi.html
  • Human
  • Homo sapiens 3,200 Mbp
    http//www.nature.com/genomics/human
  • http//www.sciencemag.org/content/vol291/issue55
    07/index.shtml

5
And what do we do with a genome sequence?
  • We have to translate the sequence into a
    language human beings can understand Genome
    annotation.

6
We have the human genome sequence
  • So, what is the problem?
  • Well...
  • We dont know how many genes there are!
  • We dont know where they are!
  • We dont know what they do!

7
Definitions of Annotation
  • Interpreting raw sequence data into useful
    biological information
  • Information attached to genomic coordinates with
    start and end point, can occur at different
    levels
  • Addition of as much reliable and up-to-date
    information as possible to describe a sequence
  • Identification, structural description,
    characterization of putative protein products and
    other features in primary genomic sequence

8
Genome annotation
  • Two main levels
  • Structural annotation Nucleotide-Protein level
    annotation Finding genes and other biologically
    relevant sites thus building up a model of genome
    as objects with specific locations
  • Functional annotation Objects are used in
    database searches (and experiments) aim is
    attributing biologically relevant information to
    whole sequence and individual objects

Large-scale genome analysis projects
  • Rate-limiting step is annotation

9
Part I Structural annotation gene prediction
  • This step consist in identifying the coding
    genes in the DNA sequence. Properties of coding
    genes that can be used for to detect them on a
    genomic sequence are numerous.

10
Gene prediction Methods
  • Gene Prediction can be based upon
  • Coding statistics
  • Gene structure/Statistical approaches
  • Comparison/homology

11
Gene prediction Methods
  • Gene Prediction can be based upon
  • Coding statistics
  • Gene structure/Statistical approaches
  • Comparison/homology

12
Gene prediction Coding statistics
  • Coding regions of the sequence have different
    properties than non-coding regions non random
    properties of coding regions.
  • GC content
  • Codon bias (CODON FREQUENCY).
  • Third base composition (every third base in a
    coding region tends to be the same one much more
    often than by chance alone) (TESTCODE).

13
Gene prediction Codon bias
  • Synonymous codons depict the same Amino-acids
    (degenerative genetic code)
  • For each species, the use of one of the codon
    for a similar AA will be vary based on the
    relative abundance of the corresponding tRNA.
    Codon bias.
  • This is true only for Coding regions. In non
    coding regions the appearance of a codon will
    appear randomly.

Example graphical output of the codonpreference
program of GCG
14
Gene prediction Methods
  • Gene Prediction can be based upon
  • Coding statistics
  • Gene structure/Statistical approaches
  • Comparison/homology

15
Gene structure in Prokaryots
Transcribed region
start codon
stop codon
Coding region
5
3
RBS
Untranslated regions
Promoter
Transcription stop side
Transcription start side
16
Gene structure in Eukaryots
Transcribed region
exons
start codon
stop codon
introns
5
3
GT AG
donor and acceptor sites
Promoter
Transcription stop site
Untranslated regions
Transcription start site
17
Gene prediction Finding ORFs
  • The coding region of all protein-coding genes
    starts with a START codon and ends with a STOP
    codon. So called ORFs (Open Reading Frames) can
    be searched in the genome sequence. Valid only
    for prokaryots or lower eukaryots (few or no
    introns).

18
Gene prediction Features that can be searched for
  • Prokaryots
  • ORFs
  • RBS (Ribosome Binding Site) (Shine Dalgarno)
    (RBS finder).
  • Promoters (Promoter regions of genes often have
    a particular DNA structure/sequence
    TTGACAT(..)17TATAAT
  • Program used for most of the complete microbial
    genomes Glimmer (97-98 genes predicted
    accurately)
  • Eukaryots
  • Poly-Adenylation signal
  • Splicing sites (consensus for splice sites)
  • CpG islands
  • Promoters, transcriptional regulators binding
    sites

19
Eukaryots the problem
  • Consensus are neither strong nor unique

Solution Use a combination of all prediction
criteria
  • All parameters are studied in parallel.
  • Programs are trained to evaluate the prediction
    capacity for each of the parameters and learn
    to recognize genes
  • Based on probability HMM
  • Based on artificial networks Neural Networks
  • Programs need to be trained on your favorite
    organism!!

20
Hidden Markov Models (HMM) for gene prediction
  • What is an HMM?
  • An HMM describes the probability of transition
    between the hidden states of a model.

ATGCGTGCAGTCACCAGCAGTCAGTCG
Genomic sequence
21
Hidden Markov Models (HMM) for gene prediction
  • What is an HMM?
  • An HMM describes the probability of transition
    between the hidden states of a model.

Exon
Genomic sequence
ATGCGTGCAGTCACCAGCAGTCAGTCG
HIDDEN STATES
Introns
22
Hidden Markov Models (HMM) for gene prediction
P 0.5
Introns
Exon
P 0.8
Genomic sequence
ATGCGTGCAGTCACCAGCAGTCAGTCG
The probability that one base pair is in one
particular state depends on the state of the
previous base pair. The transition probabilty to
another state depends on the appearance of a
transition signal (splice site) and/or the
average number of bp in a certain hidden state
(size of Exon/Introns).
23
Hidden Markov Models (HMM) for gene prediction
  • Basic probabilistic model of gene structure.

E
EF
EI
3
5
I
SE
Signals
B Begin sequence S Start translation A
acceptor site D Donor site T Stop
translation F End sequence
24
Neural Networks for gene prediction (1)
  • What are Neural Networks?
  • Neural Network is a computer program that given
    a training set of data that preserve certain
    pattern learn to recognize given pattern.
  • The name derives from the fact that originally
    they ware intended to imitate human brain.
  • Like a brain cells, neural networks consists of
    central decision making unit connected to other
    units with the same topology.

25
Neural Networks for gene prediction (2)
  • Artficial neurons the nodes of the network

26
Neural Networks for gene prediction (3)
  • Weighting factor A neuron receives many
    simultaneous inputs. Each input has its own
    relative weight (w)
  • Summation function Processing in the usual
    artificial neuron consists of computing weighted
    sum.
  • Transfer function the result of the summating
    function is transferred via transfer function.
    Transfer function usually compares the weighted
    sum against some threshold value and may transfer
    no signal is the value is below the threshold.

27
Neural Networks for gene prediction GRAIL II
  • Neural Network of gene structure.

28
Gene prediction Statistical methods programs
  • Grail II
  • Genscan
  • GeneMark
  • Veil
  • GeneParser
  • FGENES

Any HMM or Neural network method need to be
trained on your model organism!!! Do not trust
the results of a single program but rather look
at the proposed gene structure from different
programs.
29
How do you train learning programs?
Whole genome sequence
Known genes
Unknown genes
1/2
1/2
Good prediction
Training set (sequence and annotation)
Verification set (sequence)
Program
Program with adapted parameters (weigth
functions, probabilities)
30
Gene prediction Methods
  • Gene Prediction can be based upon
  • Coding statistics
  • Gene structure/Statistical approaches
  • Comparison/homology

31
EST alignment to predict Intron/Exon boundaries
  • EST Expressed Sequence Tag. cDNA is produced
    from mRNA and sequenced.
  • Very powerful
  • If several ESTs are available, allows the
    identification of alternative splicing products
  • Programs EST-GENOME Genseqer
  • BUT
  • EST sequences are usually very poor quality
    (sequence errors)
  • EST sequences are often contaminated
  • Presence of an EST sequence depends on
    expression (level, tissus...)

32
Gene prediction sequence conservation
  • Between organisms, protein sequence conservation
    can be conserved (homology). Homology will be
    detectable only in the coding regions.
  • Database search programs such as Blast ot tFasta
    can be used to search the DNA sequence against a
    protein database. The DNA sequence is translated
    in all six-frame and searched individually
    against the database.

Homologous protein name
Expect value
Coordinate of the hit on the DNA sequence
gtYMR272C GENE YMR272C CHR. XIIIC REV FROM
209623 TO 210777 Length 384 Score
485 bits (1248), Expect e-137 Identities
232/383 (60), Positives 274/383 (70), Gaps
4/383 (1) Frame 3 Query 3708
SKMVSKTLPLYSKATLQKHTDRTSCWVSVGNRKIYDVSQFLDEHPGGDQY
ILDYAGKDIT 3887 S SKTL LSK TQH
CWV NRKIYDVFL EHPGGD ILDYAGKDIT Sbjct 2
STNTSKTLELFSKKTVQEHNTANDCWVTYQNRKIYDVTRFLSEHPGGDE
SILDYAGKDIT 61 Query 3888 AVLKDKLIHEHTEAAYEILDES
YLVGYLATEEEEIKLLTNEKHVMEVTPE----NLDTTT 4055
KD HEHAYEIL YLGYLATEE LLTN H
EV DTT Sbjct 62 EIMKDSDVHEHSDSAYEILED
EYLIGYLATDEEAARLLTNKNHKVEVQLSADGTEFDSTT
121 Query 4056 FVKELPAEEVLSVATDFGTDYTKHHFLDLNKPL
LMQVLRGNFTRDFYIDQIHRPRHYGKG 4235
FVKELPAEE LSATD DY KH FLDLNPLLMQLR F
DFYDQIHRPRHYGKG Sbjct 122 FVKELPAEEKLSIATDYSND
YKKHKFLDLNRPLLMQILRSDFKKDFYVDQIHRPRHYGKG 181
DNA frame where the hit was found
Here must be a gene!!!
33
Comparative genomics approach to annotation
  • Ashbya/Yeast as an example of synteny.

34
Gene prediction in higher eukaryotsTake home
message
  • The problem INTRONS the detection of the
    numerous introns in higher eukaryotic genes is
    difficult
  • It does not help to search for ORFs
  • There are often many introns per gene
  • The intron splicing sites do not always have a
    strict consensus.
  • The existence of alternative splicing makes the
    things even more difficult.
  • The potential solutions
  • Based the gene prediction on homology (ESTs or
    related species).
  • Exon/intron prediction programs (HMM or neural
    networks based) are trained on known gene
    sequences to recognize intron/exon boundaries.
    They can be used to search new sequences.
  • None of the method is good alone.
  • Very often a combination of all this methods is
    used to increase the accuracy but still gene
    prediction in higher eukaryots is a challenge.

35
Genome annotation and submission tools
  • Oakridge Genome Annotation Channel
    (http//compbio.ornl.gov/channel/)
  • ENSEMBL (http//ensembl.ebi.ac.uk)
  • Artemis (http//www.sanger.ac.uk/Software/Artemis)
    Sequence viewer and annotation tool
  • GeneQuiz (http//www.sander.ebi.ac.uk/genequiz/)
    System for automated annotation of sequences, web
    access required
  • Genome Annotation Assessment Project (GASP1)
    (http//www.fruitfly.org/GASP1)
  • Sequin submission tool ftp//ftp.ebi.ac.uk/pub/so
    ftware/sequin/

36
(No Transcript)
37
SEQUIN Submission System
  • Multi-platform (Mac/PC/Unix) stand-alone software
    tool
  • Allows direct submissions to EMBL, GenBank and
    DDBJ
  • Available from EBI ftp//ftp.ebi.ac.uk/pub/softwa
    re/sequin/
  • Free

38
Artemis
  • Multi-platform (Mac/PC/Unix) stand-alone software
    tool
  • Nice visualization of the annotation.
  • Easy extraction of the data.
  • Available from Sanger Center http//www.sanger.ac
    .uk/Software/Artemis/
  • Free

39
Finding tRNA genes using tRNAscan-SE
  • Availability
  • Web search http//www.genetics.wustl.edu/eddy/tR
    NAscan-SE/
  • UNIX source code also available at that address.
  • Prediction is based on
  • Identification of RNA pol III intragenic
    promoters
  • Secondary structure prediction

gtAGCHR1_3 agchr1_3.seq Continuation (3 of 7) of
agchr1 from base 200001 GCTACTCCGGGCCCAAATGAAGGAAG
AAGTTGAAAAGGTGTTCAGGAGACATGGCGGTAT CGAGAACAATGAACC
ACCCATTATTTTCCCCAAAGCTCCATTCTACTCGTCTCAAAATGT GTAT
GAGGTATTGGATAGAGGGGGTTCTGTGTTGCAGCTGCAATATGATTTAAC
GTACCC TATGGCGCGCTATCTTTCTAAGAACCCTCATTGCATATCAAAA
CAGTACAGAATGCAGTC AGTATACCGCCCAGCAGAACAGCAGCATGGCA
GCGTTGAACCACGAAGATTCGGAGAAAT AGATTTTGATATTATATCTGG
ATCATCTGCGGATTCAGCTTTATACGACGCTGAAAGTAT TAAAATCATT
GATGAACTGATATCAGTGTTTCCTGTCTTCGAAAAGACTAATACTTTGAT
TATTGTGAATCACTCAGATATTATGGAAAGTATCTTCAACCTTTGTTCT
ATTGATAAAGC CCAACGTTCCCTCGTATCTCAGATGCTGTCTCAGGTTG
GCTTTGCCAAGTCGTTTAAAGA TGTCAAAACCGAGCTGAAGGCCCAGTT
AAATATATCTTCTACCTCCTTGAACGATTTGGA GATGTTCGATTTCAAG
GTGGATTTTGACAATGCAAAAAAGAGGCTCAACAAACTGATGAT CGATA
GTCCGCACCTAACCAAGGTTGAGGAATCGCTTTCATATATATTCAAAGTG
TTGAA CTTCCTGAAGCCTCTTGGTGTAACACGAAATGTGGTGGTATCCC
CGTTAAGCAATTATAA CAGTGCCTTCTACAAGGGCGGCATCATGTTCCA
GGCCATATACGATAGCGGCCGTGTAAA AAGTTTGTTGGCAGCTGGTGGA
CGTTACGATAATTTGATTTCTTACATTGCAAGGCCATC
Sequence tRNA Bounds tRNA
Anti Intron Bounds Cove Name tRNA
Begin End Type Codon Begin End
Score -------- ------ ---- ------
---- ----- ----- ---- ------ AGCHR1_3
1 84548 84659 Leu CAA
84586 84615 54.86 AGCHR1_3 2
105389 105459 Gly GCC 0 0
62.63 AGCHR1_3 3 83748 83656 Phe
GAA 83711 83692 68.65 AGCHR1_3
4 53864 53792 Val CAC 0 0
76.92
40
After gene prediction and structural annotation...
MGWCDSLAIVTSI...
  • Endless strings of four-lettered DNA can be
    translated to twenty-lettered proteins but other
    as yet unknown translations will be necessary to
    convert this alphabetical soup to biology
  • S. Fields

...Functional annotation
41
Part II Functional annotation is the
description of
  • Function(s) of the protein
  • Post-translational modification(s)
  • Domains and sites
  • Secondary structure
  • Quaternary structure
  • Similarities to other proteins
  • Diseases associated with deficiencies in the
    protein
  • Sequence conflicts, variants, etc.

42
Functional annotation sources
  • Publications that report experimental data
  • Protein sequence analysis
  • Search for characteristic domains (patterns in
    protein sequences found in all protein carrying
    the same function DNA binding domain, kinase
    domain, transmembrane domain)
  • Comparison with other, related sequenced
    organisms
  • Homology to protein of known function
  • Experimental data (see functional genomics
    lecture)
  • Expression studies
  • Biochemical studies
  • 3D structure determination
  • Loss of function phenotype

43
From sequence to function
44
Example of annotation pipeline
NB look out for multi-domain proteins, put into
genome context
Supplement with manual curation and use evidence
tags
45
Example of Interpro search domain search
In all these proteins, the Interpro domain
IPR002198 (Short-chain dehydrogenase/reductase
SDR ) was identified. The short-chain
dehydrogenases/reductases family (SDR) 1 is a
very large family of enzymes, most of which are
known to be NAD- or NADP-dependent
oxidoreductases.
46
Limits of annotation
  • Databases are biased in sequence and AA
    composition and search is dependent on size
  • If no homology found- limited amount of
    information can be inferred
  • Incorrect functional annotation can be propagated
    very fast. If a functional annotation is wrong,
    then all the proteins with homology to that
    protein discovered afterwards will have a wrong
    functional annotation.
  • No answers to tissue-specificity, binding of
    ligands, relationship between genotype and
    phenotype

47
IMPORTANT TO NOTE
  • DONT COMPLETELY TRUST COMPUTER RESULTS
  • CHECK LITERATURE
  • CONFIRM WITH WETLAB WORK
Write a Comment
User Comments (0)
About PowerShow.com