Computational Biology: Genome annotation formats - PowerPoint PPT Presentation

1 / 28

About This Presentation

Title:

Computational Biology: Genome annotation formats

Description:

Department of Bioengineering. University of California, Berkeley ... In which format can a genome annotation be saved to files? ... – PowerPoint PPT presentation

Number of Views:59

Avg rating:3.0/5.0

Slides: 29

Provided by: biow

Category:

more less

Transcript and Presenter's Notes

Title: Computational Biology: Genome annotation formats

1
Computational BiologyGenome annotation formats

October 2004
Ian Holmes
Department of Bioengineering
University of California, Berkeley
From an original lecture by Irmtraud Meyer

2
Overview

What is genome annotation?
In which format can a genome annotation be saved
to files?
Definition of the gff genome annotation format
Other genome annotation formats
Application evaluating the performance of a
gene prediction program
Exercises

3
What is genome annotation ?

genome annotation is the localisation of
functional elements in a genomic sequence
For example the location of
protein coding genes
tRNA and other RNA genes
promoters
...

4
Example 1 protein coding genes

Unannotated DNA
5'
3'
Annotated DNA
Legend
Exon (protein coding)
Intron
Intergenic sequence
5
Formats for saving annotations
Example 1 DNA with protein coding genes

Motivation
To save information on a gene, a format should be
able to record
the location of the gene in the genome
the position of its exon-intron boundaries
the strand of DNA on which the gene lies
the source of annotation
the completeness of the gene structure

6
The GFF format

GFF Genefinding File Format
a format used to save gene structures
idea divide gene into its constituents
Exon transcribed sections of a gene
CDS translated sections of a gene
Start_Codon
Stop_Codon

7
The GFF format

Gene structure
Splicing pattern
Exons
1
2
3
5
4
CDS
Start_Codon
Stop_Codon

The information on each Exon, CDS, Start_Codon
and Stop_Codon corresponds to one line within
the gff file

8
The GFF format

Format of each gff-line
name source feature start end score strand
frame group
where
name the name of sequence (string)
source the name of the source of annotation
(string)
feature feature type Exon, CDS,
Start_Codon, Stop_Codon (string)
start start position of feature (integer)
end end position of feature (integer)
score score (rational number) associated with
feature, set to . if score not used
strand strand on which feature lies, possible
values are or -
frame 0, 1 or 2 for CDS, Start_Codon and
Stop_Codon, . for Exon

9
The GFF format remarks

the fields in a gff line are tab delimited
start lt end (important to keep in mind when
dealing with genes on the reverse strand !)
the start and end positions are the corresponding
positions on the strand
definition of frame for CDS, Start_Codon and
Stop_Codon features
0 first nucleotide in feature has codon
position 0
1 first nucleotide in feature has codon
position 2
2 first nucleotide in feature has codon
position 1
gt note that the frame of a CDS is NOT its
length modulo 3 and that the frame of a
Start_Codon and Stop_Codon always has to be 0
(Why ?)
Exons do not have a frame, use . as the value
of their frame
if there is no score associated with a feature,
use .

1
0
2
0
2
1
2
1
0
10
The GFF format more remarks

the terminal CDS does not comprise the positions
of the Stop_Codon as the Stop_Codon is not
translated
the initial CDS does comprise the positions of
the Start_Codon as it is translated
the order of lines in a gff file is irrelevant
although it makes sense to group them by genes

11
The GFF format

Example 2
A valid description of this gene in gff format
is for example
Chr1 src Exon 150 200 . . gene_id 1
transcript_id 1 exon_number 1
Chr1 src Exon 300 401 . . gene_id 1
transcript_id 1 exon_number 2
Chr1 src CDS 380 401 . 0 gene_id 1
transcript_id 1 exon_number 2
Chr1 src Exon 501 650 . . gene_id 1
transcript_id 1 exon_number 3
Chr1 src CDS 501 650 . 2 gene_id 1
transcript_id 1 exon_number 3
Chr1 src Exon 700 800 . . gene_id 1
transcript_id 1 exon_number 4
Chr1 src CDS 700 707 . 2 gene_id 1
transcript_id 1 exon_number 4
Chr1 src Exon 900 1000 . . gene_id 1
transcript_id 1 exon_number 5
Chr1 src Start_Codon 380 382 . 0 gene_id
1 transcript_id 1 exon_number 2
Chr1 src Stop_Codon 708 709 . 0 gene_id 1
transcript_id 1 exon_number 4

Towards larger numbers
extra information in the "group" field
12
The GFF format

Example 3 a gene on the reverse strand
The valid description of this gene in gff format
is for example
Chr22 src Exon 649 700 . - . gene_id 1
transcript_id 1 exon_number 1
Chr22 src CDS 649 700 . - 0 gene_id 1
transcript_id 1 exon_number 1
Chr22 src Exon 351 500 . - . gene_id 1
transcript_id 1 exon_number 2
Chr22 src CDS 351 500 . - 2 gene_id 1
transcript_id 1 exon_number 2
Chr22 src Exon 150 250 . - . gene_id 1
transcript_id 1 exon_number 3
Chr22 src CDS 153 250 . - 2 gene_id 1
transcript_id 1 exon_number 3
Chr22 src Start_Codon 698 700 . - 0 gene_id
1 transcript_id 1 exon_number 1
Chr22 src Stop_Codon 150 152 . - 0 gene_id
1 transcript_id 1 exon_number 3

Towards larger numbers

13
Other genome annotation formats

DAS XML version of GFF
uses tags to delimit fields, not whitespace
a lirrle more structured
GAME Genome Annotation Markup Elements
The format definition can be found at
http//www.bioxml.org/Projects/game

14
Uses of a genome annotation format

exchanging annotation information
checking an annotation
comparing differrent annotations
visualising an annotation, see for example
www.ensembl.org

15
Evaluating the performance of a gene prediction
program
Motivation need a measure to evaluate the
quality of a gene prediction and to compare the
quality of different gene prediction or gene
annotation methods Idea compare a set
of known genes (annotation) to a set predicted
genes (prediction) by comparing them on three
different levels - nucleotide level fine
scale compare nucleotides - intermediate
level medium scale compare entire CDS, start
and stop codons - gene level coarse
scale compare entire gene structures

16
Evaluation on different levels

Evaluation on gene level

Annotation
Prediction
Tp
Tp(overlapping)
Fn
Fp

Evaluation on intermediate level, for example
for the CDS label

Annotation
Prediction
Tp
Tp(overlapping)
Fn
Fp

Definitions
Tp true positive, Tn true negative, Fp
false positive, Fn false negative

17
Evaluation on different levels (cont'd)

Evaluation on nucleotide level, for example for
the CDS label

Annotation
Prediction
Fp
Tp
Tn
Fn
18
Measures of performance

For a given entity and label one can compute
sensitivity ( tp) / ( tp fn)
the fraction of annotated entities which were
correctly predicted
specificity ( tp) / ( tp fp)
the fraction of predicted entities which are
correct
missing ( fn) / ( tp fn)
the fraction of annotated entities which are
missing in the prediction
overlapping_1 ( tp(overlapping)) / ( tp
fn)
the fraction of annotated entities which are
overlapped by a predicted entity
overlapping_2 ( tp(overlapping)) / ( tp
fp)
the fraction of predicted entities which are
overlapping an annotated entity
wrong ( fp) / ( tp fp)
the fraction of predicted entities which do not
overlap any annotated entity
Where a label is for example CDS or
Start_Codon and an entity can be either a
nucleotide, a CDS, Exon, Start_Codon, Stop_Codon
or an entire gene

19
Exercises

1.) Check that you can reproduce the frames of
the CDS lines in example 3 knowing the
positions of the CDSs, the start codon and
the stop codon.
2.) What do the terms ( tp fp) and ( tp
fn) stand for ?
3.) Looking at a gff entry of a gene, can you
deduce if the annotation of the gene is
complete ?
4.) In which interval of numbers do the values
of sensitivity and specificity fall ?

20
Exercises

5.) This exercise prepares you for the
practicals following this lecture
You are collaborating with colleagues abroad who
send you a gff file with the genes of
their genome annotation as well as a fasta file
with the corresponding genome sequence.
a) How do you check the gff file for errors ?
Which checks can you think of ?
b) Outline the structure of (i.e. write the
pseudocode for) a program which checks the gff
file for errors.
6.) You are given a gff file with an annotation
predicted by a gene prediction program.
a) Which information do you require to evaluate
the performance of the gene prediction program ?
b) Outline the structure of a program which
evaluates the performance of a gene prediction
program by comparing the predicted genes
(contained in gff format in file 1) to the known
genes (contained in gff format in file 2) (see
example 4).

21
Answers to exercises

1.) look at gff lines with features CDS and
start codon in example 3
- CDS with exon_number 1 is the initial i.e.
5'-most CDS of the gene as it starts
with a start codon
- the initial CDS has length 700 649 1
52 17 3 1
gt the next CDS with exon_number 2 starts
with codon position 1
gt the next CDS has frame 2
- the second CDS has length 500 351 1
150 50 3
gt the next CDS with exon_number 3 start
with the same codon position
gt the next CDS has frame 2
2.) ( tp fp) is the number of predicted
features
( tp fn) is the number of annotated
features
3.) A gff entry to a gene only tells you if the
protein coding part of the gene is
complete. If the gff entry comprises start and
stop codon of the gene, its protein
coding part is complete. A gff entry does not
show if the information
on the untranslated exons is complete.

22
Answers to exercises

4.) The values for sensitivity (( tp) / ( tp
fn)) and specificity (( tp) / ( tp fp))
lie between 0 and 1. The sensitivity is 1
only if ( fn) 0 and the specificity is 1 only
if ( fp) 0.

23
Answer to exercise 5

Note This exercise is about checking the
annotation given in gff format, NOT the gff
format itself !
a) checking the annotation in the gff file is
best done if the corresponding DNA
sequences are available as this allows
more checks to be performed,
so for the practical you can assume that
you are given a gff and the corresponding
fasta file containing the DNA sequences
- possible checks of the annotation are
-Is the start codon correct (if it exists) ?
- Is the stop codon correct (if it
exists) ?
- Are there no in-frame stop codons
within the CDS ?
- Do the splice sites look fine ?
- For complete genes Is the sum of CDS
lengths a multiple of 3 ?

24
Answer to exercise 5 (cont'd)

b) For the program which checks the annotation
you may assume the following which
you do not have to check
- sequences names in the fasta file are
unique
- use of gff format is correc
You may assume in your program, but should
check the following
- DNA sequences consist of A,C,G,T letters
only
- all genes are complete, ie comprise a
start and stop codon
- splice sites are either GTAG
(consensus) or GCAG
- there is exactly one gene associated
with each fasta file sequence
Some things to keep in mind
- genes can lie on the forward or the
reverse - strand
- the DNA sequences in the fasta file are
the strand sequences
- the coordinates in the fasta and the gff
file are absolute coordinates, but in your
program you may prefer to make some
calculatations in relative coordinates (ie the
first sequence position being 1 and the
last being length_of_sequence

25
Pseudocode (outline of the program)

1.) read all of the fasta file and get all DNA
sequences and headers
2.) for each entry in the fasta file
a) check fasta entry
i) length of DNA sequence equals
length indicated in header ?
If not, report error and go to next
sequence ( rerrgonext)
ii) DNA sequence consists of A, C, G,
T letter only ? If not, rerrgonext.
b) read gff lines for that sequence name
i) check gff lines exist if not,
rerrgonext
ii) check there is exactly one gene
associated with fasta entry if not, rerrgonext
Iii) check if gene is complete if not,
rerrgonext
iv) check if sum of CDS lengths
multiple of 3 if not, rerrgonext
v) check if start codon correct if
not, report error
vi) check if stop codon correct if
not, report error
vii) check that there are no in-frame
stop codons if there are any, report error
viii) if relevant, check if splice
sites are ok if not, report error

26
Info on input files and functions

All files can be found on my web-page URL HERE
There are two sets of gff-fasta file pairs,
the input files to your check program -
ce.gff and ce.fasta a set of c. elegans genes
whose annotation is correct - mh.gff and
mh.fasta a set of mouse and human genes whose
annotation is partly corrupt (aim compile a
list of errors) There is a file with
predefined functions and hashes that you may use
in your program, if you like -
functions.pl
27
Remark about the fasta header lines

Header format (non standard)
gtseqname start_position-sto
p_position strand
Example of a fasta entry
gtMM.U35323.Lmp2.20 47366-53419 reverse
catccatgcaggaagaaaaaaaaaaaaccacatacaatgaaagtaaacaa
atctataaaa
ttaaaaagaagtcaacccacagtgaccgctcatatctgagggcttttctc
ccagggtgct
ccgctctctttctctggagaaagtgtttggcatgggaataaagaacaaga
aggtgttcaa
tttaatgtcgaacttcacagcttccattttaaatctcagttcttactgtc
tagaagttca
ctgagtctcgcagactttgaatgtggttccctttgtgtggatgctttttt
atttatttgc
....
acgtggaggg
So, this DNA sequence which is by default the
sequence on the forward strand starts with
position 47366 and ends at position 53419, ie has
length 53419 - 47366 1 6054.
The reverse in the header line indicates that
the corresponding gene in the gff file lies on
the reverse - strand.

28
Answer to exercise 5b

As always, there are many ways of writing a
program which makes the required checks
One solution to exercise 5b which uses the
function in functions.pl can be found on
URL HERE
see
check_gff_and_fasta.pl
On this web page, you can also find the two
output files that this program produces on the ce
and the mh set of genes so that you know which
errors your program should have found.