Title: Computational Biology: Genome annotation formats
1Computational BiologyGenome annotation formats
- October 2004
- Ian Holmes
- Department of Bioengineering
- University of California, Berkeley
- From an original lecture by Irmtraud Meyer
2Overview
- What is genome annotation?
- In which format can a genome annotation be saved
to files? - Definition of the gff genome annotation format
- Other genome annotation formats
- Application evaluating the performance of a
gene prediction program - Exercises
3What is genome annotation ?
- genome annotation is the localisation of
functional elements in a genomic sequence - For example the location of
- protein coding genes
- tRNA and other RNA genes
- promoters
- ...
4Example 1 protein coding genes
Unannotated DNA
5'
3'
Annotated DNA
Legend
Exon (protein coding)
Intron
Intergenic sequence
5Formats for saving annotations
Example 1 DNA with protein coding genes
- Motivation
- To save information on a gene, a format should be
able to record - the location of the gene in the genome
- the position of its exon-intron boundaries
- the strand of DNA on which the gene lies
- the source of annotation
- the completeness of the gene structure
-
6The GFF format
- GFF Genefinding File Format
- a format used to save gene structures
- idea divide gene into its constituents
- Exon transcribed sections of a gene
- CDS translated sections of a gene
- Start_Codon
- Stop_Codon
7The GFF format
Gene structure
Splicing pattern
Exons
1
2
3
5
4
CDS
Start_Codon
Stop_Codon
- The information on each Exon, CDS, Start_Codon
and Stop_Codon corresponds to one line within
the gff file
8The GFF format
- Format of each gff-line
- name source feature start end score strand
frame group - where
- name the name of sequence (string)
- source the name of the source of annotation
(string) - feature feature type Exon, CDS,
Start_Codon, Stop_Codon (string) - start start position of feature (integer)
- end end position of feature (integer)
- score score (rational number) associated with
feature, set to . if score not used - strand strand on which feature lies, possible
values are or - - frame 0, 1 or 2 for CDS, Start_Codon and
Stop_Codon, . for Exon
9The GFF format remarks
- the fields in a gff line are tab delimited
- start lt end (important to keep in mind when
dealing with genes on the reverse strand !) - the start and end positions are the corresponding
positions on the strand - definition of frame for CDS, Start_Codon and
Stop_Codon features - 0 first nucleotide in feature has codon
position 0 - 1 first nucleotide in feature has codon
position 2 - 2 first nucleotide in feature has codon
position 1 - gt note that the frame of a CDS is NOT its
length modulo 3 and that the frame of a
Start_Codon and Stop_Codon always has to be 0
(Why ?) - Exons do not have a frame, use . as the value
of their frame - if there is no score associated with a feature,
use .
1
0
2
0
2
1
2
1
0
10The GFF format more remarks
- the terminal CDS does not comprise the positions
of the Stop_Codon as the Stop_Codon is not
translated - the initial CDS does comprise the positions of
the Start_Codon as it is translated - the order of lines in a gff file is irrelevant
although it makes sense to group them by genes
11The GFF format
- Example 2
-
- A valid description of this gene in gff format
is for example - Chr1 src Exon 150 200 . . gene_id 1
transcript_id 1 exon_number 1 - Chr1 src Exon 300 401 . . gene_id 1
transcript_id 1 exon_number 2 - Chr1 src CDS 380 401 . 0 gene_id 1
transcript_id 1 exon_number 2 - Chr1 src Exon 501 650 . . gene_id 1
transcript_id 1 exon_number 3 - Chr1 src CDS 501 650 . 2 gene_id 1
transcript_id 1 exon_number 3 - Chr1 src Exon 700 800 . . gene_id 1
transcript_id 1 exon_number 4 - Chr1 src CDS 700 707 . 2 gene_id 1
transcript_id 1 exon_number 4 - Chr1 src Exon 900 1000 . . gene_id 1
transcript_id 1 exon_number 5 - Chr1 src Start_Codon 380 382 . 0 gene_id
1 transcript_id 1 exon_number 2 - Chr1 src Stop_Codon 708 709 . 0 gene_id 1
transcript_id 1 exon_number 4
Towards larger numbers
extra information in the "group" field
12The GFF format
- Example 3 a gene on the reverse strand
-
- The valid description of this gene in gff format
is for example -
- Chr22 src Exon 649 700 . - . gene_id 1
transcript_id 1 exon_number 1 - Chr22 src CDS 649 700 . - 0 gene_id 1
transcript_id 1 exon_number 1 - Chr22 src Exon 351 500 . - . gene_id 1
transcript_id 1 exon_number 2 - Chr22 src CDS 351 500 . - 2 gene_id 1
transcript_id 1 exon_number 2 - Chr22 src Exon 150 250 . - . gene_id 1
transcript_id 1 exon_number 3 - Chr22 src CDS 153 250 . - 2 gene_id 1
transcript_id 1 exon_number 3 - Chr22 src Start_Codon 698 700 . - 0 gene_id
1 transcript_id 1 exon_number 1 - Chr22 src Stop_Codon 150 152 . - 0 gene_id
1 transcript_id 1 exon_number 3
Towards larger numbers
13Other genome annotation formats
- DAS XML version of GFF
- uses tags to delimit fields, not whitespace
- a lirrle more structured
- GAME Genome Annotation Markup Elements
- The format definition can be found at
http//www.bioxml.org/Projects/game
14Uses of a genome annotation format
- exchanging annotation information
- checking an annotation
- comparing differrent annotations
- visualising an annotation, see for example
www.ensembl.org
15Evaluating the performance of a gene prediction
program
Motivation need a measure to evaluate the
quality of a gene prediction and to compare the
quality of different gene prediction or gene
annotation methods Idea compare a set
of known genes (annotation) to a set predicted
genes (prediction) by comparing them on three
different levels - nucleotide level fine
scale compare nucleotides - intermediate
level medium scale compare entire CDS, start
and stop codons - gene level coarse
scale compare entire gene structures
16Evaluation on different levels
Annotation
Prediction
Tp
Tp(overlapping)
Fn
Fp
- Evaluation on intermediate level, for example
for the CDS label
Annotation
Prediction
Tp
Tp(overlapping)
Fn
Fp
- Definitions
- Tp true positive, Tn true negative, Fp
false positive, Fn false negative
17Evaluation on different levels (cont'd)
- Evaluation on nucleotide level, for example for
the CDS label
Annotation
Prediction
Fp
Tp
Tn
Fn
18Measures of performance
- For a given entity and label one can compute
- sensitivity ( tp) / ( tp fn)
- the fraction of annotated entities which were
correctly predicted - specificity ( tp) / ( tp fp)
- the fraction of predicted entities which are
correct - missing ( fn) / ( tp fn)
- the fraction of annotated entities which are
missing in the prediction - overlapping_1 ( tp(overlapping)) / ( tp
fn) - the fraction of annotated entities which are
overlapped by a predicted entity - overlapping_2 ( tp(overlapping)) / ( tp
fp) - the fraction of predicted entities which are
overlapping an annotated entity - wrong ( fp) / ( tp fp)
- the fraction of predicted entities which do not
overlap any annotated entity - Where a label is for example CDS or
Start_Codon and an entity can be either a
nucleotide, a CDS, Exon, Start_Codon, Stop_Codon
or an entire gene
19Exercises
- 1.) Check that you can reproduce the frames of
the CDS lines in example 3 knowing the
positions of the CDSs, the start codon and
the stop codon. - 2.) What do the terms ( tp fp) and ( tp
fn) stand for ? - 3.) Looking at a gff entry of a gene, can you
deduce if the annotation of the gene is
complete ? - 4.) In which interval of numbers do the values
of sensitivity and specificity fall ?
20Exercises
- 5.) This exercise prepares you for the
practicals following this lecture
You are collaborating with colleagues abroad who
send you a gff file with the genes of
their genome annotation as well as a fasta file
with the corresponding genome sequence. - a) How do you check the gff file for errors ?
Which checks can you think of ? - b) Outline the structure of (i.e. write the
pseudocode for) a program which checks the gff
file for errors. - 6.) You are given a gff file with an annotation
predicted by a gene prediction program. - a) Which information do you require to evaluate
the performance of the gene prediction program ? - b) Outline the structure of a program which
evaluates the performance of a gene prediction
program by comparing the predicted genes
(contained in gff format in file 1) to the known
genes (contained in gff format in file 2) (see
example 4).
21Answers to exercises
- 1.) look at gff lines with features CDS and
start codon in example 3 - - CDS with exon_number 1 is the initial i.e.
5'-most CDS of the gene as it starts
with a start codon - - the initial CDS has length 700 649 1
52 17 3 1 - gt the next CDS with exon_number 2 starts
with codon position 1 - gt the next CDS has frame 2
- - the second CDS has length 500 351 1
150 50 3 - gt the next CDS with exon_number 3 start
with the same codon position - gt the next CDS has frame 2
- 2.) ( tp fp) is the number of predicted
features - ( tp fn) is the number of annotated
features - 3.) A gff entry to a gene only tells you if the
protein coding part of the gene is
complete. If the gff entry comprises start and
stop codon of the gene, its protein
coding part is complete. A gff entry does not
show if the information - on the untranslated exons is complete.
22Answers to exercises
- 4.) The values for sensitivity (( tp) / ( tp
fn)) and specificity (( tp) / ( tp fp)) - lie between 0 and 1. The sensitivity is 1
only if ( fn) 0 and the specificity is 1 only
if ( fp) 0.
23Answer to exercise 5
- Note This exercise is about checking the
annotation given in gff format, NOT the gff - format itself !
- a) checking the annotation in the gff file is
best done if the corresponding DNA
sequences are available as this allows
more checks to be performed, - so for the practical you can assume that
you are given a gff and the corresponding - fasta file containing the DNA sequences
- - possible checks of the annotation are
- -Is the start codon correct (if it exists) ?
- - Is the stop codon correct (if it
exists) ? - - Are there no in-frame stop codons
within the CDS ? - - Do the splice sites look fine ?
- - For complete genes Is the sum of CDS
lengths a multiple of 3 ? -
24Answer to exercise 5 (cont'd)
- b) For the program which checks the annotation
you may assume the following which - you do not have to check
- - sequences names in the fasta file are
unique - - use of gff format is correc
- You may assume in your program, but should
check the following - - DNA sequences consist of A,C,G,T letters
only - - all genes are complete, ie comprise a
start and stop codon - - splice sites are either GTAG
(consensus) or GCAG - - there is exactly one gene associated
with each fasta file sequence - Some things to keep in mind
- - genes can lie on the forward or the
reverse - strand - - the DNA sequences in the fasta file are
the strand sequences - - the coordinates in the fasta and the gff
file are absolute coordinates, but in your
program you may prefer to make some
calculatations in relative coordinates (ie the - first sequence position being 1 and the
last being length_of_sequence
25Pseudocode (outline of the program)
- 1.) read all of the fasta file and get all DNA
sequences and headers - 2.) for each entry in the fasta file
- a) check fasta entry
- i) length of DNA sequence equals
length indicated in header ? - If not, report error and go to next
sequence ( rerrgonext) - ii) DNA sequence consists of A, C, G,
T letter only ? If not, rerrgonext. - b) read gff lines for that sequence name
- i) check gff lines exist if not,
rerrgonext - ii) check there is exactly one gene
associated with fasta entry if not, rerrgonext - Iii) check if gene is complete if not,
rerrgonext - iv) check if sum of CDS lengths
multiple of 3 if not, rerrgonext - v) check if start codon correct if
not, report error - vi) check if stop codon correct if
not, report error - vii) check that there are no in-frame
stop codons if there are any, report error - viii) if relevant, check if splice
sites are ok if not, report error
26Info on input files and functions
All files can be found on my web-page URL HERE
There are two sets of gff-fasta file pairs,
the input files to your check program -
ce.gff and ce.fasta a set of c. elegans genes
whose annotation is correct - mh.gff and
mh.fasta a set of mouse and human genes whose
annotation is partly corrupt (aim compile a
list of errors) There is a file with
predefined functions and hashes that you may use
in your program, if you like -
functions.pl
27Remark about the fasta header lines
- Header format (non standard)
- gtseqname start_position-sto
p_position strand - Example of a fasta entry
- gtMM.U35323.Lmp2.20 47366-53419 reverse
- catccatgcaggaagaaaaaaaaaaaaccacatacaatgaaagtaaacaa
atctataaaa - ttaaaaagaagtcaacccacagtgaccgctcatatctgagggcttttctc
ccagggtgct - ccgctctctttctctggagaaagtgtttggcatgggaataaagaacaaga
aggtgttcaa - tttaatgtcgaacttcacagcttccattttaaatctcagttcttactgtc
tagaagttca - ctgagtctcgcagactttgaatgtggttccctttgtgtggatgctttttt
atttatttgc - ....
- acgtggaggg
- So, this DNA sequence which is by default the
sequence on the forward strand starts with
position 47366 and ends at position 53419, ie has
length 53419 - 47366 1 6054. - The reverse in the header line indicates that
the corresponding gene in the gff file lies on
the reverse - strand. -
-
28 Answer to exercise 5b
- As always, there are many ways of writing a
program which makes the required checks - One solution to exercise 5b which uses the
function in functions.pl can be found on - URL HERE
- see
- check_gff_and_fasta.pl
-
- On this web page, you can also find the two
output files that this program produces on the ce
and the mh set of genes so that you know which
errors your program should have found.