Computational Biology: Genome annotation formats - PowerPoint PPT Presentation

1 / 28
About This Presentation
Title:

Computational Biology: Genome annotation formats

Description:

Department of Bioengineering. University of California, Berkeley ... In which format can a genome annotation be saved to files? ... – PowerPoint PPT presentation

Number of Views:58
Avg rating:3.0/5.0
Slides: 29
Provided by: biow
Category:

less

Transcript and Presenter's Notes

Title: Computational Biology: Genome annotation formats


1
Computational BiologyGenome annotation formats
  • October 2004
  • Ian Holmes
  • Department of Bioengineering
  • University of California, Berkeley
  • From an original lecture by Irmtraud Meyer

2
Overview
  • What is genome annotation?
  • In which format can a genome annotation be saved
    to files?
  • Definition of the gff genome annotation format
  • Other genome annotation formats
  • Application evaluating the performance of a
    gene prediction program
  • Exercises

3
What is genome annotation ?
  • genome annotation is the localisation of
    functional elements in a genomic sequence
  • For example the location of
  • protein coding genes
  • tRNA and other RNA genes
  • promoters
  • ...

4
Example 1 protein coding genes

Unannotated DNA
5'
3'
Annotated DNA
Legend
Exon (protein coding)
Intron
Intergenic sequence
5
Formats for saving annotations
Example 1 DNA with protein coding genes
  • Motivation
  • To save information on a gene, a format should be
    able to record
  • the location of the gene in the genome
  • the position of its exon-intron boundaries
  • the strand of DNA on which the gene lies
  • the source of annotation
  • the completeness of the gene structure

6
The GFF format
  • GFF Genefinding File Format
  • a format used to save gene structures
  • idea divide gene into its constituents
  • Exon transcribed sections of a gene
  • CDS translated sections of a gene
  • Start_Codon
  • Stop_Codon

7
The GFF format

Gene structure
Splicing pattern
Exons
1
2
3
5
4
CDS
Start_Codon
Stop_Codon
  • The information on each Exon, CDS, Start_Codon
    and Stop_Codon corresponds to one line within
    the gff file

8
The GFF format
  • Format of each gff-line
  • name source feature start end score strand
    frame group
  • where
  • name the name of sequence (string)
  • source the name of the source of annotation
    (string)
  • feature feature type Exon, CDS,
    Start_Codon, Stop_Codon (string)
  • start start position of feature (integer)
  • end end position of feature (integer)
  • score score (rational number) associated with
    feature, set to . if score not used
  • strand strand on which feature lies, possible
    values are or -
  • frame 0, 1 or 2 for CDS, Start_Codon and
    Stop_Codon, . for Exon

9
The GFF format remarks
  • the fields in a gff line are tab delimited
  • start lt end (important to keep in mind when
    dealing with genes on the reverse strand !)
  • the start and end positions are the corresponding
    positions on the strand
  • definition of frame for CDS, Start_Codon and
    Stop_Codon features
  • 0 first nucleotide in feature has codon
    position 0
  • 1 first nucleotide in feature has codon
    position 2
  • 2 first nucleotide in feature has codon
    position 1
  • gt note that the frame of a CDS is NOT its
    length modulo 3 and that the frame of a
    Start_Codon and Stop_Codon always has to be 0
    (Why ?)
  • Exons do not have a frame, use . as the value
    of their frame
  • if there is no score associated with a feature,
    use .

1
0
2
0
2
1
2
1
0
10
The GFF format more remarks
  • the terminal CDS does not comprise the positions
    of the Stop_Codon as the Stop_Codon is not
    translated
  • the initial CDS does comprise the positions of
    the Start_Codon as it is translated
  • the order of lines in a gff file is irrelevant
    although it makes sense to group them by genes

11
The GFF format
  • Example 2
  • A valid description of this gene in gff format
    is for example
  • Chr1 src Exon 150 200 . . gene_id 1
    transcript_id 1 exon_number 1
  • Chr1 src Exon 300 401 . . gene_id 1
    transcript_id 1 exon_number 2
  • Chr1 src CDS 380 401 . 0 gene_id 1
    transcript_id 1 exon_number 2
  • Chr1 src Exon 501 650 . . gene_id 1
    transcript_id 1 exon_number 3
  • Chr1 src CDS 501 650 . 2 gene_id 1
    transcript_id 1 exon_number 3
  • Chr1 src Exon 700 800 . . gene_id 1
    transcript_id 1 exon_number 4
  • Chr1 src CDS 700 707 . 2 gene_id 1
    transcript_id 1 exon_number 4
  • Chr1 src Exon 900 1000 . . gene_id 1
    transcript_id 1 exon_number 5
  • Chr1 src Start_Codon 380 382 . 0 gene_id
    1 transcript_id 1 exon_number 2
  • Chr1 src Stop_Codon 708 709 . 0 gene_id 1
    transcript_id 1 exon_number 4

Towards larger numbers
extra information in the "group" field
12
The GFF format
  • Example 3 a gene on the reverse strand
  • The valid description of this gene in gff format
    is for example
  • Chr22 src Exon 649 700 . - . gene_id 1
    transcript_id 1 exon_number 1
  • Chr22 src CDS 649 700 . - 0 gene_id 1
    transcript_id 1 exon_number 1
  • Chr22 src Exon 351 500 . - . gene_id 1
    transcript_id 1 exon_number 2
  • Chr22 src CDS 351 500 . - 2 gene_id 1
    transcript_id 1 exon_number 2
  • Chr22 src Exon 150 250 . - . gene_id 1
    transcript_id 1 exon_number 3
  • Chr22 src CDS 153 250 . - 2 gene_id 1
    transcript_id 1 exon_number 3
  • Chr22 src Start_Codon 698 700 . - 0 gene_id
    1 transcript_id 1 exon_number 1
  • Chr22 src Stop_Codon 150 152 . - 0 gene_id
    1 transcript_id 1 exon_number 3

Towards larger numbers

13
Other genome annotation formats
  • DAS XML version of GFF
  • uses tags to delimit fields, not whitespace
  • a lirrle more structured
  • GAME Genome Annotation Markup Elements
  • The format definition can be found at
    http//www.bioxml.org/Projects/game

14
Uses of a genome annotation format
  • exchanging annotation information
  • checking an annotation
  • comparing differrent annotations
  • visualising an annotation, see for example
    www.ensembl.org

15
Evaluating the performance of a gene prediction
program
Motivation need a measure to evaluate the
quality of a gene prediction and to compare the
quality of different gene prediction or gene
annotation methods Idea compare a set
of known genes (annotation) to a set predicted
genes (prediction) by comparing them on three
different levels - nucleotide level fine
scale compare nucleotides - intermediate
level medium scale compare entire CDS, start
and stop codons - gene level coarse
scale compare entire gene structures

16
Evaluation on different levels
  • Evaluation on gene level

Annotation
Prediction
Tp
Tp(overlapping)
Fn
Fp
  • Evaluation on intermediate level, for example
    for the CDS label

Annotation
Prediction
Tp
Tp(overlapping)
Fn
Fp
  • Definitions
  • Tp true positive, Tn true negative, Fp
    false positive, Fn false negative

17
Evaluation on different levels (cont'd)
  • Evaluation on nucleotide level, for example for
    the CDS label

Annotation
Prediction
Fp
Tp
Tn
Fn
18
Measures of performance
  • For a given entity and label one can compute
  • sensitivity ( tp) / ( tp fn)
  • the fraction of annotated entities which were
    correctly predicted
  • specificity ( tp) / ( tp fp)
  • the fraction of predicted entities which are
    correct
  • missing ( fn) / ( tp fn)
  • the fraction of annotated entities which are
    missing in the prediction
  • overlapping_1 ( tp(overlapping)) / ( tp
    fn)
  • the fraction of annotated entities which are
    overlapped by a predicted entity
  • overlapping_2 ( tp(overlapping)) / ( tp
    fp)
  • the fraction of predicted entities which are
    overlapping an annotated entity
  • wrong ( fp) / ( tp fp)
  • the fraction of predicted entities which do not
    overlap any annotated entity
  • Where a label is for example CDS or
    Start_Codon and an entity can be either a
    nucleotide, a CDS, Exon, Start_Codon, Stop_Codon
    or an entire gene

19
Exercises
  • 1.) Check that you can reproduce the frames of
    the CDS lines in example 3 knowing the
    positions of the CDSs, the start codon and
    the stop codon.
  • 2.) What do the terms ( tp fp) and ( tp
    fn) stand for ?
  • 3.) Looking at a gff entry of a gene, can you
    deduce if the annotation of the gene is
    complete ?
  • 4.) In which interval of numbers do the values
    of sensitivity and specificity fall ?

20
Exercises
  • 5.) This exercise prepares you for the
    practicals following this lecture
    You are collaborating with colleagues abroad who
    send you a gff file with the genes of
    their genome annotation as well as a fasta file
    with the corresponding genome sequence.
  • a) How do you check the gff file for errors ?
    Which checks can you think of ?
  • b) Outline the structure of (i.e. write the
    pseudocode for) a program which checks the gff
    file for errors.
  • 6.) You are given a gff file with an annotation
    predicted by a gene prediction program.
  • a) Which information do you require to evaluate
    the performance of the gene prediction program ?
  • b) Outline the structure of a program which
    evaluates the performance of a gene prediction
    program by comparing the predicted genes
    (contained in gff format in file 1) to the known
    genes (contained in gff format in file 2) (see
    example 4).

21
Answers to exercises
  • 1.) look at gff lines with features CDS and
    start codon in example 3
  • - CDS with exon_number 1 is the initial i.e.
    5'-most CDS of the gene as it starts
    with a start codon
  • - the initial CDS has length 700 649 1
    52 17 3 1
  • gt the next CDS with exon_number 2 starts
    with codon position 1
  • gt the next CDS has frame 2
  • - the second CDS has length 500 351 1
    150 50 3
  • gt the next CDS with exon_number 3 start
    with the same codon position
  • gt the next CDS has frame 2
  • 2.) ( tp fp) is the number of predicted
    features
  • ( tp fn) is the number of annotated
    features
  • 3.) A gff entry to a gene only tells you if the
    protein coding part of the gene is
    complete. If the gff entry comprises start and
    stop codon of the gene, its protein
    coding part is complete. A gff entry does not
    show if the information
  • on the untranslated exons is complete.

22
Answers to exercises
  • 4.) The values for sensitivity (( tp) / ( tp
    fn)) and specificity (( tp) / ( tp fp))
  • lie between 0 and 1. The sensitivity is 1
    only if ( fn) 0 and the specificity is 1 only
    if ( fp) 0.

23
Answer to exercise 5
  • Note This exercise is about checking the
    annotation given in gff format, NOT the gff
  • format itself !
  • a) checking the annotation in the gff file is
    best done if the corresponding DNA
    sequences are available as this allows
    more checks to be performed,
  • so for the practical you can assume that
    you are given a gff and the corresponding
  • fasta file containing the DNA sequences
  • - possible checks of the annotation are
  • -Is the start codon correct (if it exists) ?
  • - Is the stop codon correct (if it
    exists) ?
  • - Are there no in-frame stop codons
    within the CDS ?
  • - Do the splice sites look fine ?
  • - For complete genes Is the sum of CDS
    lengths a multiple of 3 ?

24
Answer to exercise 5 (cont'd)
  • b) For the program which checks the annotation
    you may assume the following which
  • you do not have to check
  • - sequences names in the fasta file are
    unique
  • - use of gff format is correc
  • You may assume in your program, but should
    check the following
  • - DNA sequences consist of A,C,G,T letters
    only
  • - all genes are complete, ie comprise a
    start and stop codon
  • - splice sites are either GTAG
    (consensus) or GCAG
  • - there is exactly one gene associated
    with each fasta file sequence
  • Some things to keep in mind
  • - genes can lie on the forward or the
    reverse - strand
  • - the DNA sequences in the fasta file are
    the strand sequences
  • - the coordinates in the fasta and the gff
    file are absolute coordinates, but in your
    program you may prefer to make some
    calculatations in relative coordinates (ie the
  • first sequence position being 1 and the
    last being length_of_sequence

25
Pseudocode (outline of the program)
  • 1.) read all of the fasta file and get all DNA
    sequences and headers
  • 2.) for each entry in the fasta file
  • a) check fasta entry
  • i) length of DNA sequence equals
    length indicated in header ?
  • If not, report error and go to next
    sequence ( rerrgonext)
  • ii) DNA sequence consists of A, C, G,
    T letter only ? If not, rerrgonext.
  • b) read gff lines for that sequence name
  • i) check gff lines exist if not,
    rerrgonext
  • ii) check there is exactly one gene
    associated with fasta entry if not, rerrgonext
  • Iii) check if gene is complete if not,
    rerrgonext
  • iv) check if sum of CDS lengths
    multiple of 3 if not, rerrgonext
  • v) check if start codon correct if
    not, report error
  • vi) check if stop codon correct if
    not, report error
  • vii) check that there are no in-frame
    stop codons if there are any, report error
  • viii) if relevant, check if splice
    sites are ok if not, report error

26
Info on input files and functions

All files can be found on my web-page URL HERE
There are two sets of gff-fasta file pairs,
the input files to your check program -
ce.gff and ce.fasta a set of c. elegans genes
whose annotation is correct - mh.gff and
mh.fasta a set of mouse and human genes whose
annotation is partly corrupt (aim compile a
list of errors) There is a file with
predefined functions and hashes that you may use
in your program, if you like -
functions.pl
27
Remark about the fasta header lines
  • Header format (non standard)
  • gtseqname start_position-sto
    p_position strand
  • Example of a fasta entry
  • gtMM.U35323.Lmp2.20 47366-53419 reverse
  • catccatgcaggaagaaaaaaaaaaaaccacatacaatgaaagtaaacaa
    atctataaaa
  • ttaaaaagaagtcaacccacagtgaccgctcatatctgagggcttttctc
    ccagggtgct
  • ccgctctctttctctggagaaagtgtttggcatgggaataaagaacaaga
    aggtgttcaa
  • tttaatgtcgaacttcacagcttccattttaaatctcagttcttactgtc
    tagaagttca
  • ctgagtctcgcagactttgaatgtggttccctttgtgtggatgctttttt
    atttatttgc
  • ....
  • acgtggaggg
  • So, this DNA sequence which is by default the
    sequence on the forward strand starts with
    position 47366 and ends at position 53419, ie has
    length 53419 - 47366 1 6054.
  • The reverse in the header line indicates that
    the corresponding gene in the gff file lies on
    the reverse - strand.

28
Answer to exercise 5b
  • As always, there are many ways of writing a
    program which makes the required checks
  • One solution to exercise 5b which uses the
    function in functions.pl can be found on
  • URL HERE
  • see
  • check_gff_and_fasta.pl
  • On this web page, you can also find the two
    output files that this program produces on the ce
    and the mh set of genes so that you know which
    errors your program should have found.
Write a Comment
User Comments (0)
About PowerShow.com