Sharing Genomic Data and Annotations using GFF3 format - PowerPoint PPT Presentation

1 / 10
About This Presentation

Sharing Genomic Data and Annotations using GFF3 format


Sharing Genomic Data and Annotations using GFF3 format Dina Sulakhe and Natalia Maltsev Bioinformatics Group MCS, Argonne National Laboratory Computation Institute – PowerPoint PPT presentation

Number of Views:203
Avg rating:3.0/5.0
Slides: 11
Provided by: MCSDi


Transcript and Presenter's Notes

Title: Sharing Genomic Data and Annotations using GFF3 format

Sharing Genomic Data and Annotations using GFF3
Dina Sulakhe and Natalia Maltsev Bioinformatics
Group MCS, Argonne National Laboratory Computation
Institute University of Chicago
What we are going to talk about?
  • GFF3 overview
  • GFF3 Standards
  • For Sharing Annotations
  • For cross-referencing the data
  • Extending GFF3
  • Adding annotations from public databases
  • Adding users annotations
  • Sharing and exchanging annotations using
  • GFF3 genomes repository at Argonne
  • Downloads
  • Web-services

GFF3 overview (Lincoln Stein, 2004)
  • A tab-delimited flat file representation of
    genomic features
  • GFF3 format
  • provides a mechanism for representing of
    hierarchical grouping of genomic features and
  • separates the ideas of group membership and
    feature name/id
  • Enforces the use of controlled vocabularies by
    imposing constraints on the definitions of
    genomic features
  • allows a single feature (e.g. an exon) to belong
    to more than one group at a time.
  • provides an explicit convention for pair wise
  • provides an explicit convention for features that
    occupy disjoint regions

An Example
  • PUMA2 (http// is an
    Interactive Integrated Environment for
    High-throughput Genetic Sequence analysis and
    Metabolic reconstructions of public genomes with
    Grid-based computational backend
  • GNARE is PUMA2 for analysis of user-submitted
  • http//
  • PUMA2 contains
  • Integrates Information from over 25 genomic,
    metabolic, structural and taxonomic databases
    (RefSeq, Unirot, IproClass, PDB, KEGG, EMP, CATH,
    NCBI Taxonomy, Phenotypes, etc)
  • Pre-computed analysis of publicly available
    completely and almost completely sequenced
    genomes (517 bacteria, 41 archaeal, 24
    eukaryotic, 638 mitochondrial and 2127 viral
    genomes) in interactive PUMA2 framework
  • Automated Metabolic reconstructions for 300
    completely sequenced organisms
  • GNARE User Models a framework for analysis of
    genomes provided by users (Shewanella federation,
    Apicomplexa genomes, strains of B. anthracis,
    Yersinia, Staphylococcus, Haemophilus, etc)
  • A suite of unique tools for evolutionary analysis
    of enzymes and metabolic networks (Chisel,
    PhyloBlocks, etc) developed by our group
  • PUMA2 satellite databases Pathos (GLRCE
    biodefence), TarGet (MCSG structural bilogy),
    Sentra (prokaryotic signal transduction),
    SubUnit, Physiological Profiles. MetaGenomes
    (PNNL Hanford Site), etc

GFF3 genomes repository at Argonneftp//ftp.mcs.a
  • All completely sequenced genomes from RefSeq are
    converted into GFF3 format.
  • GFF3 files for 8419 bacterial, eukaryotic,
    mitochondrial, viral, etc genomes can be
    downloaded from
  • ftp//
  • The file names correspond to the NCBI-RefSeq
    accession numbers, e.g
  • ftp//

Future Plans Annotations
  • In 2007 we will supplement Genome GFF3
    annotations for public genomes with
  • additional annotations from public databases
    (e.g. NCBI, UniProt, Integr8, GenomeNet, etc) and
  • annotations from our analysis tools (e.g. Chisel
    and PUMA2_FP), and other analysis tools
  • Supplement the GFF3 files for RefSeq genomes with
    annotations provided by users via the GNARE system

Future Plans Sharing Annotations and
Cross-referencing the data
  • GFF3 format can be used to share annotations and
    cross-references by different annotation centers
  • We plan to build services (Web-services and
    Web-interfaces) to allow users to
  • Submit and share their annotations via the PUMA2
    GFF3 converter
  • Extract public annotations from PUMA2 integrated
    database as well as user-submitted annotations in
    GFF3 format
  • Support customization of the GFF3 format (e.g.
    include only the fields of interest to a user,
    provide information from particular resource)
  • Cross-references to various databases (e.g.
    NCBI-RefSeq, PIR, SwissProt, UniProt, and others)
    will be included as feature data in the GFF3
  • Explore the use of ontologies for extension of
    the GFF3 format (we need your advice!)

Future Plans Data Distribution..(GFF3 genomes
repository at Argonne)
  • All the feature data collected and computed by
    the PUMA2 project for publicly available genomes
    will be distributed in the GFF3 format.
  • We will distribute the data through
  • Web-services
  • Web Interface (http)
  • FTP downloads

  • Our Team and
  • Globus Ian Foster, Mike Wilde, Nika Nefedova,
    Jens Voeckler Condor Zach Miller, Miron Livny
    OSG, TeraGrid
  • MCS Rick Stevens, systems, Susan Coghlan, and a
    lot of others.
Write a Comment
User Comments (0)