Title: Bioinformatics for Genome data
1Bioinformatics for Genome data
- Genome databases / information systems
- FlyBase, http//flybase.bio.indiana.edu/ --
genome infosystem of Drosophila fruitfly - euGenes, http//eugenes.org/ -- infosystem for 8
important eukaryotes with 180,000 genes - Bio-info archiving and distribution
- IUBio Archive, http//iubio.bio.indiana.edu/ --
public molecular biology data / software archive - Bio-Mirrors, http//www.bio-mirror.net/ --
Sequence and related biology databanks
Don Gilbert, gilbertd_at_bio.indiana.edu
2What is bioinformatics?
- Bioinformatics is the field of science in which
biology, computer science, and information
technology merge into a single discipline. The
ultimate goal of the field is to enable the
discovery of new biological insights as well as
to create a global perspective from which
unifying principles in biology can be discerned. - There are three important sub-disciplines within
bioinformatics - development of new algorithms and statistics with
which to assess relationships among members of
large data sets - analysis and interpretation of various types of
data including nucleotide sequences and proteins
gene and genome features and functions,
expression in cells and through development and
other biology data. - development and implementation of tools that
enable efficient access and management of
different types of information. - http//www.ncbi.nlm.nih.gov/Education/index.html
w/ dgg edits
3BioData
- BioData size, contents, dispersion, uses
- Genome data
- very important, highly complex, harder to find,
long lived - Literature (abstracted and curated), Sequence and
feature analyses, maps, controlled
vocabulary/ontologies, people, biologics,
contacts, etc. - BioData access
- Need to find and use best data
- New data kinds and sources - bio-information is
very fluid - Need current data update monthly, weekly, daily
- Distributed widely in world among 1000s of
national, regional centers labs
4Bio Databanks, EBI, Sept. 2002
5Constellation of Bio-Data (SRS - Lion Bioscience)
6Genome Data Objects
Drosophila genome, FlyBase, Sept. 2002
8 eukaryote genomes, euGenes, July 2002
7Genome Databases
- Drosophila FlyBase, http//flybase.net/ (Indiana
Univ.) - C. elegans WormBase, http//www.wormbase.org/
- Mouse MGD, http//www.informatics.jax.org/
- Saccaromyces SGD, http//genome-www.stanford.edu/
Saccharomyces/ - Human LocusLink, http//www.ncbi.nlm.nih.gov/Locu
sLink/ - Human GeneCards http//bioinfo.weizmann.ac.il/car
ds/ - Various eukaryotes Ensembl http//www.ensembl.org
/ - Various eukaryotes euGenes http//eugenes.org/
(Indiana Univ.) - Many new organism genome systems for Daphnia,
insects, vertebrates, others with complete genome
data
8FlyBase and euGenes
9FlyBase.net
- Distributed project (4 sites, 6 PIs, 15
curators, 15 informaticians) 10 years old - Multiple databases project data flow and
exchange critical - Curated and computed data, from expt. literature,
genome sequence - Integrated database modules (for generic use w/
GMOD) - Genetics, Sequences, Maps, Expression
- Controlled vocabularies Ontologies
- Computational analyses
- Organism, taxonomy, phylogenetic/comparative
- Publications, General
10euGenes.org
- Automated genome summaries for Human, Fruitfly,
Mouse, Mosquito, Arabidopsis, C. elegans,
Saccharomyces, Zebrafish - 3 year, computational DB project, 1 part-time
informatician (dgg ?) - genome maps, sequences, gene reports, external
database links - cross-species comparisons similar genes, genome
features, gene function
11(No Transcript)
12Anatomy of genome database info system
13Anatomy of genome database
- Data components
- biosequences, literature, external data,
expression info, pathways, maps, anatomy,
populations, species, ecology, stocks, people - Metadata about primary data (ID, dates, sources,
evidence) - Architecture
- Relational database for management
- Search and retrieval software for flat file data
- Backend (database, analyses piplelines) Frontend
(public views and access web, ftp) Middleware
('glue' back and front) - Flexible data schema changes common
- Performance constraints
- Internet-shared, standards-based, open-source
preferred
14Anatomy of genome database, cont.
- Analysis software
- Project uses sequence analyses, external
database comparisons - Pipeline for automated analyses, rerun as needed
- Public uses (e.g. BLAST search)
- Editing / data management interface
- Interactive document editing
- Batch data updates
15Anatomy of genome database, cont.
- Publication interface
- Detailed biological object views (sequences,
genes, etc.) - Queries simple and frequent, ad-hoc and general
- Graphic viewers
- Data exchange
- Data definitions schema (XML)
- Controlled vocabularies of science terms,
ontologies - Minimal information for collaboration, sharing
16Compute parts of system
- Web server (Apache) and modules
- FTP server for bulk data exchange
- Relational DBMS PostgreSQL.org, MySQL.com,
Oracle.. - Analysis programs BLAST, various bioinformatics
tools - Perl, Java middleware for data access analysis,
search and report - Limited, secure access for project data
management - Public access for released data (web, ftp)
17FlyBase/euGenes Query System
18FlyBase Query Results
FlyBase Genes query results Query  (
libsFBgn PFgn-allwing or libs-synwing )
and libs-orgDmel, Â No. matches 1437 Bookmark
FBquery ( libsFBgn PFgn-allwing
libs-synwing ) libs-orgDmel
Symbol Name Map Alleles Stocks Refs DNA Date
1 18w 18 wheeler 56F11 16 2 56 13 31 May
02 2 2R-F - - 2 1 3 - 31 May 02 ...
19 Act42A Actin 42A 42A2 2 - 73 23 31 May
02 20 Act5C Actin 5C 5C7 14 1 129 43 31 May
02 ------------------- Page and Sort results
------------------ Batch Download Fetch items x
All Items  Format SpreadsheetÂ
Report content Summary  Report only Select
fields Field list Refine query or find
items in related data Refine query ( libsFBgn
PFgn-allwing or libs-synwing ) and
libs-orgDmel and other fields matches
.. Search Genes , retrieve Related Data
Classes (alleles, aberrations, transcripts,
insertions, sequences )
19 Efficiency of SRS versus RDB
Drosophila Genome Annotations SRS or Gadfly DB
relational database Web search time (shorter is
better two computers - O,F)
20Current System Dataflow
Master Genes File (rw)
People DB
Stock Center DB
Image Curation
Supplemental nightly update
Mol5 (rw)
Public (ro) /SRS
Web Browser
Dumper
Gene Loader
Gene -Seen applet
GenBank
Sequence Analysis Pipeline
gadfly (rw)
Apollo
gadfly scripts
gadfly x xml
From Stan Letovsky, FlyBase
21Ultimate System Dataflow?
Master Genes File (rw)
People DB
Stock Center DB
Image Curation
Public (ro) /SRS
IDB proto (rw)
Web Browser
XML Dumper
XML Loader
Error cleanup
XML?
Gene -Seen applet
Apollo
GenBank
Sequence Analysis Pipeline
From Stan Letovsky, FlyBase
22Single DB vs. Federated Info. S/R
23GMOD - Generic Model Organism Database
Construction Set, http//www.gmod.org/
- Database schemas
- Literature curation tools
- Gene ontology management tools
- Visualization tools
- Data processing pipelines
24From Shawn Hoon, Fugu Informatics Group
25From Shawn Hoon, Fugu Informatics Group
26From Shawn Hoon, Fugu Informatics Group
27From Shawn Hoon, Fugu Informatics Group
28Bio-Grids
29Bio-Grids - what are they?
- transparent use of available workstations
commodity grid resources (commercial, academic) - find biodata, computing resources easily and
automatically via directories - personal/project resources and peer-peer sharing
- less reliance, less cost for centralized services
or building local IT centers - Power grid - plug in your toaster, ignore the
power sources and grid. Bio grid - plug in
workstation, ignore where data and compute power
comes from -- eventually!
30BioGrid Schematic
- Grid-aware client software
- Data and software resource directories
- Grid of processing computers
31From Shawn Hoon, Fugu Informatics Group