Title: Bioinformatics: building bridges to biology information
1Bioinformatics building bridges to biology
information
- Biology Information Access
- Genome Information Systems
- Bio-Grids and Bio-Directories
Don Gilbert, gilbertd_at_bio.indiana.edu November
2002
2Biology information access projects
- Bio-info archiving and distribution
- IUBio Archive, http//iubio.bio.indiana.edu/ --
public molecular biology data / software archive - Bio-Mirrors, http//www.bio-mirror.net/ --
Sequence and related biology databanks - Genome information systems
- FlyBase, http//flybase.bio.indiana.edu/ --
genome infosystem of Drosophila fruitfly - euGenes, http//eugenes.org/ -- infosystem for 8
important eukaryotes with 180,000 genes - Bioinformatics services
- http//sunflower.bio.indiana.edu/bioweb/ --
molecular biology program use via web - Bio-Data Grids
- http//iubio.bio.indiana.edu/grid/ --
experimental distributed computing
3(No Transcript)
4BioData
- BioData size, contents, dispersion, uses
- Genome data
- very important, highly complex, harder to find,
long lived - Literature (abstracted and curated), Sequence and
feature analyses, maps, controlled
vocabulary/ontologies, people, biologics,
contacts, etc. - BioData access
- Need to find and use best data
- New data kinds and sources - bio-information is
very fluid - Need current data update monthly, weekly, daily
- Distributed widely in world among 1000s of
national, regional centers labs
5Bio Databanks, EBI, Sept. 2002
6Constellation of Bio-Data (SRS - Lion Bioscience)
7(No Transcript)
8FlyBase and euGenes
9Genome Databases
- Drosophila FlyBase, http//flybase.net/ (Indiana
Univ.) - C. elegans WormBase, http//www.wormbase.org/
- Mouse MGD, http//www.informatics.jax.org/
- Saccaromyces SGD, http//genome-www.stanford.edu/
Saccharomyces/ - Human LocusLink, http//www.ncbi.nlm.nih.gov/Locu
sLink/ - Human GeneCards http//bioinfo.weizmann.ac.il/car
ds/ - Various eukaryotes Ensembl http//www.ensembl.org
/ - Various eukaryotes euGenes http//eugenes.org/
(Indiana Univ.) - Many new organism genome systems for Daphnia,
insects, vertebrates, others with complete genome
data
10FlyBase.net
- Distributed project (4 sites, 6 PIs, 15
curators, 15 informaticians) 10 years old - Multiple databases project data flow and
exchange critical - Curated and computed data, from expt. literature,
genome sequence - Integrated database modules (for generic use w/
GMOD) - Genetics, Sequences, Maps, Expression
- Controlled vocabularies Ontologies
- Computational analyses
- Organism, taxonomy, phylogenetic/comparative
- Publications, General
11euGenes.org
- Automated genome summaries for Human, Fruitfly,
Mouse, Mosquito, Arabidopsis, C. elegans,
Saccharomyces, Zebrafish - 3 year, computational DB project, 1 part-time
informatician (dgg ?) - genome maps, sequences, gene reports, external
database links - cross-species comparisons similar genes, genome
features, gene function
12(No Transcript)
13Genome Data Objects
Drosophila genome, FlyBase, Sept. 2002
8 eukaryote genomes, euGenes, July 2002
14Genome attributes in euGenesJuly 2002
Genes as extracted from genome project sources.
These differ from true gene numbers by orphan
gene records, prediction artifacts, unmerged
predicted/expt. records, and unfinished
sequencing gaps.
15Anatomy of genome database info system
16FlyBase/euGenes Query System
17FlyBase Query Results
FlyBase Genes query results Query  (
libsFBgn PFgn-allwing or libs-synwing )
and libs-orgDmel, Â No. matches 1437 Bookmark
FBquery ( libsFBgn PFgn-allwing
libs-synwing ) libs-orgDmel
Symbol Name Map Alleles Stocks Refs DNA Date
1 18w 18 wheeler 56F11 16 2 56 13 31 May
02 2 2R-F - - 2 1 3 - 31 May 02 ...
19 Act42A Actin 42A 42A2 2 - 73 23 31 May
02 20 Act5C Actin 5C 5C7 14 1 129 43 31 May
02 ------------------- Page and Sort results
------------------ Batch Download Fetch items x
All Items  Format SpreadsheetÂ
Report content Summary  Report only Select
fields Field list Refine query or find
items in related data Refine query ( libsFBgn
PFgn-allwing or libs-synwing ) and
libs-orgDmel and other fields matches
.. Search Genes , retrieve Related Data
Classes (alleles, aberrations, transcripts,
insertions, sequences )
18GMOD - Generic genome database tools
- Generic Model Organism Database Construction Set,
http//www.gmod.org/ - Database schemas
- Literature curation tools
- Gene ontology management tools
- Visualization tools
- Data processing pipelines
19Bio-grids - what might they be ?
- transparent use of available workstations
commodity grid resources (commercial, academic) - find biodata, computing resources easily and
automatically via directories - personal/project resources and peer-peer sharing
- less reliance, less cost for centralized services
or building local IT centers - Power grid - plug in your toaster, ignore the
power sources and grid. Bio grid - plug in
workstation, ignore where data and compute power
comes from -- eventually!
20 EU DataGrid Interfacesfrom Bob Jones, CERN
Computing Elements
Mass Storage Systems HPSS, Castor
21BioGrid Schematic
- Grid-aware client software
- Data and software resource directories
- Grid of processing computers
22Directories of Genome Data
- For genome data, "broad and shallow" directories
federate the "narrow and deep" data-bases - BioData access tools
- SRS - Sequence Retrieval System Entrez AceDB
- RDBMS Ensembl IBM DiscoveryLink BioSQL
BioDAS - Directory services - Data tools LDAP , Web
Services - LDAP mature, efficient for high volumes, allows
federated queries over distributed directories,
and works well for SRS databanks and genome
annotations - Web Services new, simple complex for XML
messages over Web has wide industry support ,
but its many standards are in flux
23Bio-directories Technology
- Technology for finding bio-data
- Current Web pages Web Indexers (Google) FTP
servers - Sometimes CORBA Java RMI
- Usable Lightweight Directories - LDAP
- Developing Web Services (XML on Web SOAP, WSDL,
UDDI, ...) - Related BioDAS BioMoby Life Sciences ID
(LSID)
24SRS - LDAP WebXML gateways
- Sequence Retrieval System (SRS) knows millions of
bio-objects good start for bio-directories - OpenLDAP server combined with SRS6
- WebService SOAP server/client with SRS6
- SRS-LDAP-SOAP software is available at
http//iubio.bio.indiana.edu/biogrid/directories/ - Compare LDAP, SOAP, Wgetz, FTP for Grid uses
- LDAP is 5x faster than SOAP, Wgetz
25BioDirectory tests for Grid
26Using Bio Directories
- Simple client software
- Automated use
- People use
- Discovery
- Search by many criteria
- Retrieve bulk subsets
27Genome Feature Directory
- Genome objects fit in directories, search
retrieve as annotated sequence (like BioDAS) - Same directory methods work for objects,
databases, software and other resources
28BioGrid Runner
29Wrap up
- Future of Bio-data distribution
- Computationally find and use dispersed, complex
data - Best methods for Bio-data Grids
- High volume and complex data
- Efficient selection and transport to grid
computers - LDAP works well Web-XML is usable
- Community needs and uses
- Shared data descriptions, schema, ontologies
(Semantic web) - Simple, practical, flexible grid methods use
existing dbs - Use common developing standards
30Thanks to the work, help and support of these
folks and others IUBio Archive -- Danfeng Yao,
Paul Poole Bio-Mirrors --Yoshihiro Ugawa, Tin
Tan Wee, Markus Buchhorn, Akira Mizushima,
Juncai MA, and others. FlyBase -- Victor
Strelets, Gary Grumbling, Nihar Sheth, Manish
Anand, Edwin Wang, Thom Kaufman, Kathy
Matthews, and a crowd at other sites euGenes
-- original idea of and work with Bill Gelbart
and others Bioinformatics web -- Sue Olson
Eugenes fulgens (Magnificent Hummingbird, Costa
Rica)