Title: Overview of the Encyclopedia of Life EOL Project
1Overview of the Encyclopedia of Life (EOL) Project
2Background
- Biology has become a data driven science
- We have the blueprint (genomes) of over 800
organisms - This number will increase rapidly to the point
in 5-10 years where your blueprint becomes a tool
in your medical diagnosis - First we must understand the buildings
(proteins) that control lifes processes - EOL strives to be the 21st century Britannica
that everyone will turn to
3EOL Project Description
- The Encyclopedia of Life is a joint development
of the San Diego Supercomputer Center (SDSC)
and scientists and biological resources
worldwide - EOL involves SDSC staff from HPC, DAKS, Grids
and clusters and visualization - EOL has three parts
- 1. Putative functional and 3-D structure
assignment - through the largest computation ever attempted
- 2. True API level integration with key biological
resources - 3. A focus for future collaborative developments
via the EOL Notebook
4Type of Questions to be Addressed by EOL
- If a knockout gene in arabidopsis leads to an
average phenotypic response of 10 increased
growth, will the same likely happen in rice? - Is protein X found in anthrax?
- Is protein X a drug target, that is, does it
exist predominantly in pathogenic bacteria of is
it found in eukaryotes also? - Has caspase-1, a protein involved in cell death
and aging been identified in any plants, if so
what species and do the proposed protein
structures look similar? - Give me all available information on caspase-1
5EOL Basic Topology
Genomic Data
Putative Functional and 3D Assignment
Integration with Other Resources
Public and Private Databases To Serve Thousands
Worldwide
6TeraGrid
Sequence data from genomic sequencing projects
Ported applications
Load/update scripts
MySQL DataMart(s)
Data warehouse
Pipeline data
Structure assignment by 123D
Domain location prediction
Structure assignment by PSI-BLAST
Application Server
Normalized DB2 schema
Web/SOAP Server
Some Technical Detail Mapped to the Topology
Retrieve Web pages Invoke SOAP methods
7http//arabidopsis.sdsc.edu One Plant Genome
Processed as a Prototype
8Current Genomic Pipeline
Arabidopsis Protein sequences
sequence info
structure info
Prediction of signal peptides (SignalP,
PSORT) transmembrane (TMHMM, PSORT) coiled
coils (COILS) low complexity regions (SEG)
NR, PFAM
SCOP, PDB
Building FOLDLIB PDB chains SCOP domains PDP
domains CE matches PDB vs. SCOP 90 sequence
non-identical minimum size 25 aa coverage (90,
gaps lt30, endslt30)
Create PSI-BLAST profiles for Protein sequences
Structural assignment of domains by PSI-BLAST on
FOLDLIB
Only sequences w/out A-prediction
Structural assignment of domains by 123D on
FOLDLIB
Only sequences w/out A-prediction
Functional assignment by PFAM, NR, PSIPred
assignments
Domain location prediction by sequence
FOLDLIB
Store assigned regions in the DB
9Scale of Multi-genome Analysis
800 genomes _at_ 10k-20k per 107 ORFs
Genomes Protein sequences
sequence info
structure info
Prediction of signal peptides (SignalP,
PSORT) transmembrane (TMHMM, PSORT) coiled
coils (COILS) low complexity regions (SEG)
NR, PFAM
SCOP, PDB
4 CPU years
104 entries
Building FOLDLIB PDB chains SCOP domains PDP
domains CE matches PDB vs. SCOP 90 sequence
non-identical minimum size 25 aa coverage (90,
gaps lt30, endslt30)
228 CPU years
Create PSI-BLAST profiles for Protein sequences
3 CPU years
Structural assignment of domains by PSI-BLAST on
FOLDLIB
Only sequences w/out A-prediction
9 CPU years
Structural assignment of domains by 123D on
FOLDLIB
Only sequences w/out A-prediction
252 CPU years
Functional assignment by PFAM, NR, PSIPred
assignments
3 CPU years
Domain location prediction by sequence
FOLDLIB
Store assigned regions in the DB
10TeraGrid application
- Technical aspects
- Excellent charter application for the TeraGrid
project! - Good demonstration of producing practical output
from TeraGrid computing scientific papers and
an extensive web site and services will be
produced - Software pipeline now a proven technique and a
sure bet - Can be implemented in the fastest possible time
project already initialized
11EOL Data Services
Load/update scripts
MySQL DataMart(s)
Data warehouse
Pipeline data
Structure assignment by 123D
Domain location prediction
Structure assignment by PSI-BLAST
Publish Web Services API
Application server
SOAP/Web Server
UDDI directory
Web pages served via JSP
EOL Notebook
Data incorporated into third party web pages
Automated data downloads to mirrors and
researchers
Encyclopedia of Life
WWW
12Basic Web Interface
MS Internet Explorer
Netscape 4.7/6.1
Mozilla v1.0
Opera
Microsoft Windows
Encyclopedia of Life
MS Internet Explorer
Netscape 4.7/6.1
Mozilla v1.0
Opera
Apple Macintosh
Netscape 4.7/6.1
Mozilla v1.0
Opera
Linux
MS Internet Explorer
Netscape 4.7/6.1
Mozilla v1.0
Opera
Win-CE and pen-based devices
13Local Data Mirrors
Mirror Manager
MySQL DataMart(s)
Structure assignment by 123D
Domain location prediction
Structure assignment by PSI-BLAST
SDSC
SOAP Server
Request for bulk data streams
Data Management Layer
MySQL DataMart(s)
Structure assignment by 123D
Domain location prediction
Structure assignment by PSI-BLAST
BLAST server
SOAP Server
Web Interface
14Local Data Mirrors
- Support for server platforms, i.e.
- Sparc Solaris
- IRIX
- Linux
- Based on MySQL Apache because of availability
- Automated mirror registration and listing
- User-friendly admin for mirror maintenance
- Means of metering of data usage per species data
stream to generate revenue from industry
15EOL Notebook
EOL DataMart
Structure assignment by 123D
Domain location prediction
Structure assignment by PSI-BLAST
SOAP Server
Encyclopedia of Life
EOL SOAP Queries
Invoke
Virtual community messaging
XML/RDF store
Metadata sharing
BLAST Data
Keyword data
Stored queries
Scheduler
Annotations
BLAST
Keyword queries
Session info
16EOL Notebook
- Provides a consistent, advanced, cross-platform
GUI to view returned data from queries to the EOL
database via Web Services. - Provide persistence of both queries and returned
data via local XML database - Provide mechanism to enable unattended,
scheduled, periodic queries - Provides means to annotate data and results and
share those with others, in effect a scientific
Napster - Provide means to create virtual community(s)
17Summary
- 1. EOL is a large-scale data analysis project,
one of the largest biological computations
attempted, whose results will be eagerly awaited
by an enormous number of biologists - 2. Core scientific analysis techniques
well-proven in existing arabidopsis project - 3. Its a perfect choice as a charter application
for the TeraGrid - Very large scale computation
- Pipeline-type computations well suited to the
Grid platform - High visibility and very practical use of
TeraGrid results - TeraGrid name will become associated with high
quality data analysis