Overview of the Encyclopedia of Life EOL Project - PowerPoint PPT Presentation

1 / 17

About This Presentation

Title:

Overview of the Encyclopedia of Life EOL Project

Description:

Good demonstration of producing practical output from TeraGrid computing: ... a perfect choice as a charter application for the TeraGrid. Very large scale ... – PowerPoint PPT presentation

Number of Views:91

Avg rating:3.0/5.0

Slides: 18

Provided by: greg238

Category:

more less

Transcript and Presenter's Notes

Title: Overview of the Encyclopedia of Life EOL Project

1
Overview of the Encyclopedia of Life (EOL) Project
2
Background

Biology has become a data driven science
We have the blueprint (genomes) of over 800
organisms
This number will increase rapidly to the point
in 5-10 years where your blueprint becomes a tool
in your medical diagnosis
First we must understand the buildings
(proteins) that control lifes processes
EOL strives to be the 21st century Britannica
that everyone will turn to

3
EOL Project Description

The Encyclopedia of Life is a joint development
of the San Diego Supercomputer Center (SDSC)
and scientists and biological resources
worldwide
EOL involves SDSC staff from HPC, DAKS, Grids
and clusters and visualization
EOL has three parts
1. Putative functional and 3-D structure
assignment
through the largest computation ever attempted
2. True API level integration with key biological
resources
3. A focus for future collaborative developments
via the EOL Notebook

4
Type of Questions to be Addressed by EOL

If a knockout gene in arabidopsis leads to an
average phenotypic response of 10 increased
growth, will the same likely happen in rice?
Is protein X found in anthrax?
Is protein X a drug target, that is, does it
exist predominantly in pathogenic bacteria of is
it found in eukaryotes also?
Has caspase-1, a protein involved in cell death
and aging been identified in any plants, if so
what species and do the proposed protein
structures look similar?
Give me all available information on caspase-1

5
EOL Basic Topology
Genomic Data
Putative Functional and 3D Assignment
Integration with Other Resources
Public and Private Databases To Serve Thousands
Worldwide
6
TeraGrid
Sequence data from genomic sequencing projects
Ported applications
Load/update scripts
MySQL DataMart(s)
Data warehouse
Pipeline data
Structure assignment by 123D
Domain location prediction
Structure assignment by PSI-BLAST
Application Server
Normalized DB2 schema
Web/SOAP Server
Some Technical Detail Mapped to the Topology
Retrieve Web pages Invoke SOAP methods
7
http//arabidopsis.sdsc.edu One Plant Genome
Processed as a Prototype
8
Current Genomic Pipeline
Arabidopsis Protein sequences
sequence info
structure info
Prediction of signal peptides (SignalP,
PSORT) transmembrane (TMHMM, PSORT) coiled
coils (COILS) low complexity regions (SEG)
NR, PFAM
SCOP, PDB
Building FOLDLIB PDB chains SCOP domains PDP
domains CE matches PDB vs. SCOP 90 sequence
non-identical minimum size 25 aa coverage (90,
gaps lt30, endslt30)

Create PSI-BLAST profiles for Protein sequences
Structural assignment of domains by PSI-BLAST on
FOLDLIB
Only sequences w/out A-prediction
Structural assignment of domains by 123D on
FOLDLIB
Only sequences w/out A-prediction
Functional assignment by PFAM, NR, PSIPred
assignments
Domain location prediction by sequence
FOLDLIB
Store assigned regions in the DB
9
Scale of Multi-genome Analysis
800 genomes _at_ 10k-20k per 107 ORFs
Genomes Protein sequences
sequence info
structure info
Prediction of signal peptides (SignalP,
PSORT) transmembrane (TMHMM, PSORT) coiled
coils (COILS) low complexity regions (SEG)
NR, PFAM
SCOP, PDB
4 CPU years
104 entries
Building FOLDLIB PDB chains SCOP domains PDP
domains CE matches PDB vs. SCOP 90 sequence
non-identical minimum size 25 aa coverage (90,
gaps lt30, endslt30)

228 CPU years
Create PSI-BLAST profiles for Protein sequences
3 CPU years
Structural assignment of domains by PSI-BLAST on
FOLDLIB
Only sequences w/out A-prediction
9 CPU years
Structural assignment of domains by 123D on
FOLDLIB
Only sequences w/out A-prediction
252 CPU years
Functional assignment by PFAM, NR, PSIPred
assignments
3 CPU years
Domain location prediction by sequence
FOLDLIB
Store assigned regions in the DB
10
TeraGrid application

Technical aspects
Excellent charter application for the TeraGrid
project!
Good demonstration of producing practical output
from TeraGrid computing scientific papers and
an extensive web site and services will be
produced
Software pipeline now a proven technique and a
sure bet
Can be implemented in the fastest possible time
project already initialized

11
EOL Data Services
Load/update scripts
MySQL DataMart(s)
Data warehouse
Pipeline data
Structure assignment by 123D
Domain location prediction
Structure assignment by PSI-BLAST
Publish Web Services API
Application server
SOAP/Web Server
UDDI directory
Web pages served via JSP
EOL Notebook
Data incorporated into third party web pages
Automated data downloads to mirrors and
researchers
Encyclopedia of Life
WWW
12
Basic Web Interface
MS Internet Explorer
Netscape 4.7/6.1
Mozilla v1.0
Opera
Microsoft Windows
Encyclopedia of Life
MS Internet Explorer
Netscape 4.7/6.1
Mozilla v1.0
Opera
Apple Macintosh
Netscape 4.7/6.1
Mozilla v1.0
Opera
Linux
MS Internet Explorer
Netscape 4.7/6.1
Mozilla v1.0
Opera
Win-CE and pen-based devices
13
Local Data Mirrors
Mirror Manager
MySQL DataMart(s)
Structure assignment by 123D
Domain location prediction
Structure assignment by PSI-BLAST
SDSC
SOAP Server
Request for bulk data streams
Data Management Layer
MySQL DataMart(s)
Structure assignment by 123D
Domain location prediction
Structure assignment by PSI-BLAST
BLAST server
SOAP Server
Web Interface
14
Local Data Mirrors

Support for server platforms, i.e.
Sparc Solaris
IRIX
Linux
Based on MySQL Apache because of availability
Automated mirror registration and listing
User-friendly admin for mirror maintenance
Means of metering of data usage per species data
stream to generate revenue from industry

15
EOL Notebook
EOL DataMart
Structure assignment by 123D
Domain location prediction
Structure assignment by PSI-BLAST
SOAP Server
Encyclopedia of Life
EOL SOAP Queries
Invoke
Virtual community messaging
XML/RDF store
Metadata sharing
BLAST Data
Keyword data
Stored queries
Scheduler
Annotations
BLAST
Keyword queries
Session info
16
EOL Notebook

Provides a consistent, advanced, cross-platform
GUI to view returned data from queries to the EOL
database via Web Services.
Provide persistence of both queries and returned
data via local XML database
Provide mechanism to enable unattended,
scheduled, periodic queries
Provides means to annotate data and results and
share those with others, in effect a scientific
Napster
Provide means to create virtual community(s)

17
Summary

1. EOL is a large-scale data analysis project,
one of the largest biological computations
attempted, whose results will be eagerly awaited
by an enormous number of biologists
2. Core scientific analysis techniques
well-proven in existing arabidopsis project
3. Its a perfect choice as a charter application
for the TeraGrid
Very large scale computation
Pipeline-type computations well suited to the
Grid platform
High visibility and very practical use of
TeraGrid results
TeraGrid name will become associated with high
quality data analysis