Overview of the Encyclopedia of Life EOL Project - PowerPoint PPT Presentation

1 / 17
About This Presentation
Title:

Overview of the Encyclopedia of Life EOL Project

Description:

Good demonstration of producing practical output from TeraGrid computing: ... a perfect choice as a charter application for the TeraGrid. Very large scale ... – PowerPoint PPT presentation

Number of Views:91
Avg rating:3.0/5.0
Slides: 18
Provided by: greg238
Category:

less

Transcript and Presenter's Notes

Title: Overview of the Encyclopedia of Life EOL Project


1
Overview of the Encyclopedia of Life (EOL) Project
2
Background
  • Biology has become a data driven science
  • We have the blueprint (genomes) of over 800
    organisms
  • This number will increase rapidly to the point
    in 5-10 years where your blueprint becomes a tool
    in your medical diagnosis
  • First we must understand the buildings
    (proteins) that control lifes processes
  • EOL strives to be the 21st century Britannica
    that everyone will turn to

3
EOL Project Description
  • The Encyclopedia of Life is a joint development
    of the San Diego Supercomputer Center (SDSC)
    and scientists and biological resources
    worldwide
  • EOL involves SDSC staff from HPC, DAKS, Grids
    and clusters and visualization
  • EOL has three parts
  • 1. Putative functional and 3-D structure
    assignment
  • through the largest computation ever attempted
  • 2. True API level integration with key biological
    resources
  • 3. A focus for future collaborative developments
    via the EOL Notebook

4
Type of Questions to be Addressed by EOL
  • If a knockout gene in arabidopsis leads to an
    average phenotypic response of 10 increased
    growth, will the same likely happen in rice?
  • Is protein X found in anthrax?
  • Is protein X a drug target, that is, does it
    exist predominantly in pathogenic bacteria of is
    it found in eukaryotes also?
  • Has caspase-1, a protein involved in cell death
    and aging been identified in any plants, if so
    what species and do the proposed protein
    structures look similar?
  • Give me all available information on caspase-1

5
EOL Basic Topology
Genomic Data
Putative Functional and 3D Assignment
Integration with Other Resources
Public and Private Databases To Serve Thousands
Worldwide
6
TeraGrid
Sequence data from genomic sequencing projects
Ported applications
Load/update scripts
MySQL DataMart(s)
Data warehouse
Pipeline data
Structure assignment by 123D
Domain location prediction
Structure assignment by PSI-BLAST
Application Server
Normalized DB2 schema
Web/SOAP Server
Some Technical Detail Mapped to the Topology
Retrieve Web pages Invoke SOAP methods
7
http//arabidopsis.sdsc.edu One Plant Genome
Processed as a Prototype
8
Current Genomic Pipeline
Arabidopsis Protein sequences
sequence info
structure info
Prediction of signal peptides (SignalP,
PSORT) transmembrane (TMHMM, PSORT) coiled
coils (COILS) low complexity regions (SEG)
NR, PFAM
SCOP, PDB
Building FOLDLIB PDB chains SCOP domains PDP
domains CE matches PDB vs. SCOP 90 sequence
non-identical minimum size 25 aa coverage (90,
gaps lt30, endslt30)

Create PSI-BLAST profiles for Protein sequences
Structural assignment of domains by PSI-BLAST on
FOLDLIB
Only sequences w/out A-prediction
Structural assignment of domains by 123D on
FOLDLIB
Only sequences w/out A-prediction
Functional assignment by PFAM, NR, PSIPred
assignments
Domain location prediction by sequence
FOLDLIB
Store assigned regions in the DB
9
Scale of Multi-genome Analysis
800 genomes _at_ 10k-20k per 107 ORFs
Genomes Protein sequences
sequence info
structure info
Prediction of signal peptides (SignalP,
PSORT) transmembrane (TMHMM, PSORT) coiled
coils (COILS) low complexity regions (SEG)
NR, PFAM
SCOP, PDB
4 CPU years
104 entries
Building FOLDLIB PDB chains SCOP domains PDP
domains CE matches PDB vs. SCOP 90 sequence
non-identical minimum size 25 aa coverage (90,
gaps lt30, endslt30)

228 CPU years
Create PSI-BLAST profiles for Protein sequences
3 CPU years
Structural assignment of domains by PSI-BLAST on
FOLDLIB
Only sequences w/out A-prediction
9 CPU years
Structural assignment of domains by 123D on
FOLDLIB
Only sequences w/out A-prediction
252 CPU years
Functional assignment by PFAM, NR, PSIPred
assignments
3 CPU years
Domain location prediction by sequence
FOLDLIB
Store assigned regions in the DB
10
TeraGrid application
  • Technical aspects
  • Excellent charter application for the TeraGrid
    project!
  • Good demonstration of producing practical output
    from TeraGrid computing scientific papers and
    an extensive web site and services will be
    produced
  • Software pipeline now a proven technique and a
    sure bet
  • Can be implemented in the fastest possible time
    project already initialized

11
EOL Data Services
Load/update scripts
MySQL DataMart(s)
Data warehouse
Pipeline data
Structure assignment by 123D
Domain location prediction
Structure assignment by PSI-BLAST
Publish Web Services API
Application server
SOAP/Web Server
UDDI directory
Web pages served via JSP
EOL Notebook
Data incorporated into third party web pages
Automated data downloads to mirrors and
researchers
Encyclopedia of Life
WWW
12
Basic Web Interface
MS Internet Explorer
Netscape 4.7/6.1
Mozilla v1.0
Opera
Microsoft Windows
Encyclopedia of Life
MS Internet Explorer
Netscape 4.7/6.1
Mozilla v1.0
Opera
Apple Macintosh
Netscape 4.7/6.1
Mozilla v1.0
Opera
Linux
MS Internet Explorer
Netscape 4.7/6.1
Mozilla v1.0
Opera
Win-CE and pen-based devices
13
Local Data Mirrors
Mirror Manager
MySQL DataMart(s)
Structure assignment by 123D
Domain location prediction
Structure assignment by PSI-BLAST
SDSC
SOAP Server
Request for bulk data streams
Data Management Layer
MySQL DataMart(s)
Structure assignment by 123D
Domain location prediction
Structure assignment by PSI-BLAST
BLAST server
SOAP Server
Web Interface
14
Local Data Mirrors
  • Support for server platforms, i.e.
  • Sparc Solaris
  • IRIX
  • Linux
  • Based on MySQL Apache because of availability
  • Automated mirror registration and listing
  • User-friendly admin for mirror maintenance
  • Means of metering of data usage per species data
    stream to generate revenue from industry

15
EOL Notebook
EOL DataMart
Structure assignment by 123D
Domain location prediction
Structure assignment by PSI-BLAST
SOAP Server
Encyclopedia of Life
EOL SOAP Queries
Invoke
Virtual community messaging
XML/RDF store
Metadata sharing
BLAST Data
Keyword data
Stored queries
Scheduler
Annotations
BLAST
Keyword queries
Session info
16
EOL Notebook
  • Provides a consistent, advanced, cross-platform
    GUI to view returned data from queries to the EOL
    database via Web Services.
  • Provide persistence of both queries and returned
    data via local XML database
  • Provide mechanism to enable unattended,
    scheduled, periodic queries
  • Provides means to annotate data and results and
    share those with others, in effect a scientific
    Napster
  • Provide means to create virtual community(s)

17
Summary
  • 1. EOL is a large-scale data analysis project,
    one of the largest biological computations
    attempted, whose results will be eagerly awaited
    by an enormous number of biologists
  • 2. Core scientific analysis techniques
    well-proven in existing arabidopsis project
  • 3. Its a perfect choice as a charter application
    for the TeraGrid
  • Very large scale computation
  • Pipeline-type computations well suited to the
    Grid platform
  • High visibility and very practical use of
    TeraGrid results
  • TeraGrid name will become associated with high
    quality data analysis
Write a Comment
User Comments (0)
About PowerShow.com