Title: Bioinformatics applications on the EGEE Grid
1Bioinformatics applications on the EGEE Grid
- Brendan Hamill
- Edinburgh Centre for Bioinformatics
2Contents
- History of the EGEE Grid
- Overview of the main grid services
- Training activity in EGEE
- Bioinformatics applications
3EGEE international e-infrastructure
- Objectives of programme
- Build, deploy and operate a consistent, robust,
large scale production grid service that links
with and builds on national, regional and
international initiatives - Improve and maintain the middleware in order to
deliver a reliable service to users - Attract new users from research and industry and
ensure training and support for them
4History of EGEE
5History of EGEE (2)
- European DataGrid (EDG) Project
- Ended March 2004
- EGEE phase 1
- April 2004-March 2006
- EGEE-II
- April 2006-March 2008
- Part of the EU Sixth Framework Programme (FP6)
- Budget gt 50M
- gt1200 individuals in 91 partner organisations
6CERN Large Hadron Collider
7Large Hadron Collider (2)
8Large Hadron Collider (3)
9Large Hadron Collider (4)
10LHC pre-accelerators and detectors
11(No Transcript)
12CMS Detector
13Other subject areas in EGEE
Astrophysics
Bioinformatics
Computational Chemistry
14Applications on EGEE
- More than 25 applications from 9 domains
- Astrophysics
- MAGIC, Planck
- Computational Chemistry
- Earth Sciences
- Earth Observation, Solid Earth Physics,
Hydrology, Climate - Financial Simulation
- E-GRID
- Fusion
- Geophysics
- EGEODE
- High Energy Physics
- 4 LHC experiments (ALICE, ATLAS, CMS, LHCb)
- BaBar, CDF, DØ, ZEUS
- Multimedia
- Life Sciences
- Bioinformatics (Drug Discovery, GPS_at_,
Xmipp_MLrefine, etc.) - Medical imaging (GATE, CDSS, gPTM3D, SiMRI 3D,
etc.)
15Distribution of CPU time by disciplines and dates
16EGEE-II Expertise Resources
- More than 90 partners
- 32 countries
- 12 federations
- ? Major and national Grid projects in Europe,
USA, Asia - 27 countries through related projects
- BalticGrid
- SEE-GRID
- EUMedGrid
- EUChinaGrid
- EELA
17Collaborating projects
18Regional distribution
18
19EGEE-II Activities
- Service activities - establishing operations
- Grid Operations Geneva
- Security Lyon
- Testing Geneva
- Network activities - supporting VOs
- Project management Geneva
- Training Edinburgh
- Applications Support Paris
- External projects Athens
- Joint Research Activities - e.g. hardening
middleware - Middleware development Bologna
20Related projects infrastructure, education,
application
21Grid services
- How can EGEE middleware support collaboration and
resource sharing within and between many diverse
VOs ?
22Grid Middleware
- When using a Grid you
- Login with digital credentials (Authentication)
- Use rights given you (Authorisation)
- Run jobs
- Manage files create them, read/write, list
directories - Services are linked by the Internet
- Middleware
- Many admin domains
- When using a PC or workstation you
- Login with a username and password
(Authentication) - Use rights given to you (Authorisation)
- Run jobs
- Manage files create them, read/write, list
directories - Components are linked by a bus
- Operating system
- One admin domain
23Typical current grid
- Grid middleware runs on each shared resource
- Data storage
- (Usually) batch queues on pools of processors
- Users join VOs
- Virtual organisation negotiates with sites to
agree access to resources - Distributed services (both people and middleware)
enable the grid, allow single sign-on
24Authorisation, Authentication (AA)
Users in many locations and organisations
Grid Security Infrastructure
Resources in many locations and organisations
System software
Operating system
Local scheduler
File system
Hardware
Computing clusters,
Network resources
Data storage
25Basic job submission
Users
- Tools that
- copy files to and between CEs and data storage
- Submit job to a CE
- Monitor job
- Get output
How do I run a job on a compute element (CE) ?
(CE batch queue)
Resources
Compute elements
Data storage
Network resources
26Information service (IS)
Users
- Information service
- Resources send updates to IS
- Grid services query IS before running jobs
How do I know which CE could run my job? Which is
free?
Resources
Compute elements
Data storage
Network resources
27File management
Users
Storage Transfer Replica management
Weve terabytes of data in files.
My data are in files, and Ive terabytes
Our data are in files, and Ive terabytes
- EGEE data primarily file-based
- services for databases used by some VOs
Resources
Compute elements
Data storage
Network resources
28Security, Authentication and Authorisation
29Authentication and Authorisation
- Authentication - communication of identity
- Basis for
- Message integrity - so tampering is recognised
- Message confidentiality, if needed - so only
sender and receiver can understand the message - Non-repudiation knowing who did what, when
cant deny it - Authorisation - once identity is known, what can
a user do? - Delegation- A allows service B to act on behalf
of A - Based on X.509 certificates
30http//compchem.unipg.it
31Current production middleware
Replica Catalogue
User interface
Information Service
Resource Broker
Author. Authen.
Input sandbox Broker Info
Output sandbox
Logging Book-keeping
Computing Element
Job Status
32User Interface node
- The users interface to the Grid
- Command-line interface to
- Create/Manage proxy certificates
- Job operations
- To submit a job
- Monitor its status
- Retrieve output
- Data operations
- Upload file to SE
- Create replica
- Discover replicas
- Other grid services
33User Interface node
- Also C and Java APIs
- To run a job user creates a JDL (Job Description
Language) file
34Querying job status
Possible Job States
35(No Transcript)
36Live Real Time Monitor Site
- http//gridportal.hep.ph.ic.ac.uk/rtm/applet.html
37Overall load
- 19.6 million jobs run in 1st year of EGEE-II
- 56000 per day sustained average
- Peak of 98000
- Non-LHC 13500 /day
- Level of total in EGEE in 2005
- 8400 CPU-years delivered in 1 year
- 1/3 of total available sustained over the year
- Peak of 50 of available in Feb 07
- 1/3 of total was non-LHC in Dec 06
37
38EGEE Training Site
http//www.egee.nesc.ac.uk
39NA3 Activity Partners in EGEE-II
EGEE Training Team, University of Warsaw
40Training effort in EGEE-II
- 30 partners
- 31 FTEs, 135 individuals
- 5 of project budget (2.4M)
- e-Learning digital library
- Training infrastructure
- dedicated sub-grid of training clusters
(Catania, Karlsruhe, Edinburgh, Budapest,
Warsaw, Athens, Prague, Bratislava)
41EGEE-II Training Events
42EGEE Training Site
http//www.egee.nesc.ac.uk
43Digital Library
http//egee.lib.ed.ac.uk/
44EGEE Digital Library statistics
- 73 articles
- 13 courses
- 316 events
- 53 modules
- 3926 presentations
- 70 tutorials
- 97 videos
- 27 ETF Exemplars
The EGEE Digital Library contains over 4000
learning resources derived from EGEE events
45UIG pages
- http//www.egee.nesc.ac.uk/uig
46UIG pages
47(No Transcript)
48(No Transcript)
49(No Transcript)
50(No Transcript)
51(No Transcript)
52(No Transcript)
53(No Transcript)
54(No Transcript)
55(No Transcript)
56(No Transcript)
57Link to EGEE 07 presentations
- EGEE 07 Conference Agenda Page
- http//indico.cern.ch/conferenceDisplay.py?confId
18714
58Example Applications
- WISDOM Project
- BioinfoGRID
- Health e-Child
- MATLAB in Grids
59WISDOM
- WISDOM stands for World-wide In Silico Docking On
Malaria - Goal find new drugs for neglected and emerging
diseases - Neglected diseases lack RD
- Emerging diseases require very rapid response
time - Method grid-enabled virtual docking
- Cheaper than in vitro tests
- Faster than in vitro tests
60In-Silico Drug Discovery
- WISDOM Project (Wide In-Silico Docking On
Malaria) - About 80 CPU years to produce TB of data
61First Target Malaria
- 300 million people worldwide are affected
- 1-1.5 million people die every year
- Widely spread
- Caused by protozoan parasites of the genus
Plasmodium
Life cycle
62Role of Plasmepsins
- Plasmepsins are involved in hemoglobin
degradation during the parasites life cycle. - Present in the 4 species of Plasmodium causing
the disease in humans - Sequence homology between the plasmepsins is high
(65-70) - X-ray-crystallography data available
HEMOGLOBIN
Plasmepsins (I, II, IV, and HAP)
Small Peptides
Heme
Falcipain and plasmepsin
oxidation
Smaller Peptides
Hematin
polymerization
Aminopeptidases
Hemozoin (malarial pigment)
Amino acids
63Second Target Avian Flu
- Profiling Inhibitors of Influenza H5N1
- docking of 300,000 compounds studied
- 8 different target structures of Influenza A
neuraminidases - 2000 CPUs were used over 4 weeks (gt100 CPU-years)
- gt60,000 output files with a data volume of 600
Gigabytes
64Biological objectives
- Malaria Find active molecules
- on a known mutated protein (DHFR)
- on new targets
- Plasmepsins
- GST
- Tubulin
- Avian Flu
- Study the impact of point mutations of the N1
enzyme - Tamiflu active on N1
- Find new molecules active on N1
N1
H5
65A first step towards in silico drug discovery
virtual screening
- In silico virtual screening
- Starting from millions of compounds, select a
handful of compounds for in vitro testing - Very computationally intensive but potentially
much cheaper than in vitro testing - Where to find CPUs to make it time effective ?
66Grid-enabled virtual docking
Millions of potential drugs to test
against interesting proteins!
High Throughput Screening 1-10/compound, several
hours
67Statistics of deployment
- First Data Challenge July 1st - August 15th 2005
- Target malaria
- 80 CPU years
- 1 TB of data produced
- 1700 CPUs used in parallel
- 1st large scale docking deployment world-wide on
an e-infrastructure - Second Data Challenge April 15th - June 30th
2006 - Target avian flu
- 100 CPU years
- 800 GB of data produced
- 1700 CPUs used in parallel
- Collaboration initiated on March 1st deployment
preparation achieved in 45 days - Third Data Challenge October 1st - 15th December
2006 - Target malaria
- 400 CPU years
- 1,6 TB of data produced
- Up to 5000 CPUs used in parallel
68Status of in vitro tests
- Avian Flu
- Initial number of compounds 300,000
- 123 compounds bought and tested out of the 2250
selected - 7 out of 123, approximately 6, are active
- Usual average success rate for in vitro tests
0,1 - Factor 60 increase to be confirmed on more
compounds - Tests under way at Chonnam National University
(ROK)
- Malaria
- Initial number of compounds 500,000 (WISDOM-I)
- Selection of 30 molecules in 2 steps
- 1000 molecules selected on docking score
- Selection of 30 molecules through molecular
dynamics - Tests under way at Chonnam National University
(ROK) - First results are very encouraging
69http//www.bioinfogrid.eu
70Biological Databases Use case
71Biological Database in GRID
- The following biological databases are currently
available in Grid - InterPro databases
- PROSITE Patterns (Hofmann, K. et al. 1999),
- PROSITE profile (Hofmann, K. et al. 1999),
- PRINTS (Attwood, T. K. et al. 2000),
- Pfam (Bateman, A. et al. 2000),
- PRODOM (Corpet, F. et al. 1999),
- SMART (Schultz, J. et al. 2000),
- TIGRFAMs (Haft, D.H. et al. 2001),
- PIRSF,
- PANTHER,
- SUPERFAMILY
72Biological Databases
- BLAST databases
- nr (NCBI),
- nt (NCBI),
- pdbaa (NCBI),
- UCSC_human_chrs (UCSC),
- human_genomic (NCBI),
- refseq_protein (NCBI),
- refseq_rna (NCBI),
- refseq_genomic (NCBI),
- ecoli (NCBI),
- yeast (NCBI),
- uniprot (UNIPROT),
- est_human (NCBI),
- est_mouse (NCBI)?
73Functional Analogous Finder
- Goal
- Compare gene products according to their
description - AND NOT
- according to their sequence similarity.
As description we use the standardised
terminology of the Gene Ontology (GO).
Data source The Gene Ontology Database (GODB),
is a repository of the GO and the associations
between the terms and the gene products
(GOA). Currently there are 2M gene products
described by 21000 terms producing 9M
associations.
74Functional Analogous Finder
- Approach
- A selection of about 1M well annotated gene
products are involved in the search. - A simple chi-square application compares the
common and non-common terms between two compared
gene products. - Problem
- A comparison of one gene product against the
whole 1M gene products occupies 1 CPU for 30 min.
on average - The whole every gene product against each other
search would occupy 1 CPU for more than 50 years.
75Functional Analogous Finder
- Solution
- Split the search into a number of small jobs and
distribute them together with the DB on as many
free WN as possible.
- The job submission is made by a script running as
a daemon - The script submits 80 jobs every 30 minutes
- It is possible to run more instances of the
submission daemon in order to increase the total
number of jobs submitted in one hour - The multi-process submission improve the speed of
submission - The submission uses 3 RB in a round robin
algorithm in order to avoid overloading a single
RB and to avoid that the failure of a single RB
can stop the submission of jobs - Retrieve periodically the OutputSandbox of the
jobs - Monitor the status of the production by simply
querying the monitoring DB - The user can know the number of processed/running
genes - The number of the running jobs
- The location of each job
- Debug possible errors in running jobs
- The software to submit jobs is installed on 2
different machines in order to avoid that a
single hardware failure can stop the submission
76Functional Analogous Finder the job submission
Farm2
SE2
SE1
- A simple monitoring system, based on a central
DB, makes it possible to know in real time the
status of each job and make some post-mortem
analysis. - Status of the single operation made by the
running script - Location of the jobs
Farm1
RB
DB
- A series of scripts runs periodically on the UI
to submit and control the jobs - The script submits 80 jobs every 30 minutes
UI
- The central DB acts as a task queue for
automatic job submission
77Functional Analogous Finder actions performed
when the job reaches the WN
Farm2
SE1
Farm1
SE2
Reads from a DB the n genes to compare (for 10
hours jobs) (chosen between the not completed
genes or the running ones from more than 48 hours)
RB
DB
- Downloads the input files (from one of three
available SE) - Decompresses them
- Installs the perl libraries
UI
- Start the perl script and the comparison
78Functional Analogous Finder Results
- All 1M gene products processed in less than one
month - Different Farms used 64
- Different Hosts used 2446
- Total Submitted jobs 95041
- Total started jobs 66313
- Total of successful jobs (from application point
of view) 42992 - Total Failed jobs (for input staging problem)
3209
79The Health-e-Child Project PlatformA gLite
Adoption Case Study
2007-10-01 EGEE Business Track Budapest, Hungary
David Manset - dmanset_at_maat-g.com maat
Gknowledge MAAT http//www.maat-g.com
80Project Objectives
- Establish Horizontal and Vertical integration of
data, information and knowledge for Paediatrics - Develop a grid-based biomedical information
platform, supported by sophisticated and robust
search, optimisation, and matching techniques for
heterogeneous information, - Build enabling tools and services that improve
the quality of care and reduce its cost by
increasing efficiency - Integrated disease models exploiting all
available information levels - Database-guided decision support systems
- Large-scale, cross-modality information fusion
and data mining for knowledge discovery - A Knowledge Repository for Paediatrics
3
4
5
6
1
2
81Distributed Computing with MATLAB in Grids
- Silvina Grad-Freilich
- Manager, Parallel Computing Technical Marketing
- sgradfre_at_mathworks.com
http//indico.cern.ch/materialDisplay.py?contribId
283sessionId25materialIdslidesconfId18714
82Licensing for Third-Party and Global Use
University A
HPC Center
83- License Management within Grid Framework
- Some of the issues to resolve
- Third-party licensing
- Global licensing
- Commercial vs. academic use
- Policy on license management within the EGEE
framework
84Pilot EGEE The MathWorks
Integrate distributed computing tools with EGEE
middleware
- Step 1 Research need and pre-setup
- Survey EGEE virtual organizations on MATLAB use
(EGEE) - Identify sites to be used in test (EGEE)
- Provide trial licenses (MathWorks)
- Step 2 Technical feasibility study
- Integrate with local resource manager (EGEE)
- Integrate with local resource manager through
Workload Management System (MathWorks EGEE) - Step 3 Define licensing model
- Create model for Grid deployment within the EGEE
framework (MathWorks, with much appreciated EGEE
support!)
85Further Information
- EGEE Public Portal
- http//www.eu-egee.org
- EGEE Training Site
- http//www.egee.nesc.ac.uk
- EGEE Digital Library
- http//egee.lib.ed.ac.uk
- EGEE User Information Group
- http//www.egee.nesc.ac.uk/uig
863rd EGEE User Forum
87EGEE08
- The EGEE08 Conference will take place in