Title: EMISPHER 3rd EuroMEDiterranean Conference,
1- EMISPHER 3rd Euro-MEDiterranean Conference,
- Dimitrios Vogiatzis
- University of Cyprus
- 26/06/2004
2Summary of talk
- Brief presentation of bioGRID project
- Focus on the computational GRID
- Present the PROVA (distributed computations).
- Early evaluation results
3Scientific Objectives bioGRID project
- BioGrid is a trial IST (FP 5) project, 2 year
(Sep-02 to Sep-04) with the following objectives - Development and Integration of grid technologies
so that - Researchers obtain an efficient information
output - Three tools to be integrated
- PSIMAP, protein interaction discovery
visualisation - Space Explorer, gene protein visualisation
- Classification Server, text data mining
- Tools access Information resources
- Databases (protein, gene expression).
- Unstructured data (pubmed abstracts)
- Software tools (TOPS, for protein structural
comparison)
4The Grid component of the project
- Develop a grid technology,
- for integration of protein interaction space,
gene expression document space. - Speed up calculations
- UCY team manager George Papadopoulos, Members
Dimitrios Vogiatzis, Aristos Stavrou - Site www.bio-grid.net
-
5Prova language
- In the project, the language is used as a
rules-based backbone for distributed running of
PSIMAP - Prova is derived from Mandarax- Java-based
inference system. - Prova extends Mandarax by providing a proper
language syntax, - native syntax integration with Java,
- agent messaging and reaction rules.
6Design Goal of Prova
- Marry the benefits of declarative and
object-oriented programming - Combine the syntaxes of Prolog and Javaultimate
logic and object-oriented languages - Expose logic and agent behaviour as rules
- Access data sources via wrappers written in Java
or command-line shells like Perl - Make all Java API from available packages
directly accessible from rules - Run within the Java runtime
- Enable rapid prototyping of applications
- Offer a rule-based platform for distributed agent
programming
7Prova is useful for data integration tasks when
the following is important
- Location transparency (local, remote, mirrors)
- Format transparency (database, RDF, XML, HTML,
flat files, computation resource) - Resilience to change (databases and web sites
change often) - Use of open and open source technologies
- Understandability and modifiability by a non-IT
specialist - Economical knowledge representation
- Extensibility with additional functionality
8Examples, using PROVA ?
Given the open database connection DB and a
unique protein idendifier, from PDB, test
whether the provided domains with IDs PXA and
PXB interact (have at least 4 atoms within 5
angstroms) scop_dom2dom (DB, PDB_ID, PXA, PXB)
-- access_data(pdb, PDB_ID, Protein), access_do
m_atoms (DB, Protein, PXA, DomainA), access_dom_a
toms(DB, Protein, PXB, DomainB), DomainA.interact
s (DomainB).
9Examples, using PROVA ??
Opening a database location (database,
scop,jdbcmysql//comas.soi.city.ac.uk,u,p).
location (database, scop,jdbcmysql//localhost
,u,p). dbopen(scop, DB). Querying a
database sql_select (DB, From, N1, V1, ., Nk,
Vk),
10Agent based Prova I
- messages via
- Java Messaging System (JMS), message oriented
middleware platform. Joram Implementation. - A sends message to B. Once B goes online the
messages will be delivered - JADE-HTTP, minimal configuration requirements
compared to JMS -? ad-hoc networks
11Features of Prova-AA
- Agent communication
- sendMSG,
- sendMsg(XID,Protocol,Agent,Performative,Predicate
ArgsContext) - rcvMSG rules
- rcvMSG(XID,Protocol,From.queryref,XXsContext
12PROVA architecture
13Distributed PSIMAP
- PSIMAPdiscover possible protein interactions
- Database contains 6120 multidomain proteins
- PROVA 1.3,1.4, PSIMAP prepackaged with prova
14PSIMAP
- PSIMAP is the first complete protein structural
domain interaction map - shows, what kinds of protein domains are found to
be interacting structurally. - PSIMAP has specific shapes reflecting the types
of protein domains their interaction partners
15On protein interactions
- Protein interactions provide an important context
for the understanding of function - Get multidomain proteins (3000-30000 residues)
12000 proteins and growing - Determine interaction
- Check all residue pairs of any domain
- Possible interaction
- if number of residue pairs within a threshold
(5Angstroms) is gt - 5 pairs
16Protein interaction, computation
17Algorithm
- Interaction(Superfamliy1, Superfamily2) if
- PDB(Protein),
- Domain(Protein,Domain1),
- Domain(Protein,Domain2),
- SCOP Superfamily(Domain1, Superfamily1),
- SCOP Superfamily(Domain2, Superfamily2),
- InteractionDD(Domain1,Domain2, 5 Ang, 5 Residues)
- Complexity is O(n log(n))
18Purpose of Experiments
- Discover significance of net delays
- Manager, Worker plus in remote locations
- Discover significance of management delays
- Manager, worker plus situation
- Come up with speed up.
19Manager, workerplus plus local set-up
Manager
Worker-plus
Worker-plus
Worker-plus
local copy of Protein structs
- Processing manager, workers
- Management manager
20Distributed Set up
Router
RUG
UCY
Worker plus
Worker plus
Worker plus
Manager
local copy of Protein structs
Worker plus
Worker plus
Router
CITY
local copy of Protein structs
Worker plus
Worker plus
local copy of Protein structs
21Manager
location(database,scop,"jdbcmysql//xxx.xxx) st
art_psimap() - dbopen(scop,DB),
Psimappsimap.Psimap(DB), ListPsimap.divideSup
erTaskList(1500), assert(processor(Psimap)),
assert(tasks(List)), attach_routers(),
iam(Me), sendMsg(XID,self,Me,tell,ready()). a
ttach_routers() - router(Router),
sendMsg(XID,jade,Router,tell,attach()). attach_ro
uters().
rcvMsg(XID,Protocol,From,tell,ready()) -
tasks(List), TaskList.removeFirst(),
execute_task(XID,Protocol,From,Task). execute_tas
k(XID,self,Me,Task) - println("About to
execute ",Task), !, processor(Psimap),
spawn(Psimap,executeTask,Task),
rcvMsg (XID1,self,Me,return,complete(Psimap,execut
eTask,Task)), ResultPsimap.getTaskResult(),
sendMsg(XID,self,Me,reply,worker(Result,Task)),
sendMsg(XID,self,Me,tell,ready()). execute_t
ask(XID,jade,From,Task) - sendMsg(XID,jade,Fro
m,submit,worker(Result,Task)). rcvMsg(XID,Protoco
l,From,reply,worker(Result,Task)) -
store_result(From,Result,Task).
22Worker
super("prova_at_SUPERNODE"). location(database,scop
,"jdbcmysql//localhost","guest","guestdb"). sta
rt_worker() - dbopen(scop,DB),
Psimappsimap.Psimap(DB),
assert(processor(Psimap)), super(Super),
sendMsg(XID,jade,Super,tell,ready()). work
er(Result,Task) - println("About to
execute ",Task), iam(Me),
processor(Psimap), spawn(Psimap,executeTas
k,Task),
rcvMsg(XID,self,Me,return,complete(Psimap,executeT
ask,Task)), ResultPsimap.getTaskResult(),
println(Result). Reaction rule to
submit rcvMsg(XID,Protocol,From,submit,XXs)
- derive(X,XXs),
sendMsg(XID,Protocol,From,reply,XXs),
sendMsg(XID,Protocol,From,tell,ready()).
23Router
location(database,scop,"jdbcmysql//lh",u",p").
start_router() - dbopen(scop,DB),
Psimappsimap.Psimap(DB), assert(processor(Psi
map)), Workersjava.util.LinkedList(),
assert(workers(Workers)). rcvMsg(XID,Protocol,Sup
er,tell,attach()) - assert(super(Super)),
workers(Workers), element(Worker,Workers),
sendMsg(XID,Protocol,Super,tell,ready()). rcvMs
g(XID,Protocol,Super,tell,attach()) -
sendMsg(XID,Protocol,Super,tell,ready()). rcvMsg(
XID,Protocol,From,tell,ready()) -
println(From," is ready."),
workers(Workers), Workers.addLast(From),
super(Super), sendMsg(XID,Protocol,Super,tell,re
ady()).
rcvMsg(XID,Protocol,From,submit,XXs) -
workers(Workers), WorkerWorkers.removeFirst()
, !, sendMsg(XID,Protocol,Worker,forward
,submit(From,XXs)). rcvMsg(XID,Protocol,From,s
ubmit,XXs) - derive(X,XXs),
sendMsg(XID,Protocol,From,reply,XXs),
sendMsg(XID,Protocol,From,tell,ready()). rcvMsg(X
ID,Protocol,Router,forward,submit(Super,XXs))
- derive(X,XXs), sendMsg(XID,Protocol
,Super,reply,XXs), sendMsg(XID,Protocol,Ro
uter,tell,ready()). worker(Result,Task) -
println("About to execute ",Task),
iam(Me), processor(Psimap),
spawn(Psimap,executeTask,Task),
rcvMsg(XID1,self,Me,return,complete(Psimap,execute
Task,Task)), ResultPsimap.getTaskResult().
24Evaluation methods
- Find the processing power of each node
- Expressed in proteins/minute
- Caution proteins are of varying size
- Method System.currentTimeMillis() output all
results in a file - Find the processing power of a manager/worker
setup locally - Processing power of manager/workers over two
sites - Processing power of a single manager accessing a
remote database - Available processing power
- RUG 18 nodes (linux)
- UCY 7 nodes (5 linux, 2 pcs)
251st series of experiments
- Evaluate speed up locally on UCY
- 6120 proteins
- 1500 tasks
- Prova 1,3
262nd series of experiments I
- Evaluate processing power of each node operating
alone - Expressed in proteins/second
272nd series of experiments II
- Different processing power of nodes should be
taken into account - Processing initial proteins is fast, slowing down
next.
283rd series of experiments
- SET-UP
- One node _at_ UCY (cs1005)
- Worker plus
- Local to UCY database
- 4,37 prots/min
- One node _at_ RUG (lilith)
- Manager
- Local to RUG database
- RESULTS
- node _at_ UCY
- 4,37 prots/min (1,18 slowdown)
- One lilith _at_ RUG
- 6,64 prots/min (1,05 times slowdown)
- Overall speed up
- Just 9 slower than adding the processing power
of both machines
29Further Steps
- Preset 1500 tasks, is it optimal?
- Dependent on avail. Nodes.
- Few sites/many nodes per site
- Try to integrate more nodes, in the processing
- Few nodes/many sites
- SETI like, (not currently not feasible)
- Expected speed up, close to optimal?
- Evaluate that.
- We collect the results in a huge file (gt100MB)
not all results are necessary
30Thanks Questions?