Title: OGSADQP: current status and future direction
1OGSA-DQP current status and future direction
- Steven Lynden
- Department of Computer Science
- University of Manchester
2Overview
- OGSA-DQP is a service based distributed query
processor - First release was in September 2003
- Current (2nd) release is based on the Globus
toolkit 3.2 and OGSA-DAI release 4.0 - A 3rd release is out soon based on OGSA-DAI
release 7 - DQP is a myGrid component
3People, places
- University of Manchester
- Tasos Gounaris, Steven Lynden, Alvaro Fernandes,
Rizos Sakellariou, Norman Paton - University of Newcastle
- Arijit Mukherjee, Jim Smith, Paul Watson
- OGSA-DAI
4Outline
- Distributed query processing, OGSA-DQP aims and
objectives - Background
- Current release OGSA-DQP v 2.0
- OGSA-DQP and MyGrid
- Imminent release OGSA-DQP v 3.0
- Future work
5Distributed Query Processing
- The user submits a single query referencing data
stored at multiple sites. - The author of the query need not be aware of
how/where data is stored.
- Given two DBMSs and one analysis tool (e.g., a
WS) - goTerm to a GO Gene Ontology running as a remote
mySQL DB, which has a service interface - protein to a GIMS Genome Warehouse running as a
remote DB, which has a service interface - Blast (sequence alignment scoring)
- We want to obtain alignment scores for a sequence
against proteins of a certain kind
Example OGSA-DQP query written in OQL select
p.proteinId, Blast(p.sequence) from protein p,
goTerm t where t.termId GO0005942 and
p.proteinId t.proteinId
6Aims and objectives
- To process distributed queries
- To benefit from homogeneous access to
heterogeneous data sources (OGSA-DAI) - To benefit from Grid abstractions for on-demand
allocation of resources for required tasks - To provide transparent distribution and
parallelism - To orchestrate the composition of data retrieval
and analysis services
7On the Grid
- The Grid is a middleware for discovering and
accessing computational resources. - The Open Grid Services Architecture (OGSA)
recasts the Grid as a collection of Web Services. - Grid Services combine
- Web Services for service description and
invocation. - Grid middleware for computational resource
description and utilisation. - Databases are made available through integration
with other Grid services, and provision of
standard interfaces.
8Mediator / wrapper middleware
- One approach to data integration is to use
mediator/wrapper based middleware. - The wrappers reconcile differences and impose a
global schema. - Mediators only need to interact with the
wrappers
Query
Results
mediator
wrapper
wrapper
DBMS
data
9OGSA-DQP adopts mediator/wrapper architecture
- OGSA-DQP can be seen as a mediator over OGSA-DAI
wrappers - OGSA-DQP is a Grid service which utilises Grid
infrastructure to execute queries which are
compiled, scheduled and executed in parallel - In addition to OGSA-DAI wrapped DBMSs, Web
Services may be invoked to analyse data
Query
Results
OGSA-DQP
OGSA-DAI
OGSA-DAI
DBMS
data
10OGSA-DQP is service based
- Service based in two senses
- Supports queries over data storage and analysis
resources made available as services - The execution of distributed query plans are
factored out to services - There are two types of DQP services
- Coordinators interact with clients, compile and
schedule query executions - Evaluators execute the query (retrieve data from
sources, carry out joins etc.)
11Client
query partition
evaluator
query partition
resource list
query
evaluator
evaluator
results
get schema
query partition
schema
Coordinator
wsdl
Global schema
OGSA-DAI
Web Service
parsing
scheduling
Single-node optimisation
Partitioning
DBMS
data
12Query compilation
- Polar is used to compile and partition query
execution plans. - Queries must be written in OQL (Object Query
Language) - Consider the example query
- select p.proteinId, Blast(p.sequence)
- from protein p, goTerm t
- where t.termId GO0005942 and
- p.proteinId t.proteinId
- goTerm Gene ontology database
- protein GIMS Genome Warehouse
- Blast sequence alignment scoring (Web Service)
13Logical optimisation
- Plan is expressed as a logical algebra
- Multiple equivalent plans are generated
- select p.proteinId, Blast(p.sequence)
- from protein p, goTerm t
- where t.termId GO0005942 and
- p.proteinId t.proteinId
reduce
op_call (Blast)
join (proteinId)
reduce
reduce
scan termIdGO0005942 (goTerm)
scan (protein)
14Physical optimisation
- Plan is expressed as a physical algebra
- Plan is chosen by cost-ranking of equivalent plans
reduce
op_call (Blast)
hash_join (proteinId)
reduce
reduce
table_scan termIdGO0005942 (goTerm)
table_scan (protein)
15Query partitioning
- Plan is transformed into a parallel algebra
(physical operators data exchange) - Exchange operators are placed where data exchange
must take place
reduce
op_call (Blast)
exchange
hash_join (proteinId)
exchange
exchange
reduce
reduce
table_scan termIdGO0005942 (goTerm)
table_scan (protein)
16Query scheduling
- Allocate operators to evaluator nodes
- A heuristic algorithm is used based on available
resources, memory use and network costs
17OGSA-DQP version 2.0
- Version 2.0 is the current release
- Based on OGSI (Globus toolkit 3) and OGSA-DAI
release 4.0 - Coordinators/evaluators are Grid services
- Coordinators are mapped to Grid Distributed Query
Services (GDQS) they extend OGSA-DAI data
services - Evaluators are mapped to Grid Query Evaluation
Services (GQES) - Factories are used to create services
- GDQSF GDQS Factory
- GQESF GQES Factory
18OGSA-DQP version 2.0
XML perform document submitted containing an OQL
query statement activity parameterised
by the query
results
Client
Exists for the lifetime of a query session
(terminated by the client)
resource list
GDQS (extends OGSA-DAI GDS)
Web Service
Factories are how OGSA-DQP version 2.0 handles
concurrency
Exist for the lifetime of a query
OGSA-DAI
GDQS factory
create instance
DBMS
GQES factory
GQES
create instances
data
GQES factory
GQES
GQES
GQES factory
19DQP myGrid
- myGrid middleware offers support for creating and
managing the information from in silico
experiments - Aims to avoid the need for repetitive manual use
of bioinformatics tools - Programmatic access provided by Web Services
- Automation requires a representation of the
process achieved using workflows - Workflows represent a procedure as a set of
processes and relationships between processes
20DQP myGrid
- Taverna workbench is a GUI based application used
to construct workflows - Workflow enactment engine executes the workflow
- Data metadata stored using the myGrid
Information Repository (mIR) - Much more including semantic services, event
notification etc.
21DQP myGrid
- OGSA-DQP has been developed as a myGrid component
- Complex queries have potentially high response
times DQP can address via parallelisation - A Web Service wrapper was created to allow
OGSA-DQP to be invoked from within myGrid - A use-case involving DQP has been developed by
the ISPIDER (http//www.ispider.man.ac.uk)
project
22ISPIDER use case
- Involves genome-focused protein identification
- PepMapper is a web service that uses mass
spectrometry data produced by the digestion of a
protein to match with a sequence database
proteins - The biologist may know in advance a set of
proteins that are relevant e.g. proteins
belonging to a particular family or domain - By searching over a smaller protein set, the
identification experiment is more efficient - Implementation uses OGSA-DQP to reduce the size
of the input data set
23Workflow
select p.Name, p.Seq from p in db_proteinSequences
where p.OS'HomoSapiens'
workflow inputs
error
spots
OQL query
perform
- DQP web service
- wraps DQP
- queries (IPI International Protein Index)
save spots
save
convert
xml format -gt fasta format
identify
PepMapper web service
out
workflow output
24DQP version 2.0 summary
- WS wrapper required to use with myGrid
- The WS wrapper was not included in a DQP 2.0
release it can be obtained from the myGrid CVS - The DQP WS wrapper must contact the GDQS factory,
create instances destroy instances - Based on OGSI (Globus toolkit 3.2) and OGSA-DAI
release 4.0 - OGSA-DAI 7.0 is out soon, so a new DQP release is
being developed
25OGSA-DQP version 3.0
- Released around OGSA-DAI version 7.0
- Changes required to shield DQP from the multiple
platforms (WSRF and WS-I) - Instead of services which extend OGSA-DAI,
coordinators is now based on activities which can
be installed on either type of OGSA-DAI service - Evaluators do all their work behind the scenes
they are implemented as WS-I web services
26Data services and data service resources
- OGSA-DAI supports interaction with data resources
via perform documents - A data service exposes a number of data service
resources and is a point of contact for clients - The data resource implementation contains code
that can be used by activities to access a
physical data resource - In the case of DQP the data resource
implementation provides access to the distributed
query processor
Data service
n
Data service resource
1
1
Data resource
Data service Resource configuration
1
27Client
DQP factory activity
OQL query statement activity
Coordinators and evaluators can handle concurrency
Perform document
resource list
query
Results
OGSA-DAI (WSRF or WS-I)
OGSA-DAI data service
coordinator
DQP factory data service resource
DQP instance data service resource
schema
create
OGSA-DAI
wsdl
GQES
Web Service
evaluators
DBMS
assign query partitions
GQES
GQES
data
28OGSA-DQP version 3.0 summary
- DQP is shielded from the WSRF/WS-I duality of
OGSA-DAI - More flexibility when invoking DQP now
- Coordinators and evaluators now support
concurrency - There are other performance enhancements
29Future work
- DQP can currently query relational data sources
only - Web services can be invoked, but only simple
parameter types are supported - DQP will be extended to provide support for
querying XML data sources - Research on adaptivity (updating the query
execution plan while the query is being executed)
is ongoing - Dynasoar project at Newcastle University
30Conclusions
- OGSA-DQP is able to query distributed databases
and parallelise the execution of the query - Databases must be relational, and wrapped by
OGSA-DAI - Can also invoke Web Services to analyse results
- Provides declarative support for data management
and service orchestration