OGSADQP: current status and future direction - PowerPoint PPT Presentation

1 / 30
About This Presentation
Title:

OGSADQP: current status and future direction

Description:

optimisation. Partitioning. scheduling. query partition. query partition. query partition ... Physical optimisation. Plan is expressed as a physical algebra ... – PowerPoint PPT presentation

Number of Views:45
Avg rating:3.0/5.0
Slides: 31
Provided by: slyn
Category:

less

Transcript and Presenter's Notes

Title: OGSADQP: current status and future direction


1
OGSA-DQP current status and future direction
  • Steven Lynden
  • Department of Computer Science
  • University of Manchester

2
Overview
  • OGSA-DQP is a service based distributed query
    processor
  • First release was in September 2003
  • Current (2nd) release is based on the Globus
    toolkit 3.2 and OGSA-DAI release 4.0
  • A 3rd release is out soon based on OGSA-DAI
    release 7
  • DQP is a myGrid component

3
People, places
  • University of Manchester
  • Tasos Gounaris, Steven Lynden, Alvaro Fernandes,
    Rizos Sakellariou, Norman Paton
  • University of Newcastle
  • Arijit Mukherjee, Jim Smith, Paul Watson
  • OGSA-DAI

4
Outline
  • Distributed query processing, OGSA-DQP aims and
    objectives
  • Background
  • Current release OGSA-DQP v 2.0
  • OGSA-DQP and MyGrid
  • Imminent release OGSA-DQP v 3.0
  • Future work

5
Distributed Query Processing
  • The user submits a single query referencing data
    stored at multiple sites.
  • The author of the query need not be aware of
    how/where data is stored.
  • Given two DBMSs and one analysis tool (e.g., a
    WS)
  • goTerm to a GO Gene Ontology running as a remote
    mySQL DB, which has a service interface
  • protein to a GIMS Genome Warehouse running as a
    remote DB, which has a service interface
  • Blast (sequence alignment scoring)
  • We want to obtain alignment scores for a sequence
    against proteins of a certain kind

Example OGSA-DQP query written in OQL select
p.proteinId, Blast(p.sequence) from protein p,
goTerm t where t.termId GO0005942 and
p.proteinId t.proteinId
6
Aims and objectives
  • To process distributed queries
  • To benefit from homogeneous access to
    heterogeneous data sources (OGSA-DAI)
  • To benefit from Grid abstractions for on-demand
    allocation of resources for required tasks
  • To provide transparent distribution and
    parallelism
  • To orchestrate the composition of data retrieval
    and analysis services

7
On the Grid
  • The Grid is a middleware for discovering and
    accessing computational resources.
  • The Open Grid Services Architecture (OGSA)
    recasts the Grid as a collection of Web Services.
  • Grid Services combine
  • Web Services for service description and
    invocation.
  • Grid middleware for computational resource
    description and utilisation.
  • Databases are made available through integration
    with other Grid services, and provision of
    standard interfaces.

8
Mediator / wrapper middleware
  • One approach to data integration is to use
    mediator/wrapper based middleware.
  • The wrappers reconcile differences and impose a
    global schema.
  • Mediators only need to interact with the
    wrappers

Query
Results
mediator
wrapper
wrapper
DBMS
data
9
OGSA-DQP adopts mediator/wrapper architecture
  • OGSA-DQP can be seen as a mediator over OGSA-DAI
    wrappers
  • OGSA-DQP is a Grid service which utilises Grid
    infrastructure to execute queries which are
    compiled, scheduled and executed in parallel
  • In addition to OGSA-DAI wrapped DBMSs, Web
    Services may be invoked to analyse data

Query
Results
OGSA-DQP
OGSA-DAI
OGSA-DAI
DBMS
data
10
OGSA-DQP is service based
  • Service based in two senses
  • Supports queries over data storage and analysis
    resources made available as services
  • The execution of distributed query plans are
    factored out to services
  • There are two types of DQP services
  • Coordinators interact with clients, compile and
    schedule query executions
  • Evaluators execute the query (retrieve data from
    sources, carry out joins etc.)

11
Client
query partition
evaluator
query partition
resource list
query
evaluator
evaluator
results
get schema
query partition
schema
Coordinator
wsdl
Global schema
OGSA-DAI
Web Service
parsing
scheduling
Single-node optimisation
Partitioning
DBMS
data
12
Query compilation
  • Polar is used to compile and partition query
    execution plans.
  • Queries must be written in OQL (Object Query
    Language)
  • Consider the example query
  • select p.proteinId, Blast(p.sequence)
  • from protein p, goTerm t
  • where t.termId GO0005942 and
  • p.proteinId t.proteinId
  • goTerm Gene ontology database
  • protein GIMS Genome Warehouse
  • Blast sequence alignment scoring (Web Service)

13
Logical optimisation
  • Plan is expressed as a logical algebra
  • Multiple equivalent plans are generated
  • select p.proteinId, Blast(p.sequence)
  • from protein p, goTerm t
  • where t.termId GO0005942 and
  • p.proteinId t.proteinId

reduce
op_call (Blast)
join (proteinId)
reduce
reduce
scan termIdGO0005942 (goTerm)
scan (protein)
14
Physical optimisation
  • Plan is expressed as a physical algebra
  • Plan is chosen by cost-ranking of equivalent plans

reduce
op_call (Blast)
hash_join (proteinId)
reduce
reduce
table_scan termIdGO0005942 (goTerm)
table_scan (protein)
15
Query partitioning
  • Plan is transformed into a parallel algebra
    (physical operators data exchange)
  • Exchange operators are placed where data exchange
    must take place

reduce
op_call (Blast)
exchange
hash_join (proteinId)
exchange
exchange
reduce
reduce
table_scan termIdGO0005942 (goTerm)
table_scan (protein)
16
Query scheduling
  • Allocate operators to evaluator nodes
  • A heuristic algorithm is used based on available
    resources, memory use and network costs

17
OGSA-DQP version 2.0
  • Version 2.0 is the current release
  • Based on OGSI (Globus toolkit 3) and OGSA-DAI
    release 4.0
  • Coordinators/evaluators are Grid services
  • Coordinators are mapped to Grid Distributed Query
    Services (GDQS) they extend OGSA-DAI data
    services
  • Evaluators are mapped to Grid Query Evaluation
    Services (GQES)
  • Factories are used to create services
  • GDQSF GDQS Factory
  • GQESF GQES Factory

18
OGSA-DQP version 2.0
XML perform document submitted containing an OQL
query statement activity parameterised
by the query
results
Client
Exists for the lifetime of a query session
(terminated by the client)
resource list
GDQS (extends OGSA-DAI GDS)
  • import schemas
  • compile query

Web Service
Factories are how OGSA-DQP version 2.0 handles
concurrency
Exist for the lifetime of a query
OGSA-DAI
GDQS factory
create instance
DBMS
GQES factory
GQES
create instances
data
GQES factory
GQES
GQES
GQES factory
  • execute query

19
DQP myGrid
  • myGrid middleware offers support for creating and
    managing the information from in silico
    experiments
  • Aims to avoid the need for repetitive manual use
    of bioinformatics tools
  • Programmatic access provided by Web Services
  • Automation requires a representation of the
    process achieved using workflows
  • Workflows represent a procedure as a set of
    processes and relationships between processes

20
DQP myGrid
  • Taverna workbench is a GUI based application used
    to construct workflows
  • Workflow enactment engine executes the workflow
  • Data metadata stored using the myGrid
    Information Repository (mIR)
  • Much more including semantic services, event
    notification etc.

21
DQP myGrid
  • OGSA-DQP has been developed as a myGrid component
  • Complex queries have potentially high response
    times DQP can address via parallelisation
  • A Web Service wrapper was created to allow
    OGSA-DQP to be invoked from within myGrid
  • A use-case involving DQP has been developed by
    the ISPIDER (http//www.ispider.man.ac.uk)
    project

22
ISPIDER use case
  • Involves genome-focused protein identification
  • PepMapper is a web service that uses mass
    spectrometry data produced by the digestion of a
    protein to match with a sequence database
    proteins
  • The biologist may know in advance a set of
    proteins that are relevant e.g. proteins
    belonging to a particular family or domain
  • By searching over a smaller protein set, the
    identification experiment is more efficient
  • Implementation uses OGSA-DQP to reduce the size
    of the input data set

23
Workflow
select p.Name, p.Seq from p in db_proteinSequences
where p.OS'HomoSapiens'
workflow inputs
error
spots
OQL query
perform
  • DQP web service
  • wraps DQP
  • queries (IPI International Protein Index)

save spots
save
convert
xml format -gt fasta format
identify
PepMapper web service
out
workflow output
24
DQP version 2.0 summary
  • WS wrapper required to use with myGrid
  • The WS wrapper was not included in a DQP 2.0
    release it can be obtained from the myGrid CVS
  • The DQP WS wrapper must contact the GDQS factory,
    create instances destroy instances
  • Based on OGSI (Globus toolkit 3.2) and OGSA-DAI
    release 4.0
  • OGSA-DAI 7.0 is out soon, so a new DQP release is
    being developed

25
OGSA-DQP version 3.0
  • Released around OGSA-DAI version 7.0
  • Changes required to shield DQP from the multiple
    platforms (WSRF and WS-I)
  • Instead of services which extend OGSA-DAI,
    coordinators is now based on activities which can
    be installed on either type of OGSA-DAI service
  • Evaluators do all their work behind the scenes
    they are implemented as WS-I web services

26
Data services and data service resources
  • OGSA-DAI supports interaction with data resources
    via perform documents
  • A data service exposes a number of data service
    resources and is a point of contact for clients
  • The data resource implementation contains code
    that can be used by activities to access a
    physical data resource
  • In the case of DQP the data resource
    implementation provides access to the distributed
    query processor

Data service
n
Data service resource
1
1
Data resource
Data service Resource configuration
1
27
Client
DQP factory activity
OQL query statement activity
Coordinators and evaluators can handle concurrency
Perform document
resource list
query
Results
OGSA-DAI (WSRF or WS-I)
OGSA-DAI data service
coordinator
DQP factory data service resource
DQP instance data service resource
  • import schemas

schema
  • compile query

create
OGSA-DAI
wsdl
GQES
Web Service
evaluators
DBMS
assign query partitions
GQES
GQES
data
  • execute query

28
OGSA-DQP version 3.0 summary
  • DQP is shielded from the WSRF/WS-I duality of
    OGSA-DAI
  • More flexibility when invoking DQP now
  • Coordinators and evaluators now support
    concurrency
  • There are other performance enhancements

29
Future work
  • DQP can currently query relational data sources
    only
  • Web services can be invoked, but only simple
    parameter types are supported
  • DQP will be extended to provide support for
    querying XML data sources
  • Research on adaptivity (updating the query
    execution plan while the query is being executed)
    is ongoing
  • Dynasoar project at Newcastle University

30
Conclusions
  • OGSA-DQP is able to query distributed databases
    and parallelise the execution of the query
  • Databases must be relational, and wrapped by
    OGSA-DAI
  • Can also invoke Web Services to analyse results
  • Provides declarative support for data management
    and service orchestration
Write a Comment
User Comments (0)
About PowerShow.com