Title: NeSC Data Projects and Initiatives
1NeSC Data Projects and Initiatives
- Dr. Dave Berry
- Research Manager
2Contents
- The Data Deluge
- Web Services
- The DAI vision
- The OGSA-DAI Project and GGF
- The OGSA-DAI Software
- Edikt
- Other relevant projects in the UK
3Acknowledgements
- This talk includes material prepared by
- The OGSA-DAI project
- The e-Diamond project
- The BRIDGES project
- The GGF OGSA Working Group
- and others
4The Data Deluge
- Entering an age of data
- CERN LHC will generate 1GB/s 10PB/y
- VLBA (NRAO) generates 1GB/s today
- Pixar generate 100 TB/Movie
- Data stored in many different ways
- Relational databases
- XML databases
- Flat files
- Need ways to facilitate
- Data discovery
- Data access
- Data integration
Mont Blanc (4810 m)
Downtown Geneva
5Astronomical Databases
Data and images courtesy Alex Szalay, John
Hopkins
- No. sizes of data sets as of mid-2002, grouped
by wavelength - 12 waveband coverage of large areas of the
sky - Total about 200 TB data
- Doubling every 12 months
- Largest catalogues nr. 1B objects
6Bioinformatics Databases
PDB Content Growth
- Biobliographic (MedLine, )
- Amino Acid Seq (SWISS-PROT, )
- 3D Molecular Structure (PDB, )
- Nucleotide Seq (GenBank, EMBL, )
- Biochemical Pathways (KEGG, WIT)
- Molecular Classifications (SCOP, CATH,)
- Motif Libraries (PROSITE, Blocks, )
7Web Services
- Using the protocols and ideas that have made the
web a success for humans - And applying them to distributed programming
- HTTP
- Single networking port
- Autonomy Failure handling
- Open standards
- Tools Platforms
- Apache axis
- Websphere, .NET, Oracle Application Server, Sun
ONE,
8From Browsing to Programming
 Browsing the web Programming the web
Readers People Software
Discovery Google, Altavista, UDDI,
Description N/A WSDL
Operations Get, post, Service-specific
Protocol HTTP SOAP over HTTP
Format HTML, XHTML XML Schema
9A Perspective on WS Specifications
10Open Grid Services Architecture
Access resource
Manage resource
Share resource
Continuous Availability
Applications on demand
Resources on demand
Secure and universal access
Global Accessibility
Business integration
Vast resource scalability
Web Services
Grid Protocols
The architecture of the Global Grid Forum
11GGF11 OGSA specification informational document
Cataloging
Provisioning
VO Mgmt
Integration
Policy Mgmt
Access
Context Services
Information Services
Data Services
Event Mgmt
Trouble- shooting
Discovery
Logging
Infrastructure Services
Execution Mgmt Services
WSRF
WSN
WSDM
Job Mgmt
Execution Planning
Workflow Mgmt
Workload Mgmt
Application Mgmt
Naming
Resource Mgmt Services
Self Mgmt Services
Provisioning
Deployment
Configuration
Reservation
Security Services
Heterogeneity Mgmt
Authentication
Optimization
Authorization
Service Level Attainment
Integrity
QoS Mgmt
Boundary Traversal
12Data Access and Integration
- Web Services for querying and integrating
structured data resources - The foundation framework for
- Building tailored DAI applications
- Higher-level services
- Replication Data located in multiple locations
- Federation Composition of multiple sources
- Provenance How was data generated?
13The OGSA-DAI Project
Funded by the Grid Core Programme OGSA-DAI 3
million, 18 months, from Feb 2002 Three major
releases, three interim releases DAIT
(DAI-Two) Keep the OGSA-DAI brand name 1.5
million, 24 months, from Oct 2003 Four major
releases
14DAI in GGF and OGSA
- Data Access and Integration Services WG
- Strong involvement from OGSA-DAI members
- Standardise the interfaces WS-DAI
- OGSA-DAI a reference implementation
- Experience informing specification work
- OGSA WG Data Design Team
- Designing the data-oriented aspects of OGSA
- Created after GGF10 (March 2004)
- Led by NeSC
15OGSA Design Teams
Data Service design team
Information Service design team
EMS design team
Naming design team
OGSA-WG
Self Mgmt design team
Resource Mgmt design team
Security Service design team
Core (roadmap) design team
16Data Services design team
- Informal domain expert groups within OGSA
- May include co-chairs of other WG/RGs
- Output is included in OGSA specification
DAIS-WG
OGSA Data Service Design team
GSM-WG
GFS-WG
OGSA-WG
Tele cons, F2F meetings
Info-D WG
ADF, OREP,
17OGSA v2 Document Deliverables
Root Documents
Glossary
Usecase doc
Architecture v2
Design team Documents
Service descriptions
Scenarios
Working Group Specifications
GGF Recommendation documents
18How OGSA-DAI works
19OGSA-DAI compared to JDBC
- Language independence at the client end
- Platform independence
- Do not have to worry about connection technology,
drivers, etc - Can handle XML resources
- Can embed additional functionality at the service
end - Transformations
- Third party delivery
- Avoiding unnecessary data movement
- Provision of Metadata is powerful
- Usefulness of the Registry for service discovery
- Dynamic service binding process
20Future DAI Services
1a. Request to Registry for
sources of data about x
Data
y
Registry
1b. Registry
responds with
Factory handle
2a. Request to Factory for access and
integration from resources Sx and Sy
Data Access Integrationmaster
2c. Factory
returns handle of GDS to client
3b. Client
2b. Factory creates
tells
GridDataServices network
analyst
Client
3a. Client submits sequence of
scripts each has a set of queries
GDTS
to GDS with XPath, SQL, etc
1
XML
Analyst
GDS
GDTS
database
GDS
2
S
x
GDS
S
y
3c. Sequences of result sets returned to
Relational
analyst as formatted binary described in
GDTS
GDS
GDS
2
3
a standard XML notation
database
1
GDS
GDTS
21Activities are the drivers
- Express a task to be performed by a GDS
- Three broad classes of activities
- Statement
- Transformations
- Delivery
- Extensible
- Easy to add new functionality
- Does not require modification to the service
interface - Extension operate within the OGSA-DAI framework
- Functionality
- Implemented at the service
- Work where the data is (do not require to move
data back)
22OGSA-DAI Deck
23Building Applications
- Activities are grouped together
- Perform document
- Data can flow between activities
- Optimisation
- Avoids multiple message exchanges
- Can deliver to other GDSs
- Prerequisite for data integration
- Base middleware for projects requiring data
access - Some capability for data integration
24Release 4, April 2004
- Provides Data Access components, an extensible
framework for building applications and some
integration components - Built on top of Globus Toolkit 3.2
- Supports relational, xml and some files
- MySQL, Oracle, DB2, SQL Server, Postgres,
XIndice, CSV - Supports various delivery options
- SOAP, FTP, GridFTP, HTTP, files, email,
inter-service - Supports various transforms
- XSLT, ZIP, GZip
- Supports message level security using X509
certificates - Client Toolkit library for application developers
- GUI data browser (contributed by FirstDIG
project) - Separate Distributed Query Processing components
- Comprehensive documentation and tutorials in
XHTML format
25Downloads by Release
2746 downloads (4.7 downloads a day)
26Downloads by country
792 registered users _at_ 23/8/04
27Release 5, October 2004
- Re-engineered interface-independent core OGSA-DAI
functionality. - Improved dependability and security integration.
- New file data resources representing flat files
queried using full text searches (e.g. EMBL
format). - Installation and Configuration Wizard, including
all-in-one installer - Improved Data Browser which allows XPath
querying. - Set of standard benchmarks.
- JSP Quick View interface.
- Support for other databases (e.g. Access, Exist,
HSQL).
28Release 6, April 2006
- Data Integration applications supporting
identified scenarios - OGSA-DQP as an integrated part of release
- Fully compliant JDBC Driver for OGSA-DAI
- Support for WS-Security implementations
- Support for stored procedures on all supported
databases - Improved support for different database specific
SQL types - SQL translation between vendor dialects for
subset of queries - Support for XQuery data resources
- We expect to comply with a version of the
emerging DAIS specification at this release.
29Who is Using OGSA-DAI?
N2Grid (http//www.cs.univie.ac.at/institute/index
.html?project-8080)
Bridges (http//www.brc.dcs.gla.ac.uk/projects/bri
dges/)
BioSimGrid (http//www.biosimgrid.org/)
INWA (http//www.epcc.ed.ac.uk/projects/inwa/)
BioGrid (http//www.biogrid.jp/)
AstroGrid (http//www.astrogrid.org/)
eDiaMoND (http//www.ediamond.ox.ac.uk/)
OGSA-DAI (http//www.ogsadai.org.uk)
GEON (http//www.geongrid.org/)
myGrid (http//www.mygrid.org.uk/)
MCS (http//www.isi.edu/deelman/MCS/)
ODD-Genes (http//www.epcc.ed.ac.uk/oddgenes/)
OGSA-WebDB (http//www.gtrc.aist.go.jp/dbgrid/)
GridMiner (http//www.gridminer.org/)
FirstDig (http//www.epcc.ed.ac.uk/firstdig/)
GeneGrid (http//www.qub.ac.uk/escience/projects.p
hpgenegrid)
IU RGRBench (http//www.cs.indiana.edu/plale/proj
ects/RGR/OGSA-DAI.html)
30Project classification
31Edikt
Requirementsanalysis
Technologymatchmaking
Edikt project
Gap filling
Rigorousengineering
- The team 8 professional software engineers,
support staff, project manager, commercialisation
manager, architect, and SAB - SHEFC funded research and development grant
- 3 years funding May 2002 2005
- 3 years funding upon successful project and
review
32ELDAS Data Access Service
Grid User1
Grid User2
JavaFramework
Another (partial) implementation of the GGF
WS-DAI specifications
ELDAS
EJB - DAS
DB2 DB
MySQL DB
Xindice DB
Oracle 9i DB
- Implemented using Enterprise Java Beans
- Data Access Components interface to distinct
DBMSs - Accessible as a grid data service or a web data
service
33BinX accessing legacy binary data
simulations
- The Problem
- Many binary data files
- Applications must knowthe data format
- Binary data formats are machine-specific
BinaryData File
BinaryData File
BinaryData File
- The Solution
- Write a stand-aside format description in XML
- Provide a library to
- Interpret the description
- Provide file access across different machines
- Build higher-level services
BinX Library
e-ScienceApplication
34Mammography
A prototype of a national database of
mammographic images in support of the UK breast
screening programme
Temporal mammography
Computer Aided Detection
Standard Mammo Format
Mammograms have different appearances, depending
on image settings and acquisition systems
3D View
35(No Transcript)
36The BRIDGES Project
- Biomedical Research Informatics Delivered by Grid
Enabled Services - NeSC (Edinburgh and Glasgow) and IBM
- www.brc.dcs.gla.ac.uk/projects/bridges
- Supporting project for CFG project
- Generating data on hypertension
- Rat, Mouse, Human genome databases
- Variety of tools used
- BLAST, BLAT, Gene Prediction, visualisation,
- Variety of data sources and formats
- Microarray data, genome DBs, project partner
research data, medical records, - Aim is integrated infrastructure supporting
- Data federation
- Security
37BRIDGES
VO Authorisation
38INWA Project
- Innovation Node Western Australia
- Informing Business Regional Policy
Grid-enabled fusion of global data and local
knowledge - Involved 10 partners (6 UK 4 Australia)
- Aim
- Data mine commercially sensitive data
- Security an absolute MUST
- Employ Grid technologies
- Need access to data and computational resources
- OGSA-DAI
- Access data resources
- SunDCG's TOG (Transfer-queue Over Globus)
- Handle job submission to analyse micro array data
39INWA
40Further Information on OGSA-DAI
- The OGSA-DAI Project Site
- http//www.ogsadai.org.uk
- The DAIS-WG site
- http//cs.man.ac.uk/grid-db
- OGSA-DAI Users Mailing list
- users_at_ogsadai.org.uk
- General discussion on grid DAI matters
- Formal support for OGSA-DAI releases
- http//www.ogsadai.org.uk/support
- support_at_ogsadai.org.uk
- OGSA-DAI training courses