Title: 5th Feb 03
1 The Challenge of Data Integration Data Grid
Discovery? Prof. Malcolm Atkinson Director w
ww.nesc.ac.uk 22nd January 2003
2Overview
- Essentials of e-Science
- Collaboration
- Resource Sharing
- Data Sharing
- Mutual Dependence
- Essentials of the Grid
- Distributed Virtual Machine?
- Essentials of Data Sharing
- Database Research did it?
- New Challenges
- Data Access Integration Building Bricks
- Band Wagon v Research Opportunity
- Thresholds, Visions and Questions
3Essentials of e-Science
What happened?
4UK e-Science Programme (1)2001 - 2003
DG Research Councils
Grid TAG
Total US 200 M
E-Science Steering Committee
Director
Directors Awareness and Co-ordination Role
Directors Management Role
Generic Challenges EPSRC (15m), DTI (15m)
Over 3 years
Academic Application Support Programme Research
Councils (74m), DTI (5m) PPARC (26m) BBSRC
(8m) MRC (8m) NERC (7m) ESRC (3m) EPSRC
(17m) CLRC (5m)
80m Collaborative projects
Plus gt 90M for HPCx 6 years
Industrial Collaboration (40m)
5UK e-Science
150 Million e-Science 55Million HPCx
From presentation by Tony Hey
6UK e-Science Investment
Nationale-Science Centre
Edinburgh
Glasgow
Newcastle
Belfast
- Projects
- gt 60 started
- gt 30 proposed
-
- EU Projects
Manchester
Daresbury Lab
Cambridge
Oxford
Hinxton
RAL
Cardiff
London
Southampton
7UK e-Science Programme (2)2003 - 2005
DG Research Councils
Grid TAG
Total gt 150 M
E-Science Steering Committee
Director
Directors Awareness and Co-ordination Role
Directors Management Role
Generic Challenges EPSRC (15m), DTI (15m)
Over 2 years
Academic Application Support Programme Research
Councils (74m), DTI (5m) PPARC (26m) BBSRC
(8m) MRC (8m) NERC (7m) ESRC (3m) EPSRC
(17m) CLRC (5m)
80m Collaborative projects
Industrial Collaboration (40m)
8Essentials of e-Science
Why its Happening
9Collaboration Growing
What's New? Scale At a Distance Instantaneous
Dynamic
- Hard Problems, Multi-disciplinary, Expense
- Sharing
- Ideas
- Thought processes and Stimuli
- Effort
- Resources
- Requires
- Communication
- Common understanding Framework
- Mechanisms for sharing fairly
- Organisation and Infrastructure
Requires Trust
Scientists have done this for Centuries
10Collaboration Growing
Text, digital media, structured, organised
curated data, annotation, computable models,
visualisation, shared instruments, shared
systems, shared administration,
- Data, Policy Digital Infrastructure Key
- Sharing
- Ideas
- Thought processes and Stimuli
- Effort
- Resources
- Requires
- Communication
- Common understanding Framework
- Mechanisms for sharing fairly
- Organisation and Infrastructure
Changing the ways Science is done
Nationally Internationally Distributed,
Routine, Daily, Automated,
That Requires very Significant Investment in
DigitalSystems and their Support
11Collaboration Growing
- Digital Communication, Metadata,
- Sharing
- Ideas
- Thought processes and Stimuli
- Effort
- Resources
- Requires
- Communication
- Common understanding Framework
- Mechanisms for sharing fairly
- Organisation and Infrastructure
Digital networks, digital work-places, digital
instruments,
Metadata, ontologies, standards, shared curated
data, shared codes,
Common platforms, shared software, shared
training,
Authentication, Authorisation, Accounting,
Provenance, Policies,
Shared Provision of Platform,
The Grid SHOULD make this much easier
by providing a common, supported high-level of
Software and Organisational infrastructure
12Interdependence
- Science has relied on experiment and theory
- Simulation, Data Mining, Analysis
Plus Moor's Law ...
Experiment - Italy 1,500 AD
Computing Science at the Party
13Interdependence
14Database Growth
PDB protein structures
15The Grid
What's happening?
16Globus Toolkit History
Does not include downloads fromNMI, UK
eScience, EU Datagrid,IBM, Platform, etc.
GT 2.0 Released
Physiology of the Grid Paper Released
Anatomy of the Grid Paper Released
Significant Commercial Interest in Grids
The Grid Blueprint for a New Computing Infrastru
cture published
NSF European Commission Initiate Many New Grid
Projects
Early Application Successes Reported
GT 1.0.0 Released
NASA begins funding Grid work,DOE adds support
17Encompassing Vision
18People Industry
- Global Grid Forum
- GGF2 260 Jul 01
- GGF3 220 Oct 01
- GGF4 400 Feb 02
- GGF5 900 Jul 02
- GGF6 450 Oct 02
- GGF7 gt1000 Mar 03
- UK All Hands
- AHM02 350 Sep 02
- GlobusWorld
- 1 450 Jan 03
- IBM This week
- IBM DRIVES GRID COMPUTING FOR COMMERCIAL
BUSINESS WITH TEN NEW GRID OFFERINGS - Targets
- Financial, Life Sciences
- Automotive Aerospace
- Governments
- Partners
- Platform, DataSynapse
- Avaki, Entropia
- United Devices
- IBM last 20 months
- Leaders of OGSI
- Development teams
- Grid Jamboree
- GGF
This is a Global Phenomenum
19The Grid
What is it?
20High-Altitude Views
- A Rallying Cry
- Meeting a Hard Challenge requires Many Minds
- Operating Maintaining Infrastructure requires
Many Hands Many Companies
- Another Stab at Distributed Computing
- Hard Challenge Intellectually and Practically
Important - Dependable Ubiquity over Heterogeneity
Fallibility
All Views Significant
- An Ambitious Virtual Machine
- Consistent large scale computational environments
- A Global Operating System
- Collective Resources, Common Management
21An Architectural View
Application
Application Platform Developers
Common Application Platform for Group of
Applications
Grid Plumbing Security Infrastructure
Operations Teams
Providers
22Open Grid Services Infrastructure
- Confluence of Web Services Grid
- Consistent Interface Description
- Based on WSDL 1.2 proposal
- Extend Properties
- Separate Binding from Interface
- Function Composition Inheritence
- Exploit WS Investment
- Grid Features
- Security
- Life-Time Management
- Service (state) Information via Data Elements
- Discovery
- Grouping
- Notification
- OGSI Version 1 Proposal at GGF7 (March 03)
Open, Strongly Led Design Process
Multiple Development Efforts
Open Source Alpha Release Jan 03
23Open Grid Services Architecture
- Ubiquitous Building Blocks
- Using OGSI Platform
- Open Extensible
- Encourage Refactoring Experiments
- Initially
- The Globus 2 model
- Except State Information now distributed
- Example New Features
- Global Name Mapping Service
- Replication and Caching Service
- Data Access Integration
- Metering, Logging, Authorisation, Charging,
Many Open Issues
24Grid Challenge
- Balancing Direct Access to the Platforms with
Abstraction Virtualisation - Developers often have exploitable application
knowledge - Automation necessary helpful
- Interface matching, operation validation,
- Optimisation at many scales
- There isnt enough effort to develop Languages
Abstractions
Needs CS Research!
25Data Sharing
What's needed?
26Data Integration
Repeat until Hypothesis Tested
Scientist with Idea
Repeat for next Hypothesis
27Wellcome Trust Cardiovascular Functional
Genomics
28OGSA-DAI Partners
IBM USA
EPCC NeSC
Glasgow
Newcastle
Belfast
Manchester
Daresbury Lab
Cambridge
Oxford
EPCC NeSCIBM UK IBM USA Manchester
e-SC Newcastle e-SCOracle
Oracle
Hinxton
RAL
Cardiff
London
IBM Hursley
Southampton
3 million, 18 months, started February 2002
29OGSA-DAI Data Access and Integration for the New
Grid
30DAI Key Services
GridDataService GDS Access to data DB
operations GridDataServiceFactory GDSF Makes GDS
GDSF GridDataServiceRegistry GDSR Discovery of
GDS(F) Data GridDataTranslationService GDTS Tra
nslates or Transforms Data GridDataTransportDepot
GDTD Data transport with persistence
Integrated Structured Data Transport Relational
XML models supported Role-based
Authorisation Binary structured files (later)
31DAI Architecture
321a. Request to Registry for sources of data about
x
1b. Registry responds with Factory handle
2a. Request to Factory for access to database
2c. Factory returns handle of GDS to client
2b. Factory creates GridDataService to manage
access
3a. Client queries GDS with XPath, SQL, etc
3c. Results of query returned to client as XML
3b. GDS interacts with database
331a. Request to Registry for sources of data about
x y
1b. Registry responds with Factory handle
2a. Request to Factory for access and
integration to databases
34Biomedical (or ANY) Data
- Opportunities
- Global Production of Published Data
- Volume? Diversity?
- Combination ? Analysis ? Discovery
- Challenges
- Data Huggers
- Meagre metadata
- Ease of Use
- Automated, optimised integration
- Traceability, Dependability
- Opportunities
- Specialised Indexing
- Structurally varied replication
- Consistent Structured Universe of Discourse
- Data Computation Integration
- Challenges
- Approximate Matching
- Multi-scale optimisation
- Bad habits / industrial structures
- Safety and Multi-scale optimisation
35Data Integration Challenges
- High-Level Languages
- Describing the Data Extraction Recipes
- Describing the Sources Components
- Metadata that drives automation validation
- Mobility
- Code Data
- Integrating Existing DB technology
- Moving the DBMS to the Grid context
- New Optimisation Challenges
- Data Computation Storage Movement
- Shared Distributed Annotation Systems
- How to Reference
- Provenance Acknowledgement
36What Do We Do Now?
Address the Challenges?
37Challenges
- A Programming Development Model
- Dependability at this Scale
- Foundations for Trust
- Raising the Level of Automation
- Supporting New Forms of
- Collaboration
- Data
Opportunities for All
38Questions Answers