Title: High Performance Data Management
1High PerformanceData Management
- Omer Rana
- Cardiff University and
- Welsh E-Science Centre
- o.f.rana_at_cs.cf.ac.uk
2From Michael Lesk
Burned manuscript from British Library
3From Michael Lesk
4From Michael Lesk
George Washington manuscript
5From Michael Lesk
Keio University - Gutenberg Bible
6From Michael Lesk
International Dunhuang project cave painting
7From Michael Lesk
Seismic activity IRIS data base
8From Michael Lesk
9From Michael Lesk
Alcohol dehydrogenase
Molecule of the month, Protein Data Bank
10From Michael Lesk
CalFlora database rhododendron
11From Michael Lesk
Sky Sloan Digital Sky Survey
12From Michael Lesk
MRI scan, UCLA
13From Michael Lesk
East coast dolphin database, Eckerd College
(Kelly Debure)
14From multimap.com
15 Synthetic Aperture Radar
Uses Vegetation type, extent and deforestation,
Soil erosion Soil moisture content,
Archaeological investigations, Ocean dynamics,
wave and surface wind speeds and
direction Volcanic and tectonic activity
Roy Williams (Caltech)
16Applications Detection and monitoring of oil
spills
Roy Williams (Caltech)
17Archaelogy
SAR sees new archaeology beneath forest cover
Angkor Wat, Cambodia Vast complex of more than
60 temples Spiritual center for the Khmer
people 9th century
Roy Williams (Caltech)
18From Michael Lesk
Human motion digital library, Jezekiel Ben-Arie
19From Michael Lesk
CT scans of crocodile skulls, Tim Rowe (U Texas)
20From Michael Lesk
Beauvais cathedral, Peter Allen, Columbia
21From Michael Lesk
Forma Urbis Romae, Marc Levoy, Stanford
22From Michael Lesk
Image searching Jitendra Malik, David Forsyth
23From Michael Lesk
3-D search Tom Funkhouser, Princeton
24(No Transcript)
25Drivers
- Survey of needs - led by some questions
- Grid based infrastructure has become an important
environment -- how do we utilise this? - What are specific manufacturers offering?
- BlueArcs FPGA based storage
- What is available now - and what is (ideally)
needed at all layers of data management
hierarchy? - Currently available tools
- Application demands
26Server Storage Capacity Requirements
27(No Transcript)
28Technologies
- Three technologies core
- Object based representation
- XML for metadata representation and management
- Object distribution and Peer-2-Peer systems
(JXTA, Gnutella, Freenet) - Related areas
- Digital Libraries
- Persistent Archives
- Data Mining environments
29Wider Perspective Data Management
- Data Preparation, Format, Fusion Data
Interoperability and Integration (Quality)
Problem (Meta(n) Data) - Data Storage I/O Problem (hierarchic)
- Data Mining and Knowledge Discovery
Intelligence Problem Neural nets, Genetic
Algorithms, Rules - Query Estimation and Optimisation estimate cost
of query based on data gathered from user - Data Exploration and Visualisation Navigation
and Interaction Problem - Role of Standards
30Services
- Requirement for
- Local services
- directly supported on resource or via a proxy
- Must be implemented on all Grid-enabled resources
- Resource dependent
- Global services
- Shared across resources
- May be subscribed to
- May be integrated with other such services
31Global Services
- Data Access
- Primitive operations (read, write) supported with
location management - Service should provide hooks to such locally
supported functions - Storage management to support multiple access (eg
maintaining a container in cache if multiple
accesses are being requested) - Location Transparency
- Support data set discovery
- Keep location of data source independent of
access method - Support for a global namespace
- Support logical view of data set -- and maintain
independence from physical location of data
source - Support logical factoring of data sets into
containers - and subsequently, federations
32Global Services 2
- Security
- System/User access without account on remote
source - Access control shared across collections/aggregati
ons - Ownership by collection objects, rather than
individual users - Access control catalogues to separate logical
access control mechanisms, independent of any
particular resource - Persistence and Replication
- Access via unique identifier independent of
physical location of data set/collection (not
possible with URIs or PURLs) - Must be automatically supported via an event
service - Requires a global namespace
- Error Handling
- Distinguish between errors from individual
resources and those from data collections/aggregat
ions - Error handling supported via global namespace and
event service - Error reporting to resource and client requesting
service - Management Services
33Global Services 3
- Additional management services provided globally
can include - Check-pointing and state management service
- Data migration service to facilitate logical
collections - Container management
- Support for collection aggregation
- Transformation between data supported through
languages such as XSLT - XML useful for encoding data
- Must be able to cover both ASCII and Binary data
(BinX (NeSC), VOTable (Astronomy))
34XPath and XSLT
- XSL
- Extensible Style Language
- XSLT
- Extensible Style Language Transformation
Output (HTML, Latex, Excel, ...)
35Data Storage and Access
- Policy (division between services)
- automated (resource supported) vs. user
implemented. - Must be exposed to the user
- Operations (minimal set supported)
- Access operations (read, write)
- Discovery support (address lookup, access
properties) - Exception handling
- State Management (transaction support)
- State recording within resource or via a
checkpoint service - Mechanism
- Actual implementation of local services
- Direct access to such mechanisms via resource
metadata (by global services) - Ability to support multiple mechanisms in same
resource - Structure
- physical organisation of resource or its contents
(disks, number of heads, access/transfer rates
etc) - Support for external manipulation
36Minimum Unit of transfer
- Unit of transfer based on access patterns and
data structs - Array based access in Scientific Computing
- Random access in Business Computing
- Support for block, cyclic, or irregular array
strides (eg from Fortran programs) - Resource type also determines minimum unit
- File system (NFS or AFS)
- HPSS or DPSS or network caches
- Structured databases (Objectivity, Oracle)
- Minimum unit of transfer must be made explicit
- Support for collections/aggregations of minimal
units
37Data Formats
- Need for common data formats to support
cross-domain analysis and data sharing - Automated annotation
- of experimental results for analysis
- of stages of analysis for management tools
- Support for data fusion and quality management
- However,
- Unlikely to happen
- Unlikely to be ratified through standards
- Unlikely to be accepted by everyone
- Compromise?
- Define points of sharing rather than actual
data formats - Support ease of exchange between formats, rather
than agree on specific formats - Tried before Ontolingua (Stanford), DAML (DARPA,
Maryland) - Can we have a Grid Ontology for Data Management?
(OWL, OiL)
38Metadata Management
- Separate Content from Structure
- Re-purpose data (supports sharing) - via
catalogues - Used for data integration
- DataCutter Project
- Could represent
- Scheme for locating data
- Properties of data resource externally visible
- security prevlidges (access rights)
- Content types and structure (relationships)
- Content Structure (Relationships) to support
semantic interoperability - semantic/functional
- spatial/structural
- temporal/procedural
39 Data Processing
- Processing Characteristics
- Well defined work flow
- Correction, calibration, transformation,filtering,
merging - Relatively static reference data
- Stable processing functions (audited changes)
- Periodic reprocessing from archive
40Analysis and Interpretation
- Analysis Characteristics
- - Variable workflow
- - Standard functions
- - Standard and personal
- filtering and summarisation
- - Retain drill down capability
41Analysis and Interpretation
- Conclusions/Inferences
- Descriptions
- Trends
- Correlations
- Relationships
- Analysis and Interpretation Characteristics
- Highly dynamic work flow
- Multiple data types
- Volatile data
- Annotations, inferences, conclusions
- Evidential reasoning
- Shared multiple versions of truth
- Periodic version consolidation
42Metadata Requirements
- Technical Metadata
- Direct referencing - Physical location and data
schema/structure - Data currency/status version, time stamping
- Accreditation/Access permissions - Ownership
(Dublin Core) - Query time/Governance - data volume, no. of
records, access paths - Contextual Metadata
- Logical referencing physical data
semantic/syntactic ontologies - Lexical translation Thesaurus, ontological
mapping - Named derivations (summarisations)
- Scope of Requirements
- All science communities
- Related to provenance
43Metadata Requirements
- Data Versioning
- Distinguish latest/agreed version of data
- Maintain history record of change
- Synchronise and mirror replicated data
- Distinguish shared personal interpretations
and/or annotations - Provenance
- Record of data processing calibration,
filtering, transformation - Record of workflow methods, standards and
protocols - Reasoning evidential justification for
inferences conclusions - Scope of Requirements
- All science communities
- Includes Technical and Contextual Metadata
44Provenance
- When you see some data on the Web, do you know
- where it came from?
- why it is there?
- This information (provenance) is typically lost
in the process of copying/transcribing/transformin
g databases - Loss of provenance is an acute problem in some
scientific databases - Especially relevant when combining data from
multiple sources - The Web has lots of stuff (50 TB or so). Digital
libraries introducing new kinds of collections,
and often try to maintain metadata control, but
often still have little control over the ultimate
content.
45Standards at different levels
- IEEE Open Storage Systems Inteconnection (OSSI)
- derived from IEEE Mass Storage Reference Model
(1980s) - Low level detail of storage systems media, drive
technology, structure of media - Meant to facilitate mechanism sharing between
storage resources - ISO Open Archival Information System (OAIS)
- Support for managing persistent archives
- Support for data submission, recording,
ingestion, encapsulation of data objects with
attributes and data export - Involves the human in the loop (this is seen as
an important part) - ISO 13250 Topic Maps
- Standard notation for structure of information
sources (topics) and relationships between topics
- Interrelated docs supporting this representation
Topic Maps - Can relate topics via associations, groups,
similarities
46A Common Infrastructure
Application
Data Cutter
Local Services
Collect./ Aggreg.
Storage Resource Broker
Global Services
Globus (Legion)
P2P Infrastructure
Client/Server Infras.
Resources
47Data Cutter
- Provides support for managing distributed,
archived data sets - Extends the Active Data Repository to
- support range queries
- user defined filters on data
- data processing at source
- Uses the abstraction of a filter to support
distributed processing and management of data via
common services - Also supports data division based on range
queries - uses a hierarchical indexing scheme
- Primary aim is to support
- distributed data sets organised as collections
- where data sets can be generated by distributed
platforms
48From Reagan Moore
SDSC Storage Resource Broker Meta-data
Catalog Common APIs
Application
Linux I/O
OAI
Access APIs
DLL / Python
Java, NT Browsers
GridFTP
Consistency Management / Authorization-Authenticat
ion
Prime Server
Logical Name Space
Latency Management
Data Transport
Metadata Transport
Storage Abstraction
Catalog Abstraction
Databases DB2, Oracle, Sybase
Servers
HRM
49Spitfire (CERN)
Uniform access to RDMS
- Aimed as being a middleware for access to
relational databases - Three SOAP services
- Base service for standard operations,
- Admin service for administrative access,
- Info service for information on the database and
its tables
50Grid Database Service Specification Principles
From Amy Kraus et al. (GGF6)
- Provide service-based access to existing database
systems in the context of Grid Computing. - Be orthogonal to the Grid authentication and
authorization mechanisms - Accommodate diverse database paradigms within a
consistent - framework.
- Accommodate diverse database metadata
- Support higher-level information-integration and
federation services. - Adopt the document approach to service
description - Defined semi-formally (WSDL and English text)
51Grid Database Service Port Types
Grid Data Service
GridService Port
- Mandatory
- GridDataService.
- GridService.
- Optional
- GridDataTransport.
- NotificationSource.
Find Service Data
Client (Application or Federator)
GridDataService Port
Query Specification
Transport Specification
GridDataTransport Port
52Grid Data Service Initialization
Grid Service Registry
Grid Data Service Factory
Create Grid Data Service
Create Grid Data Service
Find Factory
Client (Application or Federator)
Grid Data Service
Database
Data Request
53- Montage - Custom Image Mosaics
- http//montage.ipac.caltech.edu
- User specified size, WCS projection,
coordinates, spatial sampling, rotation - Rectification of backgrounds in images
- Supports drizzle algorithm
- Delivery
- Semi-annual deliveries from Feb 2003
- Final Delivery Jan 2005
- Available for download
- Science Drivers
- Science Grade Images
- Analyze diverse images as if part of same
multi-wavelength image (radio, infrared,
optical etc)
54Montage
- Combine data from different astronomers (using
different instruments)
55(No Transcript)
56(No Transcript)
57http//www.neurogrid.net/
58http//www.neurogrid.net/
59OceanStore (Berkeley)
data source
data plane
Web content server
network plane
60OceanStore Goal and Challenges
Provide content distribution to clients with good
Quality of Service (QoS) while retaining
efficient and balanced resource consumption of
the underlying infrastructure
- Dynamic choice of number and location of replicas
- Clients QoS constraints
- Servers capacity constraints
- Efficient update dissemination
- Delay
- Bandwidth consumption
- Scalability millions of objects, clients and
servers - No global network topology knowledge
61Dynamic Replica Placement naïve
data plane
s
c
network plane
Tapestry mesh
62Dynamic Replica Placement naïve
data plane
parent candidate
s
proxy
c
network plane
Tapestry mesh
Tapestry overlay path
63Dynamic Replica Placement smart
data plane
client child
s
parent
proxy
sibling
c
server child
network plane
64Dynamic Replica Placement smart
- Aggressive search
- Lazy placement
data plane
parent candidates
client child
s
parent
proxy
sibling
c
server child
network plane
65HiveCache from MojoNation
- Utilises free disk space to undertake backup --
builds a dynamic RAID network - Disk space monitored to determine how much to
allocate (via a local agent) - File broken into pieces along with error
correction information encryption - Data highly replicated
- Little unique info in a company PC
- common apps (word, excel etc)
- operating system files, utilities
- Personal data, however, is not very large
- Uses SHA1 hash algorithm to map each file
fragment - acts as a unique key to locate the data fragment
- Request to retrieve a file is mapped to machines
-- which retrieve file in parts in parallel
66Peer-Oriented systems
From Wolfgang Hoschek
67Final Thoughts
- Aim to identify themes of interest
- for Grid environments
- Identify Local and Global services
- Some of these are available in existing software
- some must be implemented by a user - Standards play an important role
- Development of Common Open Grid Data Services for
Science