High Performance Data Management - PowerPoint PPT Presentation

1 / 67

About This Presentation

Title:

High Performance Data Management

Description:

Image processing techniques are used to extract ... Angkor Wat, Cambodia. Vast complex of more. than 60 temples. Spiritual center for. the Khmer people ... – PowerPoint PPT presentation

Number of Views:96

Avg rating:3.0/5.0

Slides: 68

Provided by: scm77

Category:

more less

Transcript and Presenter's Notes

Title: High Performance Data Management

1
High PerformanceData Management

Omer Rana
Cardiff University and
Welsh E-Science Centre
o.f.rana_at_cs.cf.ac.uk

2
From Michael Lesk
Burned manuscript from British Library
3
From Michael Lesk
4
From Michael Lesk
George Washington manuscript
5
From Michael Lesk
Keio University - Gutenberg Bible
6
From Michael Lesk
International Dunhuang project cave painting
7
From Michael Lesk
Seismic activity IRIS data base
8
From Michael Lesk
9
From Michael Lesk
Alcohol dehydrogenase
Molecule of the month, Protein Data Bank
10
From Michael Lesk
CalFlora database rhododendron
11
From Michael Lesk
Sky Sloan Digital Sky Survey
12
From Michael Lesk
MRI scan, UCLA
13
From Michael Lesk
East coast dolphin database, Eckerd College
(Kelly Debure)
14
From multimap.com
15
Synthetic Aperture Radar
Uses Vegetation type, extent and deforestation,
Soil erosion Soil moisture content,
Archaeological investigations, Ocean dynamics,
wave and surface wind speeds and
direction Volcanic and tectonic activity
Roy Williams (Caltech)
16
Applications Detection and monitoring of oil
spills
Roy Williams (Caltech)
17
Archaelogy
SAR sees new archaeology beneath forest cover
Angkor Wat, Cambodia Vast complex of more than
60 temples Spiritual center for the Khmer
people 9th century
Roy Williams (Caltech)
18
From Michael Lesk
Human motion digital library, Jezekiel Ben-Arie
19
From Michael Lesk
CT scans of crocodile skulls, Tim Rowe (U Texas)
20
From Michael Lesk
Beauvais cathedral, Peter Allen, Columbia
21
From Michael Lesk
Forma Urbis Romae, Marc Levoy, Stanford
22
From Michael Lesk
Image searching Jitendra Malik, David Forsyth
23
From Michael Lesk
3-D search Tom Funkhouser, Princeton
24
(No Transcript)
25
Drivers

Survey of needs - led by some questions
Grid based infrastructure has become an important
environment -- how do we utilise this?
What are specific manufacturers offering?
BlueArcs FPGA based storage
What is available now - and what is (ideally)
needed at all layers of data management
hierarchy?
Currently available tools
Application demands

26
Server Storage Capacity Requirements
27
(No Transcript)
28
Technologies

Three technologies core
Object based representation
XML for metadata representation and management
Object distribution and Peer-2-Peer systems
(JXTA, Gnutella, Freenet)
Related areas
Digital Libraries
Persistent Archives
Data Mining environments

29
Wider Perspective Data Management

Data Preparation, Format, Fusion Data
Interoperability and Integration (Quality)
Problem (Meta(n) Data)
Data Storage I/O Problem (hierarchic)
Data Mining and Knowledge Discovery
Intelligence Problem Neural nets, Genetic
Algorithms, Rules
Query Estimation and Optimisation estimate cost
of query based on data gathered from user
Data Exploration and Visualisation Navigation
and Interaction Problem
Role of Standards

30
Services

Requirement for
Local services
directly supported on resource or via a proxy
Must be implemented on all Grid-enabled resources
Resource dependent
Global services
Shared across resources
May be subscribed to
May be integrated with other such services

31
Global Services

Data Access
Primitive operations (read, write) supported with
location management
Service should provide hooks to such locally
supported functions
Storage management to support multiple access (eg
maintaining a container in cache if multiple
accesses are being requested)
Location Transparency
Support data set discovery
Keep location of data source independent of
access method
Support for a global namespace
Support logical view of data set -- and maintain
independence from physical location of data
source
Support logical factoring of data sets into
containers - and subsequently, federations

32
Global Services 2

Security
System/User access without account on remote
source
Access control shared across collections/aggregati
ons
Ownership by collection objects, rather than
individual users
Access control catalogues to separate logical
access control mechanisms, independent of any
particular resource
Persistence and Replication
Access via unique identifier independent of
physical location of data set/collection (not
possible with URIs or PURLs)
Must be automatically supported via an event
service
Requires a global namespace
Error Handling
Distinguish between errors from individual
resources and those from data collections/aggregat
ions
Error handling supported via global namespace and
event service
Error reporting to resource and client requesting
service
Management Services

33
Global Services 3

Additional management services provided globally
can include
Check-pointing and state management service
Data migration service to facilitate logical
collections
Container management
Support for collection aggregation
Transformation between data supported through
languages such as XSLT
XML useful for encoding data
Must be able to cover both ASCII and Binary data
(BinX (NeSC), VOTable (Astronomy))

34
XPath and XSLT

XSL
Extensible Style Language
XSLT
Extensible Style Language Transformation

Output (HTML, Latex, Excel, ...)
35
Data Storage and Access

Policy (division between services)
automated (resource supported) vs. user
implemented.
Must be exposed to the user
Operations (minimal set supported)
Access operations (read, write)
Discovery support (address lookup, access
properties)
Exception handling
State Management (transaction support)
State recording within resource or via a
checkpoint service
Mechanism
Actual implementation of local services
Direct access to such mechanisms via resource
metadata (by global services)
Ability to support multiple mechanisms in same
resource
Structure
physical organisation of resource or its contents
(disks, number of heads, access/transfer rates
etc)
Support for external manipulation

36
Minimum Unit of transfer

Unit of transfer based on access patterns and
data structs
Array based access in Scientific Computing
Random access in Business Computing
Support for block, cyclic, or irregular array
strides (eg from Fortran programs)
Resource type also determines minimum unit
File system (NFS or AFS)
HPSS or DPSS or network caches
Structured databases (Objectivity, Oracle)
Minimum unit of transfer must be made explicit
Support for collections/aggregations of minimal
units

37
Data Formats

Need for common data formats to support
cross-domain analysis and data sharing
Automated annotation
of experimental results for analysis
of stages of analysis for management tools
Support for data fusion and quality management
However,
Unlikely to happen
Unlikely to be ratified through standards
Unlikely to be accepted by everyone
Compromise?
Define points of sharing rather than actual
data formats
Support ease of exchange between formats, rather
than agree on specific formats
Tried before Ontolingua (Stanford), DAML (DARPA,
Maryland)
Can we have a Grid Ontology for Data Management?
(OWL, OiL)

38
Metadata Management

Separate Content from Structure
Re-purpose data (supports sharing) - via
catalogues
Used for data integration
DataCutter Project
Could represent
Scheme for locating data
Properties of data resource externally visible
security prevlidges (access rights)
Content types and structure (relationships)
Content Structure (Relationships) to support
semantic interoperability
semantic/functional
spatial/structural
temporal/procedural

39
Data Processing

Processing Characteristics
Well defined work flow
Correction, calibration, transformation,filtering,
merging
Relatively static reference data
Stable processing functions (audited changes)
Periodic reprocessing from archive

40
Analysis and Interpretation

Analysis Characteristics
- Variable workflow
- Standard functions
- Standard and personal
filtering and summarisation
- Retain drill down capability

41
Analysis and Interpretation

Conclusions/Inferences
Descriptions
Trends
Correlations
Relationships

Analysis and Interpretation Characteristics
Highly dynamic work flow
Multiple data types
Volatile data
Annotations, inferences, conclusions
Evidential reasoning
Shared multiple versions of truth
Periodic version consolidation

42
Metadata Requirements

Technical Metadata
Direct referencing - Physical location and data
schema/structure
Data currency/status version, time stamping
Accreditation/Access permissions - Ownership
(Dublin Core)
Query time/Governance - data volume, no. of
records, access paths
Contextual Metadata
Logical referencing physical data
semantic/syntactic ontologies
Lexical translation Thesaurus, ontological
mapping
Named derivations (summarisations)
Scope of Requirements
All science communities
Related to provenance

43
Metadata Requirements

Data Versioning
Distinguish latest/agreed version of data
Maintain history record of change
Synchronise and mirror replicated data
Distinguish shared personal interpretations
and/or annotations
Provenance
Record of data processing calibration,
filtering, transformation
Record of workflow methods, standards and
protocols
Reasoning evidential justification for
inferences conclusions
Scope of Requirements
All science communities
Includes Technical and Contextual Metadata

44
Provenance

When you see some data on the Web, do you know
where it came from?
why it is there?
This information (provenance) is typically lost
in the process of copying/transcribing/transformin
g databases
Loss of provenance is an acute problem in some
scientific databases
Especially relevant when combining data from
multiple sources
The Web has lots of stuff (50 TB or so). Digital
libraries introducing new kinds of collections,
and often try to maintain metadata control, but
often still have little control over the ultimate
content.

45
Standards at different levels

IEEE Open Storage Systems Inteconnection (OSSI)
derived from IEEE Mass Storage Reference Model
(1980s)
Low level detail of storage systems media, drive
technology, structure of media
Meant to facilitate mechanism sharing between
storage resources
ISO Open Archival Information System (OAIS)
Support for managing persistent archives
Support for data submission, recording,
ingestion, encapsulation of data objects with
attributes and data export
Involves the human in the loop (this is seen as
an important part)
ISO 13250 Topic Maps
Standard notation for structure of information
sources (topics) and relationships between topics
Interrelated docs supporting this representation
Topic Maps
Can relate topics via associations, groups,
similarities

46
A Common Infrastructure
Application
Data Cutter
Local Services
Collect./ Aggreg.
Storage Resource Broker
Global Services
Globus (Legion)
P2P Infrastructure
Client/Server Infras.
Resources
47
Data Cutter

Provides support for managing distributed,
archived data sets
Extends the Active Data Repository to
support range queries
user defined filters on data
data processing at source
Uses the abstraction of a filter to support
distributed processing and management of data via
common services
Also supports data division based on range
queries
uses a hierarchical indexing scheme
Primary aim is to support
distributed data sets organised as collections
where data sets can be generated by distributed
platforms

48
From Reagan Moore
SDSC Storage Resource Broker Meta-data
Catalog Common APIs
Application
Linux I/O
OAI
Access APIs
DLL / Python
Java, NT Browsers
GridFTP

Consistency Management / Authorization-Authenticat
ion
Prime Server
Logical Name Space
Latency Management
Data Transport
Metadata Transport
Storage Abstraction
Catalog Abstraction
Databases DB2, Oracle, Sybase
Servers
HRM
49
Spitfire (CERN)
Uniform access to RDMS

Aimed as being a middleware for access to
relational databases
Three SOAP services
Base service for standard operations,
Admin service for administrative access,
Info service for information on the database and
its tables

50
Grid Database Service Specification Principles
From Amy Kraus et al. (GGF6)

Provide service-based access to existing database
systems in the context of Grid Computing.
Be orthogonal to the Grid authentication and
authorization mechanisms
Accommodate diverse database paradigms within a
consistent
framework.
Accommodate diverse database metadata
Support higher-level information-integration and
federation services.
Adopt the document approach to service
description
Defined semi-formally (WSDL and English text)

51
Grid Database Service Port Types
Grid Data Service
GridService Port

Mandatory
GridDataService.
GridService.
Optional
GridDataTransport.
NotificationSource.

Find Service Data

Client (Application or Federator)
GridDataService Port
Query Specification

Transport Specification
GridDataTransport Port

52
Grid Data Service Initialization
Grid Service Registry
Grid Data Service Factory
Create Grid Data Service
Create Grid Data Service
Find Factory

Client (Application or Federator)

Grid Data Service
Database
Data Request

53

Montage - Custom Image Mosaics
http//montage.ipac.caltech.edu
User specified size, WCS projection,
coordinates, spatial sampling, rotation
Rectification of backgrounds in images
Supports drizzle algorithm

Delivery
Semi-annual deliveries from Feb 2003
Final Delivery Jan 2005
Available for download

Science Drivers
Science Grade Images
Analyze diverse images as if part of same
multi-wavelength image (radio, infrared,
optical etc)

54
Montage

Combine data from different astronomers (using
different instruments)

55
(No Transcript)
56
(No Transcript)
57
http//www.neurogrid.net/
58
http//www.neurogrid.net/
59
OceanStore (Berkeley)
data source
data plane
Web content server
network plane
60
OceanStore Goal and Challenges
Provide content distribution to clients with good
Quality of Service (QoS) while retaining
efficient and balanced resource consumption of
the underlying infrastructure

Dynamic choice of number and location of replicas
Clients QoS constraints
Servers capacity constraints
Efficient update dissemination
Delay
Bandwidth consumption
Scalability millions of objects, clients and
servers
No global network topology knowledge

61
Dynamic Replica Placement naïve
data plane
s
c
network plane
Tapestry mesh
62
Dynamic Replica Placement naïve
data plane
parent candidate
s
proxy
c
network plane
Tapestry mesh
Tapestry overlay path
63
Dynamic Replica Placement smart
data plane
client child
s
parent
proxy
sibling
c
server child
network plane
64
Dynamic Replica Placement smart

Aggressive search
Lazy placement

Greedy load distribution

data plane
parent candidates
client child
s
parent
proxy
sibling
c
server child
network plane
65
HiveCache from MojoNation

Utilises free disk space to undertake backup --
builds a dynamic RAID network
Disk space monitored to determine how much to
allocate (via a local agent)
File broken into pieces along with error
correction information encryption
Data highly replicated
Little unique info in a company PC
common apps (word, excel etc)
operating system files, utilities
Personal data, however, is not very large
Uses SHA1 hash algorithm to map each file
fragment
acts as a unique key to locate the data fragment
Request to retrieve a file is mapped to machines
-- which retrieve file in parts in parallel

66
Peer-Oriented systems
From Wolfgang Hoschek
67
Final Thoughts

Aim to identify themes of interest
for Grid environments
Identify Local and Global services
Some of these are available in existing software
- some must be implemented by a user
Standards play an important role
Development of Common Open Grid Data Services for
Science

Write a Comment

User Comments (0)