Title: KnowledgeBased Grids
1Knowledge-Based Grids
- Reagan W. Moore
- San Diego Supercomputer Center
- moore_at_sdsc.edu
- http//www.npaci.edu/DICE
2Data Intensive Computing Environment
- Staff
- Reagan Moore
- Chaitan Baru
- Sheau Yen Chen
- Charles Cowart
- Amarnath Gupta
- George Kremenek
- Bertram Ludäscher
- Richard Marciano
- Arcot Rajasekar
- Abe Singer
- Michael Wan
- Ilya Zaslavsky
- Bing Zhu
- Students - GSRA
- Martin Kuhl
- Liying Sui
- Yang Yu
- Valter Crescenzi
- Students - Undergrad Interns
- Peter Shin
- Roman Olshanowsky
- Shabbar Tambawala
- Pratik Mukhopadhyay
- /- NN
3Abstract
Grids are middleware infrastructure that tie
together distributed storage systems and
execution platforms. Data grids organize digital
objects into collections, and identify remote
data objects by specifying a set of attributes.
Knowledge-based grids add a second level of
indirection for data discovery, by providing a
concept space. It is then possible to map from
scientific domain concepts, to the attributes
used within a collection, to the digital objects
residing in an archival storage system. These
concepts will be illustrated based on
infrastructure under development at the San
Diego Supercomputer Center.
4Topics
- Three communities
- Digital Library, Grid Forum, Persistent Archive
- Evolving standards
- Data, Information, Knowledge
- Application requirements
- Science, education
- Common infrastructure
- Collection management systems
- SDSC Storage Resource Broker (SRB)
5Merging Infrastructures
- Digital Libraries
- Collection management
- Services
- Grids
- Interoperability
- Latency management
- Persistent Archives
- Archivable forms
- Technology evolution
6Application Communities
- Digital Libraries
- NSF National SMET Education Digital Library
- NSF Digital Library Initiative, Phase II -
Interlib CDL - NSF NPACI Digital Sky Project
- NLM Digital Embryo Project
- Data Grids
- DOE ASCI Data Visualization Corridor
- DOE Particle Physics Data Grid
- NASA Information Power Grid
- NSF Grid Physics Network
- Persistent Archives
- Knowledge-based persistent archive
7Differentiating between Data, Information, and
Knowledge
- Data - bit stream
- Information - any tagged data element
- Knowledge - any relationship between information
elements - Data are digital representations of reality
- Concepts are tags that describe reality
- Knowledge is the relationship between concepts
8Examples of Knowledge
- Relationship types
- Logical / Semantic - digital library crosswalks
- Temporal / Procedural - workflow management
- Spatial / Structural - GIS
- Functional / Algorithmic - feature analysis
- Relationships can be quantified by rules
- Ingestion - how to tag attributes
- Collection instantiation - how to build a
database - Knowledge mining - how to identify a feature
9Digital Library Architecture
Manage
Access
Ingest
Concepts Relationships
Knowledge Repository for Relationships
Knowledge or Topic-Based Query
Knowledge
Information Repository for Attributes
Attribute- based Query
Semantics Attributes
Information
Fields Containers Folders
Storage (Replicas, Persistent IDs)
Feature-based Query
Data
Standards
Infrastructure
Languages
10Standards
- Data
- HDF - describe data structure
- SRB containers - aggregate data
- Unix directory system - folders for organizing
data - Information
- XML - tag information
- DTD - semi-structured organization of information
- Knowledge
- ISO 13250 topic map - type relationships
- RDF - manipulate relationships
11Management
- Manage storage systems instead of storage devices
- Characterize a storage system by its interaction
mechanisms - Manage information repositories instead of
implementing a database - Characterize an information repository structure
- Manage knowledge repositories instead of
implementing an inference engine - ISO 13250 Topic Maps to describe knowledge
12Grid Architecture
Manage
Access
Ingest
Knowledge or Topic-based Query
Knowledge Repository for Relationships
Concepts Relationships
Rules - KQL
Knowledge
XTM DTD
(Topic Maps / Model-based Access)
Information Repository
Attribute- based Query
Semantics Attributes
Information
XML DTD
EMCAT / MIX
(Data Handling System - Storage Resource Broker)
Data
Fields Containers Folders
Storage (Replicas, Persistent IDs)
Feature-based Query
MCAT/HDF
Grids
13Data Handling Infrastructure
- Model-Based Knowledge Management
- Rule-based ontology mapping, conceptual-level
mediation - CMIX - Data Grid
- Data federation across multiple libraries - MIX
- Digital Library
- Interoperable services for information discovery
and presentation - SDLIP - Data Collection
- Information Management - MCAT
- Data Handling
- Systems for data retrieval from storage systems -
SRB - Archival storage
- Storage of data collections for the life of the
republic - HPSS
14Digital Library - Scenario I
- Data Assimilation Office
- Data sets are stored on Unitree at NASA Goddard
- Computation is done at NASA Ames
- Re-analyze 10 years of data with a new data
assimilation code - Can the analysis be automated?
15Collection-based Data Access
- Automate access through creation of a logical
collection that spans multiple sites - Data sets managed by attributes
- Data set discovery through use of attributes
- Access across administration domains
- SRB API - Connection to remote resource
- clConnect call
- Unix file system semantics
- srbObjCreate / srbObjClose / srbObjRead /
srbObjWrite
16Collection Attributes
- SRB location attributes
- Storage location (IP address), access protocol,
local file name - Unix file attributes
- Owner, creation date, size, access control list,
- Dublin core attributes
- Provenance information
- Domain specific attributes
- Attributes that describe the physical properties
of the digital object
17SDSC Storage Resource Broker - Digital Library
Application
Resource
User
MCAT
Dublin Core
Application Meta-data
18(No Transcript)
19Data Grid - Scenario 2
- NASA Information Power Grid
- Demonstrate distributed data analysis using
multiple NASA resources while accessing data
objects stored at multiple sites - HPSS at SDSC
- File systems at Caltech
- NASA FTP sites
- Support access to legacy systems
20SDSC Storage Resource Broker - Grid Middleware
Client Library
SRB Server
Local SRB Server
21Digital Library - Scenario 3
- California Digital Library
- Provide persistent identifiers (site and access
protocol independent) - Provide support for copies of the data objects
- Provide archival backup for the data collections
- Manage persistence across technology evolution
22Logical Resource Naming
- Global digital object namespace (location,
protocol independence) - Logical resource namespace
- Create a logical resource name that groups
multiple physical resources - Writing to the logical resource name writes to
all of the associated physical resources - Completion on write to k of n resources,
k lt n - Latency management
- Access copy stored on lowest latency storage
system
23Digital Library - Scenario 4
- NSF NPACI - Digital Sky Project
- Support formation of a 5-million image, 10-TB
image collection for 2MASS - Store images in an archive
- Sort images into containers based on spatial
location rather than temporal order as seen by
the telescope
24Containers
- Containers are used to aggregate data sets
- Minimizes number of files seen by archives
- Improves latency of access for related files
- Containers have a maximum size
- When write into a container, a new container is
automatically started when the initial container
is full - Metadata catalog manages mapping from object to
container - Containers are cached on disk after retrieval
from archive
25(No Transcript)
26Data Grid - Scenario 5
- DOE - ASCI Data Visualization Corridor
- Provide interactive visualization of
terabyte-sized data objects, retrieved from
remote archive - Subset data objects at the storage system
- Latency management mechanism
- Page data as needed from remote source into
rendering system
27SDSC Storage Resource Broker Meta-data Catalog
28Data Grid - Scenario 6
- DOE - Particle Physics Data Grid
- Support replicas of data objects to minimize
access latency - Support access to local resource managers
- Stage command - to force prefetch
- Status command - to track progress of resource
manager - Support interoperation between grids
29Particle Physics Data Grid - Replication System
30Interoperation between Grids
Application
Application
SRB Client Library
Metadata Catalog
Replica Catalog
Metadata Catalog
SRB Server
Storage System
Globus
SDSC Storage Resource Broker
31Grid Latency Management
- Minimize the number of times the wide-area
network latency is incurred - Caching on disk
- Replication within a grid
- Containers for aggregating data
- Remote I/O proxies for aggregating I/O commands
32Data Grid - Scenario 7
- NSF NPACI - Human Brain database federation
- Provide rule-based access to multiple data
collections - Support access based on domain concepts, rather
than collection attributes - Define relationship of retrieved images
33(No Transcript)
34(No Transcript)
35Digital Library - Scenario 8
- National SMET Education Digital Library
- Support for curricula
- A curriculum is an instance of a concept space
- Support for images used in curricula
- Map from curriculum concepts to associated images
managed in collections - Support interoperability between multiple
existing collections
36User Interfaces
Usage Enhancement
Delivery Presentation Aggregation - Channels
Information about collections
Core NSDL Bus
Meta-data delivery Data delivery Discovery Global
Ids Security Network
Metadata data access-based services
Virtual Collections Mediators
Collection Building
37Data Grid - Scenario 9
- NSF Grid Physics Network
- Develop a Virtual Data Grid
- Provide access to derived data products
- Retrieve if already created
- Create if unavailable
- Collection-based execution environment
38GriPhyN
- Collection-based execution environment
- Save context associated with the execution of an
application - Application input - parameters, files
- Execution environment - resources, operating
systems - Output results - files, visualizations
- Knowledge - (semantic, procedural , structural,
algorithmic relationships)
39Collection Attributes
- Capturing process attributes
- Fixed information
- Host machine environments
- Application
- Input parameters
- Default values
- Data Structures
- Run-time environment
- Host machines used
- Application execution instance
- Input parameters
- Data Structures
- Data set usage
- Data access patterns
40Persistent Archive - Scenario 10
- Collection-based Persistent Archives
- Use a collection to provide the context for
describing archived digital objects - Requires the ability to migrate a collection to
new technology - Rule-based Persistent Archives
- Define the rules that support tagging of the
information content of a collection - Express the implied knowledge inherent within a
collection
41(No Transcript)
42Ingestion Processes for Collection Creation
Accession Template
Closure Concept/Attribute
Attribute Inverse Indexing
Information Generation
Knowledge Generation
Attribute Tagging
Attribute Selection
Occurrence Tagging
View Management
Data Organization
Collection
43Closure
- Digital libraries are starting to recognize the
need to federate existing repositories - Grid interoperability mechanisms
- Grids are starting to use collections to manage
execution context - Information management mechanisms
- Persistent Archives are using Grid
interoperability mechanisms to manage evolution,
and Digital Library information management to
instantiate collections - A common architecture is emerging
44Further Information
http//www.npaci.edu/DICE