KnowledgeBased Grids - PowerPoint PPT Presentation

1 / 44
About This Presentation
Title:

KnowledgeBased Grids

Description:

to the attributes used within a collection, to the digital objects ... Ingestion - how to tag attributes. Collection instantiation - how to build a database ... – PowerPoint PPT presentation

Number of Views:26
Avg rating:3.0/5.0
Slides: 45
Provided by: npa5
Category:

less

Transcript and Presenter's Notes

Title: KnowledgeBased Grids


1
Knowledge-Based Grids
  • Reagan W. Moore
  • San Diego Supercomputer Center
  • moore_at_sdsc.edu
  • http//www.npaci.edu/DICE

2
Data Intensive Computing Environment
  • Staff
  • Reagan Moore
  • Chaitan Baru
  • Sheau Yen Chen
  • Charles Cowart
  • Amarnath Gupta
  • George Kremenek
  • Bertram Ludäscher
  • Richard Marciano
  • Arcot Rajasekar
  • Abe Singer
  • Michael Wan
  • Ilya Zaslavsky
  • Bing Zhu
  • Students - GSRA
  • Martin Kuhl
  • Liying Sui
  • Yang Yu
  • Valter Crescenzi
  • Students - Undergrad Interns
  • Peter Shin
  • Roman Olshanowsky
  • Shabbar Tambawala
  • Pratik Mukhopadhyay
  • /- NN

3
Abstract
Grids are middleware infrastructure that tie
together distributed storage systems and
execution platforms. Data grids organize digital
objects into collections, and identify remote
data objects by specifying a set of attributes.
Knowledge-based grids add a second level of
indirection for data discovery, by providing a
concept space. It is then possible to map from
scientific domain concepts, to the attributes
used within a collection, to the digital objects
residing in an archival storage system. These
concepts will be illustrated based on
infrastructure under development at the San
Diego Supercomputer Center.
4
Topics
  • Three communities
  • Digital Library, Grid Forum, Persistent Archive
  • Evolving standards
  • Data, Information, Knowledge
  • Application requirements
  • Science, education
  • Common infrastructure
  • Collection management systems
  • SDSC Storage Resource Broker (SRB)

5
Merging Infrastructures
  • Digital Libraries
  • Collection management
  • Services
  • Grids
  • Interoperability
  • Latency management
  • Persistent Archives
  • Archivable forms
  • Technology evolution

6
Application Communities
  • Digital Libraries
  • NSF National SMET Education Digital Library
  • NSF Digital Library Initiative, Phase II -
    Interlib CDL
  • NSF NPACI Digital Sky Project
  • NLM Digital Embryo Project
  • Data Grids
  • DOE ASCI Data Visualization Corridor
  • DOE Particle Physics Data Grid
  • NASA Information Power Grid
  • NSF Grid Physics Network
  • Persistent Archives
  • Knowledge-based persistent archive

7
Differentiating between Data, Information, and
Knowledge
  • Data - bit stream
  • Information - any tagged data element
  • Knowledge - any relationship between information
    elements
  • Data are digital representations of reality
  • Concepts are tags that describe reality
  • Knowledge is the relationship between concepts

8
Examples of Knowledge
  • Relationship types
  • Logical / Semantic - digital library crosswalks
  • Temporal / Procedural - workflow management
  • Spatial / Structural - GIS
  • Functional / Algorithmic - feature analysis
  • Relationships can be quantified by rules
  • Ingestion - how to tag attributes
  • Collection instantiation - how to build a
    database
  • Knowledge mining - how to identify a feature

9
Digital Library Architecture
Manage
Access
Ingest
Concepts Relationships
Knowledge Repository for Relationships
Knowledge or Topic-Based Query
Knowledge
Information Repository for Attributes
Attribute- based Query
Semantics Attributes
Information
Fields Containers Folders
Storage (Replicas, Persistent IDs)
Feature-based Query
Data
Standards
Infrastructure
Languages
10
Standards
  • Data
  • HDF - describe data structure
  • SRB containers - aggregate data
  • Unix directory system - folders for organizing
    data
  • Information
  • XML - tag information
  • DTD - semi-structured organization of information
  • Knowledge
  • ISO 13250 topic map - type relationships
  • RDF - manipulate relationships

11
Management
  • Manage storage systems instead of storage devices
  • Characterize a storage system by its interaction
    mechanisms
  • Manage information repositories instead of
    implementing a database
  • Characterize an information repository structure
  • Manage knowledge repositories instead of
    implementing an inference engine
  • ISO 13250 Topic Maps to describe knowledge

12
Grid Architecture
Manage
Access
Ingest
Knowledge or Topic-based Query
Knowledge Repository for Relationships
Concepts Relationships
Rules - KQL
Knowledge
XTM DTD
(Topic Maps / Model-based Access)
Information Repository
Attribute- based Query
Semantics Attributes
Information
XML DTD
EMCAT / MIX
(Data Handling System - Storage Resource Broker)
Data
Fields Containers Folders
Storage (Replicas, Persistent IDs)
Feature-based Query
MCAT/HDF
Grids
13
Data Handling Infrastructure
  • Model-Based Knowledge Management
  • Rule-based ontology mapping, conceptual-level
    mediation - CMIX
  • Data Grid
  • Data federation across multiple libraries - MIX
  • Digital Library
  • Interoperable services for information discovery
    and presentation - SDLIP
  • Data Collection
  • Information Management - MCAT
  • Data Handling
  • Systems for data retrieval from storage systems -
    SRB
  • Archival storage
  • Storage of data collections for the life of the
    republic - HPSS

14
Digital Library - Scenario I
  • Data Assimilation Office
  • Data sets are stored on Unitree at NASA Goddard
  • Computation is done at NASA Ames
  • Re-analyze 10 years of data with a new data
    assimilation code
  • Can the analysis be automated?

15
Collection-based Data Access
  • Automate access through creation of a logical
    collection that spans multiple sites
  • Data sets managed by attributes
  • Data set discovery through use of attributes
  • Access across administration domains
  • SRB API - Connection to remote resource
  • clConnect call
  • Unix file system semantics
  • srbObjCreate / srbObjClose / srbObjRead /
    srbObjWrite

16
Collection Attributes
  • SRB location attributes
  • Storage location (IP address), access protocol,
    local file name
  • Unix file attributes
  • Owner, creation date, size, access control list,
  • Dublin core attributes
  • Provenance information
  • Domain specific attributes
  • Attributes that describe the physical properties
    of the digital object

17
SDSC Storage Resource Broker - Digital Library
Application
Resource
User
MCAT
Dublin Core
Application Meta-data
18
(No Transcript)
19
Data Grid - Scenario 2
  • NASA Information Power Grid
  • Demonstrate distributed data analysis using
    multiple NASA resources while accessing data
    objects stored at multiple sites
  • HPSS at SDSC
  • File systems at Caltech
  • NASA FTP sites
  • Support access to legacy systems

20
SDSC Storage Resource Broker - Grid Middleware
Client Library
SRB Server
Local SRB Server
21
Digital Library - Scenario 3
  • California Digital Library
  • Provide persistent identifiers (site and access
    protocol independent)
  • Provide support for copies of the data objects
  • Provide archival backup for the data collections
  • Manage persistence across technology evolution

22
Logical Resource Naming
  • Global digital object namespace (location,
    protocol independence)
  • Logical resource namespace
  • Create a logical resource name that groups
    multiple physical resources
  • Writing to the logical resource name writes to
    all of the associated physical resources
  • Completion on write to k of n resources,
    k lt n
  • Latency management
  • Access copy stored on lowest latency storage
    system

23
Digital Library - Scenario 4
  • NSF NPACI - Digital Sky Project
  • Support formation of a 5-million image, 10-TB
    image collection for 2MASS
  • Store images in an archive
  • Sort images into containers based on spatial
    location rather than temporal order as seen by
    the telescope

24
Containers
  • Containers are used to aggregate data sets
  • Minimizes number of files seen by archives
  • Improves latency of access for related files
  • Containers have a maximum size
  • When write into a container, a new container is
    automatically started when the initial container
    is full
  • Metadata catalog manages mapping from object to
    container
  • Containers are cached on disk after retrieval
    from archive

25
(No Transcript)
26
Data Grid - Scenario 5
  • DOE - ASCI Data Visualization Corridor
  • Provide interactive visualization of
    terabyte-sized data objects, retrieved from
    remote archive
  • Subset data objects at the storage system
  • Latency management mechanism
  • Page data as needed from remote source into
    rendering system

27
SDSC Storage Resource Broker Meta-data Catalog
28
Data Grid - Scenario 6
  • DOE - Particle Physics Data Grid
  • Support replicas of data objects to minimize
    access latency
  • Support access to local resource managers
  • Stage command - to force prefetch
  • Status command - to track progress of resource
    manager
  • Support interoperation between grids

29
Particle Physics Data Grid - Replication System
30
Interoperation between Grids
Application
Application
SRB Client Library
Metadata Catalog
Replica Catalog
Metadata Catalog
SRB Server
Storage System
Globus
SDSC Storage Resource Broker
31
Grid Latency Management
  • Minimize the number of times the wide-area
    network latency is incurred
  • Caching on disk
  • Replication within a grid
  • Containers for aggregating data
  • Remote I/O proxies for aggregating I/O commands

32
Data Grid - Scenario 7
  • NSF NPACI - Human Brain database federation
  • Provide rule-based access to multiple data
    collections
  • Support access based on domain concepts, rather
    than collection attributes
  • Define relationship of retrieved images

33
(No Transcript)
34
(No Transcript)
35
Digital Library - Scenario 8
  • National SMET Education Digital Library
  • Support for curricula
  • A curriculum is an instance of a concept space
  • Support for images used in curricula
  • Map from curriculum concepts to associated images
    managed in collections
  • Support interoperability between multiple
    existing collections

36
User Interfaces
Usage Enhancement
Delivery Presentation Aggregation - Channels
Information about collections
Core NSDL Bus
Meta-data delivery Data delivery Discovery Global
Ids Security Network
Metadata data access-based services
Virtual Collections Mediators
Collection Building
37
Data Grid - Scenario 9
  • NSF Grid Physics Network
  • Develop a Virtual Data Grid
  • Provide access to derived data products
  • Retrieve if already created
  • Create if unavailable
  • Collection-based execution environment

38
GriPhyN
  • Collection-based execution environment
  • Save context associated with the execution of an
    application
  • Application input - parameters, files
  • Execution environment - resources, operating
    systems
  • Output results - files, visualizations
  • Knowledge - (semantic, procedural , structural,
    algorithmic relationships)

39
Collection Attributes
  • Capturing process attributes
  • Fixed information
  • Host machine environments
  • Application
  • Input parameters
  • Default values
  • Data Structures
  • Run-time environment
  • Host machines used
  • Application execution instance
  • Input parameters
  • Data Structures
  • Data set usage
  • Data access patterns

40
Persistent Archive - Scenario 10
  • Collection-based Persistent Archives
  • Use a collection to provide the context for
    describing archived digital objects
  • Requires the ability to migrate a collection to
    new technology
  • Rule-based Persistent Archives
  • Define the rules that support tagging of the
    information content of a collection
  • Express the implied knowledge inherent within a
    collection

41
(No Transcript)
42
Ingestion Processes for Collection Creation
Accession Template
Closure Concept/Attribute
Attribute Inverse Indexing
Information Generation
Knowledge Generation
Attribute Tagging
Attribute Selection
Occurrence Tagging
View Management
Data Organization
Collection
43
Closure
  • Digital libraries are starting to recognize the
    need to federate existing repositories
  • Grid interoperability mechanisms
  • Grids are starting to use collections to manage
    execution context
  • Information management mechanisms
  • Persistent Archives are using Grid
    interoperability mechanisms to manage evolution,
    and Digital Library information management to
    instantiate collections
  • A common architecture is emerging

44
Further Information
http//www.npaci.edu/DICE
Write a Comment
User Comments (0)
About PowerShow.com