Title: Digital Libraries, Data Grids, and Persistent Archives
1Digital Libraries, Data Grids, and Persistent
Archives Reagan W. Moore San Diego
Supercomputer Center moore_at_sdsc.edu http//www.npa
ci.edu/DICE/
2Data and Knowledge Systems Group
- Staff
- Reagan Moore
- Ilkai Altintas
- Chaitan Baru
- Sheau Yen Chen
- Charles Cowart
- Amarnath Gupta
- George Kremenek
- Bertram Ludäscher
- Richard Marciano
- XuFei Qian
- Roman Olshanowsky
- Arcot Rajasekar
- Abe Singer
- Michael Wan
- Ilya Zaslavsky
- Bing Zhu
- Graduate Students
- A. Bagchi
- S. Bansal
- A. Behere
- R. Bharath
- S. Bharath
- M. Kulrul
- L. Sui
- Undergraduate Interns
- N. Cotofana
- M. Shumaker
- J. Trang
- L. Yin
- /- NN
3Topics
- Application of
- Data management systems
- Information management systems
- Knowledge management systems
- to
- Distributed data collections
- Digital libraries
- Data Grids
- Persistent Archives
- by
- Defining levels of abstraction
4Information Management Projects
- Digital Libraries
- CDL - AMICO
- DARPA/USPTO - patent digital library
- NLM Visible Embryo digital library - GMU
- NSF Digital Library Initiative, Phase II - UCSB,
Stanford - NSF NPACI Digital Sky - Caltech 2MASS sky survey
- NSF NSDL - UCAR / Columbia / Cornell / UCSB
- Data Grid Environments
- DOE Data Visualization Corridor - LLNL
- DOE Particle Physics Data Grid - Stanford,
Caltech - NASA Information Power Grid - NASA Ames
- NIH Biomedical Informatics Research Network
- NSF Grid Physics Network - U Florida
- NSF National Virtual Observatory - Johns Hopkins
University / Caltech - NSF Southern California Earthquake Center - ISI
- Persistent Archives
- NARA Persistent Archive
- NHPRC - Archivist workbench
5Managing Distributed Storage
- Separate the organization of digital objects from
their physical storage - Logical Name Space to manage attributes about the
digital objects - Data handling system to manage interactions with
remote storage systems - Create storage abstraction layer
- Storage Resource Broker (SRB) provides data
management system
6Information Management- Logical Name Space
- Set of attributes to describe digital entities
that are registered into the logical name space - SRB metadata - Unix file system semantics
- Provenance metadata - Dublin Core
- Resource metadata - User access control lists
- Discipline metadata - User defined attributes
- Each digital entity may have unique attributes
7Information Management
- Abstraction layer for interacting with
information repositories - Manage the schema and physical table structures
of a database - Extensible schema
- User defined attributes
- Extensible Metadata CATalog (EMCAT) manages
collections - mySRB.html interface supports dynamic collection
creation
8Knowledge Management - Discovery across
Collections
- Characterization of relationships between
attributes - Semantic / logical - cross-walks
- Procedural / temporal - records management
- Structural / spatial - GIS
- Abstraction layer for knowledge repositories
- Mapping from collection attributes to discipline
concepts - Model-based Mediation supports mapping from
knowledge relationships to rule-based inference
engines
9Presentation of Digital Objects
Application
Operating System
Storage System
Display System
Digital Object
10Technology Management
Application
Wrap Application
Operating System
Storage System
Display System
Digital Object
11Technology Management
Application
Add Operating System Call
Operating System
Storage System
Display System
Digital Object
12Technology Management
Application
Add Operating System Call
Operating System
Add Operating System Call
Storage System
Display System
Digital Object
13Technology Management
Application
Add Operating System Call
Operating System
Wrap Storage System
Wrap Display System
Storage System
Display System
Digital Object
14Technology Management
Application
Operating System
Storage System
Display System
Migrate Encoding Format
Digital Object
15Specifying levels of Abstraction
- Technology management becomes simpler if the
persistent archive infrastructure operates on
abstractions, rather than an explicit physical
implementation of a resource - Can we abstract
- Digital object
- Storage
16Technology Management
Application
Operating System
Storage System Abstraction
Display System Abstraction
Storage System
Display System
Digital Object Abstraction
Digital Object
17Types of Digital Entity Abstractions
- Logical representation
- What does the digital entity represent?
- What is the associated meaning?
- Physical representation
- What is the physical structure of the digital
entity?
18Levels of Abstraction for Bits
Logical I-nodes
Physical Track / Sector
Abstraction for Digital Entity
Digital Entity
Bit Stream
Abstraction for Repository
Logical File Name
Physical File System (NFS/AFS/NTFS)
Repository
Disk
19Levels of Abstraction for Data
Logical Data Model (units, semantics)
Physical Encoding Format (syntax, structure)
Abstraction for Digital Entity
Digital Entity
Files
Abstraction for Repository
Logical Name Space
Physical Data Handling System -SRB/MCAT
Repository
File System, Archive
20Levels of Abstraction for Information
Logical Collection Schema
Physical XML Syntax
Abstraction for Digital Entity
Digital Entity
Metadata Attributes
Abstraction for Repository
Logical Database Schema
Physical EMCAT/CWM
Repository
Database
21Levels of Abstraction for Knowledge
Logical Relationship Schema
Physical ER/UML/XMI/ RDF syntax
Abstraction for Digital Entity
Concept Space (ontology instance)
Digital Entity
Abstraction for Repository
Logical Knowledge Repository Schema
Physical Model-based Mediation System
Repository
Knowledge Repository
22Information Management Projects
- Digital Libraries
- CDL - AMICO
- DARPA/USPTO - patent digital library
- NLM Visible Embryo digital library - GMU
- NSF Digital Library Initiative, Phase II - UCSB,
Stanford - NSF NPACI Digital Sky - Caltech 2MASS sky survey
- NSF NSDL - UCAR / Columbia / Cornell / UCSB
- Data Grids
- DOE Data Visualization Corridor - LLNL
- DOE Particle Physics Data Grid - Stanford,
Caltech - NASA Information Power Grid - NASA Ames
- NIH Biomedical Informatics Research Network
- NSF Grid Physics Network - U Florida
- NSF National Virtual Observatory - Johns Hopkins
University / Caltech - NSF Southern California Earthquake Center - ISI
- Persistent Archives
- NARA Persistent Archive
- NHPRC - Archivist workbench
23Evolution of Data Management
Collection - managed data Use database to
organize attributes about data objects Separate
information management from data storage Support
APIs for information discovery, data access
Database A
Storage
Storage Resource Broker
Integration accomplished through a data handling
system which characterizes the storage systems
24SDSC Storage Resource Broker Meta-data Catalog
25Evolution of Data Management
Distributed Data Collection Same name
space Same schema Separate administration
domains Heterogeneous database instances
Database A
Database B
Storage Resource Broker
Integration requires the ability to characterize
both the schemas and the table structures of
each information repository
26Distributed Data Collection
- Logical organization of distributed digital
objects into a collection - Access through federated servers
- Collection-owned data, implies the server at each
storage repository runs under a collection
user-ID - Collection attributes define a global namespace
- Self-consistent attribute update on all data
accesses - Support for multiple access APIs
- Extensible support for access to any type of
storage system (archive, file system, database) - Extensible collection attributes
27Interoperability across Data and Information
Repositories
- Define a representation for storage that is
independent of the implementation of the storage
system - Unix file system semantics - Open/Close/Read/Write
/Seek - Define a representation of a collection that is
independent of the choice of database - schema, table structures
28Visible Embryo Project
Disk Cache
AFIP Collab WS
Image Generation
OHSU
Eolas
GST
ATD Net
NIC
Disk Cache
UIC Startap
ASX200
BEN
MSWS
NT WS
MSWS
NT WS
Oakland
HSCC
WRL
100 Gbit
Vegas
OC-3
JHU
Disk Cache
DS3
Los Angeles
VBNS OC-12
GMU
Abilene OC-3
Disk Cache
DC POP
OC-3
Abilene OC-3
SDSC
Archive
29Data Grids
Data Grid - linking multiple data
collections Separate name spaces Separate
schema Separate administration
domains Heterogeneous database instances
Database A
Database B
Data grid
The data grid is itself a collection that
provides mechanisms to hide latency and manage
semantics
30National Virtual Observatory Data Grid
1. Portals and Workbenches
2.Knowledge Resource Management
Bulk Data Analysis
Metadata View
Data View
Catalog Analysis
3.
Standard APIs and Protocols
Concept space
4.Grid Security Caching Replication Backup Schedul
ing
Information Discovery
Metadata delivery
Data Discovery
Data Delivery
5.
Standard Metadata format, Data model, Wire format
Catalog Mediator
6.
Data mediator
Catalog/Image Specific Access
Compute Resources
Catalogs
Data Archives
Derived Collections
7.
31Federated Digital Libraries
Virtual Data Grid - linking multiple data
collections Ability to execute processes to
recreate derived data
Database A Services
Database B Services
Virtual Data Grid
The virtual data grid integrates data grid and
digital library technology to manage processes
32User Interfaces
NSDL
Usage Enhancement
Delivery Presentation Aggregation - Channels
Information about collections
Core NSDL Bus
Meta-data delivery Data delivery Query Global
Ids Security Network
Metadata data access-based services
Virtual Collections Mediators
Collection Building
33Persistent Archive
Persistent archive Describe archived data as
collections Describe processes used to create
collections Manage evolution of technology
Database A (today)
Database A (tomorrow)
Virtual Data Grid
The persistent archive is itself a virtual data
grid that provides mechanisms to manage
migration to new technology
34Persistent Archives
- Storage system abstraction
- Logical name space and data manipulations
- Information repository abstraction
- Logical schema and physical table structure
- Knowledge repository abstraction
- Topic maps and inference rules
- Digital object abstraction
- Data model and encoding format
35Persistent Collection
- Define context for archiving data -annotate
information content - Create archivable form - standard encoding format
- Archive information content along with data
- Test closure of the collection - all digital
objects that can be discovered in the collection
are members of the collection - Test completeness of the collection - inherent
relationships within the collection can be cast
in terms of attributes generated from the
annotated information. - Differentiate between inherent knowledge and
anomalies / artifacts
36Self-Instantiating Archive
- Archive the processes that are used to control
the ingestion process - Conversion to archivable form
- Annotation of information content
- When accessing the collection, retrieve the
processes and the original digital objects - Apply the processing steps to re-create the
information content - Query the result to discover desired digital
objects - A self-instantiating archive is a virtual data
grid
37ERA Concept model
38Data Management Systems
- Distributed data collections
- Single name space
- Distributed data storage systems
- Data Grid - integration of multiple data
collections - Each collection has a separate name space
- Infrastructure that interconnects the collections
can use its own name space, containers,
replication - Virtual Data Grids - federation of digital
libraries - In addition, support interoperability between
services for manipulation, presentation,
discovery of digital objects - Persistent archive
- In addition, manage evolution of technology
components
39Differentiating between Data, Information, and
Knowledge
- Data
- Digital object
- Objects are streams of bits
- Information
- Any tagged data, which is treated as an
attribute. - Attributes may be tagged data within the digital
object, or tagged data that is associated with
the digital object - Knowledge
- Relationships between attributes
- Relationships can be procedural/temporal,
structural/spatial, logical/semantic, functional
40Knowledge Management
- Must manage semantic relationships between the
multiple name spaces - Data Grid
- Must manage procedural relationships between
digital library services - Federated digital library
- Must manage structural relationships between
different archivable forms - encoding formats - Persistent archive
41Types of Knowledge Relationships
- Logical / semantic
- Digital Library cross-walks
- Temporal / procedural
- Workflow systems
- Spatial / structural
- GIS systems
- Functional / algorithmic
- Scientific feature analysis
42Knowledge Based Data Grids
Ingest Services
Management
Access Services
Knowledge or Topic-Based Query / Browse
Knowledge Repository for Rules
Relationships Between Concepts
Knowledge
XTM DTD
Rules - KQL
(Model-based Access)
XML DTD
Information Repository
Attribute- based Query
Attributes Semantics
SDLIP
Information
(Data Handling System - SRB)
Data
Fields Containers Folders
Storage (Replicas, Persistent IDs)
Grids
Feature-based Query
MCAT/HDF
43Further Information
http//www.npaci.edu/DICE