Title: Digital Libraries, Data Grids, and Persistent Archives
1Digital Libraries, Data Grids, and Persistent
Archives Reagan W. Moore San Diego
Supercomputer Center moore_at_sdsc.edu http//www.npa
ci.edu/DICE/
2Data and Knowledge Systems Group
- Staff
- Reagan Moore
- Ilkai Altintas
- Chaitan Baru
- Sheau Yen Chen
- Charles Cowart
- Amarnath Gupta
- George Kremenek
- Bertram Ludäscher
- Richard Marciano
- XuFei Qian
- Roman Olshanowsky
- Arcot Rajasekar
- Abe Singer
- Michael Wan
- Ilya Zaslavsky
- Bing Zhu
- Graduate Students
- A. Bagchi
- S. Bansal
- A. Behere
- R. Bharath
- S. Bharath
- M. Kulrul
- L. Sui
- Undergraduate Interns
- N. Cotofana
- M. Shumaker
- J. Trang
- L. Yin
- /- NN
3Accessing Data
- How do you access storage systems at remote sites
in someone elses administration domain? - How do you organize distributed data into a
cohesive collection with global, persistent
identifiers?
4Topics
- Application of
- Data management systems
- Information management systems
- Knowledge management systems
- to
- Distributed data collections
- Digital libraries
- Data Grids
- Persistent Archives
- by
- Defining levels of abstraction
5Information Management Projects
- Digital Libraries
- CDL - AMICO
- DARPA/USPTO - patent digital library
- NLM Visible Embryo digital library - GMU
- NSF Digital Library Initiative, Phase II - UCSB,
Stanford - NSF NPACI Digital Sky - Caltech 2MASS sky survey
- NSF NSDL - UCAR / Columbia / Cornell / UCSB
- Data Grid Environments
- DOE Data Visualization Corridor - LLNL
- DOE Particle Physics Data Grid - Stanford,
Caltech - NASA Information Power Grid - NASA Ames
- NIH Biomedical Informatics Research Network
- NSF Grid Physics Network - U Florida
- NSF National Virtual Observatory - Johns Hopkins
University / Caltech - NSF Southern California Earthquake Center - ISI
- Persistent Archives
- NARA Persistent Archive
- NHPRC - Archivist workbench
6Managing Persistence
- What are digital objects?
- Digital objects require infrastructure support
- Application / Operating system / storage system /
display system - Where do you manage the definition of the digital
object? - Emulation - manage at the application level
- Migration - manage at the digital object format
level
7Presentation of Digital Objects
Application
Operating System
Storage System
Display System
Digital Object
8Technology Management
Application
Wrap Application
Operating System
Storage System
Display System
Digital Object
9Technology Management
Application
Add Operating System Call
Operating System
Storage System
Display System
Digital Object
10Technology Management
Application
Add Operating System Call
Operating System
Add Operating System Call
Storage System
Display System
Digital Object
11Technology Management
Application
Add Operating System Call
Operating System
Wrap Storage System
Wrap Display System
Storage System
Display System
Digital Object
12Technology Management
Application
Operating System
Storage System
Display System
Migrate Encoding Format
Digital Object
13Specifying levels of Abstraction
- Technology management becomes simpler if the
persistent archive infrastructure operates on
abstractions, rather than an explicit physical
implementation of a resource - Can we abstract
- Digital object
- Storage
14Technology Management
Application
Operating System
Storage System Abstraction
Display System Abstraction
Storage System
Display System
Digital Object Abstraction
Digital Object
15Types of Digital Entity Abstractions
- Logical representation
- What does the digital entity represent?
- What is the associated meaning?
- Physical representation
- What is the physical structure of the digital
entity?
16Levels of Abstraction for Bits
Logical I-nodes
Physical Track / Sector
Abstraction for Digital Entity
Digital Entity
Bit Stream
Abstraction for Repository
Logical File Name
Physical File System (NFS/AFS/NTFS)
Repository
Disk
17Managing Distributed Storage
- Separate the organization of digital objects from
their physical storage - Logical Name Space to manage attributes about the
digital objects - Data handling system to manage interactions with
remote storage systems - Create storage abstraction layer
- Storage Resource Broker (SRB) provides data
management system
18Levels of Abstraction for Data
Logical Data Model (units, semantics)
Physical Encoding Format (syntax, structure)
Abstraction for Digital Entity
Digital Entity
Files
Abstraction for Repository
Logical Name Space
Physical Data Handling System -SRB/MCAT
Repository
File System, Archive
19Visible Embryo Project
Disk Cache
AFIP Collab WS
Image Generation
OHSU
Eolas
GST
ATD Net
NIC
Disk Cache
UIC Startap
ASX200
BEN
MSWS
NT WS
MSWS
NT WS
Oakland
HSCC
WRL
100 Gbit
Vegas
OC-3
JHU
Disk Cache
DS3
Los Angeles
VBNS OC-12
GMU
Abilene OC-3
Disk Cache
DC POP
OC-3
Abilene OC-3
SDSC
Archive
20Disaster Response
- Support replicas - provide multiple copies of a
data set stored at multiple sites, but accessed
by the same logical file name - On access, map from logical file name to the
physical file name. If the file is not
accessible, automatically fail over to a replica.
21SDSC Storage Resource Broker Meta-data Catalog
Application
Linux I/O
DLL / Python
Java, NT Browsers
Prolog Predicate
Web
Clients
Consistency Management / Authorization-Authenticat
ion
Prime Server
Logical Name Space
Latency Management
Data Transport
Metadata Transport
Storage Abstraction
Catalog Abstraction
Databases DB2, Oracle, Sybase
Servers
HRM
22Information Management- Logical Name Space
- Set of attributes to describe digital entities
that are registered into the logical name space - SRB metadata - Unix file system semantics
- Provenance metadata - Dublin Core
- Resource metadata - User access control lists
- Discipline metadata - User defined attributes
- Each digital entity may have unique attributes
23Information Management
- Abstraction layer for interacting with
information repositories - Manage the schema and physical table structures
of a database - Extensible schema
- User defined attributes
- Extensible Metadata CATalog (EMCAT) manages
collections - mySRB.html interface supports dynamic collection
creation
24Levels of Abstraction for Information
Logical Collection Schema
Physical XML Syntax
Abstraction for Digital Entity
Digital Entity
Metadata Attributes
Abstraction for Repository
Logical Database Schema
Physical EMCAT/CWM
Repository
Database
25National Virtual Observatory Data Grid
1. Portals and Workbenches
2.Knowledge Resource Management
Bulk Data Analysis
Metadata View
Data View
Catalog Analysis
3.
Standard APIs and Protocols
Concept space
4.Grid Security Caching Replication Backup Schedul
ing
Information Discovery
Metadata delivery
Data Discovery
Data Delivery
5.
Standard Metadata format, Data model, Wire format
Catalog Mediator
6.
Data mediator
Catalog/Image Specific Access
Compute Resources
Catalogs
Data Archives
Derived Collections
7.
26Knowledge Management - Discovery across
Collections
- Characterization of relationships between
attributes - Semantic / logical - cross-walks
- Procedural / temporal - records management
- Structural / spatial - GIS
- Abstraction layer for knowledge repositories
- Mapping from collection attributes to discipline
concepts - Model-based Mediation supports mapping from
knowledge relationships to rule-based inference
engines
27Levels of Abstraction for Knowledge
Logical Relationship Schema
Physical ER/UML/XMI/ RDF syntax
Abstraction for Digital Entity
Concept Space (ontology instance)
Digital Entity
Abstraction for Repository
Logical Knowledge Repository Schema
Physical Model-based Mediation System
Repository
Knowledge Repository
28Information Management Projects
- Digital Libraries
- CDL - AMICO
- DARPA/USPTO - patent digital library
- NLM Visible Embryo digital library - GMU
- NSF Digital Library Initiative, Phase II - UCSB,
Stanford - NSF NPACI Digital Sky - Caltech 2MASS sky survey
- NSF NSDL - UCAR / Columbia / Cornell / UCSB
- Data Grids
- DOE Data Visualization Corridor - LLNL
- DOE Particle Physics Data Grid - Stanford,
Caltech - NASA Information Power Grid - NASA Ames
- NIH Biomedical Informatics Research Network
- NSF Grid Physics Network - U Florida
- NSF National Virtual Observatory - Johns Hopkins
University / Caltech - NSF Southern California Earthquake Center - ISI
- Persistent Archives
- NARA Persistent Archive
- NHPRC - Archivist workbench
29Evolution of Data Management
Collection - managed data Use database to
organize attributes about data objects Separate
information management from data storage Support
APIs for information discovery, data access
Database A
Storage
Storage Resource Broker
Integration accomplished through a data handling
system which characterizes the storage systems
30Evolution of Data Management
Distributed Data Collection Same name
space Same schema Separate administration
domains Heterogeneous database instances
Database A
Database B
Storage Resource Broker
Integration requires the ability to characterize
both the schemas and the table structures of
each information repository
31SDSC Storage Resource Broker Meta-data Catalog
32Distributed Data Collection
- Logical organization of distributed digital
objects into a collection - Access through federated servers
- Collection-owned data, implies the server at each
storage repository runs under a collection
user-ID - Collection attributes define a global namespace
- Self-consistent attribute update on all data
accesses - Support for multiple access APIs
- Extensible support for access to any type of
storage system (archive, file system, database) - Extensible collection attributes
33Interoperability across Data and Information
Repositories
- Define a representation for storage that is
independent of the implementation of the storage
system - Unix file system semantics - Open/Close/Read/Write
/Seek - Define a representation of a collection that is
independent of the choice of database - schema, table structures
34Particle Physics Data Grid - Replication System
35Containers
- Containers are used to aggregate data sets
- Minimizes number of files seen by archives
- Improves latency of access for related files
- Containers have a maximum size
- When write into a container, a new container is
automatically started when the initial container
is full - Metadata catalog manages mapping from object to
container - Containers are cached on disk after retrieval
from archive
36Data Grids
Data Grid - linking multiple data
collections Separate name spaces Separate
schema Separate administration
domains Heterogeneous database instances
Database A
Database B
Data grid
The data grid is itself a collection that
provides mechanisms to hide latency and manage
semantics
37Federated Digital Libraries
Virtual Data Grid - linking multiple data
collections Ability to execute processes to
recreate derived data
Database A Services
Database B Services
Virtual Data Grid
The virtual data grid integrates data grid and
digital library technology to manage processes
38User Interfaces
NSDL
Usage Enhancement
Delivery Presentation Aggregation - Channels
Information about collections
Core NSDL Bus
Meta-data delivery Data delivery Query Global
Ids Security Network
Metadata data access-based services
Virtual Collections Mediators
Collection Building
39Persistent Archive
Persistent archive Describe archived data as
collections Describe processes used to create
collections Manage evolution of technology
Database A (today)
Database A (tomorrow)
Virtual Data Grid
The persistent archive is itself a virtual data
grid that provides mechanisms to manage
migration to new technology
40Persistent Archives
- Storage system abstraction
- Logical name space and data manipulations
- Information repository abstraction
- Logical schema and physical table structure
- Knowledge repository abstraction
- Topic maps and inference rules
- Digital object abstraction
- Data model and encoding format
41Interoperation between Grids
Application
Application
SRB Client Library
Metadata Catalog
Replica Catalog
Metadata Catalog
SRB Server
Storage System
Globus
SDSC Storage Resource Broker
42Persistent Collection
- Define context for archiving data -annotate
information content - Create archivable form - standard encoding format
- Archive information content along with data
- Test closure of the collection - all digital
objects that can be discovered in the collection
are members of the collection - Test completeness of the collection - inherent
relationships within the collection can be cast
in terms of attributes generated from the
annotated information. - Differentiate between inherent knowledge and
anomalies / artifacts
43Self-Instantiating Archive
- Archive the processes that are used to control
the ingestion process - Conversion to archivable form
- Annotation of information content
- When accessing the collection, retrieve the
processes and the original digital objects - Apply the processing steps to re-create the
information content - Query the result to discover desired digital
objects - A self-instantiating archive is a virtual data
grid
44ERA Concept model
45Ingestion Processes for Collection Creation
Accession Template
Closure Concept/Attribute
Attribute Inverse Indexing
Information Generation
Knowledge Generation
Attribute Tagging
Attribute Selection
Occurrence Tagging
View Management
Data Organization
Collection
46Data Management Systems
- Distributed data collections
- Single name space
- Distributed data storage systems
- Data Grid - integration of multiple data
collections - Each collection has a separate name space
- Infrastructure that interconnects the collections
can use its own name space, containers,
replication - Virtual Data Grids - federation of digital
libraries - In addition, support interoperability between
services for manipulation, presentation,
discovery of digital objects - Persistent archive
- In addition, manage evolution of technology
components
47Differentiating between Data, Information, and
Knowledge
- Data
- Digital object
- Objects are streams of bits
- Information
- Any tagged data, which is treated as an
attribute. - Attributes may be tagged data within the digital
object, or tagged data that is associated with
the digital object - Knowledge
- Relationships between attributes
- Relationships can be procedural/temporal,
structural/spatial, logical/semantic, functional
48Knowledge Management
- Must manage semantic relationships between the
multiple name spaces - Data Grid
- Must manage procedural relationships between
digital library services - Federated digital library
- Must manage structural relationships between
different archivable forms - encoding formats - Persistent archive
49(No Transcript)
50Integrating Across Data Sources
- The idea
- Different models capture correlated but distinct
aspects of biological reality - How can we express and evaluate queries that
compute data across models - Our approach
- For each source create a knowledge-base of the
anatomy of the observations - Attach the data of each source to the respective
knowledge-base - bridge the sources by a simple ontological
mapping - Compute the query across bridged sources
51(No Transcript)
52Types of Knowledge Relationships
- Logical / semantic Domain maps
- Digital Library cross-walks
- Temporal / procedural Process maps
- Workflow systems
- Spatial / structural Spatial maps
- GIS systems
- Functional / algorithmic Systemic maps
- Scientific feature analysis
53Knowledge Based Data Grids
Ingest Services
Management
Access Services
Knowledge or Topic-Based Query / Browse
Knowledge Repository for Rules
Relationships Between Concepts
Knowledge
XTM DTD
Rules - KQL
(Model-based Access)
XML DTD
Information Repository
Attribute- based Query
Attributes Semantics
SDLIP
Information
(Data Handling System - SRB)
Data
Fields Containers Folders
Storage (Replicas, Persistent IDs)
Grids
Feature-based Query
MCAT/HDF
54mySRB
- Web-bases Access to the SRB
- Secure HTTP
- Uses Cookies for Session Control
- Self Registration of Users Supported
- Currently limited to SDSC users
- Self Registration of Resources (soon)
- Access to Both Data and Metadata
55mySRB Features
- Data File Management
- Collection Creation and Management
- Collection of Varied Objects
- Files, SQL Objects, Databases, URLs, directories,
archives, - Metadata Handling
- Browsing Querying Interface
- Access Control
- Version Control (soon)
- Support proxy (remote) operations
56Data Management
- Browse in Hierarchical Collections
- Registration of
- (remote) Legacy Files Directories
- Registration of SQL Objects URLs
- Data Movement Operations
- Ingest Re-Ingest, Delete, Unlink
- Replicate, Copy, Move, S-Link
- Access Control Operations
- Read, Write, Own, Curate, Annotate,
- Ticket-based Access
- Version Control Operations (soon)
- Read Lock, Write Lock, Unlock
- Check In Check Out
57Types of Meta data
- System-level Metadata
- Size, resource, owner, date, access control,
- User-defined Meta data
- for data collections
- ltname,value,unitgt triples
- No limits in number of metadata
- Support for Collection-level schemas
- Comments, default values, drop-down lists
- Support for Standardized Schemas
- (eg. Dublin Core)
- Annotations
- Supports textual annotations
- Annotator, date, context also registered
58Meta Data Management
- Insert, Update and Delete of Metadata
- Access Control for Metadata (soon in mySRB)
- Querying across system-level, user-defined
metadata and annotations - Query under collections across collections
- Browsing on user-defined metadata (partially
developed) - Metadata supported for legacy files directories
- Extract Metadata (using proxy operations)
59Further Information
http//www.npaci.edu/DICE