Digital Libraries, Data Grids, and Persistent Archives - PowerPoint PPT Presentation

1 / 59
About This Presentation
Title:

Digital Libraries, Data Grids, and Persistent Archives

Description:

National Partnership for Advanced Computational Infrastructure ... How do you access storage systems at remote sites in someone else's administration domain? ... – PowerPoint PPT presentation

Number of Views:40
Avg rating:3.0/5.0
Slides: 60
Provided by: reag1
Category:

less

Transcript and Presenter's Notes

Title: Digital Libraries, Data Grids, and Persistent Archives


1
Digital Libraries, Data Grids, and Persistent
Archives Reagan W. Moore San Diego
Supercomputer Center moore_at_sdsc.edu http//www.npa
ci.edu/DICE/
2
Data and Knowledge Systems Group
  • Staff
  • Reagan Moore
  • Ilkai Altintas
  • Chaitan Baru
  • Sheau Yen Chen
  • Charles Cowart
  • Amarnath Gupta
  • George Kremenek
  • Bertram Ludäscher
  • Richard Marciano
  • XuFei Qian
  • Roman Olshanowsky
  • Arcot Rajasekar
  • Abe Singer
  • Michael Wan
  • Ilya Zaslavsky
  • Bing Zhu
  • Graduate Students
  • A. Bagchi
  • S. Bansal
  • A. Behere
  • R. Bharath
  • S. Bharath
  • M. Kulrul
  • L. Sui
  • Undergraduate Interns
  • N. Cotofana
  • M. Shumaker
  • J. Trang
  • L. Yin
  • /- NN

3
Accessing Data
  • How do you access storage systems at remote sites
    in someone elses administration domain?
  • How do you organize distributed data into a
    cohesive collection with global, persistent
    identifiers?

4
Topics
  • Application of
  • Data management systems
  • Information management systems
  • Knowledge management systems
  • to
  • Distributed data collections
  • Digital libraries
  • Data Grids
  • Persistent Archives
  • by
  • Defining levels of abstraction

5
Information Management Projects
  • Digital Libraries
  • CDL - AMICO
  • DARPA/USPTO - patent digital library
  • NLM Visible Embryo digital library - GMU
  • NSF Digital Library Initiative, Phase II - UCSB,
    Stanford
  • NSF NPACI Digital Sky - Caltech 2MASS sky survey
  • NSF NSDL - UCAR / Columbia / Cornell / UCSB
  • Data Grid Environments
  • DOE Data Visualization Corridor - LLNL
  • DOE Particle Physics Data Grid - Stanford,
    Caltech
  • NASA Information Power Grid - NASA Ames
  • NIH Biomedical Informatics Research Network
  • NSF Grid Physics Network - U Florida
  • NSF National Virtual Observatory - Johns Hopkins
    University / Caltech
  • NSF Southern California Earthquake Center - ISI
  • Persistent Archives
  • NARA Persistent Archive
  • NHPRC - Archivist workbench

6
Managing Persistence
  • What are digital objects?
  • Digital objects require infrastructure support
  • Application / Operating system / storage system /
    display system
  • Where do you manage the definition of the digital
    object?
  • Emulation - manage at the application level
  • Migration - manage at the digital object format
    level

7
Presentation of Digital Objects
Application
Operating System
Storage System
Display System
Digital Object
8
Technology Management
Application
Wrap Application
Operating System
Storage System
Display System
Digital Object
9
Technology Management
Application
Add Operating System Call
Operating System
Storage System
Display System
Digital Object
10
Technology Management
Application
Add Operating System Call
Operating System
Add Operating System Call
Storage System
Display System
Digital Object
11
Technology Management
Application
Add Operating System Call
Operating System
Wrap Storage System
Wrap Display System
Storage System
Display System
Digital Object
12
Technology Management
Application
Operating System
Storage System
Display System
Migrate Encoding Format
Digital Object
13
Specifying levels of Abstraction
  • Technology management becomes simpler if the
    persistent archive infrastructure operates on
    abstractions, rather than an explicit physical
    implementation of a resource
  • Can we abstract
  • Digital object
  • Storage

14
Technology Management
Application
Operating System
Storage System Abstraction
Display System Abstraction
Storage System
Display System
Digital Object Abstraction
Digital Object
15
Types of Digital Entity Abstractions
  • Logical representation
  • What does the digital entity represent?
  • What is the associated meaning?
  • Physical representation
  • What is the physical structure of the digital
    entity?

16
Levels of Abstraction for Bits
Logical I-nodes
Physical Track / Sector
Abstraction for Digital Entity
Digital Entity
Bit Stream
Abstraction for Repository
Logical File Name
Physical File System (NFS/AFS/NTFS)
Repository
Disk
17
Managing Distributed Storage
  • Separate the organization of digital objects from
    their physical storage
  • Logical Name Space to manage attributes about the
    digital objects
  • Data handling system to manage interactions with
    remote storage systems
  • Create storage abstraction layer
  • Storage Resource Broker (SRB) provides data
    management system

18
Levels of Abstraction for Data
Logical Data Model (units, semantics)
Physical Encoding Format (syntax, structure)
Abstraction for Digital Entity
Digital Entity
Files
Abstraction for Repository
Logical Name Space
Physical Data Handling System -SRB/MCAT
Repository
File System, Archive
19
Visible Embryo Project
Disk Cache
AFIP Collab WS
Image Generation
OHSU
Eolas
GST
ATD Net
NIC
Disk Cache
UIC Startap
ASX200
BEN
MSWS
NT WS
MSWS
NT WS
Oakland
HSCC
WRL
100 Gbit
Vegas
OC-3
JHU
Disk Cache
DS3
Los Angeles
VBNS OC-12
GMU
Abilene OC-3
Disk Cache
DC POP
OC-3
Abilene OC-3
SDSC
Archive
20
Disaster Response
  • Support replicas - provide multiple copies of a
    data set stored at multiple sites, but accessed
    by the same logical file name
  • On access, map from logical file name to the
    physical file name. If the file is not
    accessible, automatically fail over to a replica.

21
SDSC Storage Resource Broker Meta-data Catalog
Application
Linux I/O
DLL / Python
Java, NT Browsers
Prolog Predicate
Web
Clients

Consistency Management / Authorization-Authenticat
ion
Prime Server
Logical Name Space
Latency Management
Data Transport
Metadata Transport
Storage Abstraction
Catalog Abstraction
Databases DB2, Oracle, Sybase
Servers
HRM
22
Information Management- Logical Name Space
  • Set of attributes to describe digital entities
    that are registered into the logical name space
  • SRB metadata - Unix file system semantics
  • Provenance metadata - Dublin Core
  • Resource metadata - User access control lists
  • Discipline metadata - User defined attributes
  • Each digital entity may have unique attributes

23
Information Management
  • Abstraction layer for interacting with
    information repositories
  • Manage the schema and physical table structures
    of a database
  • Extensible schema
  • User defined attributes
  • Extensible Metadata CATalog (EMCAT) manages
    collections
  • mySRB.html interface supports dynamic collection
    creation

24
Levels of Abstraction for Information
Logical Collection Schema
Physical XML Syntax
Abstraction for Digital Entity
Digital Entity
Metadata Attributes
Abstraction for Repository
Logical Database Schema
Physical EMCAT/CWM
Repository
Database
25
National Virtual Observatory Data Grid
1. Portals and Workbenches
2.Knowledge Resource Management
Bulk Data Analysis
Metadata View
Data View
Catalog Analysis
3.
Standard APIs and Protocols
Concept space
4.Grid Security Caching Replication Backup Schedul
ing
Information Discovery
Metadata delivery
Data Discovery
Data Delivery
5.
Standard Metadata format, Data model, Wire format
Catalog Mediator
6.
Data mediator
Catalog/Image Specific Access
Compute Resources
Catalogs
Data Archives
Derived Collections
7.
26
Knowledge Management - Discovery across
Collections
  • Characterization of relationships between
    attributes
  • Semantic / logical - cross-walks
  • Procedural / temporal - records management
  • Structural / spatial - GIS
  • Abstraction layer for knowledge repositories
  • Mapping from collection attributes to discipline
    concepts
  • Model-based Mediation supports mapping from
    knowledge relationships to rule-based inference
    engines

27
Levels of Abstraction for Knowledge
Logical Relationship Schema
Physical ER/UML/XMI/ RDF syntax
Abstraction for Digital Entity
Concept Space (ontology instance)
Digital Entity
Abstraction for Repository
Logical Knowledge Repository Schema
Physical Model-based Mediation System
Repository
Knowledge Repository
28
Information Management Projects
  • Digital Libraries
  • CDL - AMICO
  • DARPA/USPTO - patent digital library
  • NLM Visible Embryo digital library - GMU
  • NSF Digital Library Initiative, Phase II - UCSB,
    Stanford
  • NSF NPACI Digital Sky - Caltech 2MASS sky survey
  • NSF NSDL - UCAR / Columbia / Cornell / UCSB
  • Data Grids
  • DOE Data Visualization Corridor - LLNL
  • DOE Particle Physics Data Grid - Stanford,
    Caltech
  • NASA Information Power Grid - NASA Ames
  • NIH Biomedical Informatics Research Network
  • NSF Grid Physics Network - U Florida
  • NSF National Virtual Observatory - Johns Hopkins
    University / Caltech
  • NSF Southern California Earthquake Center - ISI
  • Persistent Archives
  • NARA Persistent Archive
  • NHPRC - Archivist workbench

29
Evolution of Data Management
Collection - managed data Use database to
organize attributes about data objects Separate
information management from data storage Support
APIs for information discovery, data access
Database A
Storage
Storage Resource Broker
Integration accomplished through a data handling
system which characterizes the storage systems
30
Evolution of Data Management
Distributed Data Collection Same name
space Same schema Separate administration
domains Heterogeneous database instances
Database A
Database B
Storage Resource Broker
Integration requires the ability to characterize
both the schemas and the table structures of
each information repository
31
SDSC Storage Resource Broker Meta-data Catalog
32
Distributed Data Collection
  • Logical organization of distributed digital
    objects into a collection
  • Access through federated servers
  • Collection-owned data, implies the server at each
    storage repository runs under a collection
    user-ID
  • Collection attributes define a global namespace
  • Self-consistent attribute update on all data
    accesses
  • Support for multiple access APIs
  • Extensible support for access to any type of
    storage system (archive, file system, database)
  • Extensible collection attributes

33
Interoperability across Data and Information
Repositories
  • Define a representation for storage that is
    independent of the implementation of the storage
    system
  • Unix file system semantics - Open/Close/Read/Write
    /Seek
  • Define a representation of a collection that is
    independent of the choice of database
  • schema, table structures

34
Particle Physics Data Grid - Replication System
35
Containers
  • Containers are used to aggregate data sets
  • Minimizes number of files seen by archives
  • Improves latency of access for related files
  • Containers have a maximum size
  • When write into a container, a new container is
    automatically started when the initial container
    is full
  • Metadata catalog manages mapping from object to
    container
  • Containers are cached on disk after retrieval
    from archive

36
Data Grids
Data Grid - linking multiple data
collections Separate name spaces Separate
schema Separate administration
domains Heterogeneous database instances
Database A
Database B
Data grid
The data grid is itself a collection that
provides mechanisms to hide latency and manage
semantics
37
Federated Digital Libraries
Virtual Data Grid - linking multiple data
collections Ability to execute processes to
recreate derived data
Database A Services
Database B Services
Virtual Data Grid
The virtual data grid integrates data grid and
digital library technology to manage processes
38
User Interfaces
NSDL
Usage Enhancement
Delivery Presentation Aggregation - Channels
Information about collections
Core NSDL Bus
Meta-data delivery Data delivery Query Global
Ids Security Network
Metadata data access-based services
Virtual Collections Mediators
Collection Building
39
Persistent Archive
Persistent archive Describe archived data as
collections Describe processes used to create
collections Manage evolution of technology
Database A (today)
Database A (tomorrow)
Virtual Data Grid
The persistent archive is itself a virtual data
grid that provides mechanisms to manage
migration to new technology
40
Persistent Archives
  • Storage system abstraction
  • Logical name space and data manipulations
  • Information repository abstraction
  • Logical schema and physical table structure
  • Knowledge repository abstraction
  • Topic maps and inference rules
  • Digital object abstraction
  • Data model and encoding format

41
Interoperation between Grids
Application
Application
SRB Client Library
Metadata Catalog
Replica Catalog
Metadata Catalog
SRB Server
Storage System
Globus
SDSC Storage Resource Broker
42
Persistent Collection
  • Define context for archiving data -annotate
    information content
  • Create archivable form - standard encoding format
  • Archive information content along with data
  • Test closure of the collection - all digital
    objects that can be discovered in the collection
    are members of the collection
  • Test completeness of the collection - inherent
    relationships within the collection can be cast
    in terms of attributes generated from the
    annotated information.
  • Differentiate between inherent knowledge and
    anomalies / artifacts

43
Self-Instantiating Archive
  • Archive the processes that are used to control
    the ingestion process
  • Conversion to archivable form
  • Annotation of information content
  • When accessing the collection, retrieve the
    processes and the original digital objects
  • Apply the processing steps to re-create the
    information content
  • Query the result to discover desired digital
    objects
  • A self-instantiating archive is a virtual data
    grid

44
ERA Concept model
45
Ingestion Processes for Collection Creation
Accession Template
Closure Concept/Attribute
Attribute Inverse Indexing
Information Generation
Knowledge Generation
Attribute Tagging
Attribute Selection
Occurrence Tagging
View Management
Data Organization
Collection
46
Data Management Systems
  • Distributed data collections
  • Single name space
  • Distributed data storage systems
  • Data Grid - integration of multiple data
    collections
  • Each collection has a separate name space
  • Infrastructure that interconnects the collections
    can use its own name space, containers,
    replication
  • Virtual Data Grids - federation of digital
    libraries
  • In addition, support interoperability between
    services for manipulation, presentation,
    discovery of digital objects
  • Persistent archive
  • In addition, manage evolution of technology
    components

47
Differentiating between Data, Information, and
Knowledge
  • Data
  • Digital object
  • Objects are streams of bits
  • Information
  • Any tagged data, which is treated as an
    attribute.
  • Attributes may be tagged data within the digital
    object, or tagged data that is associated with
    the digital object
  • Knowledge
  • Relationships between attributes
  • Relationships can be procedural/temporal,
    structural/spatial, logical/semantic, functional

48
Knowledge Management
  • Must manage semantic relationships between the
    multiple name spaces
  • Data Grid
  • Must manage procedural relationships between
    digital library services
  • Federated digital library
  • Must manage structural relationships between
    different archivable forms - encoding formats
  • Persistent archive

49
(No Transcript)
50
Integrating Across Data Sources
  • The idea
  • Different models capture correlated but distinct
    aspects of biological reality
  • How can we express and evaluate queries that
    compute data across models
  • Our approach
  • For each source create a knowledge-base of the
    anatomy of the observations
  • Attach the data of each source to the respective
    knowledge-base
  • bridge the sources by a simple ontological
    mapping
  • Compute the query across bridged sources

51
(No Transcript)
52
Types of Knowledge Relationships
  • Logical / semantic Domain maps
  • Digital Library cross-walks
  • Temporal / procedural Process maps
  • Workflow systems
  • Spatial / structural Spatial maps
  • GIS systems
  • Functional / algorithmic Systemic maps
  • Scientific feature analysis

53
Knowledge Based Data Grids
Ingest Services
Management
Access Services
Knowledge or Topic-Based Query / Browse
Knowledge Repository for Rules
Relationships Between Concepts
Knowledge
XTM DTD
Rules - KQL
(Model-based Access)
XML DTD
Information Repository
Attribute- based Query
Attributes Semantics
SDLIP
Information
(Data Handling System - SRB)
Data
Fields Containers Folders
Storage (Replicas, Persistent IDs)
Grids
Feature-based Query
MCAT/HDF
54
mySRB
  • Web-bases Access to the SRB
  • Secure HTTP
  • Uses Cookies for Session Control
  • Self Registration of Users Supported
  • Currently limited to SDSC users
  • Self Registration of Resources (soon)
  • Access to Both Data and Metadata

55
mySRB Features
  • Data File Management
  • Collection Creation and Management
  • Collection of Varied Objects
  • Files, SQL Objects, Databases, URLs, directories,
    archives,
  • Metadata Handling
  • Browsing Querying Interface
  • Access Control
  • Version Control (soon)
  • Support proxy (remote) operations

56
Data Management
  • Browse in Hierarchical Collections
  • Registration of
  • (remote) Legacy Files Directories
  • Registration of SQL Objects URLs
  • Data Movement Operations
  • Ingest Re-Ingest, Delete, Unlink
  • Replicate, Copy, Move, S-Link
  • Access Control Operations
  • Read, Write, Own, Curate, Annotate,
  • Ticket-based Access
  • Version Control Operations (soon)
  • Read Lock, Write Lock, Unlock
  • Check In Check Out

57
Types of Meta data
  • System-level Metadata
  • Size, resource, owner, date, access control,
  • User-defined Meta data
  • for data collections
  • ltname,value,unitgt triples
  • No limits in number of metadata
  • Support for Collection-level schemas
  • Comments, default values, drop-down lists
  • Support for Standardized Schemas
  • (eg. Dublin Core)
  • Annotations
  • Supports textual annotations
  • Annotator, date, context also registered

58
Meta Data Management
  • Insert, Update and Delete of Metadata
  • Access Control for Metadata (soon in mySRB)
  • Querying across system-level, user-defined
    metadata and annotations
  • Query under collections across collections
  • Browsing on user-defined metadata (partially
    developed)
  • Metadata supported for legacy files directories
  • Extract Metadata (using proxy operations)

59
Further Information
http//www.npaci.edu/DICE
Write a Comment
User Comments (0)
About PowerShow.com