Title: Exploring Scalable Storage: DSpace
1ExploringScalable Storage DSpace SRB
- Chris Frymann
- University of California San Diego Libraries
- DSpace User Group Meeting
- March 10, 2004
2Extending DSpaceStorage Capabilities
- Much of the value and success of the Web is a
result of its enormous size, which has been
achieved through a distributed storage model. - The current DSpace model assumes local storage.
- What if DSpace collections could be of virutally
unlimited size, stored, replicated and accessible
via federated grid technologies?
3This Presentation Will
- Report on a proposed project to extend DSpace
data management capabilites through integration
with the San Diego Supercomputer Centers Storage
Resource Broker (SRB) - Provide an overview of SRB
- Review how both DSpace and SRB users will benefit
4NARA Proposal Participants
- San Diego Super Computer Center (SDSC)
- Member of National Partnership for Advanced
Computational Infrastructure (NPACI) an NSF
sponsored program - MIT Libraries
- UC San Diego Libraries (UCSD)
- Hewlett Packard Laboratories (HP)
- National Archives and Records Administration
(NARA)
5Proposal Views DSpace As
- Simple user-friendly front end providing
- Digital content ingestion
- Search and discovery
- Content management
- Dissemination services
- Preservation
6What is SRB?
- Storage Resource Broker
- Data management infrastructure
- Developed at San Diego Supercomputer Center
- Utilizes data grid and federation technologies
7Levels of Possible Integration
- Replace DSpace file system calls with SRB access
calls - Utilize SRB metadata management capabilities
- Provide support for federation of DSpace systems
8Storage Resource BrokerTechnical Overview
9Definition of SRB
- Middleware which allows other applications to
treat a diverse collection of physical storage
devices as a single logical resource - A distributed file system (Data Grid), based on a
client-server architecture.
10What SRB Does
- Replicates, syncs, archives, and connects
heterogeneous resources in a logical manner using
abstraction mechanisms. - Also provides a way to access files and computers
based on their attributes rather than just their
names or physical locations.
11A Sample Storage Environment
12Commodity IDE Disk Drive
200 GB - 200
13Grid Brick
2 Terabytes
14Grid Bricks Details
- Hardware components
- Intel Celeron 1.7 GHz CPU
- SuperMicro P4SGA PCI Local bus ATX mainboard
- 1 GB memory (266 MHz DDR DRAM)
- 3Ware Escalade 7500-12 port PCI bus IDE RAID
- 10 Western Digital Caviar 200-GB IDE disk drives
- 3Com Etherlink 3C996B-T PCI bus 1000Base-T
- Redstone RMC-4F2-7 4U ten bay ATX chassis
- Linux operating system
- Cost is 2,200 per Tbyte plus tax
- Gig-E network switch costs 500 per brick
- Effective cost is about 2,700 per TByte
15Rack of Grid Bricks 12 TB
16Data Grid 50 TB
Room of Racks
17Grid Bricks at SDSC
- Used to implement picking environments for
10-TB collections - Web-based access
- Web services (WSDL/SOAP) for data subsetting
- Implemented 15-TBs of storage
- Astronomy sky surveys, NARA prototype persistent
archive, NSDL web crawls - Must still apply Linux security patches to each
Grid Brick
18SDSC Production Data Grid
- SDSC Storage Resource Broker
- Federated client-server system, managing
- Over 90 TBs of data at SDSC
- Over 16 million files
- Manages data collections stored in
- Archives (HPSS, UniTree, ADSM, DMF)
- Hierarchical Resource Managers
- Tapes, tape robots
- File systems (Unix, Linux, Mac OS X, Windows)
- FTP sites
- Databases (Oracle, DB2, Postgres, SQLserver,
Sybase, Informix) - Virtual Object Ring Buffers
19SRB Collections at SDSC
20Data Grids
- Distributed data sources
- Inter-realm authentication and authorization
- Heterogeneity
- Storage repository abstraction
- Scalability
- Differentiation between context and content
management - Preservation
- Support for automated processing (migration,
archival processes)
21Data Grid Components
- Federated client-server architecture
- Servers can talk to each other independently of
the client - Infrastructure independent naming
- Logical names for users, resources, files,
applications - Collective ownership of data
- Collection-owned data, with infrastructure
independent access control lists - Context management
- Record state information in a metadata catalog
from data grid services such as replication - Abstractions for dealing with heterogeneity
22Data Grid Abstractions
- Logical name space for files
- Global persistent identifier
- Storage repository abstraction
- Standard operations supported on storage systems
- Information repository abstraction
- Standard operations to manage collections in
databases - Access abstraction
- Standard interface to support alternate APIs
- Latency management mechanisms
- Aggregation, parallel I/O, replication, caching
- Security interoperability
- GSSAPI, inter-realm authentication,
collection-based authorization
23Data Grid Federation
- Data grids provide the ability to name, organize,
and manage data on distributed storage resources - Federation provides a way to name, organize, and
manage data on multiple data grids.
24Distributed Data Grids 200 TB
SRB Federation
25SRB APIs
26SDSC Storage Resource Broker Meta-data Catalog
Application
Linux I/O
OAI WSDL
Access APIs
DLL / Python
Java, NT Browsers
GridFTP
Federation
Consistency Management /
Authorization-Authentication
SRB Server
Logical Name Space
Latency Management
Data Transport
Metadata Transport
Storage Abstraction
Catalog Abstraction
Databases DB2, Oracle, Sybase, SQLServer
Drivers
HRM
27Federated SRB server model
Peer-to-peer Brokering
Read Application
Parallel Data Access
Logical Name Or Attribute Condition
1
6
5/6
SRB server
SRB server
3
4
5
SRB agent
SRB agent
2
Server(s) Spawning
R1
MCAT
1.Logical-to-Physical mapping 2.Identification of
Replicas 3.Access Audit Control
R2
Data Access
28Peer-to-Peer Federation
- Occasional Interchange - for specified users
- Replicated Catalogs - entire state information
replication - Resource Interaction - data replication
- Replicated Data Zones - no user interactions
between zones - Master-Slave Zones - slaves replicate data from
master zone - Snow-Flake Zones - hierarchy of data
replication zones - User / Data Replica Zones - user access from
remote to host zone - Nomadic Zones - synchronize local zone to
parent zone - Free-floating myZone - synchronize without a
parent zone - Archival BackUp Zone - synchronize to an archive
29Principle peer-to-peer federation approaches
30SRB Availability
- SRB source distributed to academic and research
institutions - Commercial use access through UCSD Technology
Transfer Office - William Decker WJDecker_at_ucsd.edu
- Commercial version from
- http//www.nirvanastorage.com
31SRB Info Resources
- SRB Homepage
- http//www.npaci.edu/DICE/SRB/
- Grid Port Toolkit
- https//gridport.npaci.edu/
- Data Intensive Computing Environment (DICE)
- http//www.npaci.edu/DICE
- mySRB - Web-based Browser and Query Tool for
Storage Resource Broker - http//www.npaci.edu/dice/srb/mySRB/mySRB.html
32NARA Proposal - Two Goals
- Demonstrate possibility of federating two
different preservation architectures - Support exchange of documents between both
systems - Demonstrate DSpace/SRB integration leads to
improved life-cycle support
33NARA Proposal - Plan of Work
- Evaluation of life-cycle management
- Use SRB as filestore for DSpace bitstreams
- Identify METS storage profile
- Enable exchange of data and metadata between
independent DSpace and SRB systems
34Schedule Deliverables
- Year 1 Develop Research Prototype
- Develop functional requirements
- Specify standard interfaces METS profiles
- Prototype implementation of specified design
- Ingest data, evaluate functionality and
performance
35Schedule Deliverables
- Year 2
- Federation with additional systems, possibly
- CDL
- OCLC
- Fedora
- Scalability testing
- Ingestion of more content types
36UCSD LibrariesTo ProvideTest Collections
- Slide Collection
- Over 200,000 Digitized Slide Images
- Over 1 million files (counting derivatives)
- Approximately 5 Terabytes
- Moving Image Collection
- California movie newsreel footage
- Size of collection to be determined
37Testing Will Explore
- Management of terabyte scale collections
- Handling of compound documents
- Automating aspects of archival workflow
- Integration of METS
- Feasibility of deriving descriptive metadata from
the material during ingest - Automated verification and validation checking
38Hewlett Packard Labs
- Will provide a second storage utility which will
be used to test - Federation
- Zone level access and management controls
- Data validation and authenticity
39Data Model
- Paired Content and Metadata Files
- Metadata encoded in standard METS profiles
- Stand-alone METS files used to describe arbitrary
levels of aggregation of lower level objects
40(No Transcript)
41Integration of DSpace / SRB EnablesMultiple
Modes of Control
- From DSpace
- Via SRB APIs
- Specification of Federated SRB Zone configuration
and interoperability
42ConcusionExpected Results / Benefits
- DSpace users
- Federation collection management through
distributed grid technology - Exchange of METS encoded collections
- SRB users
- User friendly ingest mechanism
- Extended life-cycle management
- Exchange of METS encode collections
43Q A
- Presentation will be available at
- http//libnet.ucsd.edu/dspace/user_group_2004.03.
ppt