Title: The SMB Archive System: Data Backup Across the Web
1The SMB Archive SystemData Backup Across the Web
- Kenneth R. Sharp
- Stanford Synchrotron Radiation Laboratory
2Why a high capacity, long term data archive is
needed
- Need a replacement for tapes
- Tapes age and medium formats change rapidly.
- Storage capacity and reliability of tapes
limited. - Much manual book-keeping is needed to keep track
of data stored on tapes.
- Need to support large-area CCD detectors
- Three Q315 detectors will be generating 20-80 MB
files at much increased rate when the SPEAR3
upgrade is complete. - RAID data storage at SSRL will be 24 TB in
2004--all that data must be backed up somehow! - Need to archive data as rapidly as it is
collected.
- Need to support high-throughput structural
biology - Automated beam lines will generated huge amounts
of data. - Large numbers of samples and targets require that
metadata be stored and tracked systematically. - Data must be archived automatically and easy to
retrieve.
3SMB Archive Uses NPACI Resources at SDSC
- National Partnership for Advanced Computational
Infrastructure (NPACI) - Mission advance science by creating national
computational infrastructure the Grid. - Maintains resources at San Diego Supercomputer
Center (SDSC) including HPSS, SRB.
- High Performance Storage System (HPSS)
- Centralized long term data storage system at
SDSC. - Stores over 344 TB of data in 18 million files.
(Jan 2002) - Capacity 2000 GBytes Disk 6000 TBytes Tape
Storage.
- Storage Resource Broker (SRB)
- Client-server middleware provides uniform
interface for accessing heterogeneous resources
over the network. - Presents data in hierarchical folders w/data and
access controls. - May be used to store and retrieve data on the
HPSS at SDSC. - Powerful metadata querying system allows data
sets to be accessed based on their attributes. - Data sets can be replicated over multiple
resources. - Organizations may install and maintain their own
SRB Servers. We use the SRB installation at SDSC.
4Organizations Using SRB
- Digital Libraries
- UCB, Umich, UCSB, Stanford,CDL
- NSF NSDL - UCAR / DLESE
- NASA Information Power Grid
- Astronomy
- National Virtual Observatory
- 2MASS Project (2 Micron All Sky Survey)
- Particle Physics
- Particle Physics Data Grid (DOE)
- GriPhyN
- Medicine
- Digital Embryo (NLM)
- Earth Systems Sciences
- ESIPS
- LTER
- Persistent Archives
- NARA
- LOC
- Neuro Science Molecular Science
5InQ SRB client for Microsoft Windows
- SRB client applications
- Users must be able to upload data, download data,
and view the data in the archive. - Users perform these functions via SRB client
applications. - Available clients Command-line programs (S
Commands), InQ, MySRB. - Tools for custom clients SRB C library Java
API.
- InQ for Microsoft Windows
- InQ is the easiest to use client provided by
NPACI. - Individual files or entire folders may be
uploaded or downloaded. - Files in the archive may be browsed either by
directory structure or by data attributes.
- Limitations of InQ
- Runs only on Microsoft Windows platforms.
- Windows is not the major platform used at
synchrotron light sources or in crystallography
research labs. - No batch job capability for long archive jobs.
- Exposes confusing SRB features and terminology
(resources, containers, collections, etc).
6MySRB web browser-based SRB client
- MySRB
- MySRB is a powerful web-based SRB client which
can be run from standard web browsers. - Files in the archive may be browsed either by
directory structure or by data attributes.
- Limitations of MySRB
- No way to upload or download more than one file
at a time. - The otherwise rich functionality and powerful
features are confusing to users.
- The bottom line
- Capabilities of HPSS and SRB far exceed the
perceived needs of our beam line users. - Our users need a customized interface with
simplified functionality. - Additional infrastructure had to be designed and
implemented in order to make the SRB a viable
storage system for crystallographic data. - A browser-based user interface is ideal.
7The SMB Archive interface for using the SRB
- Convenient web browser interface
- Users may define archive jobs over the web from
anywhere in the world using any common type of
computer. - Users need only log in with their SMB Unix
account name and password.
- Simple archive job definition
- Users may rapidly browse their /home and /data
directories at SSRL. - Directory contents are listed in the browser
window. - Directories may be navigated by clicking on
directory names. - Files to be uploaded may be filtered according to
a list of wildcards. - Subdirectories may be archived recursively.
- The only SRB related information required is the
name of the new data collection to create.
8Monitoring archive jobs and downloading data
- Similar interface for data download
- Users browse their archived data sets in exactly
the same fashion. - Data may be downloaded from the archive to a
directory at SSRL (analogous to an upload job). - Another option is to download selected files in
one or more tar files directly to any computer on
the Internet.
- Batch operation
- Archive job runs in background once definition is
confirmed. - Browser does not hang during archival.
- New jobs may be started while previously defined
jobs are in progress. - Automatically restarts jobs if HPSS is
unavailable. - A job status page indicates definitions and
status of all running jobs. - User may abort running jobs.
- E-mail is sent to the user when a job is started
and again when it is completed.
9Archive System Infrastructure
- But first a word about SRB Accounts
- An SRB account (independent of the SSRL Unix
Account) is required to archive data. - Your SRB account permits you to upload/download
any data using SRB clients. - Handy web page on our site to create an SRB
account https//smb.slac.stanford.edu/secure/coll
aboratory/archive_system/SRBAccountForm.html
- Archive System Infrastructure the Archive
System uses the following software elements - Apache Web Server (v1.3.27)
- Apache Tomcat Servlet Container (v4.1.24)
- Java 2 Runtime (v1.4.1)
- SMB Authentication Gateway Server
- SMB Impersonation Server
- SRB JARGON Java API (v1.1)
- Archive System Servlets (for Upload, Download,
and Job Maintenance) - Archive System Background Applications
- All Archive System applications and servlets are
written in Java. - Archive System front-end is made up of Java
servlets. - Archive System back-end is made up of Java
applications. - All infrastructure elements are either available
for free or are home-grown.
10Significant infrastructure is required to
provide this simple interface--but the payoff
is huge.
- Authentication Gateway Server
- Java servlet that provides a common
authentication protocol for all web-based and
stand-alone applications. - Used to authenticate archive system users.
- All web-based software developed at SSRL is being
updated to use this single authentication server. - Support for the authentication server has already
been integrated into Blu-Ice/DCS. - Allows users to navigate seamlessly between
applications without authenticating multiple
times. - Will eventually allow access to beamline systems
to be controlled automatically based on the beam
schedule. - Access to other resources (computing, data
directories, etc.) available 24/7
- Impersonation Server
- Unix daemon that can run any non-interactive
program on behalf of any Unix user. - Enables web applications to run background jobs
for a user with the actual rights of the Unix
user account. - Accepts commands via the HTTP protocol.
- Verifies authentication information with the
Authentication Server. - Used by the archive system to list directories in
the web browser and run background archive jobs
as the user. - Will allow further analyses to be automatically
initiated by the beam line control system.
11Archive System Web Architecture
SMB
SDSC
Archive Servlets (Tomcat)
12Archive Projects for the next year
- Optimize data transfer rates between SSRL and
SDSC. - Provide stand-alone application for users wishing
to download datasets directly from the SRB. - Implement other functions available in inQ and
MySRB for manipulating existing collections
(replicate, delete, etc.) - Provide option for automatic data upload from
Blu-Ice. - Provide link from Blu-Ice to automatically start
browser and load Archive page w/o user having to
log in again. (New Authentication Server makes
this possible.) - Provide additional options for using SRB Metadata
Catalog (MCAT) to describe, index, and retrieve
data files.
The Collaboratory for Macromolecular
Crystallography is supported by the NIH, NCRR as
a supplement to the SSRL Synchrotron Radiation
Structural Biology Resource (P41-RR-01209). The
SSRL Structural Molecular Biology program is
funded by DOE BER, NIH NCRR, and NIH NIGMS.