The SMB Archive System: Data Backup Across the Web - PowerPoint PPT Presentation

About This Presentation
Title:

The SMB Archive System: Data Backup Across the Web

Description:

Users browse their archived data sets in exactly the same fashion. Data may be downloaded from the archive to a directory at SSRL (analogous to an upload job) ... – PowerPoint PPT presentation

Number of Views:95
Avg rating:3.0/5.0
Slides: 13
Provided by: kenneth119
Category:
Tags: smb | archive | backup | data | system | web

less

Transcript and Presenter's Notes

Title: The SMB Archive System: Data Backup Across the Web


1
The SMB Archive SystemData Backup Across the Web
  • Kenneth R. Sharp
  • Stanford Synchrotron Radiation Laboratory

2
Why a high capacity, long term data archive is
needed
  • Need a replacement for tapes
  • Tapes age and medium formats change rapidly.
  • Storage capacity and reliability of tapes
    limited.
  • Much manual book-keeping is needed to keep track
    of data stored on tapes.
  • Need to support large-area CCD detectors
  • Three Q315 detectors will be generating 20-80 MB
    files at much increased rate when the SPEAR3
    upgrade is complete.
  • RAID data storage at SSRL will be 24 TB in
    2004--all that data must be backed up somehow!
  • Need to archive data as rapidly as it is
    collected.
  • Need to support high-throughput structural
    biology
  • Automated beam lines will generated huge amounts
    of data.
  • Large numbers of samples and targets require that
    metadata be stored and tracked systematically.
  • Data must be archived automatically and easy to
    retrieve.

3
SMB Archive Uses NPACI Resources at SDSC
  • National Partnership for Advanced Computational
    Infrastructure (NPACI)
  • Mission advance science by creating national
    computational infrastructure the Grid.
  • Maintains resources at San Diego Supercomputer
    Center (SDSC) including HPSS, SRB.
  • High Performance Storage System (HPSS)
  • Centralized long term data storage system at
    SDSC.
  • Stores over 344 TB of data in 18 million files.
    (Jan 2002)
  • Capacity 2000 GBytes Disk 6000 TBytes Tape
    Storage.
  • Storage Resource Broker (SRB)
  • Client-server middleware provides uniform
    interface for accessing heterogeneous resources
    over the network.
  • Presents data in hierarchical folders w/data and
    access controls.
  • May be used to store and retrieve data on the
    HPSS at SDSC.
  • Powerful metadata querying system allows data
    sets to be accessed based on their attributes.
  • Data sets can be replicated over multiple
    resources.
  • Organizations may install and maintain their own
    SRB Servers. We use the SRB installation at SDSC.

4
Organizations Using SRB
  • Digital Libraries
  • UCB, Umich, UCSB, Stanford,CDL
  • NSF NSDL - UCAR / DLESE
  • NASA Information Power Grid
  • Astronomy
  • National Virtual Observatory
  • 2MASS Project (2 Micron All Sky Survey)
  • Particle Physics
  • Particle Physics Data Grid (DOE)
  • GriPhyN
  • Medicine
  • Digital Embryo (NLM)
  • Earth Systems Sciences
  • ESIPS
  • LTER
  • Persistent Archives
  • NARA
  • LOC
  • Neuro Science Molecular Science

5
InQ SRB client for Microsoft Windows
  • SRB client applications
  • Users must be able to upload data, download data,
    and view the data in the archive.
  • Users perform these functions via SRB client
    applications.
  • Available clients Command-line programs (S
    Commands), InQ, MySRB.
  • Tools for custom clients SRB C library Java
    API.
  • InQ for Microsoft Windows
  • InQ is the easiest to use client provided by
    NPACI.
  • Individual files or entire folders may be
    uploaded or downloaded.
  • Files in the archive may be browsed either by
    directory structure or by data attributes.
  • Limitations of InQ
  • Runs only on Microsoft Windows platforms.
  • Windows is not the major platform used at
    synchrotron light sources or in crystallography
    research labs.
  • No batch job capability for long archive jobs.
  • Exposes confusing SRB features and terminology
    (resources, containers, collections, etc).

6
MySRB web browser-based SRB client
  • MySRB
  • MySRB is a powerful web-based SRB client which
    can be run from standard web browsers.
  • Files in the archive may be browsed either by
    directory structure or by data attributes.
  • Limitations of MySRB
  • No way to upload or download more than one file
    at a time.
  • The otherwise rich functionality and powerful
    features are confusing to users.
  • The bottom line
  • Capabilities of HPSS and SRB far exceed the
    perceived needs of our beam line users.
  • Our users need a customized interface with
    simplified functionality.
  • Additional infrastructure had to be designed and
    implemented in order to make the SRB a viable
    storage system for crystallographic data.
  • A browser-based user interface is ideal.

7
The SMB Archive interface for using the SRB
  • Convenient web browser interface
  • Users may define archive jobs over the web from
    anywhere in the world using any common type of
    computer.
  • Users need only log in with their SMB Unix
    account name and password.
  • Simple archive job definition
  • Users may rapidly browse their /home and /data
    directories at SSRL.
  • Directory contents are listed in the browser
    window.
  • Directories may be navigated by clicking on
    directory names.
  • Files to be uploaded may be filtered according to
    a list of wildcards.
  • Subdirectories may be archived recursively.
  • The only SRB related information required is the
    name of the new data collection to create.

8
Monitoring archive jobs and downloading data
  • Similar interface for data download
  • Users browse their archived data sets in exactly
    the same fashion.
  • Data may be downloaded from the archive to a
    directory at SSRL (analogous to an upload job).
  • Another option is to download selected files in
    one or more tar files directly to any computer on
    the Internet.
  • Batch operation
  • Archive job runs in background once definition is
    confirmed.
  • Browser does not hang during archival.
  • New jobs may be started while previously defined
    jobs are in progress.
  • Automatically restarts jobs if HPSS is
    unavailable.
  • A job status page indicates definitions and
    status of all running jobs.
  • User may abort running jobs.
  • E-mail is sent to the user when a job is started
    and again when it is completed.

9
Archive System Infrastructure
  • But first a word about SRB Accounts
  • An SRB account (independent of the SSRL Unix
    Account) is required to archive data.
  • Your SRB account permits you to upload/download
    any data using SRB clients.
  • Handy web page on our site to create an SRB
    account https//smb.slac.stanford.edu/secure/coll
    aboratory/archive_system/SRBAccountForm.html
  • Archive System Infrastructure the Archive
    System uses the following software elements
  • Apache Web Server (v1.3.27)
  • Apache Tomcat Servlet Container (v4.1.24)
  • Java 2 Runtime (v1.4.1)
  • SMB Authentication Gateway Server
  • SMB Impersonation Server
  • SRB JARGON Java API (v1.1)
  • Archive System Servlets (for Upload, Download,
    and Job Maintenance)
  • Archive System Background Applications
  • All Archive System applications and servlets are
    written in Java.
  • Archive System front-end is made up of Java
    servlets.
  • Archive System back-end is made up of Java
    applications.
  • All infrastructure elements are either available
    for free or are home-grown.

10
Significant infrastructure is required to
provide this simple interface--but the payoff
is huge.
  • Authentication Gateway Server
  • Java servlet that provides a common
    authentication protocol for all web-based and
    stand-alone applications.
  • Used to authenticate archive system users.
  • All web-based software developed at SSRL is being
    updated to use this single authentication server.
  • Support for the authentication server has already
    been integrated into Blu-Ice/DCS.
  • Allows users to navigate seamlessly between
    applications without authenticating multiple
    times.
  • Will eventually allow access to beamline systems
    to be controlled automatically based on the beam
    schedule.
  • Access to other resources (computing, data
    directories, etc.) available 24/7
  • Impersonation Server
  • Unix daemon that can run any non-interactive
    program on behalf of any Unix user.
  • Enables web applications to run background jobs
    for a user with the actual rights of the Unix
    user account.
  • Accepts commands via the HTTP protocol.
  • Verifies authentication information with the
    Authentication Server.
  • Used by the archive system to list directories in
    the web browser and run background archive jobs
    as the user.
  • Will allow further analyses to be automatically
    initiated by the beam line control system.

11
Archive System Web Architecture
SMB
SDSC
Archive Servlets (Tomcat)
12
Archive Projects for the next year
  • Optimize data transfer rates between SSRL and
    SDSC.
  • Provide stand-alone application for users wishing
    to download datasets directly from the SRB.
  • Implement other functions available in inQ and
    MySRB for manipulating existing collections
    (replicate, delete, etc.)
  • Provide option for automatic data upload from
    Blu-Ice.
  • Provide link from Blu-Ice to automatically start
    browser and load Archive page w/o user having to
    log in again. (New Authentication Server makes
    this possible.)
  • Provide additional options for using SRB Metadata
    Catalog (MCAT) to describe, index, and retrieve
    data files.

The Collaboratory for Macromolecular
Crystallography is supported by the NIH, NCRR as
a supplement to the SSRL Synchrotron Radiation
Structural Biology Resource (P41-RR-01209). The
SSRL Structural Molecular Biology program is
funded by DOE BER, NIH NCRR, and NIH NIGMS.
Write a Comment
User Comments (0)
About PowerShow.com