A Web-Based Data Grid - PowerPoint PPT Presentation

About This Presentation
Title:

A Web-Based Data Grid

Description:

Database populated with ALL files from the Jefferson Lab silo (no ... SQL database to track cached files, ... Java application & db (running on different ... – PowerPoint PPT presentation

Number of Views:17
Avg rating:3.0/5.0
Slides: 17
Provided by: chip161
Learn more at: https://www.jlab.org
Category:
Tags: based | data | database | grid | web

less

Transcript and Presenter's Notes

Title: A Web-Based Data Grid


1
A Web-Based Data Grid
  • Chip Watson, Ian Bird, Jie Chen,
  • Ying Chen, Bryan Hess, Andy Kowalski
  • Thomas Jefferson National Accelerator Facility

2
Outline
  1. Overview of a prototype JLAB data grid
    architecture
  2. Status of the development
  3. Expected future milestones
  4. Lessons learned so far

3
JLAB Prototype Architecture Summary
  • The prototype data grid consists of
  • Web services for information management and
    control
  • File daemons (like ftpd) for bulk data transfer
  • Back-end services used by the web services
  • Communication w/ web services is via HTTP and XML
  • (HTTPS w/ X.509 certificate for privileged
    operations)
  • Communication w/ file daemons is via a daemon
    specific protocol
  • Communication w/ back-end services is site
    specific

4
In picture form
ClientProgram
Agent
DataGridServer
FileServer
ReplicaCatalog
File Host
R C Host
5
Web Services
  • Replica Catalog
  • Holds global file namespace
  • May itself be replicated for redundancy or
    performance
  • References (for given file) data grid nodes (but
    not physical path)
  • Data Grid Server (aka Replica Host)
  • Holds and serves files
  • May be a disk cache may include tertiary
    storage
  • Translates global name to URL for retrieval (if
    cache resident) (pull by client)
  • Accepts new files (push by client)
  • Supports queuing of file transfer requests
    between nodes (3rd party)
  • Supports policy based file movement

6
Replica Catalog Components
  • Relational database
  • Global directory name, file name, owner, size,
    etc
  • Set of Data Grid Nodes holding copies of the
    file, and last reported state of that replica
    copy (online, offline)
  • XML servlet
  • Directory level services per invocation,
    returning rich info from the database as an XML
    document
  • Catalog updates
  • HTTP servlet
  • Applies style sheet(s) to the XML document,
    allows easy browsing and simple interactions with
    just a simple web browser

7
Current Status of Replica Catalog
  • A prototype exists with following functionality
  • Database populated with ALL files from the
    Jefferson Lab silo (no owner, group, file size
    info loaded for now)
  • XML servlet for browsing
  • HTTP servlet for browsing
  • http//129.57.41.138/servlet/dg.HttpReplicaCatalo
    g?dname/
  • Missing functionality in this prototype
  • Authentication
  • Easy, already done for another (batch system)
    prototype
  • Edit catalog
  • In principle easy, just need to finalize
    scenarios
  • Extensible file properties
  • Moderately easy, just need to add a name-value
    table to db and expand the XML document for a
    single file to include this info

8
Status (cont.)
  • Observations
  • Web browsing into directories w/ thousands of
    files is slow (produces an ENORMOUS web
    page), but works
  • Plan to segment, with Next Page link
  • Probably need to allow client to specify number
    of files to retrieve, and offset for next
    retrieval

9
Data Grid Node Components
  • XML (and HTTP) servlets
  • File Catalog Servlet (Replica Host)
  • Translates file I/O requests to specific URL
    (including protocol
    negotiation or selection)
  • Provides offline / online status of file
  • Transfer Request Servlet
  • Queues file transfer requests, reports status
  • Edits transfer policy for specified directory
  • Disk Cache Manager Servlet
  • Edits policy of disk cache manager
  • File Server(s)
  • ftp, bbftp, gridftp,

10
Data Grid Server Components (Implementation)
  • Disk Cache Manager (back end service)
  • Java application
  • Manages disk pool -- NFS mounted read-only to
    local users
  • SQL database to track cached files, pending
    transfers
  • Migrates files to / from tape
    (if
    requested and if has a reference to a Tape
    Manager)
  • Interacts with a Disk Policy Agent (planned)
  • Tape Manager (back end service)
  • Separate Java application db (running on
    different host)
  • Stages files to or from silo (has own small disk
    cache)
  • NFS exports stub file system

11
Data Grid Node Components (Implementation)
  • Disk Policy Agent (back end service)
  • Runs in Disk Cache Managers VM
  • Keeps replica catalog up to date
  • Advises cache manager as to which files to delete
    (deleting last globally disk resident copy is
    expensive)
  • Propagates transfer policy from Replica Catalog
  • Grid Transfer Agent (back end service)
  • Operates on queued transfer requests
  • Uses remote File Servers (e.g. is or spawns an
    xxftp client)
  • Runs (probably) in disk cache managers VM

12
Current Status of Data Grid Node
  • Data Grid Servlets
  • Translation from global name to URL is hard coded
  • Supports browsing of disk cache
  • Newest prototype allows browsing of unmanaged
    node-local file system, including /home, /data,
    , and the copying of files within a single data
    node (adding authentication soon)
  • File Servers
  • bbftp in production use at Jlab waiting for
    gridFTP

13
Back End Status
  • Disk Cache Manager
  • Simple LRU policy (pluggable), no user quotas
  • No use of policy agent yet (to sync with replica
    catalog)
  • Automatic migration of specified files to tape
    guaranteed before deletion
  • Only 1 node operating in this mode
    (variant of other disk
    cache managers at Jlab)
  • Tape Manager
  • Fully operational, in production use at Jlab
  • File Transfer Agent
  • Just starting development

14
Status Summary
  • Missing Functionality
  • A lot!
  • Transfer queuing
  • Advanced reservation quotas
  • Policy based operations
  • Automatic updates of replica catalog
  • All of these are planned or in progress

15
Data Grid Applications File Manager
  • File Manager Design
  • Uses Replica Catalog (XML)
  • Uses Data Grid Node (XML)
  • GUI to browse files
  • GUI to copy files (and view queues)
  • Status
  • XML communications and file GUI done
  • 3rd party transfer operations awaiting additional
    functionality in the data grid node
  • Currently application, but plan to make into an
    applet

16
Deployment / Development
  • 2Q 01
  • 2 data grid servers running at Jlab MIT for
    LQCD
  • grid browsing (replica catalog and data grid
    server)
  • retrieve file http, bbftp gridftp
  • Command line utility and web interface to
    publish a file (insert into grid node from
    co-located machine / local file system)
  • 3Q 01
  • 2nd grid running between Jlab FSU for CLAS
    (Hall D prototype)
  • push file into a data grid server from offsite
  • 3rd party file transfers on demand (queued)
  • 1Q 02
  • Policy based file migration
  • Asynchronous event notification (HTTP based)
Write a Comment
User Comments (0)
About PowerShow.com