Modern Data Management Overview - PowerPoint PPT Presentation

1 / 36
About This Presentation
Title:

Modern Data Management Overview

Description:

Provides time-based copies of data, tools for re-loading backups ... 2003 - NIH Biomedical Informatics Research Network data grid ... – PowerPoint PPT presentation

Number of Views:52
Avg rating:3.0/5.0
Slides: 37
Provided by: marke188
Category:

less

Transcript and Presenter's Notes

Title: Modern Data Management Overview


1
Modern Data Management Overview
Storage Resource Broker
Reagan W. Moore moore_at_sdsc.edu http//www.sdsc.edu
/srb
2
Topics
  • Data management evolution
  • Shared collections
  • Digital Libraries
  • Persistent Archives
  • Building shared collections
  • Project level / National level / International
  • Demonstration of shared collections
  • Access to collections at SDSC

3
Types of Data Management
  • File system (AFS)
  • Provides caching at remote sites, uses single
    authentication system
  • Backup system (Veritas)
  • Provides time-based copies of data, tools for
    re-loading backups
  • Database system (Oracle 10g IFS)
  • Can link metadata to files on an Internet File
    System
  • Archive system (HPSS)
  • Manages data stored on tape, supports parallel
    I/O streams
  • Persistent object environment (Avaki)
  • Provides vaults for storing objects
  • Globus toolkit
  • Provides differentiated services for building a
    data grid

4
Data Management Environments
  • Data grids
  • Manage shared collections
  • Digital libraries
  • Provide discovery, browsing, presentation
    services on top of collections
  • Persistent archives
  • Manage technology evolution while the
    authenticity and integrity of the assembled
    collection is preserved
  • Real-time sensor networks
  • Manage access to real-time data streams from
    thousands of sensors

5
Generic Infrastructure
  • Can a single system provide all of the features
    needed to implement each type of data management
    system, while supporting access across
    administrative domains and managing data stored
    in multiple types of storage systems?
  • Answer is data grid technology

6
Types of Data Management
  • File system (AFS)
  • Data grid manages replication, parallel I/O,
    containers
  • Backup system (Veritas)
  • Data grid supports replicas, versions, and
    snapshots of files and containers
  • Database system (Oracle 10g IFS)
  • Data grid virtualizes catalogs - schema
    extension, bulk metadata load
  • Archive system (HPSS)
  • Data grid integrates access across archives and
    file systems
  • Persistent object environment (Avaki)
  • Data grid manages user-defined metadata and
    collection hierarchy
  • Globus toolkit - set of differentiated services
  • Data grid manages consistent state information

7
Shared Collections
  • Purpose of SRB data grid is to enable the
    creation of a collection that is shared between
    academic institutions
  • Register digital entity into the shared
    collection
  • Assign owner, access controls
  • Assign descriptive, provenance metadata
  • Manage state information
  • Audit trails, versions, replicas, backups, locks
  • Size, checksum, validation date, synchronization
    date,
  • Manage interactions with storage systems
  • Unix file systems, Windows file systems, tape
    archives,
  • Manage interactions with preferred access
    mechanisms
  • Web browser, Java, WSDL, C library,

8
Federated Server Architecture
Peer-to-peer Brokering
Read Application
Parallel Data Access
Logical Name Or Attribute Condition
1
6
5/6
SRB server
SRB server
3
4
5
SRB agent
SRB agent
2
Server(s) Spawning
R1
MCAT
1.Logical-to-Physical mapping 2.Identification of
Replicas 3.Access Audit Control
R2
Data Access
9
Generic Infrastructure
  • Digital libraries now build upon data grids to
    manage distributed collections
  • DSpace digital library - MIT and Hewlitt Packard
  • Fedora digitial library - Cornell University and
    University of Virginia
  • Persistent archives build upon data grids to
    manage technology evolution
  • NARA research prototype persistent archive
  • California Digital Library - Digital Preservation
    Repository
  • NSF National Science Digital Library persistent
    archive

10
Southern California Earthquake Center
  • Intuitive User Interface
  • Pull-Down Query Menus
  • Graphical Selection of Source Model
  • Clickable LA Basin Map (Olsen)
  • Seismogram/History extraction (Olsen)
  • Access SCEC Digital Library
  • Data stored in a data grid
  • Annotated by modelers
  • Standard naming convention
  • Automated extraction of selected data and
    metadata
  • Management of visualizations

SCEC Digital Library
11
Terashake Data Handling
  • Simulate 7.7 magnitude earthquake on San Andreas
    fault
  • 50 Terabytes in a simulation
  • Move 10 Terabytes per day
  • Post-Processing of wave field
  • Movies of seismic wave propagation
  • Seismogram formatting for interactive on-line
    analysis
  • Velocity magnitude
  • Displacement vector field
  • Cumulative peak maps
  • Statistics used in visualizations
  • Register derived data products into SCEC digital
    library

12
Humidity Climate Ecological Wireless Oceanography
Wind Speed Climate Ecological Wireless Oceanograph
y
ROADNet Sensor Network Data Integration
Seismic Geophysics
Rain start
Fire start
Frank Vernon - UCSD/SIO
13
Chile June 13, 2005
Mw 7.9
Frank Vernon - UCSD/SIO
14
National Science Digital Library
  • URLs for educational material for all grade
    levels registered into repository at Cornell
  • SDSC crawls the URLs, registers the web pages
    into a SRB data grid, builds a persistent archive
  • 750,000 URLs
  • 13 million web pages
  • About 3 TBs of data

15
(No Transcript)
16
(No Transcript)
17
Worldwide University Network Data Grid
  • SDSC
  • Manchester
  • Southampton
  • White Rose
  • NCSA
  • U. Bergen
  • A functioning, general purpose international Data
    Grid for academic collaborations

Manchester-SDSC mirror
18
KEK Data Grid
  • Japan
  • Taiwan
  • South Korea
  • Australia
  • Poland
  • US
  • A functioning, general purpose international Data
    Grid for high-energy physics

Manchester-SDSC mirror
19
BaBar High-energy Physics
  • Stanford Linear Accelerator
  • Lyon, France
  • Rome, Italy
  • San Diego
  • RAL, UK
  • A functioning international Data Grid for
    high-energy physics

Manchester-SDSC mirror
Moved over 100 TBs of data
20
Astronomy Data Grid
  • Chile
  • Tucson, Arizona
  • NCSA, Illinois
  • A functioning international Data Grid for
    Astronomy

Manchester-SDSC mirror
Moved over 400,000 images
21
International Institutions (2005)
22
(No Transcript)
23
Storage Resource Broker 3.3.1
Application
OAI, WSDL, (WSRF)
HTTP, DSpace, OpenDAP, GridFTP
DLL / Python, Perl, Windows
Linux I/O C
NT Browser, Kepler Actors
Federation Management

Consistency Metadata Management /
Authorization, Authentication, Audit
Logical Name Space
Latency Management
Data Transport
Metadata Transport
Storage Repository Abstraction
Database Abstraction
Databases - DB2, Oracle, Sybase, Postgres,
mySQL, Informix
ORB
24
SRB Objectives
  • Automate all aspects of data discovery, access,
    management, analysis, preservation
  • Security paramount
  • Distributed data
  • Provide distributed data support for
  • Data sharing - data grids
  • Data publication - digital libraries
  • Data preservation - persistent archives
  • Data collections - Real time sensor data

25
SRB Developers
  • Reagan Moore - PI
  • Michael Wan - SRB Architect
  • Arcot Rajasekar - SRB Manager
  • Wayne Schroeder - SRB Productization
  • Charlie Cowart - inQ
  • Lucas Gilbert - Jargon
  • Bing Zhu - Perl, Python, Windows
  • Antoine de Torcy - mySRB web browser
  • Sheau-Yen Chen - SRB Administration
  • George Kremenek - SRB Collections
  • Arun Jagatheesan - Matrix workflow
  • Marcio Faerman - SCEC Application
  • Sifang Lu - ROADnet Application
  • Richard Marciano - SALT persistent archives
  • 75 FTE-years of support
  • About 300,000 lines of C

26
History
  • 1995 - DARPA Massive Data Analysis Systems
  • 1997 - DARPA/USPTO Distributed Object Computation
    Testbed
  • 1998 - NSF National Partnership for Advanced
    Computational Infrastructure
  • 1998 - DOE Accelerated Strategic Computing
    Initiative data grid
  • 1999 - NARA persistent archive
  • 2000 - NASA Information Power Grid
  • 2001 - NLM Digital Embryo digital library
  • 2001 - DOE Particle Physics data grid
  • 2001 - NSF Grid Physics Network data grid
  • 2001 - NSF National Virtual Observatory data grid
  • 2002 - NSF National Science Digital Library
    persistent archive
  • 2003 - NSF Southern California Earthquake Center
    digital library
  • 2003 - NIH Biomedical Informatics Research
    Network data grid
  • 2003 - NSF Real-time Observatories, Applications,
    and Data management Network
  • 2004 - NSF ITR, Constraint based data systems
  • 2005 - LC Digital Preservation Lifecycle
    Management
  • 2005 - LC National Digital Information
    Infrastructure and Preservation program

27
Development
  • SRB 1.1.8 - December 15, 2000
  • Basic distributed data management system
  • Metadata Catalog
  • SRB 2.0 - February 18, 2003
  • Parallel I/O support
  • Bulk operations
  • SRB 3.0 - August 30, 2003
  • Federation of data grids
  • SRB 3.3.1 - April 6, 2005
  • Feature requests (extensible schema)

28
SRB Latency Management
Remote Proxies, Staging
Data Aggregation Containers
Prefetch
Network
Destination
Destination
Network
Source
Caching Client-initiated I/O
Streaming Parallel I/O
Replication Server-initiated I/O
29
Latency Management -Bulk Operations
  • Bulk register
  • Create a logical name for a file
  • Load context (metadata)
  • Bulk load
  • Create a copy of the file on a data grid storage
    repository
  • Bulk unload
  • Provide containers to hold small files and
    pointers to each file location
  • Bulk delete
  • Trash can
  • Sticky bits for access control,

30
Logical Name Spaces
Data Access Methods (C library, Unix, Web Browser)
  • Storage Repository
  • Storage location
  • User name
  • File name
  • File context (creation date,)
  • Access constraints

Data access directly between application and
storage repository using names required by the
local repository
31
Logical Name Spaces
Data Access Methods (C library, Unix, Web Browser)
Data Collection
  • Storage Repository
  • Storage location
  • User name
  • File name
  • File context (creation date,)
  • Access constraints
  • Data Grid
  • Logical resource name space
  • Logical user name space
  • Logical file name space
  • Logical context (metadata)
  • Control/consistency constraints

Data is organized as a shared collection
32
Federation Between Data Grids
Data Access Methods (Web Browser, DSpace, OAI-PMH)
Data Collection B
Data Collection A
  • Data Grid
  • Logical resource name space
  • Logical user name space
  • Logical file name space
  • Logical context (metadata)
  • Control/consistency constraints
  • Data Grid
  • Logical resource name space
  • Logical user name space
  • Logical file name space
  • Logical context (metadata)
  • Control/consistency constraints

Access controls and consistency constraints on
cross registration of digital entities
33
Types of Risk
  • Media failure
  • Replicate data onto multiple media
  • Vendor specific systemic errors
  • Replicate data onto multiple vendor products
  • Operational error
  • Replicate data onto a second administrative
    domain
  • Natural disaster
  • Replicate data to a geographically remote site
  • Malicious user
  • Replicate data to a deep archive

34
How Many Replicas
  • Three sites minimize risk
  • Primary site
  • Supports interactive user access to data
  • Secondary site
  • Supports interactive user access when first site
    is down
  • Provides 2nd media copy, located at a remote
    site, uses different vendor product, independent
    administrative procedures
  • Deep archive
  • Provides 3rd media copy, staging environment for
    data ingestion, no user access

35
State of the Art Technology
  • Grid - workflow virtualization
  • Support execution of jobs (processes) across
    multiple compute servers
  • Data grid - data virtualization
  • Manage a shared collection that is distributed
    across multiple storage servers
  • Semantic grid - information virtualization
  • Create a common understanding of information
    (metadata) across multiple collections.

36
For More Information
  • Reagan W. Moore
  • San Diego Supercomputer Center
  • moore_at_sdsc.edu
  • http//www.sdsc.edu/srb/
Write a Comment
User Comments (0)
About PowerShow.com