Title: Data Intensive Computing at SDSC
1Data Intensive Computing at SDSC
- Chaitan Baru
- Senior Principal Scientist
- Data Intensive Computing Group
- San Diego Supercomputer Center
2NPACI Data management issues
- Managing a small number of large data sets
- e.g. high resolution X-ray images in neuroscience
- Managing a large number of small data sets
- e.g. digital sky surveys, large document
collections - Querying metadata to identify data sets
- move from file-oriented view to
database-oriented view - Accessing large data collections distributed
across a network (including parallel I/O) - Querying mediated views over distributed,
heterogeneous information sources
3Collections with large number of small data sets
- Digital Sky surveys
- About 2 billion objects corresponding to light
sources in the sky - Each is about 1K in size
- Patent documents
- About 2 million documents
- Each is about 75K in size
- NARA document collections
- HTML pages, word processing documents, email
messages
4DICE Technologies
- Persistent archives
- HPSS
- DB2/HPSS (Oracle/HPSS)
- Digital library
- SRB/MCAT
- InterLib
- IBM DL
- Information mediation
- The MIX project
- GIS data sources
- Multimedia sources
5DICE Applications
- Molecular biology - PDB
- Neuroscience - brain mapping images
- Social science - census data sets
- Digital library - ELIB, ADL, Infobus, CDL
- NARA - Electronic records management
- ASCI - Data visualization corridor
- NASA - Information Power Grid (IPG)
- GDE/Marconi - Image libraries
- CDL - AMICO image collection
- USPTO - Dist. Object Computation Testbed
6Managing very large data sets
The IBM High Performance Storage System (HPSS)
- Runs on 14-node IBM RS/6000 SP, including 8
four-way SMP nodes (Silver nodes) - 1TB SSA disk, 3 StorageTek silos with 360TB
capacity - HiPPI connected devices, parallel I/O
- Over 70 (multithreaded) server processes
- Multiple classes of service for managing file
storage lt2MB, 2-200MB, 200MB-6GB, 6GB-100TB
7HPSS Archival Storage System
SSA RAID
Silver Node Tape / disk mover DCE / FTP /HSI
Log Client
RS6000 Tape Mover PVR (9490)
9490 Robot Eight Tape Drives
108 GB
SSA RAID
Silver Node Tape / disk mover DCE / FTP /HSI
Log Client
108 GB
9490 Robot Four Drives
High Performance Gateway Node
3490 Tape
SSA RAID
Silver Node Tape / disk mover DCE / FTP /HSI
Log Client
54 GB
SSA RAID
Silver Node Tape / disk mover DCE / FTP /HSI
Log Client
108 GB
Trail- Blazer3 Switch
HiPPISwitch
Silver Node Tape / disk mover DCE / FTP /HSI
Log Client
SSA RAID
9490 Robot Seven Tape Drives
High Node Disk Mover HiPPI driver
108 GB
Silver Node Tape / disk mover DCE / FTP /HSI
Log Client
SSA RAID
54 GB
Silver Node Tape / disk mover DCE / FTP /HSI
Log Client
SSA RAID
Wide Node Disk Mover HiPPI driver
108 GB
MaxStrat RAID
Silver Node Storage / Purge Bitfile / Migration
Nameservice/PVL Log Daemon
SSA RAID
160 GB
830 GB
8Managing large number of small data sets
The Integrated DB2/HPSS System
Database table
Create Tablespace HPSS-SPACE Managed By
Database Using FILE (HPSS lthpss-filenamegt
ltsizegt DISKBUF ltpathgt ltsizegt)
C4
C5
C1
C2
C3
DB2
Create Table SAMPLE-TABLE (C1 int, C2 float, C3
char, C4 CLOB, C5 BLOB) In REGULAR-SPACE
DB2 disk buffer
HPSS
HPSS disk cache
9Other DBMS/archival storage integration efforts
- Oracle / AMASS
- Oracle uses AMASS as a file server
- Objectivity / HPSS
- Being developed by Stanford SLAC
- Implements an OO staging system between
Objectivity and HPSS
10Metadata-based access to data sets
The SDSC Storage Resource Broker (SRB)
Application (SRB client)
SRB Middleware
MCAT
SRB Servers
DB2, Oracle, Illustra, ObjectStore
HPSS, UniTree
UNIX, ftp
11Querying heterogeneous information sources
User Interface
User Interface
Query
Results
Mediator (with views)
Local data repository
Query fragment
Query fragment
Convert incoming query and outgoing data
Wrapper
Wrapper
Wrapper
SQL Database
Spreadsheet
HTML, other files
12The MIX ProjectMediation of Information using
XML
- TEAM
- UCSD CSE Yannis Papakonstantinou, Pavel
Velikhov, Victor Vianu - SDSC DICE Chaitan Baru, Amarnath Gupta, Bertram
Ludaescher, Richard Marciano
13MIX Components
- Wrapper tool-kit
- model information in a resource using XML DTD
(or, XML schema), including a mapping of source
data to DTD - provide mapping from XML query language to source
query language / operations - Mediator tool-kit
- allows definition of views across multiple
resources - views are expressed in a declarative query
language - provides a query engine (for composing results)
14MIX components...
- XML Matching And Structuring (XMAS) query
language - operates on a given set of XML documents to
produce a new XML documents, using XMAS algebra - DOM-VXD DOM Virtual XML Document
- a lazy implementation of DOM. Supports
browsing/ navigation of XML documents with a
server-side, compute as you go model
15MIX components...
- Blended Browsing and Querying (BBQ) interface
- supports navigation and querying of XML documents
- generates XMAS queries on mediator views
- generates XMAS queries modified by DOM-VXD
operations to incrementally evaluate the result
set, to support navigation of XML documents
16Details of the MIX Scenario
View 1
View 2
BBQ Interface
BBQ Interface
XML data
XMAS query
Mediator
XMAS query engine
Local Data Repository
XMAS query fragment
XML data
Convert XMAS query to local query language, e.g.
SQL, and data in native format to XML
Wrapper
Wrapper
Wrapper
SQL Database
Spreadsheet
HTML files
17The NARA Project
- Electronic Records Management and Persistent
Archives - Archive a variety of data
- Census data (Tiger files)
- E-mail (Usenet newsgroups used as proxy)
- Congressional voting records (ASCII, HTML files)
- Vietnam war casualty reports
- Miscellaneous word processing documents
- USPTO
- ...
18The NARA Usenet Collection
- 1 million Usenet postings
- Data archived in its original form
- Designed an XML DTD based on Usenet standard and
analysis of 1 million documents - Documents have headers with
- 6 required keyword fields (e.g. From, Date,
Subject) - 13 optional keyword fields (e.g. Followup-To,
Keywords) - Rest are unrecognized keyword fields (e.g.
Abuse-Reports-To). Found about 2200 in 1 million
messages.
19Ingestion Retrieval of Data Collections
Extract metadata (SGML/XML)
Data Collection
HPSS
Extract metadata
Query Interface
DBMS
20On-going work
- Wrappers for GIS sources
- modeling GIS information sources with XML DTDs
- mapping from query language (XMAS) to operations
supported by GIS - returning output from GIS in the form of XML
documents - Mediation of GIS sources
- dealing with sources with differing
capabilities, e.g. coverage, resolution,
themes/layers - techniques for specifying source capabilities and
using that information in query processing
21Announcements
- MIX Demo at the ACM SIGMOD99 Intl. Conf. On
Database Systems - May 31- June 3, 1999, Philadelphia, PA
- Birds of a Feather session on, Data Modeling with
XML - 12-130PM on Thursday 1/28 in Room 362 SDSC