Title: Parallel and Grid I/O Infrastructure
1Parallel and Grid I/O Infrastructure
- Rob Ross, Argonne National Lab
- Parallel Disk Access and Grid I/O (P4)
- SDM All Hands Meeting
- March 26, 2002
2Participants
- Argonne National Laboratory
- Bill Gropp, Rob Ross, Rajeev Thakur, Rob Latham,
Anthony Chan - Northwestern University
- Alok Choudhary, Wei-keng Liao, Avery Ching, Kenin
Coloma, Jianwei Li - Collaborators
- Lawrence Livermore National Laboratory
- Ghaleb Abdulla, Tina Eliassi-Rad, Terence
Critchlow - Application groups
3Focus Areas in Project
- Parallel I/O on clusters
- Parallel Virtual File System (PVFS)
- MPI-IO hints
- ROMIO MPI-IO implementation
- Grid I/O
- Linking PVFS and ROMIO with Grid I/O components
- Application interfaces
- NetCDF and HDF5
- Everything is interconnected!
- Wei-keng Liao will drill down into specific tasks
4Parallel Virtual File System
- Lead developer R. Ross (ANL)
- R. Latham (ANL), developer
- A. Ching, K. Coloma (NWU), collaborators
- Open source, scalable parallel file system
- Project began in mid 90s at Clemson University
- Now a collaborative between Clemson and ANL
- Successes
- In use on large Linux clusters (OSC, Utah,
Clemson, ANL, Phillips Petroleum, ) - 100 unique downloads/month
- 160 users on mailing list, 90 on developers
list - Multiple Gigabyte/second performance shown
5Keeping PVFS Relevant PVFS2
- Scaling to thousands of clients and hundreds of
servers requires some design changes - Distributed metadata
- New storage formats
- Improved fault tolerance
- New technology, new features
- High-performance networking (e.g. Infiniband,
VIA) - Application metadata
- New design and implementation warranted (PVFS2)
6PVFS1, PVFS2, and SDM
- Maintaining PVFS1 as a resource to community
- Providing support, bug fixes
- Encouraging use by application groups
- Adding functionality to improve performance (e.g.
tiled display) - Implementing next-generation parallel file system
- Basic infrastructure for future PFS work
- New physical distributions (e.g. chunking)
- Application metadata storage
- Ensuring that a working parallel file system will
continue to be available on clusters as they scale
7Data Staging for Tiled Display
- Contact Joe Insley (ANL)
- Commodity components
- projectors, PCs
- Provide very high resolutionvisualization
- Staging application preprocesses frames into a
tile stream for each visualization node - Uses MPI-IO to access data from PVFS file system
- Streams of tiles are merged into movie files on
visualization nodes - End goal is to display frames directly from PVFS
- Enhancing PVFS and ROMIO to improve performance
8Example Tile Layout
- 3x2 display, 6 readers
- Frame size is 2532x1408 pixels
- Tile size is 1024x768 pixels (overlapped)
- Movies broken into frames with each frame stored
in its own file in PVFS - Readers pull data from PVFS and send to display
9Tested access patterns
- Subtile
- Each reader grabs a piece of a tile
- Small noncontiguous accesses
- Lots of accesses for a frame
- Tile
- Each reader grabs a whole tile
- Larger noncontiguous accesses
- Six accesses for a frame
- Reading individual pieces is simply too slow
10Noncontiguous Access in ROMIO
- ROMIO performs data sieving to cut down number
of I/O operations - Uses large reads which grab multiple
noncontiguous pieces - Example, reading tile 1
11Noncontiguous Access in PVFS
- ROMIO data sieving
- Works for all file systems (just uses contiguous
read) - Reads extra data (three times desired amount)
- Noncontiguous access primitive allows requesting
just desired bytes (A. Ching, NWU) - Support in ROMIO allowstransparent use of new
optimization (K. Coloma,NWU) - PVFS and ROMIO supportimplemented
12Metadata in File Systems
- Associative arrays of information related to a
file - Seen in other file systems (MacOS, BeOS,
ReiserFS) - Some potential uses
- Ancillary data (from applications)
- Derived values
- Thumbnail images
- Execution parameters
- I/O library metadata
- Block layout information
- Attributes on variables
- Attributes of dataset as a whole
- Headers
- Keeps header out of data stream
- Eliminates need for alignment in libraries
13Metadata and PVFS2 Status
- Prototype metadata storage for PVFS2 implemented
- R. Ross (ANL)
- Uses Berkeley DB for storage of keyword/value
pairs - Need to investigate how to interface to MPI-IO
- Other components of PVFS2 coming along
- Networking in testing (P. Carns, Clemson)
- Client side API under development (Clemson)
- PVFS2 beta early fourth quarter?
14ROMIO MPI-IO Implementation
- Written by R. Thakur (ANL)
- R. Ross and R. Latham (ANL), developers
- K. Coloma (NWU), collaborator
- Implementation of MPI-2 I/O specification
- Operates on wide variety of platforms
- Abstract Device Interface for I/O (ADIO) aids in
porting to new file systems - Successes
- Adopted by industry(e.g. Compaq, HP, SGI)
- Used at ASCI sites(e.g. LANL Blue Mountain)
15ROMIO Current Directions
- Support for PVFS noncontiguous requests
- K. Coloma (NWU)
- Hints - key to efficient use of HW SW
components - Collective I/O
- Aggregation (synergy)
- Performance portability
- Controlling ROMIO Optimizations
- Access patterns
- Grid I/O
- Scalability
- Parallel I/O benchmarking
16ROMIO Aggregation Hints
- Part of ASCI Software Pathforward project
- Contact Gary Grider (LANL)
- Implementation by R. Ross, R. Latham (ANL)
- Hints control what processes do I/O in
collectives - Examples
- All processes on same node as attached storage
- One process per host
- Additionally limit number of processes who open
file - Good for systems w/out shared FS (e.g. O2K
clusters) - More scalable
17Aggregation Example
- Cluster of SMPs
- Only one SMP box has connection to disks
- Data is aggregated to processes on single box
- Processes on that box perform I/O on behalf of
the others
18Optimization Hints
- MPI-IO calls should be chosen to best describe
the I/O taking place - Use of file views
- Collective calls for inherently collective
operations - Unfortunately sometimes choosing the right
calls can result on lower performance - Allow application programmers to tune ROMIO with
hints rather than using different MPI-IO calls - Avoid the misapplication of optimizations
(aggregation, data sieving)
19Optimization Problems
- ROMIO checks for applicability of two-phase
optimization when collective I/O is used - With tiled display application using subtile
access, this optimization is never used - Checking for applicability requires communication
between processes - Results in 33 drop in throughput (on test
system) - A hint that tells ROMIO not to apply the
optimization can avoid this without changes to
the rest of the application
20Access Pattern Hints
- Collaboration between ANL and LLNL (and growing)
- Examining how access pattern information can be
passed to MPI-IO interface, through to underlying
file system - Used as input to optimizations in MPI-IO layer
- Used as input to optimizations in FS layer as
well - Prefetching
- Caching
- Writeback
21Status of Hints
- Aggregation control finished
- Optimization hints
- Collectives, data sieving read finished
- Data sieving write control in progress
- PVFS noncontiguous I/O control in progress
- Access pattern hints
- Exchanging log files, formats
- Getting up to speed on respective tools
22Parallel I/O Benchmarking
- No common parallel I/O benchmarks
- New effort (consortium) to
- Define some terminology
- Define test methodology
- Collect tests
- Goal provide a meaningful test suite with
consistent measurement techniques - Interested parties at numerous sites (and
growing) - LLNL, Sandia, UIUC, ANL, UCAR, Clemson
- In infancy
23Grid I/O
- Looking at ways to connect our I/O work with
components and APIs used in the Grid - New ways of getting data in and out of PVFS
- Using MPI-IO to access data in the Grid
- Alternative mechanisms for transporting data
across the Grid (synergy) - Working towards more seamless integration of the
tools used in the Grid and those used on clusters
and in parallel applications (specifically MPI
applications) - Facilitate moving between Grid and Cluster worlds
24Local Access to GridFTP Data
- Grid I/O Contact B. Allcock (ANL)
- GridFTP striped server provides high-throughput
mechanism for moving data across Grid - Relies on proprietary storage format on striped
servers - Must manage metadata on stripe location
- Data stored on servers must be read back from
servers - No alternative/more direct way to access local
data - Next version assumes shared file system underneath
25GridFTP Striped Servers
- Remote applications connect to multiple striped
servers to quickly transfer data over Grid - Multiple TCP streams better utilize WAN network
- Local processes would need to use same mechanism
to get to data on striped servers
26PVFS under GridFTP
- With PVFS underneath, GridFTP servers would store
data on PVFS I/O servers - Stripe information stored on PVFS metadata server
27Local Data Access
- Application tasks that are part of a local
parallel job could access data directly off PVFS
file system - Output from application could be retrieved
remotely via GridFTP
28MPI-IO Access to GridFTP
- Applications such as tiled display reader desire
remote access to GridFTP data - Access through MPI-IO would allow this with no
code changes - ROMIO ADIO interface provides the infrastructure
necessary to do this - MPI-IO hints provide means for specifying number
of stripes, transfer sizes, etc.
29WAN File Transfer Mechanism
- B. Gropp (ANL), P. Dickens (IIT)
- Applications
- PPM and COMMAS (Paul Woodward, UMN)
- Alternative mechanism for moving data across Grid
using UDP - Focuses on requirements for file movement
- All data must arrive at destination
- Ordering doesnt matter
- Lost blocks can be retransmitted when detected,
but need not stop the remainder of the transfer
30WAN File Transfer Performance
- Comparing TCP utilization to WAN FT technique
- See 10-12 utilization with single TCP stream (8
streams to approach max. utilization) - With WAN FT obtain near 90 utilization, more
uniform performance
31Grid I/O Status
- Planning with Grid I/O group
- Matching up components
- Identifying useful hints
- Globus FTP client library is available
- 2nd generation striped server being implemented
- XIO interface prototyped
- Hooks for alternative local file systems
- Obvious match for PVFS under GridFTP
32NetCDF
- Applications in climate and fusion
- PCM
- John Drake (ORNL)
- Weather Research and Forecast Model (WRF)
- John Michalakes (NCAR)
- Center for Extended Magnetohydrodynamic Modeling
- Steve Jardin (PPPL)
- Plasma Microturbulence Project
- Bill Nevins (LLNL)
- Maintained by Unidata Program Center
- API and file format for storing multidimensional
datasets and associated metadata (in a single
file)
33NetCDF Interface
- Strong points
- Its a standard!
- I/O routines allow for subarray and strided
access with single calls - Access is clearly split into two modes
- Defining the datasets (define mode)
- Accessing and/or modifying the datasets (data
mode) - Weakness no parallel writes, limited parallel
read capability - This forces applications to ship data to a single
node for writing, severely limiting usability in
I/O intensive applications
34Parallel NetCDF
- Rich I/O routines and explicit define/data modes
provide a good foundation - Existing applications are already describing
noncontiguous regions - Modes allow for a synchronization point when file
layout changes - Missing
- Semantics for parallel access
- Collective routines
- Option for using MPI datatypes
- Implement in terms of MPI-IO operations
- Retain file format for interoperability
35Parallel NetCDF Status
- Design document created
- B. Gropp, R. Ross, and R. Thakur (ANL)
- Prototype in progress
- J. Li (NWU)
- Focus is on write functions first
- Biggest bottleneck for checkpointing applications
- Read functions follow
- Investigate alternative file formats in future
- Address differences in access modes between
writing and reading
36FLASH Astrophysics Code
- Developed at ASCI Center at University of Chicago
- Contact Mike Zingale
- Adaptive mesh (AMR) code for simulating
astrophysical thermonuclear flashes - Written in Fortran90, uses MPI for communication,
HDF5 for checkpointing and visualization data - Scales to thousands of processors, runs for
weeks, needs to checkpoint - At the time, I/O was a bottleneck (½ of runtime
on 1024 processors)
37HDF5 Overhead Analysis
- Instrumented FLASH I/O to log calls to H5Dwrite
H5Dwrite
MPI_File_write_at
38HDF5 Hyperslab Operations
- White region is hyperslab gather (from memory)
- Cyan is scatter (to file)
39Hand-Coded Packing
- Packing time is in black regions between bars
- Nearly order of magnitude improvement
40Wrap Up
- Progress being made on multiple fronts
- ANL/NWU collaboration is strong
- Collaborations with other groups maturing
- Balance of immediate payoff and medium term
infrastructure improvements - Providing expertise to application groups
- Adding functionality targeted at specific
applications - Building core infrastructure to scale, ensure
availability - Synergy with other projects
- On to Wei-keng!