Title: SciDAC SDM Center All Hands Meeting, October 5-7, 2005
1Parallel I/O Middleware Optimizations andFuture
Directions
Northwestern University PIs Alok Choudhary, Wei-keng Liao
Graduate Students Jianwei Li, Avery Ching, Kenin Coloma
ANL Collaborators Bill Gropp, Rob Ross, Rajeev Thakur, Rob Latham
- SciDAC SDM Center All Hands Meeting, October 5-7,
2005
2Outline
- Progress and accomplishments Wei-keng Liao
- Parallel netCDF
- Client-side file caching in MPI-IO
- Data-type I/O for non-contiguous file access in
PVFS - Future research directions Alok Choudhary
- I/O middleware
- Autonomic and Active storage Systems
3Parallel NetCDF
- NetCDF defines
- A set of APIs for file access
- A machine-independent file format
- Parallel netCDF work
- New APIs for parallel access
- Maintaining the same file format
- Tasks
- Built on top of MPI for portability and high
performance - Support C and Fortran interfaces
- Support external data representations
4PnetCDF Current Status
- Version 1.0.0 was released on July 27, 2005
- Supported platforms
- Linux Cluster, IBM SP, SGI Origin, Cray X, NEC SX
- Two sets of parallel APIs are completed
- High level APIs (mimicking the serial netCDF
APIs) - Flexible APIs (extended to utilize MPI derived
datatype) - Fully supported both in C and Fortran
- Support for large file ( gt 4GB files)
- Test suites
- Self test codes ported from Unidata netCDF
package to validate against single-process
results - Parallel test codes for both sets of APIs
5Illustrative PnetCDF Users
- FLASH astrophysical thermonuclear application
from ASCI/Alliances center at university of
Chicago - ACTM atmospheric chemical transport model, LLNL
- WRF-ROMS regional ocean model system I/O module
from scientific data technologies group, NCSA - ASPECT data understanding infrastructure, ORNL
- pVTK parallel visualization toolkit, ORNL
- PETSc portable, extensible toolkit for
scientific computation, ANL - PRISM PRogram for Integrated Earth System
Modeling, users from CC Research Laboratories,
NEC Europe Ltd. - ESMF earth system modeling framework, national
center for atmospheric research - More
6PnetCDF Future Work
- Non-blocking I/O APIs
- Performance improvement for data type conversion
- Type conversion while packing non-contiguous
buffers - Extending PnetCDF for newer applications, e.g.,
data analysis and mining - Collaboration with application users
7File Caching in MPI-IO
Applications
Parallel netCDF
MPI-IO
PVFS
Storage devices
8File Caching for Parallel Apps
- Why file caching?
- Improves the performance for repeated file access
- Enable write-behind strategy
- Accumulates multiple small writes to better
utilize network bandwidth - May balance the work load for irregular I/O
patterns - Useful for checkpointing
- Enable data pre-fetching
- Useful for read-only applications (parallel data
mining, visualization) - Why not just use traditional caching strategies?
- Each client performs independently ? cache
incoherence - I/O servers are in charged with cache coherence
control ? potential I/O serialization - Inadequate for parallel environment where
application clients frequently read/write shared
files
9Caching Sub-system in MPI-IO
- Application-aware file caching
- A user-level implementation in MPI-IO library
- MPI communicators define the subsets of processes
operating on a shared file
- Processes cooperate with each other to perform
caching - Data cached in one client can be directly
accessed by another - Moves cache coherence control from servers to
clients - Distributed coherence control (less overhead)
- Supports both collective and independent I/O
10Design
- Cache metadata
- File-block based granularity
- Cyclically stored in all processes
- Global cache pool
- Comprises local memory of all processes
- Single copy of file data to avoid coherence issue
- Two implementations
- Using an I/O thread (POSIX thread)
- Using the MPI remote-memory-access (RMA) facility
11Example Read Operation
12Future Work
- Data pre-fetching
- Instructional (through MPI info) and
non-instructional (based on sequential access) - Collective write-behind for data check-pointing
- Stand-alone distributed lock sub-system
- Using MPI-2 remote-memory access facility
- Design new MPI file hints for caching
- Application I/O pattern study
- Structured/unstructured AMR
13Data-type I/O in PVFS
Applications
Parallel netCDF
MPI-IO
PVFS
Storage devices
14Non-contiguous I/O
- Four types
- Contiguous both in memory and file
- Contiguous in memory, non-contiguous in file
- Non-contiguous in memory, contiguous in file
- Non-contiguous both in memory and file
- Each segment is an I/O request of (offset, length)
15Implementations
- POSIX I/O
- One call per (offset, length)
- Generates large number of I/O requests
- Data sieving
- Single (offset, length) covering multiple
segments - Accessing unused data and introduces consistency
control overhead - List I/O
- Single calls handle multiple non-contiguous
access - Passing multiple (offset, length)s across network
Application process
I/O request
I/O request
I/O request
Client-side file system
Application process
List I/O request
Client-side file system
network
Server-side file system
16Data-type I/O
- Single requests all the way to the servers
- Abandons offset-length pair representation
- Borrow MPI datatype concept to describe
non-contiguous access patterns - New file system data types
- New file system interfaces
- An implementation in PVFS
- Both client and server sides
Application process
Datatype I/O request
PVFS client
Single request
network
PVFS server
17Summary of Accomplishments
- High-level I/O
- Parallel netCDF
- Low-level I/O
- MPI-IO file caching
- Parallel file system
- Data-type I/O in PVFS
Parallel netCDF
MPI-IO
PVFS
18Future Research
19Typical Components in I/O Systems
Compute node
Compute node
Compute node
Compute node
- Based on a lot of current apps
- High-level
- E.g., NetCDF, HDF, ABC
- Applications use these
- Mid-level
- E.g., MPI-IO
- Performance experience
- Low-level
- E.g., File systems
- Critical for performance in above
- More access info lost if more components used
Applications
Client-side File System
network
I/O Server
I/O Server
I/O Server
End-to-End Performance critical
20(No Transcript)
21Decouple What from How andBe Proactive
Goal
Current
streaming/
FS
DM
Datasets
HSS
Small/large
configuration
Speed BW Latency QoS
s/w layer
caching
load balance
Regular/irregular
collective
Fault-tolerance
I/O SW OPT
Local/remote
reorganize
Understand
- user burdened
- Ineffective interfaces
- Non-communicating layers
App4
App1
App2
App3
22Component Design for I/O
- Application-aware
- Capture applications file access information
- Relationship between files, objects, users
- Environment-aware
- Network (reliability, security), storage devices
(active disks) - Context-aware
- Binding data attributes to files, indexing for
fast search - High-performance I/O needs supports from
- Languages Compilers
- I/O libraries
- File systems
- Storage devices
23Component Interface Design
- Informative
- Should deliver access/storage information
top-down/bottom-up - Flexibility
- Should describe arbitrary data distribution in
memory buffers, files, storage devices - Functionality
- Asynchronous operations, read-ahead,
write-behind, replications - Provides ability for additional innovation
- Object-based I/O
- For hardware control (I/O co-processor, active
disk, object-based file systems, etc.)
24Future Work in MPI-IO
- Investigate interface extensions
- Client-side caching sub-system
- Implementations for various I/O strategies
buffering, pre-fetching, replication, migration - Adaptive caching mechanisms and algorithms for
optimizing different access patterns - Distributed mutual exclusive locking sub-system
- Shared resources, such as files and memory
- Pipeline locking (overlap lock waiting time with
I/O) - Work with HDF5 and parallel netCDF
- Design I/O strategies for metadata and data
- Metadata small, overlap, repeated, strong
consistency requirement - Array data large, less frequent update
25Future Work in Parallel File Systems
- File caching (focus on parallel apps)
- File versioning
- Alternative to file locking
- Reliability and availability aspects
- Guarantee atomicity in the presence of client or
I/O system failure - Can enable efficient RAID-type schemes in PFS
(because of atomicity) - Dynamic rebalancing of I/O
- File list lock
- Locks to multiple regions in a single request
26Active Storage System (reconfigurable system)
External net
ML310-board1
ML310-host
ML310-board2
Switch
ML310-board3
ML310-board4
- Xilinx XC2VP30 Virtex-II Pro family
- 30,816 logic cells (3424 CLBs)
- 2 PPC405 embedded cores
- 2,448 Kb (136 18 Kb blocks) BRAM
- 136 dedicated 18x18 multiplier blocks
- Software
- Data Mining
- Encryption
- Functions and runtime libs
- Linux micro-kernel
27MineBench - data mining benchmark suite