Title: Yijian Wang
1Profile-Guided I/O Partitioning
- Yijian Wang
- David Kaeli
- Electrical and Computer Engineering Department
- Northeastern University
- yiwang, kaeli_at_ece.neu.edu
2Outline
- Introduction
- Related work
- Profile-guided I/O partitioning
- Benchmarks
- Experimental results
- Conclusions and future work
3Introduction
- The I/O bottleneck
- The growing gap between the speed of processors
and I/O devices - Some applications access disks very frequently
- I/O intensive applications
- Multimedia applications
- Database applications
- Parallel scientific applications
4Related work
- Fast disks
- FC-connected SCSI disks
- Smart caching I/O controller (EMC, IO Integrity)
- Parallel I/O
- Parallel disks (i.e., RAID)
- Parallel file systems (NFS, PIOF, HPS, etc.)
- Runtime parallel systems (MPI-IO, ROMIO, ADIO)
- Compiler technology
- (Loop tiling, compiler-directed collective I/O)
- To achieve high performance, I/O should be
parallelized at multiple levels (application,
file system, disks)
5I/O Partitioning
- Our target applications are parallel scientific
codes running on Beowulf clusters - I/O is parallelized at both the application level
(using MPI and MPI-IO) and the disk level (using
file partitioning) - Ideally, every process will only access files on
local disk (though this is typically not possible
due to data sharing) - How to recognize the access patterns ?
- dynamically (profiling)
- statically (compiler)
6Profile generation
Run the application
Capture I/O traces
Apply our partitioning algorithm
Rerun the tuned application
7I/O traces and partitioning
- For every process, for every contiguous file
access, we capture the following I/O profile
information - Process ID
- File ID
- Address
- Chunk size
- I/O operation (read/write)
- Timestamp
- Generate a partition for every process
- Partitioning is NP-complete
8Our Greedy Algorithm
For each MPI-IO process create a file
partition For each contiguous data
chunk identify the process that most frequently
accesses this chunk assign the chunk to the
associated partition For each
partition reorder data in the partition based on
first access to each chunk
9Benchmarks
- NASA Parallel Benchmark (NPB2.4)/BT
- Computational fluid dynamics
- Generates a file (1.6 GB) dynamically and then
reads it - Writes/reads sequentially in chunk sizes of 2040
Bytes - SPEChpc96/seismic
- Seismic processing
- Generates a file (1.5 GB) dynamically and then
reads it back - Writes sequential chunks of 96 KB and reads
sequential chunks of 2 KB - mpi-tile-io
- Parallel Benchmarking Consortium
- Tile access to a two-dimensional matrix (1 GB)
with overlap - Writes/reads sequentially chunks of 32 KB, with
2KB of overlap - All applications uses MPI and MPI-IO for
computation, communication and I/O
10(No Transcript)
11(No Transcript)
12(No Transcript)
13(No Transcript)
14(No Transcript)
15(No Transcript)
16Conclusions and future work
- We obtain scalable speedup due to
- creating parallel I/O channels
- reducing disk seek time
- reducing communication overhead
- I/O access patterns are generally independent of
data values, for the applications studied - Investigating static (compile time) approaches to
I/O partitioning
17Northeastern University Computer Architecture
Research Grouphttp//www.ece.neu.edu/groups/nucar
- This project is supported by the NSF-funded
- Center for Subsurface Sensing and Imaging System
(CenSSIS)