Title: Architectural and Design
1Architectural and Design Issues in the General
Parallel File System
May, 2005
Benny Mandler - mandler_at_il.ibm.com
2Agenda
- What is GPFS?
- A file system for high performance computing
- General architecture
- How does GPFS meet its challenges - architectural
issues - Performance
- Scalability
- High availability
- Concurrency control
3What is Parallel I/O?
- Multiple processes (possibly on multiple nodes)
participate in the I/O - Application level parallelism
- File is stored on multiple disks on a parallel
file system - Additional Interfaces for I/O (can impact
portability)
4What is Parallel I/O? (Cont.)
- Parallel I/O should safely support
- Application level I/O parallelism across multiple
computational nodes - Physical parallelism over multiple disks and
servers - Parallelism in file system overhead operations
- A parallel file system must support
- Parallel I/O
- Consistent global name space across all nodes of
the cluster - Including maintaining a consistent view across
all nodes for the same file - Programming model allowing programs to access
file data - Distributed over multiple nodes
- From multiple tasks running on multiple nodes
- Physical distribution of data across disks and
network entities - Eliminates bottlenecks both at the disk interface
and the network, providing more effective
bandwidth to the I/O resources
5GPFS vs. local and distributed file systems on
the SP
- Native AIX File System (JFS)
- No file sharing - application can only access
files on its own node - Applications must do their own data partitioning
- DCE Distributed File System
- Application nodes (DCE clients) share files on
server node - Switch is used as a fast LAN
- Coarse-grained (file or segment level)
parallelism - Server node is performance and capacity bottleneck
- GPFS Parallel File System
- GPFS file systems are striped across multiple
disks on multiple storage nodes - Independent GPFS instances run on each
application node - GPFS instances use storage nodes as "block
servers" - all instances can access all disks
6Scalable Parallel Computing
- RS/6000 SP Scalable Parallel Computer
- Hundreds of nodes connected by high-speed switch
- N-Way SMP
- gt1 TB disk per node
- Hundreds of MB/s full duplex per switch port
- Scalable parallel computing enables I/O-intensive
applications - Deep computing - simulation, seismic analysis,
data mining - Server consolidation - aggregating file, web
servers on centrally-managed machine - Streaming video and audio for multimedia
presentation - Scalable object store for large digital
libraries, web servers, databases, ...
7GPFS History
- Shark video server
- Video streaming from single RS/6000
- Complete system, included file system, network
driver, control server - Large data blocks, admission control, deadline
scheduling - Bell Atlantic video-on-demand trial (1993-94)
- Tiger Shark multimedia file system
- Multimedia file system for RS/6000 SP
- Data striped across multiple disks, accessible
from all nodes - Hong Kong and Tokyo video trials, Austin video
server products - GPFS parallel file system
- General purpose file system for commercial and
technical computing on RS/6000 SP, AIX and Linux
clusters. - Recovery, online system management, byte-range
locking, fast prefetch, parallel allocation,
scalable directory, small-block random access,
... - Released as a product 1.1 - 05/98, 1.2 - 12/98,
1.3 - 04/00,
8What is GPFS? IBMs shared disk, parallel file
system for AIX, Linux clusters
- Cluster 512 nodes today, fast reliable
communication, common admin domain - Shared disk all data and metadata on disk
accessible from any node through disk I/O
interface (i.e., "any to any" connectivity) - Parallel data and metadata flows from all of the
nodes to all of the disks in parallel - RAS reliability, accessibility, serviceability
9GPFS addresses SP I/O requirementsHigh
Performance - multiple GB/s to/from a single file
- Concurrent reads and writes, parallel data access
- within a file and across files - Byte-range locking
- Support fully parallel access both to file data
and metadata - Client caching enabled by distributed locking
- Wide striping
- Large data blocks
- Prefetch, write-behind
- Access pattern optimizations
- Distributed management functions
- Multi-pathing
10GPFS addresses SP I/O requirements (Cont.)
- Scalability in many respects
- Scales up to 512 nodes (N-Way SMP)
- Storage nodes
- File system nodes
- Disks 100s of TB
- Adapters
- High Availability
- Fault-tolerance via logging, replication, RAID
support - Survives node and disk failures
- Uniform access via shared disks - Single image
file system - High capacity multiple TB per file system, 100s
of GB per file - Standards compliant (X/Open 4.0 "POSIX") with
minor exceptions and extensions
11GPFS comes in different flavors
Storage Area Network
- Advantages
- Separate storage I/O service and application jobs
- Well suited to synchronous applications
- Can utilize extra switch bandwidth
- Disadvantages
- Performance gated by adapters in the servers
- Advantages
- Performance scales with the number of servers
- Uses un-used compute cycles if available
- Can utilize extra switch bandwidth
- Disadvantages
- Cycle stealing from the compute nodes
- Advantages
- Simpler I/O model a storage I/O operation does
not require an associated network I/O operation - Can be used instead of a switch when a switch is
not otherwise needed - Disadvantages
- Cost/complexity when building large SANs
12Agenda
- What is GPFS?
- A file system for high performance computing
- General architecture
- How does GPFS meet its challenges - architectural
issues - Performance
- Scalability
- High availability
- Concurrency control
13Shared Disks - Virtual Shared Disk architecture
- File systems consist of one or more shared disks
- Individual disk can contain data, metadata, or
both - Disks are designated to failure group
- Data and metadata are striped to balance load and
maximize parallelism - Recoverable Virtual Shared Disk for accessing
disk storage - Disks are physically attached to SP nodes
- VSD allows access to disks over the SP switch
- VSD client looks like disk device driver on
client node - VSD server executes I/O requests on storage
node. - VSD supports JBOD or RAID volumes, fencing,
multi-pathing (where physical hardware permits) - GPFS only assumes a conventional block I/O
interface
14GPFS Architecture Overview
- Implications of Shared Disk Model
- All data and metadata on globally accessible
disks (VSD) - All access to permanent data through disk I/O
interface - Distributed protocols, e.g., distributed locking,
coordinate disk access from multiple nodes - Fine-grained locking allows parallel access by
multiple clients - Logging and Shadowing restore consistency after
node failures - Implications of Large Scale
- Support up to 4096 disks of up to 1 TB each (4
Petabytes) - The largest system in production is 75 TB
- Failure detection and recovery protocols to
handle node failures - Replication and/or RAID protect against disk /
storage node failure - On-line dynamic reconfiguration (add, delete,
replace disks and nodes rebalance file system)
15GPFS Architecture - Special Node Roles
- Three types of nodes file system, storage, and
manager - File system nodes
- Run user programs, read/write data to/from
storage nodes - Implement virtual file system interface
- Cooperate with manager nodes to perform metadata
operations - Manager nodes
- Global lock manager
- File system configuration recovery, adding
disks, - Disk space allocation manager
- Quota manager
- File metadata manager - maintains file metadata
integrity - Storage nodes
- Implement block I/O interface
- Shared access to file system and manager nodes
- Interact with manager nodes for recovery (e.g.
fencing) - Data and metadata striped across multiple disks -
multiple storage nodes
16GPFS Software Structure
17Disk Data Structures Files
- Large block size allows efficient use of disk
bandwidth - Fragments reduce space overhead for small files
- No designated "mirror", no fixed placement
function - Flexible replication (e.g., replicate only
metadata, or only important files) - Dynamic reconfiguration data can migrate
block-by-block - Multi level indirect blocks
- Each disk address
- List of pointers to replicas
- Each pointer
- Disk id sector no.
18Agenda
- What is GPFS?
- A file system for High Performance Computing
- General architecture
- How does GPFS meet its challenges - architectural
issues - Performance
- Scalability
- High availability
- Concurrency control
19Large File Block Size
- Conventional file systems store data in small
blocks to pack data more densely - GPFS uses large blocks (256KB default) to
optimize disk transfer speed
20Parallelism and consistency
- Distributed locking - acquire appropriate lock
for every operation - for updates to user data - Centralized management - conflicting operations
forwarded to a designated node - for file
metadata - Distributed locking centralized hints - for
space allocation - Central coordinator - used for configuration
changes
I/O slowdown effects Additional I/O activity
rather than token server overload
21Parallel File Access From Multiple Nodes
- GPFS allows parallel applications on multiple
nodes to access non-overlapping ranges of a
single file with no conflict - Global locking serializes access to overlapping
ranges of a file - Global locking based on "tokens" which convey
access rights to an object (e.g. a file) or
subset of an object (e.g. a byte range) - Tokens can be held across file system operations,
enabling coherent data caching in clients - Cached data discarded or written to disk when
token is revoked - Performance optimizations required/desired
ranges, metanode, data shipping, special token
modes for file size operations
22I/O throughput scaling - nodes and disks
- 32 nodes SP, 480 disks, 2 I/O servers
- Single file - n large contiguous sections
- Writes - update in place
23Deep Prefetch for High Throughput
- GPFS stripes successive blocks across successive
disks - Disk I/O for sequential reads and writes is done
in parallel - GPFS measures application "think time" ,disk
throughput, and cache state to automatically
determine optimal parallelism - Prefetch algorithms now recognize strided
- and reverse sequential access.
- Accepts hints
- Write-behind policy
24GPFS Throughput Scaling for Non-cached Files
- Hardware Power2 wide nodes, SSA disks
- Experiment sequential read/write from large
number of GPFS nodes to varying number of storage
nodes - Result throughput increases nearly linearly with
number of storage nodes - Bottlenecks
- Microchannel limits node throughput to 50MB/s
- System throughput limited by available storage
nodes
25Disk Data Structures Allocation map
- Each segment contains bits representing blocks on
all disks - Each segment a separately lockable unit
- Minimizes contention for allocation map when
writing files on multiple nodes - Allocation manager service provides hints which
segments to try - Inode Allocation map looks similar
26Allocation Manager
Server
Segment
free
6 Update Free
5 Update Free
7 Update Free
Client
Client
Deleted files blocks are function-shipped to
current segment owners
27Allocation manager and metanode evaluation
- Write-in-place vs. new files creation
- Create throughput scales nearly linearly with
number of nodes - Creating a single file from multiple nodes as
fast as each node creating a different file
28HSM - Space Management
ADSM
Migrate
Recall
ADSM SERVER
DB
Migrates inactive data
Transparent recall
Cost/Disk Full Reduction
Policy managed
Integrated with backup
29Data Management API for GPFS
- Fully XDSM standard compliant
- Innovative enhancements entailed by multi-node
model - Intended for HSM applications such as HPSS,
ADSM, etc. - Principles of operation
- Backend GPFS file operations generate events
that are monitored by a data management
application - Front-end data management application initiates
invisible migration of file data between GPFS and
HSM - High throughput using multiple sessions and
parallel movers - Resilient to failures, and provides transparent
recovery
30High Availability - Logging and Recovery
- Problem detect/fix file system inconsistencies
after a failure of one or more nodes - All updates that may leave inconsistencies if
uncompleted are logged - Write-ahead logging policy log record is forced
to disk before dirty metadata is written - Redo log replaying all log records at recovery
time restores file system consistency - Logged updates
- I/O to replicated data
- Directory operations (create, delete, move, ...)
- Allocation map changes
- Other techniques
- Ordered writes
- Shadowing
31Node Failure Recovery
- Application node failure
- Force-on-steal policy ensures that all changes
visible to other nodes have been written to disk
and will not be lost - All potential inconsistencies are protected by a
token and are logged - File system manager runs log recovery on behalf
of the failed node - After log recovery tokens held by the failed node
are released - Actions taken restore metadata being updated by
the failed node to a consistent state, release
resources held by the failed node - File system manager failure
- New node is appointed to take over
- New file system manager restores volatile state
by querying other nodes - New file system manager may have to undo or
finish a partially completed configuration change
(e.g., add/delete disk) - Storage node failure
- Dual-attached disk use alternate path (VSD)
- Single attached disk treat as disk failure
32Handling Disk Failures
- When a disk failure is detected
- The node that detects the failure informs the
file system manager - File system manager updates the configuration
data to mark the failed disk as "down" (quorum
algorithm) - While a disk is down
- Read one / write all available copies
- "Missing update" bit set in the inode of modified
files - When/if disk recovers
- File system manager searches inode file for
missing update bits - All data metadata of files with missing updates
are copied back to the recovering disk (one file
at a time, normal locking protocol) - Until missing update recovery is complete, data
on the recovering disk is treated as write-only - Unrecoverable disk failure
- Failed disk is deleted from configuration or
replaced by a new one - New replicas are created on the replacement or on
other disks
33Concurrency Control High-level Metadata
- Managed by central coordinators
- Configuration manager
- Elected through Group Services
- Longest surviving node
- Appoints a manager for each GPFS file system as
it is mounted - File system manager
- Handles all changes to file system configuration,
e.g., - Adding/deleting disks (including alloc map
initialization) - The only node that reads writes configuration
data (superblock) - Initiates and coordinates data migration
(rebalance) - Creates and assigns log files
- Token manager, etc.
- Appointed by the configuration manager
- Token manager coordinates distributed locking
(next slide) - Other quota manager, allocation manager, ACL,
extended attributes
34Concurrency Control Fine-grain (Meta)data
- Token based distributed lock manager
- First lock request for an object requires a
message to the token manager to obtain a token - Subsequent lock requests can be granted locally
- Data can be cached as long as a token is held
- When a conflicting lock request is issued from
another node the token is revoked ("token steal") - Force on steal policy modified data are written
to disk when the token is revoked - Whole file locking for less frequent operations
(e.g., create, trunc, ...). Finer grain locking
for read/write
35Parallel System Administration
- Data redistribution
- Disk addition/deletion/replacement
- Replication/striping due to disk failures
36Cache Management
- Balance dynamically according to usage patterns
- Avoid fragmentation - internal and external
- Unified steal
- Periodical re-balancing
37Epilogue
- Used on six of the ten most powerful
supercomputers in the world, including the
largest (ASCI white) - Installed at several hundred customer sites, on
clusters ranging from a few nodes with less than
a TB of disk, up to 512 nodes with 140 TB of disk
in 2 file systems - IP rich - 20 filed patents
- State of the art
- TeraSort
- World record of 17 minutes
- Using 488 node SP. 432 file system and 56 storage
nodes (604e 332 MHz) - Total 6 TB disk space
- References
- GPFS home page http//www.haifa.il.ibm.com/projec
ts/storage/gpfs.html - FAST 2002 http//www.usenix.org/publications/libr
ary/proceedings/fast02/schmuck.html - TeraSort - http//www.almaden.ibm.com/cs/gpfs-spso
rt.html - Tiger Shark http//www.research.ibm.com/journal/r
d/422/haskin.html