Title: NRL File System Progress Presentation
1NRL File SystemProgress Presentation
- NRL seminar
- 2002. 2. 7.
- CA Lab., CS, KAIST
2Introduction
- Research topics
- Caching
- Replacement
- Consistency
- Cooperative caching
- Prefetching
- I/O load balancing
- Parallel-I/O
- Replication / migration
- Fault-tolerance
- Availability
- Reliability
- Hardware/protocol utilization
3Cluster File Systems (1/5)
- PVFS (Parallel Virtual File System)
- Provide a cluster-wide consistent name space
- Parallel-I/O (striping)
- User Interfaces
- PVFS-native API
- Linux VFS support
- ROMIO MPI-IO
- Independent with local file system
- ext2, ext3,
- Software/hardware RAID
- No caching
- No support for fault-tolerance
- Cant use as root file system
4Cluster File Systems (2/5)
- GFS (Global File System)
- Optimized for Storage Area Network
- Symmetric Shared Disk File System
- Scalability
- Availability
- Load balancing
- Fault tolerance
- Remove single-point of failure
- Journal recovery
- OpenSource SSI project
5Cluster File Systems (3/5)
- NFS
- Very common file system
- Numerous researches about it
- Not suitable for cluster architecture
- Cache consistency
- No support for parallel-I/O
- Diskless-NFS
- NFS as root file system
6Cluster File Systems (4/5)
- TH-CluFS
- Parallel-I/O is not suitable for web server
- Balanced I/O for web server cluster
- File migration
- File-level migration (not volume-level)
- Static file allocation / dynamic file migration
- Unique cache
- Client cache disk cache
- Cooperative cache
7Cluster File Systems (5/5)
- Trend
- High-speed interconnection network
- Myrinet, Fast/Gigabit Ethernet, SAN
- High performance achieved in the physical layer
- File system functionality
- Bring high-performance to user (application)
layer - Cooperative caching
- Consistency
- Scalability
- Balanced I/O
- Parallel-I/O
- Replication / migration
8Our Goal
- Goal efficient file system for HPC
- Not application-specific
- Can be used for general-purpose cluster
- Parallel-I/O efficient cache management
- Cache consistency
- Cooperative caching
- Prefetching
- Less-important issues
- Availiability
- Fault tolerance
client
client
client
client
NETWORK
server
server
server
server
9Idea (1/2)
- PVFS cache management
- PVFS close to our design
- Parallel I/O
- Linux
- Cooperative caching
- Consistency models / granularity
- Scalability
- Distributed cache manager
- Fewer access to the cache manager
- I/O server cache utilization
- Replacement policy
- Unified cache
10Idea (2/2)
- HPC Application support
- Prefetching
- Pattern analysis
- File prefetching
- Block prefetching
- Efficient parallel I/O algorithm
- File block distribution
- Load balancing
- Replication / migration
- Other issues
- MPI-I/O
- Fault-tolerance
11NRL File SystemPVFS Source Code Analysis
- PVFS overview
- Metadata manager
- Kernel module
- I/O server
12PVFS (1/4)
- Goals of PVFS
- Provide high-speed access to file data for
parallel applications - Provide a cluster-wide consistent name space
- Enables user-controlled striping of data across
disks on different I/O nodes - Allow existing binaries to operate on PVFS files
without the need for recompiling.
13PVFS (2/4)
- Components
- Metadata manager (mgr)
- I/O server (iod)
- Client
- Kernel module
- User-level daemon
- API
ltmetadata accessgt
ltdata accessgt
14PVFS (3/4)
pcount3
base
File /pvfs/foo
ssize
15PVFS (4/4)
- APIs
- UNIX I/O
- PVFS native API
- MPI-IO
Application
??????
C Library
VFS
PVFS syscall wrappers
ext2
PVFS
PVFS I/O Library
Kernel
(b) Kernel Module, adding new file-system support
via loadable module without recompiling the kernel
(a) Trapping System Calls, using LD_PRELOAD of
the Linux environment variable
16Metadata Manager (1/8)
17Metadata Manager (2/8)
optional
mreqdata
bsendv() or bsend()
brecv_timeout()
send_req()
Varies between handlers
request handler
send_mreq_saddr
ireq
??
iack
mack
bsend()
client
mgr
iod
18Metadata Manager (3/8)
- Caching in mgr (1/2)
- Mgr keeps all information of mounted PVFSes and
open files - File system information(strcut fsinfo fslist.h)
- Open file information(struct finfo flist.h)
- Both structures based on linked list in llist.c,h
- Mgr maintains these information with
active_ppointerwhose type is fsinfo
flist
fslist
active_p
next (llist_p)
next (llist_p)
next (llist_p)
item (void )
item (void )
item (void )
unlink (int)
unlink (int)
fs_ino (ino_t )
f_ino (ino_t)
f_ino (ino_t)
nr_iods (int)
Linked list of open files
p_stat (pvfs_filestat)
p_stat (pvfs_filestat)
fl_p (flist_p)
cap (int)
cap (int)
iod (iod_info )
cnt (int)
cnt (int)
grp_cap (int)
grp_cap (int)
List of I/O daemons
fname (char )
fname (char )
socks(dyn_fdset )
socks(dyn_fdset )
19Metadata Manager (4/8)
- Caching in mgr (2/2)
- Description of each member variable(flist.ch,
fslist.ch)
fs_ino (ino_t )
inode number of root directory for this filesystem
nr_iods (int)
Number of iods for this filesystem
fsinfo
fl_p (flist_p)
List of open files for this filesystem
iod (iod_info )
List of iod addresses
unlink (int)
fd of metadata file or -1
f_ino (ino_t)
inode number of metadata file
p_stat (pvfs_filestat)
PVFS metadata for file
cap (int)
Max. capability assigned thus far?
finfo
cnt (int)
Count of number of times file is open
grp_cap (int)
Capability number for group
fname (char )
File name for performing operations on metadata
socks(dyn_fdset )
Used to keep track of what FDs have opened the
file
20Metadata Manager (5/8)
- Crux Data Structure (1/3)
- Types of metadata (include/meta.h)
- dmeta directory metadata structure
- fmeta file metadata structure
- pvfs_filestat PVFS metadata for a file
fs_ino (int64)
fs root directory inode number
dr_uid (uid_t)
directory owner UID number
dr_gid (dr_gid)
directory owner GID number
dr_mode (mode_t)
directory mode
port (int16)
port of metadata server
dmeta ? .pvfsdir(text)
host (char )
host name of the metadata server
rd_path (char )
??root directory of metadata
sd_path (char )
??root directory of metadata
fname (char )
direcotry name
root dir inode
f_ino (ino_t)
base (int32_t)
First iod to be used
ustat (stat)
fmeta ? same as file name(binary)
pcount (int32_t)
of iods for the file
p_stat (pvfs_filestat)
ssize (int32_t)
Stripe size
fsize (int64_t)
soff (int32_t)
NOT USED
mgr (sockaddr)
bsize (int32_t)
NOT USED
21Metadata Manager (6/8)
- Crux Data Structure (2/3)
- dmeta (.pvfsdir) for each directory(text format)
root directory of metadata
fs_ino (int64)
dr_uid (uid_t))
dr_gid (dr_gid)
dr_mode (mode_t)
port (int16) - port of metadata server
host (char ) - host name of the metadata server
rd_path (char ) - root directory of metadata
fname (char )
22Metadata Manager (7/8)
- Crux Data Structure (3/3)
- fmeta for each file
- Stored in the dmeta directory with the same name
- Binary data format (112 byte size)
23Metadata Manager (8/8)
- Request Handler
- Common parameters to each request handler
- int sock socket descriptor connected
- mreq_p req_p request packet from client
- void data_p data packet from client
- mack_p ack_p ack which will be sent to client
- Request handler hierarchy
md_xxx() series for miscellaneous operationsin
meta/md_xxx.c
meta_xxx() series forraw metadata operationsin
meta/metaio.c
do_xxx() seriesin mgr.c
24Kernel Module (1/5)
PVFS I/O servers
?? ??? operation? ???? ??? ??? pvfsdev? ???? ??
app
pvfsd
user level
kernel level
VFS
/dev/pvfsd
Mount? ?? file, directory,?? operation? ????,
vfs?? PVFS? ?? operation? ??? ??. Mount.pvfs.c
kernel module?? operation? ??? ?? pvfsd? ??? ??,
??? ?? ???? ??? ????.
25Kernel module (2/5)
- Key data structures
- ll_pvfs.h
- almost key structure in this file
- struct pvfs_upcall
- type
- sequence number
- each parameters
- struct pvfs_downcall
- type
- sequence number
- return value
26Kernel module (3/5)
- VFS
- VFS operation
- dir.c, file.c, inode.c
- Make the request and enqueue the request to
the queue(served by pvfsdev) - if need to wait, wait flag is set
- Mount
- mount.pvfs.c
- When mount the pvfs, connect VFS operations to
directory, file, inode operations
27Kernel module (4/5)
- pvfsdev(pvfsdev.h, pvfsdev.c)
- kernel module
- Read
- pvfsdev get the request from the
queue(structure pvfs_upcall) - Write
- pvfsdev return the result of operation(structure
pvfs_downcall) - Provide sync. ops.
28Kernel module (5/5)
- pvfsd(pvfsd.h, pvfsd.c)
- read the request from pvfsdev
- parse the request
- send it to the I/O server
- receive the return value
- write the return value to pvfsdev
29I/O Server (1/3)
- main()
- Make a listening socket
- For()
- Wait for some sockets ready(listening and others)
- For all ready sockets
- if a socket is ready but the socket is not
registered in the job list(which means it is a
new connection) - call new_request(ready socket)
- if a socket is ready and the socket is
registered in the job list,we have read or write
jobs to be done - call do_job(socket, jinfo pointer)
-
30I/O Server (2/3)
- new_request(int socket)
- Handles receiving new requests and calling the
appropriate handler function - Calls do_rw_req(), which builds jobs for
read/write requests - Returns 0 on success, -errno on failure.
31I/O Server (3/3)
- do_job(int sock, jinfo_p j_p)
- Calls do_accesses()
- do_access()
- performs necessary operations to complete one
access - Pseudo code
- if the read operation
- mmap file
- call nbsend() to the client
- if the write operation, call do_write()
- do_write()
- Calls nbrecv()
- Then, calls write() to disk
32NRL File SystemCurrent Status Discussion
33Design
- PVFS Cooperative caching
- Distributed cache manager
- Kernel-level
- iod cache utilization?
- Unified cache?
pvfsd
pvfsd
pvfsd
mgr
cache
cache
cache
network
iod
iod
iod
iod
cache
cache
cache
cache
34Status
- Source code analysis
- Almost done
- Cache manager in kernel-level
- PVFS as root file system
- We have problems
35Problems (1/3)
- Parallel-I/O caching is caching effective?
- PVFS Applications large volume access
- Spatial/temporal locality
- Locality handled in page cache? File cache?
- A result from cooperative caching research
- Lower level cache poor hit ratio
- Too large data set capacity miss
- MPI-IO Application level cache
- Data sieving / collective IO
- worse hit ratio at file system cache
- We arent sure if caching is effective or not
36Problems (2/3)
- Parallel-I/O caching is parallel-I/O
effective? - Who needs caching?
- Web server cluster
- General purpose file system
- Problems
- Small file no need for striping
- Metadata manager / parallel-I/O overhead
- We need more investigation
- Benchmarks
- Disk access behavior caching, prefetching
algorithm - Most file system benchmarks dont concern with
caching - Real applications
- Related works
37Problems (3/3)
- File system for SSI
- Root file system
- Survey on diskless NFS
- Consistency
- Cache consistency what model SSI wants?
- File consistency none
- File migration
- Parallel-I/O does not need file migration
- PVFS not fault-tolerant
- RAID
- Metadata structure should be modified is it
sound? - Software RAID not an attractive topic
- High availability
- PVFS2 distributed metadata manager
38Solution
- Parallel-I/O caching lets try!
- General purpose file system
- Idea
- No parallel-I/O for small files
- Small file caching only, cache replication /
migration - Larger striping size
- How to reduce overhead?
- No caching for large volume access
- Rely on application cache
- caching based on access frequency
- Looking for benchmarks / real applications
39PVFS2
- Next Generation PVFS
- CLUSTER01 extended abstract
- Will be released in summer 2002
- PVFS2 features
- Requests based on MPI Datatypes
- Programmable distributions
- Support for data replication
- Modular metadata subsystem
- Distributed metadata mgr
- Allows arbitrary metadata attributes
- Multiple network transport methods
- VIA, GM
- VFS-like system interface