NRL File System Progress Presentation - PowerPoint PPT Presentation

1 / 39
About This Presentation
Title:

NRL File System Progress Presentation

Description:

NRL File System. Progress Presentation. NRL seminar. 2002. 2. 7. ... NRL File System. PVFS Source Code Analysis. PVFS overview. Metadata manager. Kernel module ... – PowerPoint PPT presentation

Number of Views:109
Avg rating:3.0/5.0
Slides: 40
Provided by: cal54
Category:

less

Transcript and Presenter's Notes

Title: NRL File System Progress Presentation


1
NRL File SystemProgress Presentation
  • NRL seminar
  • 2002. 2. 7.
  • CA Lab., CS, KAIST

2
Introduction
  • Research topics
  • Caching
  • Replacement
  • Consistency
  • Cooperative caching
  • Prefetching
  • I/O load balancing
  • Parallel-I/O
  • Replication / migration
  • Fault-tolerance
  • Availability
  • Reliability
  • Hardware/protocol utilization

3
Cluster File Systems (1/5)
  • PVFS (Parallel Virtual File System)
  • Provide a cluster-wide consistent name space
  • Parallel-I/O (striping)
  • User Interfaces
  • PVFS-native API
  • Linux VFS support
  • ROMIO MPI-IO
  • Independent with local file system
  • ext2, ext3,
  • Software/hardware RAID
  • No caching
  • No support for fault-tolerance
  • Cant use as root file system

4
Cluster File Systems (2/5)
  • GFS (Global File System)
  • Optimized for Storage Area Network
  • Symmetric Shared Disk File System
  • Scalability
  • Availability
  • Load balancing
  • Fault tolerance
  • Remove single-point of failure
  • Journal recovery
  • OpenSource SSI project

5
Cluster File Systems (3/5)
  • NFS
  • Very common file system
  • Numerous researches about it
  • Not suitable for cluster architecture
  • Cache consistency
  • No support for parallel-I/O
  • Diskless-NFS
  • NFS as root file system

6
Cluster File Systems (4/5)
  • TH-CluFS
  • Parallel-I/O is not suitable for web server
  • Balanced I/O for web server cluster
  • File migration
  • File-level migration (not volume-level)
  • Static file allocation / dynamic file migration
  • Unique cache
  • Client cache disk cache
  • Cooperative cache

7
Cluster File Systems (5/5)
  • Trend
  • High-speed interconnection network
  • Myrinet, Fast/Gigabit Ethernet, SAN
  • High performance achieved in the physical layer
  • File system functionality
  • Bring high-performance to user (application)
    layer
  • Cooperative caching
  • Consistency
  • Scalability
  • Balanced I/O
  • Parallel-I/O
  • Replication / migration

8
Our Goal
  • Goal efficient file system for HPC
  • Not application-specific
  • Can be used for general-purpose cluster
  • Parallel-I/O efficient cache management
  • Cache consistency
  • Cooperative caching
  • Prefetching
  • Less-important issues
  • Availiability
  • Fault tolerance

client
client
client
client
NETWORK
server
server
server
server
9
Idea (1/2)
  • PVFS cache management
  • PVFS close to our design
  • Parallel I/O
  • Linux
  • Cooperative caching
  • Consistency models / granularity
  • Scalability
  • Distributed cache manager
  • Fewer access to the cache manager
  • I/O server cache utilization
  • Replacement policy
  • Unified cache

10
Idea (2/2)
  • HPC Application support
  • Prefetching
  • Pattern analysis
  • File prefetching
  • Block prefetching
  • Efficient parallel I/O algorithm
  • File block distribution
  • Load balancing
  • Replication / migration
  • Other issues
  • MPI-I/O
  • Fault-tolerance

11
NRL File SystemPVFS Source Code Analysis
  • PVFS overview
  • Metadata manager
  • Kernel module
  • I/O server

12
PVFS (1/4)
  • Goals of PVFS
  • Provide high-speed access to file data for
    parallel applications
  • Provide a cluster-wide consistent name space
  • Enables user-controlled striping of data across
    disks on different I/O nodes
  • Allow existing binaries to operate on PVFS files
    without the need for recompiling.

13
PVFS (2/4)
  • Components
  • Metadata manager (mgr)
  • I/O server (iod)
  • Client
  • Kernel module
  • User-level daemon
  • API

ltmetadata accessgt
ltdata accessgt
14
PVFS (3/4)
  • File striping

pcount3
base
File /pvfs/foo
ssize
15
PVFS (4/4)
  • APIs
  • UNIX I/O
  • PVFS native API
  • MPI-IO

Application
??????
C Library
VFS
PVFS syscall wrappers
ext2
PVFS
PVFS I/O Library
Kernel
(b) Kernel Module, adding new file-system support
via loadable module without recompiling the kernel
(a) Trapping System Calls, using LD_PRELOAD of
the Linux environment variable
16
Metadata Manager (1/8)
  • main()

17
Metadata Manager (2/8)
  • Flow diagram

optional
mreqdata
bsendv() or bsend()
brecv_timeout()
send_req()
Varies between handlers
request handler
send_mreq_saddr
ireq
??
iack
mack
bsend()
client
mgr
iod
18
Metadata Manager (3/8)
  • Caching in mgr (1/2)
  • Mgr keeps all information of mounted PVFSes and
    open files
  • File system information(strcut fsinfo fslist.h)
  • Open file information(struct finfo flist.h)
  • Both structures based on linked list in llist.c,h
  • Mgr maintains these information with
    active_ppointerwhose type is fsinfo

flist
fslist
active_p
next (llist_p)
next (llist_p)
next (llist_p)
item (void )
item (void )
item (void )
unlink (int)
unlink (int)
fs_ino (ino_t )
f_ino (ino_t)
f_ino (ino_t)
nr_iods (int)
Linked list of open files
p_stat (pvfs_filestat)
p_stat (pvfs_filestat)
fl_p (flist_p)
cap (int)
cap (int)
iod (iod_info )
cnt (int)
cnt (int)
grp_cap (int)
grp_cap (int)
List of I/O daemons
fname (char )
fname (char )
socks(dyn_fdset )
socks(dyn_fdset )
19
Metadata Manager (4/8)
  • Caching in mgr (2/2)
  • Description of each member variable(flist.ch,
    fslist.ch)

fs_ino (ino_t )
inode number of root directory for this filesystem
nr_iods (int)
Number of iods for this filesystem
fsinfo
fl_p (flist_p)
List of open files for this filesystem
iod (iod_info )
List of iod addresses
unlink (int)
fd of metadata file or -1
f_ino (ino_t)
inode number of metadata file
p_stat (pvfs_filestat)
PVFS metadata for file
cap (int)
Max. capability assigned thus far?
finfo
cnt (int)
Count of number of times file is open
grp_cap (int)
Capability number for group
fname (char )
File name for performing operations on metadata
socks(dyn_fdset )
Used to keep track of what FDs have opened the
file
20
Metadata Manager (5/8)
  • Crux Data Structure (1/3)
  • Types of metadata (include/meta.h)
  • dmeta directory metadata structure
  • fmeta file metadata structure
  • pvfs_filestat PVFS metadata for a file

fs_ino (int64)
fs root directory inode number
dr_uid (uid_t)
directory owner UID number
dr_gid (dr_gid)
directory owner GID number
dr_mode (mode_t)
directory mode
port (int16)
port of metadata server
dmeta ? .pvfsdir(text)
host (char )
host name of the metadata server
rd_path (char )
??root directory of metadata
sd_path (char )
??root directory of metadata
fname (char )
direcotry name
root dir inode
f_ino (ino_t)
base (int32_t)
First iod to be used
ustat (stat)
fmeta ? same as file name(binary)
pcount (int32_t)
of iods for the file
p_stat (pvfs_filestat)
ssize (int32_t)
Stripe size
fsize (int64_t)
soff (int32_t)
NOT USED
mgr (sockaddr)
bsize (int32_t)
NOT USED
21
Metadata Manager (6/8)
  • Crux Data Structure (2/3)
  • dmeta (.pvfsdir) for each directory(text format)

root directory of metadata
fs_ino (int64)
dr_uid (uid_t))
dr_gid (dr_gid)
dr_mode (mode_t)
port (int16) - port of metadata server
host (char ) - host name of the metadata server
rd_path (char ) - root directory of metadata
fname (char )
22
Metadata Manager (7/8)
  • Crux Data Structure (3/3)
  • fmeta for each file
  • Stored in the dmeta directory with the same name
  • Binary data format (112 byte size)

23
Metadata Manager (8/8)
  • Request Handler
  • Common parameters to each request handler
  • int sock socket descriptor connected
  • mreq_p req_p request packet from client
  • void data_p data packet from client
  • mack_p ack_p ack which will be sent to client
  • Request handler hierarchy

md_xxx() series for miscellaneous operationsin
meta/md_xxx.c
meta_xxx() series forraw metadata operationsin
meta/metaio.c
do_xxx() seriesin mgr.c
24
Kernel Module (1/5)
  • Work flow

PVFS I/O servers
?? ??? operation? ???? ??? ??? pvfsdev? ???? ??
app
pvfsd
user level
kernel level
VFS
/dev/pvfsd
Mount? ?? file, directory,?? operation? ????,
vfs?? PVFS? ?? operation? ??? ??. Mount.pvfs.c
kernel module?? operation? ??? ?? pvfsd? ??? ??,
??? ?? ???? ??? ????.
25
Kernel module (2/5)
  • Key data structures
  • ll_pvfs.h
  • almost key structure in this file
  • struct pvfs_upcall
  • type
  • sequence number
  • each parameters
  • struct pvfs_downcall
  • type
  • sequence number
  • return value

26
Kernel module (3/5)
  • VFS
  • VFS operation
  • dir.c, file.c, inode.c
  • Make the request and enqueue the request to
    the queue(served by pvfsdev)
  • if need to wait, wait flag is set
  • Mount
  • mount.pvfs.c
  • When mount the pvfs, connect VFS operations to
    directory, file, inode operations

27
Kernel module (4/5)
  • pvfsdev(pvfsdev.h, pvfsdev.c)
  • kernel module
  • Read
  • pvfsdev get the request from the
    queue(structure pvfs_upcall)
  • Write
  • pvfsdev return the result of operation(structure
    pvfs_downcall)
  • Provide sync. ops.

28
Kernel module (5/5)
  • pvfsd(pvfsd.h, pvfsd.c)
  • read the request from pvfsdev
  • parse the request
  • send it to the I/O server
  • receive the return value
  • write the return value to pvfsdev

29
I/O Server (1/3)
  • main()
  • Make a listening socket
  • For()
  • Wait for some sockets ready(listening and others)
  • For all ready sockets
  • if a socket is ready but the socket is not
    registered in the job list(which means it is a
    new connection)
  • call new_request(ready socket)
  • if a socket is ready and the socket is
    registered in the job list,we have read or write
    jobs to be done
  • call do_job(socket, jinfo pointer)

30
I/O Server (2/3)
  • new_request(int socket)
  • Handles receiving new requests and calling the
    appropriate handler function
  • Calls do_rw_req(), which builds jobs for
    read/write requests
  • Returns 0 on success, -errno on failure.

31
I/O Server (3/3)
  • do_job(int sock, jinfo_p j_p)
  • Calls do_accesses()
  • do_access()
  • performs necessary operations to complete one
    access
  • Pseudo code
  • if the read operation
  • mmap file
  • call nbsend() to the client
  • if the write operation, call do_write()
  • do_write()
  • Calls nbrecv()
  • Then, calls write() to disk

32
NRL File SystemCurrent Status Discussion
  • System Design

33
Design
  • PVFS Cooperative caching
  • Distributed cache manager
  • Kernel-level
  • iod cache utilization?
  • Unified cache?

pvfsd
pvfsd
pvfsd
mgr
cache
cache
cache
network
iod
iod
iod
iod
cache
cache
cache
cache
34
Status
  • Source code analysis
  • Almost done
  • Cache manager in kernel-level
  • PVFS as root file system
  • We have problems

35
Problems (1/3)
  • Parallel-I/O caching is caching effective?
  • PVFS Applications large volume access
  • Spatial/temporal locality
  • Locality handled in page cache? File cache?
  • A result from cooperative caching research
  • Lower level cache poor hit ratio
  • Too large data set capacity miss
  • MPI-IO Application level cache
  • Data sieving / collective IO
  • worse hit ratio at file system cache
  • We arent sure if caching is effective or not

36
Problems (2/3)
  • Parallel-I/O caching is parallel-I/O
    effective?
  • Who needs caching?
  • Web server cluster
  • General purpose file system
  • Problems
  • Small file no need for striping
  • Metadata manager / parallel-I/O overhead
  • We need more investigation
  • Benchmarks
  • Disk access behavior caching, prefetching
    algorithm
  • Most file system benchmarks dont concern with
    caching
  • Real applications
  • Related works

37
Problems (3/3)
  • File system for SSI
  • Root file system
  • Survey on diskless NFS
  • Consistency
  • Cache consistency what model SSI wants?
  • File consistency none
  • File migration
  • Parallel-I/O does not need file migration
  • PVFS not fault-tolerant
  • RAID
  • Metadata structure should be modified is it
    sound?
  • Software RAID not an attractive topic
  • High availability
  • PVFS2 distributed metadata manager

38
Solution
  • Parallel-I/O caching lets try!
  • General purpose file system
  • Idea
  • No parallel-I/O for small files
  • Small file caching only, cache replication /
    migration
  • Larger striping size
  • How to reduce overhead?
  • No caching for large volume access
  • Rely on application cache
  • caching based on access frequency
  • Looking for benchmarks / real applications

39
PVFS2
  • Next Generation PVFS
  • CLUSTER01 extended abstract
  • Will be released in summer 2002
  • PVFS2 features
  • Requests based on MPI Datatypes
  • Programmable distributions
  • Support for data replication
  • Modular metadata subsystem
  • Distributed metadata mgr
  • Allows arbitrary metadata attributes
  • Multiple network transport methods
  • VIA, GM
  • VFS-like system interface
Write a Comment
User Comments (0)
About PowerShow.com