Title: Chapter 8: Distributed File Systems
1Chapter 8 Distributed File Systems
- Speaker ??? M9129001
- 5/2/2003
2Outline
- Introduction
- File service architecture
- Sun Network File System (NFS)
- Andrew File System (AFS)
- Recent advances
- Summary
3File System
- A file system
- Is responsible for the organization, storage,
retrieval, naming, sharing and protection of
files - Is designed to store and manage large number of
files, with facilities for creating, naming and
deleting files - Stores programs and data and makes them available
as needed
4Layered File System
5Persistence
- Probably one of the most important services
provided by a file system is persistence - The files exist after the program, and even the
computer, has terminated - Files typically do not go away
- They are persistent and exist between sessions
- In conventional systems files are the only
persistence objects
6Distributed Objects
- Using the OO paradigm is it easy to build
distributed systems - Place objects on different machines
- Systems have been developed that allow this
- Java RMI
- CORBA ORB
- Having a persistent object store would be useful
- Java RMI activation daemon
- Certain ORB implementations
7Properties of Storage Systems
Types of consistency between copies 1 - strict
one-copy consistency - approximate
consistency X - no automatic consistency
Figure 8.1
Sharing
Persis-
Distributed
Consistency
Example
tence
cache/replicas
maintenance
Main memory
1
RAM
1
File system
UNIX file system
Distributed file system
Sun NFS
Web server
Web
Distributed shared memory
Ivy (Ch. 16)
Remote objects (RMI/ORB)
CORBA
1
Persistent object store
1
CORBA Persistent
Object Service
Persistent distributed object store
PerDiS, Khazana
8File Model
- Files contain both data and attributes
- Data is a sequence of bytes accessible by
read/write operations - Attributes consist of a collection of information
about the file
9Common File Attributes
10UNIX file system operations
A Example Write a simple C program to copy a file
using the UNIX file system operations shown in
Figure 8.4. copyfile(char oldfile,
newfile) ltyou write this part, using open(),
creat(), read(), write()gt Note remember that
read() returns 0 when you attempt to read beyond
the end of the file.
11File system modules
12File Service
- A file service allows for the storage and access
of files on a network - Remote file access is identical to local file
access - Convenient for users who use different
workstations - Other services can be easily implemented
- Makes management and deployment easier and more
economical - File systems were the first distributed systems
that were developed - Defines the service not the implementation
13File Server
- A process that runs on some machine and helps to
implement the file service - A system may have more than one file server
14File Service Models
Upload/ download model
ReadMe.txt
ReadMe.txt
ReadMe.txt
Client
Server
Remote access model
ReadMe.txt
Server
Client
15File service Requirements
- Transparency
- Access
- Programs are unaware of the fact that files are
distributed - Location
- Programs see a uniform file name space. They do
not know, or care, where the files are physically
located - Mobility
- Programs do need to be informed when files move
(provided the name of the file remains unchanged)
16Transparency Revisited
- Location Transparency
- Path name gives no hint where the file is
physically located - \\redshirt\ptt\dos\filesystem.ppt
- File is on redshirt but where is redshirt?
17File service Requirements cont.
- Transparency
- Performance
- Satisfactory performance across a specified range
of system loads - Scaling
- Service can be expanded to meet additional loads
18File service Requirements cont.
- Concurrent File Updates
- Changes to a file by one program should not
interfere with the operation of other clients
simultaneously accessing the same file - File-level or record-level locking
- Other forms of concurrency control to minimize
contention
19File Locking
- Lock cannot be granted if other process already
has a lock on the file (or block) - Client request gets put at the end of a FIFO
queue - As soon as lock is removed, the server grants the
next lock to the client at the top of the list
Client 2
Client 3
Client 1
copy
copy
Request 3
Return lock
Server
Disk
Sent to end of queue
file
Request 3
Request 2
FIFO queue
20File service Requirements cont.
- File Replication
- A file may be represented by several copies of
its contents at different locations - Load-sharing between servers makes service more
scalable - Local access has better response (lower latency)
- Fault tolerance
- Full replication is difficult to implement.
- Caching (of all or part of a file) gives most of
the benefits (except fault tolerance)
21File service Requirements cont.
- Hardware and Software Heterogeneity
- Service can be accessed by clients running on
(almost) any OS or hardware platform. - Design must be compatible with the file systems
of different OSes - Service interfaces must be open - precise
specifications of APIs are published.
22File service Requirements cont.
- Fault Tolerance
- The service can continue to operate in the face
of client and server failures. - Consistency
- UNIX one-copy update semantics
- Security
- Based on identity of user making request
- Identities of remote users must be authenticated
- Privacy requires secure communication
- Efficiency
- Should offer facilities that are of at least the
same power and as those found in conventional
systems
23File Sharing Semantics
- When more than one user shares a file, it is
necessary to define the semantics of reading and
writing - For single processor systems
- The system enforces an absolute time ordering on
all operations and always returns the result of
the last operation - Referred to as UNIX semantics
24UNIX Semantics
Original file
A
B
PID0
A
B
C
Writes c
PID1
Read gets abc
25Distributed
A
B
PID0
Read ab
A
B
C
Writes c
Client 1
A
B
A
B
PID1
Read ab
Read gets c
Client 2
26Summary
Method Comment
UNIX semantics Every operation on a file is instantly visible to all processes
Session Semantics No changes are visible to other processes until the file is closed
Immutable Files No updates are possible simplifies sharing and replication
Transactions All changes have the all-or-nothing property
27Caching
- Attempt to hold what is needed by the process in
high speed storage - Parameters
- What unit does the cache manage?
- Files, blocks,
- What do you do when the cache fills up?
- Replacement policy
28Cache Consistency
- The real problem with caching and distributed
file systems is cache consistency - If two processes are caching the same file, how
to the local copies find out about changes made
to the file? - When they close their files, who wins the race?
- Client caching needs to be thought out carefully
29Cache Strategies
Method Comment
Write Through Changes to file sent to server Works but does not reduce write traffic.
Delayed Write Send changes to server periodically better performance but possibly ambiguous semantics.
Write on Close Write changes when file is closed Matches session semantics.
Centralized Control File server keeps track of who has which file open and for what purpose UNIX Semantics, but not robust and scales poorly
30Replication
- Multiple copies of files are maintained
- Increase reliability by having several copies of
a file - Allow file access to occur even if a server is
down - Load-balancing across servers
- Replication transparency
- To what extent is the user aware that some files
are replicated?
31Types of Replication
Explicit File Replication
Lazy File Replication
Group Replication
S0
S0
S0
Later
Now
S1
C
C
C
S1
S1
S2
S2
S2
32Update Protocols
- Okay so now we have replicas, how do we update
them? - Primary Copy Replication
- Change is sent to primary
- Primary sends changes to secondary servers
- Voting
- Primary is down you are dead
- Client must receive permissions of multiple
servers before making an update
33File Service Architecture
Figure 8.5
34System Modules
- Flat File Service
- Implements operations on the contents of files
- UFIDs are used to refer to files (think I-node)
- Directory Service
- Provides a mapping between text names and UFIDs
- Note that the name space is not necessarily flat
- Might be a client of the flat file service if it
requires persistent storage - Client Module
- Provides client access to the system
35Server operations for the model file service
Figures 8.6 and 8.7
- Flat file service
- Read(FileId, i, n) -gt Data
- Write(FileId, i, Data)
- Create() -gt FileId
- Delete(FileId)
- GetAttributes(FileId) -gt Attr
- SetAttributes(FileId, Attr)
- Directory service
- Lookup(Dir, Name) -gt FileId
- AddName(Dir, Name, File)
- UnName(Dir, Name)
- GetNames(Dir, Pattern) -gt NameSeq
Pathname lookup Pathnames such as '/usr/bin/tar'
are resolved by iterative calls to lookup(), one
call for each component of the path, starting
with the ID of the root directory '/' which is
known in every client.
Example B Show how each file operation of the
program that wrote in Example A would be executed
using the operations of the Model File Service in
Figures 8.6 and 8.7.
FileId A unique identifier for files anywhere in
the network. Similar to the remote object
references described in Section 4.3.3.
36Example B solution
Show how each file operation of the program that
you wrote in Example A would be executed using
the operations of the Model File Service in
Figures 8.6 and 8.7. if((fdold open(oldfile,
READ))gt0) fdnew creat(newfile,
FILEMODE) while (ngt0) n read(fdold, buf,
BUFSIZE) if(write(fdnew, buf, n) lt 0)
break close(fdold) close(fdnew)
37File Group
- A collection of files that can be located on any
server or moved between servers while maintaining
the same names. - Similar to a UNIX filesystem
- Helps with distributing the load of file serving
between several servers. - File groups have identifiers which are unique
throughout the system (and hence for an open
system, they must be globally unique). - Used to refer to file groups and files
38NFS
- NFS was originally designed and implemented by
Sun Microsystems - Three interesting aspects
- Architecture
- Protocol ( RFC 1094 )
- Implementation
- Suns RPC system was developed for use in NFS
- Can be configured to use UDP or TCP
39Overview
- Basic idea is to allow an arbitrary collection of
clients and servers to share a common file system - An NFS server exports one of its directories
- Clients access exported directories by mounting
them - To programs running on the client, there is
almost no difference between local and remote
files
40NFS architecture
Figure 8.8
Client computer
Server computer
Application
Application
program
program
Virtual file system
Virtual file system
UNIX
UNIX
NFS
NFS
file
file
client
server
system
system
412.Local kernel called with network message
3.Network message transferred to remote host
1.Client stub called argument marshalling
performed
4.Server stub is given message arguments are
unmarshaled/converted
5.Remote procedure is executed with arguments
7.Server stub converts/marshals values remote
kernel is called
9.Client stub receives message from kernel
6.Procedure return values given back to server
stub
8.Network message transferred back to local host
10.Return values are given to client from stub
Server Process
Client Process
Server Routine
Client Routine
Local Procedure call
(10)
(1)
(5)
(6)
Server Stub
Client Stub
(2)
(9)
(4)
(7)
System call
(8)
Network Routine
Network Routine
(3)
Network Communication
Local Kernel
Local Kernel
Remote Procedure Call Model
42Virtual File System
- Part of Unix kernel
- Makes access to local and remote files
transparent - Translates between Unix file identifiers and NFS
file handles - Keeps tack of file systems that are currently
available both locally and remotely - NFS file handles
- File system ID
- I-node number
- I-node generation number
- File systems are mounted
- The VFS keeps a structure for each mounted file
system
43Virtual File System cont.
44NFS architecture does the implementation have
to be in the system kernel?
- No
- There are examples of NFS clients and servers
that run at application-level as libraries or
processes (e.g. early Windows and MacOS
implementations, current PocketPC, etc.) - But, for a Unix implementation there are
advantages - Binary code compatible - no need to recompile
applications - Standard system calls that access remote files
can be routed through the NFS client module by
the kernel - Shared cache of recently-used blocks at client
- Kernel-level server can access i-nodes and file
blocks directly - But a privileged (root) application program could
do almost the same. - Security of the encryption key used for
authentication.
45NFS server operations (simplified)
Figure 8.9
- read(fh, offset, count) -gt attr, data
- write(fh, offset, count, data) -gt attr
- create(dirfh, name, attr) -gt newfh, attr
- remove(dirfh, name) status
- getattr(fh) -gt attr
- setattr(fh, attr) -gt attr
- lookup(dirfh, name) -gt fh, attr
- rename(dirfh, name, todirfh, toname)
- link(newdirfh, newname, dirfh, name)
- readdir(dirfh, cookie, count) -gt entries
- symlink(newdirfh, newname, string) -gt status
- readlink(fh) -gt string
- mkdir(dirfh, name, attr) -gt newfh, attr
- rmdir(dirfh, name) -gt status
- statfs(fh) -gt fsstats
46NFS Client Module
- Part of Unix kernel
- Allows user programs to access files via UNIX
system calls without recompilation or reloading - One module serves all user-level processes
- A shared cache holds recently used blocks
- The encryption key for authentication of user IDs
is kept in the kernel
47NFS access control and authentication
- Stateless server, so the user's identity and
access rights must be checked by the server on
each request. - In the local file system they are checked only on
open() - Every client request is accompanied by the userID
and groupID - not shown in the Figure 8.9 because they are
inserted by the RPC system - Server is exposed to impostor attacks unless the
userID and groupID are protected by encryption - Kerberos has been integrated with NFS to provide
a stronger and more comprehensive security
solution - Kerberos is described in Chapter 7. Integration
of NFS with Kerberos is covered later in this
chapter.
48Access Control
- NFS servers are stateless
- Users identity must be verified for each request
- The UNIX UID and GID of the user are used for
authentication purposes - Does this scare you?
- Kerberized NFS
49Mount service
- Mount operation
- mount(remotehost, remotedirectory,
localdirectory) - Server maintains a table of clients who have
mounted filesystems at that server - Each client maintains a table of mounted file
systems holding lt IP address, port number,
file handlegt - Hard versus soft mounts
50File Mounting
- File mounting protocol
- A client sends a path name to a server and can
request permission to mount the directory - If the request is legal, the server returns a
file handle to the client - The handle identifies the file system type, the
disk, the I-node number of the directory, and
security - Subsequent calls to read/write in that directory
use the file handle
51Mount service
Client
Server 1
system callmount( Server 1, /nfs/users,
/usr/staff )
Virtual file system
Virtual file system
Local
Remote
UNIX
UNIX
NFS
NFS
file
file
server
client
system
system
(root)
(root)
nfs
etc
usr
vmunix
. . .
exports
students
x
52Local and Remote Access
Note The file system mounted at /usr/students in
the client is actually the sub-tree located at
/export/people in Server 1 the file system
mounted at /usr/staff in the client is actually
the sub-tree located at /nfs/users in Server 2.
53NFS path translation
- Pathnames are translated in a step by step
procedure by the client - The file handle used for one step is used as a
parameter at next lookup
54Automounting
- Allows a number of remote directories to be
associated with a local directory - Nothing is mounted until a client tries to access
a remote directory - Advantages
- Dont need to do any work if the files are not
accessed - Some fault tolerance is provided
55Server caching
- Read-ahead fetches the pages following those
that have been recently read - Delayed-write doesnt write out disk blocks
until the cache buffer is needed for something
else - The UNIX sync flushes altered pages to disk every
30 seconds - NFS commit operation forces the blocks of a file
to be written in delayed-write mode - NFS also offers write-through caching block is
written to disk before the reply is sent back to
client - What problems occur with delayed-write?
- What problems occur with write-through?
56Client caching (reads)
- Client caching can result in inconsistent files.
Why? - NFS uses timestamped validation of cache blocks
- Tc is time block last validated
- Tm is time when block was last modified at the
server - t is the freshness interval (set adaptively for
individual files 3 to 30 secs) - T is current time
- If (T Tc lt t) or (T-Tc gt t and Tm client
Tm server), file is okay - Validation check is made a client with each
access - When a new value to Tm is received for a file, it
is applied to all blocks
57Client caching (writes)
- Modified pages are marked as dirty and flushed at
next sync - Bio-daemons (block input-output) perform
read-ahead and delayed-write - notified when client reads a block to get next
blocks - notified when client fills a block then writes it
out
58Other NFS optimizations
- Sun RPC runs over UDP by default (can use TCP if
required) - Uses UNIX BSD Fast File System with 8-kbyte
blocks - reads() and writes() can be of any size
(negotiated between client and server) - The guaranteed freshness interval t is set
adaptively for individual files to reduce
gettattr() calls needed to update Tm - File attribute information (including Tm) is
piggybacked in replies to all file requests
59Kerberized NFS
- Kerberos protocol is too costly to apply on each
file access request - Kerberos is used in the mount service
- to authenticate the user's identity
- User's UserID and GroupID are stored at the
server with the client's IP address - For each file request
- The UserID and GroupID sent must match those
stored at the server - IP addresses must also match
- This approach has some problems
- can't accommodate multiple users sharing the same
client computer - all remote filestores must be mounted each time a
user logs in
60NFS performance
- Early measurements (1987) established that
- write() operations are responsible for only 5 of
server calls in typical UNIX environments - hence write-through at server is acceptable
- lookup() accounts for 50 of operations -due to
step-by-step pathname resolution necessitated by
the naming and mounting semantics. - More recent measurements (1993) show high
performance - see www.spec.org for more recent measurements
- Provides a good solution for many environments
including - large networks of UNIX and PC clients
- multiple web server installations sharing a
single file store
61NFS Summary
- An excellent example of a simple, robust,
high-performance distributed service. - Achievement of transparencies (See section
1.4.7) - Access Excellent the API is the UNIX system
call interface for both local and remote files. - Location Not guaranteed but normally achieved
naming of filesystems is controlled by client
mount operations, but transparency can be ensured
by an appropriate system configuration. - Concurrency Limited but adequate for most
purposes when read-write files are shared
concurrently between clients, consistency is not
perfect. - Replication Limited to read-only file systems
for writable files, the SUN Network Information
Service (NIS) runs over NFS and is used to
replicate essential system files, see Chapter 14.
cont'd
62NFS summary
- Achievement of transparencies (continued)
- Failure Limited but effective service is
suspended if a server fails. Recovery from
failures is aided by the simple stateless design. - Mobility Hardly achieved relocation of files is
not possible, relocation of filesystems is
possible, but requires updates to client
configurations. - Performance Good multiprocessor servers achieve
very high performance, but for a single
filesystem it's not possible to go beyond the
throughput of a multiprocessor server. - Scaling Good filesystems (file groups) may be
subdivided and allocated to separate servers.
Ultimately, the performance limit is determined
by the load on the server holding the most
heavily-used filesystem (file group).
63Andrew File System (AFS)
64Abstract
- Distributed File Systems such as the ATT RFS
system provide the same consistency semantics of
a single machine file system, often at great cost
to performance. - Distributed File Systems such as SUN NFS provide
good performance, but with extremely weak
consistency guarantees.
65Abstract
- The designers of AFS believed that a compromise
could be achieved between the two extremes. - AFS attempts to provide useful file system
consistency guarantees along with good
performance.
66Design Goals
- Performance
- Minimize Server Load
- Minimize Network Communications
- Consistency Guarantees
- After a file system call completes, the resulting
file system state is immediately visible
everywhere on the network (with one exception,
discussed later). - Scalability
- Provide the appearance of a single, unified file
system to approximately 5000 client nodes
connected on a single LAN. - A single file server should provide service to
about 50 clients. - UNIX Support
- AFS is intended primarily for use by Unix
workstations.
67Influential Observations
- Design goals were based in part on the following
observations - Files are small most are less than 10 kB
- Read operations on files are much more common
than writes (6X more common) - Sequential access is common random access is
rare - Most files are read and written by only one user.
Usually only one user modifies a shared file. - Files are referenced in bursts. If a file has
been referenced recently, there is a high
probability that it will be referenced again in
the near future.
68Implementation Overview
69Venus / Vice Client / Server
- Venus
- User-level UNIX process that runs in each client
computer and corresponds to the client module in
the previous abstract model. - Vice
- User-level UNIX process that runs in each server
machine. - Threads
- Both Venus and Vice make use of a non-pre-emptive
threads package to enable concurrent processing.
70File Name Space
- Local files are used only for
- Temporary files (/tmp) and
- Processes that are essential for workstation
startup. - Other standard UNIX files
- Such as those normally found in /bin, /lib
- Implemented as symbolic links
- From the local space to shared space.
- Users directories
- In the shared space
- Allows them to access their files from any
workstation.
71File Name Space
72System Call Interception
- User programs use conventional UNIX pathnames to
refer to files, but AFS uses fids (file
identifiers) in the communication between Venus
and Vice. - Local file system calls
- Handled as normal BSD Unix file system calls.
- Remote file system calls
- In situations requiring non-local file
operations, the BSD Unix kernel has been modified
to convert conventional UNIX pathnames to fids
and forward the fids to Venus. - Venus then communicates directly with Vice using
fids. - The BSD Unix kernel below Vice was also modified
to so that Vice can perform file operations in
terms of fids instead of the conventional UNIX
file descriptors. - FID calculation on the client side minimizes
server workload.
73System Call Interception
74File Descriptors (FIDS)
- Each file and directory in the shared file space
is identified by a unique, 96-bit fid. - example whatfid . ./ProblemFile
- . 11969567049.855.16922
- ./ProblemFile 11969567049.5880.16923
- (www.transarc.com)
- The File IDentifier or FID is composed of four
numbers. - The first number is the cell number "1"
corresponds to the local cell. - The second number, is the volume id number. Use
vos listvldb 1969567049 to find the
corresponding volume name. - The third number is the vnode
- The fourth number is the uniquifier.
- AFS uses the third and fourth numbers to track
the file's location in the cache.
75File System Call Implementation
76Cache Consistency
- Callback Promise
- A token issued by the Vice server that it is the
custodian of the file, guaranteeing that it will
notify the Venus process when any other client
modifies the file. - Stored with the cached files on the client
workstation disks - Two states
- Valid
- Cancelled
77Cache Consistency cont.
- Callback Benefits
- Results in communication between the client and
server only when the file has been updated. - The client does not need to inform the server
that it wants to open a file (likely for reading)
if there is a valid copy on the client machine
(the callback status is set to valid).
78Cache Consistency cont.
- Callback Drawbacks
- Mechanism used in AFS-2 and later versions
requires Vice servers to maintain some state on
behalf of their Venus clients. - If clients in different workstations open, write,
and close the same file concurrently, all but the
update resulting from the last close will be
silently lost (no error report is given).
Clients must implement concurrency control
independently if they require it.
79AFS Links
- http//www.transarc.ibm.com/Support/afs/
80Recent akdvances in file services
- NFS enhancements
- WebNFS - NFS server implements a web-like service
on a well-known port. Requests use a 'public file
handle' and a pathname-capable variant of
lookup(). Enables applications to access NFS
servers directly, e.g. to read a portion of a
large file. - One-copy update semantics (Spritely NFS, NQNFS) -
Include an open() operation and maintain tables
of open files at servers, which are used to
prevent multiple writers and to generate
callbacks to clients notifying them of updates.
Performance was improved by reduction in
gettattr() traffic. - Improvements in disk storage organization
- RAID - improves performance and reliability by
striping data redundantly across several disk
drives - Log-structured file storage - updated pages are
stored contiguously in memory and committed to
disk in large contiguous blocks ( 1 Mbyte). File
maps are modified whenever an update occurs.
Garbage collection to recover disk space.
81New design approaches
- Distribute file data across several servers
- Exploits high-speed networks (ATM, Gigabit
Ethernet) - Layered approach, lowest level is like a
'distributed virtual disk' - Achieves scalability even for a single
heavily-used file - 'Serverless' architecture
- Exploits processing and disk resources in all
available network nodes - Service is distributed at the level of individual
files - Examples
- xFS (section 8.5) Experimental implementation
demonstrated a substantial performance gain over
NFS and AFS - Frangipani (section 8.5) Performance similar to
local UNIX file access - Tiger Video File System (see Chapter 15)
- Peer-to-peer systems Napster, OceanStore (UCB),
Farsite (MSR), Publius (ATT research) - see web
for documentation on these very recent systems
82New design approaches
- Replicated read-write files
- High availability
- Disconnected working
- re-integration after disconnection is a major
problem if conflicting updates have ocurred - Examples
- Bayou system (Section 14.4.2)
- Coda system (Section 14.4.3)
83Summary
- Sun NFS is an excellent example of a distributed
service designed to meet many important design
requirements - Effective client caching can produce file service
performance equal to or better than local file
systems - Consistency versus update semantics versus fault
tolerance remains an issue - Superior scalability can be achieved with
whole-file serving (Andrew FS) or the distributed
virtual disk approach
- Future requirements
- support for mobile users, disconnected operation,
automatic re-integration (Cf. Coda file system,
Chapter 14) - support for data streaming and quality of service
(Cf. Tiger file system, Chapter 15)