File Systems Concepts Distributed File Systems

About This Presentation

Title:

File Systems Concepts Distributed File Systems

Description:

These two names are sometimes called the symbolic and binary ... Cleaner – PowerPoint PPT presentation

Number of Views:44

Avg rating:3.0/5.0

Slides: 81

Provided by: BarbaraH157

Category:

more less

Transcript and Presenter's Notes

Title: File Systems Concepts Distributed File Systems

1
File Systems ConceptsDistributed File Systems
2
Structure of Windows 2000 (simplified)
3
Windows 2000 Architecture

HAL (Hardware Abstraction Layer)
HAL calls to associate interrupt service
procedures with interrupts, and set their
priorities, but does little less in this area
Kernel
Provides a higher-level abstraction of the
hardware
Provides complete mechanisms for doing context
switches
Includes code for thread scheduling
Provides low-level support for two classes of
objects (Control objects and Dispatcher objects)

4
Windows 2000 NTFS

Hierarchical file system (similar to UNIX)
Hard and symbolic links are supported
Highly complex and sophisticated FS
Different from MS-DOS FS

5
Fundamental Concepts

NTFS file not just a linear sequence of bytes
File consists of multiple streams
Multiple attributes, each represented by a stream
of bytes
Each stream has its own name foostream1
Idea of multiple streams in a file Apple
Macintosh

6
Principal Win32 APIs for File I/O

7
Open File Issues

To create and open a file, one must use
CreateFile
There is no FileOpen API
CreateFile system call has 7 parameters
Pointer to the name of the file to create or open
Flags telling whether the file can be read,
written, or both
Flags telling whether multiple processes can open
the file at once
A pointer to the security descriptor
Flags telling what to do if the file exists/does
not exist
Flags dealing with attributes such as archiving,
compression,
Handle of a file whose attributes should be
cloned for the new file
Example
InhandleCreateFile(data, GENERIC_READ,0,NULL,OP
EN_EXISTING, 0, NULL)

8
NTFS Structure

NTFS is divided into volumes
Each NTFS includes files, directories, other data
structures
Each volume is organized as a linear sequence of
blocks (block size between 512 and 4KB)
Main data structure in each volume is Master File
Table (MFT)
Bitmap keeps track which MFT entries are free

9
MFT for NTFS Organization

Boot-block has information
about the first block of MFT

8/22/2013
10
MFT Record (Example for 3-Run, 9-Block File)
11
UNIX I/O System in BSD
12
UNIX File System

UNIX File a sequence of 0 or more bytes
containing arbitrary information
Meaning of the bits is entirely up to the files
owner
File Names
up to 255 characters (BSD, System V,)
Base name extension

13
Important Directories in most UNIX Systems
14
Locking Property

Multiple processes may use the same file at the
same time may lead to race conditions
Solution 1 program application with critical
regions
Solution 2 POSIX provides a flexible and
fine-grained mechanism for processes to lock as
little as possible

15
Overlap of Locked Regions
A file with one lock
Addition of a second lock
Third lock
16
File System Calls
17
Directory System Calls
18
Implementation of UNIX FS

i-node carries metadata/attributes for exactly
one file, 64 bytes long
i-node table
a kernel structure that holds all i-nodes for
currently open files and directories
Open file operation- system reads the directory,
comparing names until finds the name
If the file is present the system extracts
i-node, uses this as an index into the i-node
table
i-node table entry device the file is on
i-node number, file mode, file size, etc.

19
UNIX/LINUX Issues

Links
support of hard links and soft/symbolic links
UNIX also supports character and block special
files
Example
/dev/tty reads from keyboard
/dev/hd1 reads and writes raw disk partitions
without regard to the file system
Raw block devices are used for paging and
swapping
4.3 BSD also supported symbolic links, which
are files containing the path name of another
file or directory Soft (symbolic) links, unlike
hard links, could point to directories and could
cross file-system boundaries

20
Relation between fd, open file table and i-node
table
21
Layout of Linux Ext2 File System

When a file grows, Ext2 tries to put it into
the same block group as its directory
Linux allows 2GB files instead 64 MB (UNIX
Version 7)

22
Linux /proc file system

For every process in the system, a directory is
created in /proc
The name of the directory is the process PID
Many Linux extensions relate to other files and
directories located in /proc
These files contain a wide variety of information
about the CPU, disk partitions, devices,
interrupt vectors, kernel counters, etc

23
Distributed File Systems

File service vs. file server
The file service is the specification.
A file server is a process running on a machine
to implement the file service for (some) files on
that machine.
In a normal distributed system would have one
file service but perhaps many file servers.
If have very different kinds of file systems we
might not be able to have a single file service
as perhaps some functions are not available.

24
Distributed File Systems

File Server Design
File
Sequence of bytes
Unix
MS-Dos
Windows
Sequence of Records
Mainframes
Keys
We do not cover these file systems. They are
often discussed in database courses.

25
Distributed File Systems

File attributes
rwx and perhaps a (append)
This is really a subset of what is called ACL --
access control list or Capability.
You get ACLs and Capabilities by reading columns
and rows of the access matrix.
owner, group, various dates, size
dump, auto-compress, immutable

26
Distributed File Systems

Upload/download vs. remote access.
Upload/download means the only file services
supplied are read file and write file.
All modifications done on a local copy of file.
Conceptually simple at first glance.
Whole file transfers are efficient (assuming you
are going to access most of the file) when
compared to multiple small accesses.
Not an efficient use of bandwidth if you access
only a small part of a large file.
Requires storage on client.

27
Distributed File Systems

What about concurrent updates?
What if one client reads and "forgets" to write
for a long time and then writes back the "new"
version overwriting newer changes from others?
Remote access means direct individual reads and
writes to the remote copy of the file.
File stays on the server.
Issue of (client) buffering
Good to reduce number of remote accesses.
But what about semantics when a write occurs?

28
Distributed File Systems

Note that meta-data is written for a read so if
you want faithful semantics every client read
must modify metadata on server or all requests
for metadata (e.g ls or dir commands) must go to
server.
Cache consistency question.
Directories
Mapping from names to files/directories.
Contains rules for names of files and
(sub)directories.
Hierarchy i.e. tree
(hard) links

29
Distributed File Systems

With hard links the filesystem becomes a Directed
Acyclic Graph instead of a simple tree.
Symbolic links
Symbolic not symmetric. Indeed asymmetric.
Consider
cd
mkdir dir1
touch dir1/file1
ln -s dir1/file1 file2

30
Distributed File Systems

file2 has a new inode it is a new type of file
called a symlink and its "contents" are the name
of the file dir/file1
When accessed file2 returns the contents of
file1, but it is not equal to file1.
If file1 is deleted, file2 "exists" but is
invalid.
If a new file2 is created, file2 now points to
it.
Symbolic links can point to directories as well.
With symbolic links pointing to directories, the
file system becomes a general graph, i.e.
directed cycles are permitted.

31
Distributed File Systems

Imagine hard links pointing to directories (Unix
does not permit this).
cd
mkdir B mkdir C
mkdir B/D mkdir B/E
ln B B/D/oh-my
Now you have a loop with honest looking links.
Normally you can't remove a directory (i.e.
unlink it from its parent) unless it is empty.
But when can have multiple hard links to a
directory, you should permit removing (i.e.
unlinking) one even if the directory is not empty.

32
Distributed File Systems

So in the above example you could unlink B from
A.
Now you have garbage (unreachable, i.e.
unnamable) directories B, D, and E.
For a centralized system you need a conventional
garbage collection.
For distributed system you need a distributed
garbage collector, which is much harder.
Transparency
Location transparency
Path name (i.e. full name of file) does not say
where the file is located.

33
Distributed File Systems

Location Independence
Path name is independent of the server. Hence
you can move a file from server to server without
changing its name.
Have a namespace of files and then have some
(dynamically) assigned to certain servers. This
namespace would be the same on all machines in
the system.
Root transparency
made up name
/ is the same on all systems
This would ruin some conventions like /tmp

34
Distributed File Systems

Examples
Machine path naming
/machine/path
machinepath
Mounting remote file system onto local hierarchy
When done intelligently we get location
transparency
Single namespace looking the same on all machines

35
Distributed File Systems

Two level naming
We said above that a directory is a mapping from
names to files (and subdirectories).
More formally, the directory maps the user name
/home/me/class-notes.html to the OS name for that
file 143428 (the Unix inode number).
These two names are sometimes called the symbolic
and binary names.
For some systems the binary names are available.

36
Distributed File Systems

The binary name could contain the server name so
that could directly reference files on other
filesystems/machines
Unix doesn't do this
We could have symbolic links contain the server
name
Unix doesn't do this either
VMS did something like this. Symbolic name was
something like nodenamefilename
Could have the name lookup yield multiple binary
names.

37
Distributed File Systems

Redundant storage of files for availability
Naturally must worry about updates
When visible?
Concurrent updates?
Whenever you hear of a system that keeps multiple
copies of something, an immediate question should
be "are these immutable?". If the answer is no,
the next question is "what are the update
semantics?
Sharing semantics
Unix semantics - A read returns the value store
by the last write.

38
Distributed File Systems

Probably Unix doesn't quite do this.
If a write is large (several blocks) do seeks for
each
During a seek, the process sleeps (in the kernel)
Another process can be writing a range of blocks
that intersects the blocks for the first write.
The result could be (depending on disk
scheduling) that the result does not have a last
write.
Perhaps Unix semantics means - A read returns the
value stored by the last write providing one
exists.
Perhaps Unix semantics means - A write syscall
should be thought of as a sequence of write-block
syscalls and similar for reads. A read-block
syscall returns the value of the last write-block
syscall for that block

39
Distributed File Systems

Easy to get this same semantics for systems with
file servers providing
No client side copies (Upload/download)
No client side caching
Session semantics
Changes to an open file are visible only to the
process (machine???) that issued the open. When
the file is closed the changes become visible to
all.
If you are using client caching you cannot flush
dirty blocks until close. What if you run out of
buffer space?

40
Distributed File Systems

Messes up file-pointer semantics
The file pointer is shared across the fork so all
the children of a parent share it.
But if the children run on another machine with
session semantics, the file pointer can't be
shared since the other machine does not see the
effect of the writes done by the parent).
Immutable files
Then there is "no problem
Fine if you don't want to change anything

41
Distributed File Systems

Can have "version numbers"
Usually old version becomes inaccessible (at
least under the current name)
With version numbers if you use name without
number you get the highest numbered version so
you would have what the book says.
But really you do have the old (full) name
accessible
VMS definitely did this
Note that directories are still mutable
Otherwise no create-file is possible

42
Distributed File Systems

Distributed File System Implementation
File Usage characteristics
Measured under Unix at a university
Not obvious that the same results would hold in a
different environment
Findings
1. Most files are small (lt 10K)
2. Reading dominates writing
3. Sequential accesses dominate
4. Most files have a short lifetime

43
Distributed File Systems

5. Sharing is unusual
6. Most processes use few files
7. File classes with different properties exist
Some conclusions
1 suggests whole-file transfer may be worthwhile
(except for really big files).
25 suggest client caching and dealing with
multiple writers somehow, even if the latter is
slow (since it is infrequent).
4 suggests doing creates on the client

44
Distributed File Systems

Not so clear. Possibly the short lifetime files
are temporaries that are created in /tmp or
/usr/tmp or /somethingorother/tmp. These would
not be on the server anyway.
7 suggests having multiple mechanisms for the
several classes.
Implementation choices
Servers clients together?
Common UnixNFS any machine can be a server
and/or a client

45
Distributed File Systems

Separate modules Servers for files and
directories are user programs so can configure
some machines to offer the services and others
not to
Fundamentally different Either the hardware or
software is fundamentally different for clients
and servers.
In Unix some server code is in the kernel but
other code is a user program (run as root) called
nfsd
File and directory servers together?

46
Distributed File Systems

If yes, less communication
If no, more modular "cleaner
Looking up a/b/c/ when a a/b a/b/c on different
servers
Natural solution is for server-a to return name
of server-a/b
Then client contacts server-a/b gets name of
server-a/b/c etc.
Alternatively server-a forwards request to
server-a/b who forwards to server-a/b/c.
Natural method takes 6 communications (3 RPCs)

47
Distributed File Systems

Alternative is 4 communications but is not RPC
Name caching
The translation from a/b/c to the inode (i.e.
symbolic to binary name) is expensive even for
centralized systems.
Called namei in Unix and was once measured to be
a significant percentage of all of kernel
activity.
Later Unices added "namei caching"
Potentially an even greater time saver for
distributed systems since communication is
expensive.
Must worry about obsolete entries.

48
Distributed File Systems

Stateless vs. Stateful
Should the server keep information between
requests from a user, i.e. should the server
maintain state?
What state?
Recall that the open returns an integer called a
file descriptor that is subsequently used in
read/write.
With a stateless server, the read/write must be
self contained, i.e. cannot refer to the file
descriptor.
Why?

49
Distributed File Systems

Advantages of stateless
Fault tolerant - No state to be lost in a crash
No open/close needed (saves messages)
So space used for tables (state requires storage)
No limit on number of open files (no tables to
fill up)
No problem if client crashes (no state to be
confused by)
Advantages of stateful
Shorter read/write (descriptor shorter than name)

50
Distributed File Systems

Better performance
Since we keep track of what files are open, we
know to keep those inodes in memory
But stateful could keep a memory cache of inodes
as well (evict via LRU instead of close, not as
good)
Blocks can be read in advance (read ahead)
Of course stateless can read ahead.
Difference is that with stateful we can better
decide when accesses are sequential.
Idempotency easier (keep sequence numbers)
File locking possible (the lock is state)
Stateless can write a lock file by convention.
Stateless can call a lock server

51
Caching

There are four places to store a file supplied by
a file server (these are not mutually exclusive)
Server's disk
essentially always done
Server's main memory
normally done
Standard buffer cache
Clear performance gain
Little if any semantics problems

52
Caching

Client's main memory
Considerable performance gain
Considerable semantic considerations
The one we will study
Clients disk
Not so common now
Unit of caching
File vs. block
Tradeoff of fewer access vs. storage efficiency

53
Caching

What eviction algorithm?
Exact LRU feasible because we can afford the time
to do it (via linked lists) since access rate is
low.
Where in client's memory to put cache?
The user's process
The cache will die with the process
No cache reuse among distinct processes
Not done for normal OS.
Big deal in databases
Cache management is a well studied DB problem

54
Caching

The kernel (i.e. the client's kernel)
System call required for cache hit
Quite common
Another process
"Cleaner" than in kernel
Easier to debug
Slower
Might get paged out by kernel!
Cache consistency
Big question

55
Caching

Write-through
All writes are sent to the server (as well as the
client cache)
Hence does not lower traffic for writes
Does not by itself fix values in other caches
We need to invalidate or update other caches
Can have the client cache check with server
whenever supplying a block to ensure that the
block is not obsolete
Hence still need to reach server for all accesses
but at least the reads that hit in the cache only
need to send tiny message (timestamp not data).

56
Caching

Delayed write
Wait a while (30 seconds is used in some NFS
implementations) and then send a bulk write
message.
This is more efficient that a bunch of small
write messages.
If file is deleted quickly, you might never write
it.
Semantics are now time dependent (and ugly).

57
Caching

Write on close
Session semantics
Fewer messages since more writes than closes.
Not beautiful (think of two files simultaneously
opened).
Not much worse than normal (uniprocessor)
semantics. The difference is that it (appears)
to be much more likely to hit the bad case.
Delayed write on close
Combines the advantages and disadvantages of
delayed write and write on close.

58
Caching

Doing it "right.
Multiprocessor caching (of central memory) is
well studied and many solutions are known.
Cache consistency (a.k.a. cache coherence).
Book mentions a centralized solution.
Others are possible, but none are cheap.
Interesting thought IPC is more expensive that a
cache invalidate but disk I/O is much rarer than
memory references. Might this balance out and
might one of the cache consistency algorithms
perform OK to manage distributed disk caches?
If so why not used?
Perhaps NSF is good enough and not enough reason
to change (NFS predates cache coherence work).

59
Replication

Some issues are similar to (client) caching.
Why?
Because whenever you have multiple copies of
anything, bells ring
Are they immutable?
What is update policy?
How do you keep copies consistent?
Purposes of replication
Reliability
A "backup" is available if data is corrupted on
one server.

60
Replication

Availability
Only need to reach any of the servers to access
the file (at least for queries).
Not the same as reliability
Performance
Each server handles less than the full load (for
a query-only system much less).
Can use closest server lowering network delays.
Not important for distributed system on one
physical network.
Very important for web mirror sites.

61
Replication

Transparency
If we can't tell files are replicated, we say the
system has replication transparency
Creation can be completely opaque
i.e. fully manual
users use copy commands
if directory supports multiple binary names for a
single symbolic name,
use this when making copies
presumably subsequent opens will try the binary
names in order (so they are not opaque)

62
Replication

Creation can use lazy replication.
User creates original
system later makes copies
subsequent opens can be (re)directed at any copy
Creation can use group communication.
User directs requests at a group.
Hence creation happens to all copies in the group
at once.

63
Replication

Update protocols
Primary copy
All updates are done to the primary copy.
This server writes the update to stable storage
and then updates all the other (secondary)
copies.
After a crash, the server looks at stable storage
and sees if there are any updates to complete.
Reads are done from any copy.
This is good for reads (read any one copy).
Writes are not so good.
Can't write if primary copy is unavailable.

64
Replication

Semantics
The update can take a long time (some of the
secondaries can be down)
While the update is in progress, reads are
concurrent with it. That is you might get old or
new value depending which copy they read.
Voting
All copies are equal (symmetric)
To write you must write at least WQ of the copies
(a write quorum). Set the version number of all
these copies to 1 max of current version
numbers.
To read you must read at least RQ copies and use
the value with the highest version.

65
Replication

Require WQRQ gt number copies
Hence any write quorum and read quorum intersect.
Hence the highest version number in any read
quorum is the highest ver number there is.
Hence always read the current version
Consider extremes (WQ1 and RQ1)
To write, you must first read all the copies in
your WQ to get the version number.
Must prevent races
Let N2, WQ2, RQ1. Both copies (A and B) have
version number 10.

66
Replication

Two updates start. U1 wants to write 1234, U2
wants to write 6789.
Both read version numbers and add 1 (get 11).
U1 writes A and U2 writes B at roughly the same
time.
Later U1 writes B and U2 writes A.
Now both are at version 11 but A6789 and B1234.
Voting with ghosts
Often reads dominate writes so we choose RQ1 (or
at least RQ very small so WQ very large).

67
Replication

This makes it hard to write. E.g. RQ1 so WQn
and hence can't update if any machine is down.
When one detects that a server is down, a ghost
is created.
Ghost cannot participate in read quorum, but can
in write quorum
write quorum must have at least one non-ghost
Ghost throws away value written to it
Ghost always has version 0
When crashed server reboots, it accesses a read
quorum to update its value

68
NFS

NFS - Sun Microsystems's Network File System.
"Industry standard", dominant system.
Machines can be (and often are) both clients and
servers.
Basic idea is that servers export directories and
clients mount them.
When server exports a directory, the sub-tree
routed there is exported.
In Unix exporting is specified in /etc/exports

69
NFS

In Unix mounting is specified in /etc/fstab
fstab file system table.
In Unix w/o NFS what you mount are filesystems.
Two Protocols
1. Mounting
Client sends server message containing pathname
(on server) of the directory it wishes to mount.
Server returns handle for the directory
Subsequent read/write calls use the handle
Handle has data giving disk, inode , et al
Handle is not an index into table of actively
exported directories. Why not?

70
NFS

Because the table would be state and NFS is
stateless. Can do this mounting at any time,
often done at client boot time.
2. File and directory access
Most Unix system calls supported
Open/close not supported
NFS is stateless
Do have lookup, which returns a file handle. But
this handle is not an index into a table.
Instead it contains the data needed.
As indicated previously, the stateless nature of
NFS makes Unix locking semantics hard to achieve.

71
NFS

Authentication
Client gives the rwx bits to server.
How does server know the client is machine it
claims to be?
Various Cryptographic keys.
This and other stuff stored in NIS (net info
service) a.k.a. yellow pages
Replicate NIS
Update master copy
master updates slaves
window of inconsistency

72
NFS

Implementation
Client system call layer processes I/O system
calls and calls the virtual file system layer
(VFS).
VFS has a v-node (virtual i-node) for each open
file
For local files v-node points to i-node in local
OS
For remote files v-node points to r-node (remote
i-node) in NFS client code.
Blow by blow
Mount (remote directory, local directory)
First the mount program goes to work
Contact the server and obtains a handle for the
remote directory.

73
NFS

Makes mount system call passing handle
Now the kernel takes over
Makes a v-node for the remote directory
Asks client code to construct an r-node
have v-node point to r-node
Open system call
While parsing the name of the file, the kernel
(VFS layer) hits the local directory on which the
remote is mounted (this part is similar to
ordinary mounts of local filesystems).
Kernel gets v-node of the remote directory (just
as would get i-node if processing local files)

74
NFS

Kernel asks client code to open the file (given
r-node)
Client code calls server code to look up
remaining portion of the filename
Server does this and returns a handle (but does
not keep a record of this). Presumably the
server, via the VFS and local OS, does an open
and this data is part of the handle. So the
handle gives enough information for the server
code to determine the v-node on the server
machine.

75
NFS

When client gets a handle for the remote file, it
makes an r-node for it. This is returned to the
VFS layer, which makes a v-node for the newly
opened remote file. This v-node points to the
r-node. The latter contains the handle
information.
The kernel returns a file descriptor, which
points to the v-node.
Read/write
VFS finds v-node from the file descriptor it is
given.
Realizes remote and asks client code to do the
read/write on the given r-node (pointed to by the
v-node).

76
NFS

Client code gets the handle from its r-node table
and contacts the server code.
Server verifies the handle is valid (perhaps
using authentication) and determines the v-node.
VFS (on server) called with the v-node and the
read/write is performed by the local (on server)
OS.
Read ahead is implemented but as stated before it
is primitive (always read ahead).
Caching
Servers cache but not big deal
Clients cache

77
NFS

Potential problems of course so
Discard cached entries after some seconds
On open the server is contacted to see when file
last modifies. If it is newer than the cached
version, the cached version is discarded.
After some seconds all dirty cache blocks are
flushed back to server.
All these Band-Aids still do not give proper
semantics (or even Unix semantics).

78
NFS