Distributed%20File%20Systems%20(DFS)

About This Presentation

Title:

Distributed%20File%20Systems%20(DFS)

Description:

Failure transparency. Client and client programs should operate correctly after server failure. ... semantics: break transparency, reduce functionality, etc. ... – PowerPoint PPT presentation

Number of Views:88

Avg rating:3.0/5.0

Slides: 43

Provided by: MateiR5

Category:

more less

Transcript and Presenter's Notes

Title: Distributed%20File%20Systems%20(DFS)

1
Distributed File Systems (DFS)

2

Problem facilitate access to remote data
Uniform access to data from multiple, network
connected nodes
Aggregate the storage offered by multiple nodes

DFS in charge with
Organization
Retrieval
Storage sharing
Naming
Protection

3
Distributed File System Goals

Access transparency
Clients unaware files are remote
Location transparency
Consistent name space (local and remote)
Concurrency transparency
Modifications are coherent
Failure transparency
Client and client programs should operate
correctly after server failure.
One client failure should not impact the
others
Heterogeneity
File service should be provided across
different hardware and software platforms

4
Distributed File System Goals

Scalability
Scale from a few machines to many (tens of
thousands?)
Replication transparency
Clients unaware of data replication
Coherence maintained
Migration transparency
Files should be able to move around without
clients knowledge
Fine grained distribution of data
Locate objects near processes that use them

5
A few terms

File service
- Specification of what the file system offers
to clients
File
name, data, attributes
Immutable file
Cannot be changed once created
- Easier to cache and replicate
Protection
Capabilities
Access control lists

6
File service types

Upload/download model
Read file copy file from server to client
Write file copy file from client to server
Advantage
Simple
Problems
Wasteful what if client needs small piece?
Problematic what if client doesnt have enough
space?
Consistency what if others need to modify the
same file?

7
File service types

Remote access model
File service provides functional interface
create, delete, read bytes, write bytes, etc
Advantages
Client gets only whats needed
Server can manage coherent view of file system
Problems
Possible server and network congestion
Servers are accessed for duration of file access
Same data may be requested repeatedly

8
File service types

Data caching model
File access local file access, client caches a
local copy
Advantage reduces communication overhead
Problem data consistency

9
File-Accessing Granularity
Transfer level Merits Problems
File Simple, less communication overhead, and immune to server crashes Client required to have large storage space
Block less storage space at client More network traffic/overhead
Byte Flexibility maximized Difficult cache management to handle the variable-length data
Record Handling structured and indexed files More network traffic More overhead to re-construct a file.
10
File-Sharing Semantics

Define when modifications of the file data made
by a user are observable by other users
Sequential semantics (Unix)
Session Semantics
Immutable shared-files semantics
Transaction-like semantics

11
Sequential Semantics(Unix Semantics)

Read returns result of last write
Easily achieved if
Only one server
Clients do not cache data
BUT
Performance problems if no cache
We can write-through to use caches and deal
with obsolete data
Must notify clients holding copies
Requires extra state, generates extra traffic

12
Session Semantics

Relax the rules
Changes to an open file are initially visible
only to the process (or machine) that modified
it.
Last process to modify the file wins.

13
Session Semantics
Client C
Server
Client A
Client B
Open(file)
Append(c)
Open(file)
Append(d)
Append(x)
Append(e)
Append(y)
Close(file)
Append(z)
Open(file)
Close(file)
Append(m)
m
Close(file)
m
Close(file)
14
Other solutions

Make files immutable
Aids in replication
Does not help with detecting modification
Or...
Use atomic transactions
Each file access is an atomic transaction
If multiple transactions start concurrently
resulting modification is serial

15
File-Sharing SemanticsImmutable Shared-Files
Semantics
Server
Client B
Client A
Version 1.0
Tentative based on 1.0
Tentative based on 1.0
Version 1.1
Version conflict
Abort
Depend on each file system. Abortion is simple
(later, the client A can Decide to overwrite it
with its tentative 1.0 by changing the
corresponding directory)
Version 1.2
Version 1.2
Merge
Ignore conflict
16
File usage patterns

We cant have the best of all worlds
Where to compromise?
Semantics vs. efficiency
Efficiency client performance, network
traffic, server load
- Modified semantics break transparency, reduce
functionality, etc.
To help decision Understand how files are used
1981 study by Satyanarayanan

17
File usage patterns

Most files are lt10 Kbytes
(2005 average size of 385,341 files on a
typical Mac 197 KB)
(files accessed within 30 days 147,398 files.
average size56.95 KB)
Feasible to transfer entire files (simpler)
Still have to support long files
Most files have short lifetimes
Perhaps keep them local
Few files are shared
Overstated problem
Session semantics will cause no problem most
of the time

18
Design issues
19
Namespace Location transparency

Is the name of the server known to the client?
//server1/dir/file
Server can move without client caring
if the name stays the same.
If file moves to server2 we have problems!
Location independence
Files can be moved without changing the
pathname
//archive/paul

20
Namespace Where do you find the remote files?

Should all machines have the exact same view of
the directory hierarchy?
e.g., global root directory?
//server/path
or forced remote directories
/remote/server/path
or.
Should each machine have its own hierarchy with
remote resources located as needed?
/usr/local/games

21
Access How do you access files?

Requirement Access remote files as local files
Remote FS name space should be syntactically
consistent with local name space
redefine the way all files are named and provide
a syntax for specifying remote files
-- e.g. //server/dir/file
-- Can cause legacy applications to fail
2. use a file system mounting mechanism
Overlay portions of another FS name space over
local name space

22
Name resolution how to handle ..

Parse
(a) component at a time
(b) entire path at once
(b) is more efficient but
offers less flexibility (e.g., naming as
indirection)
Perhaps use (a) and cache bindings to increase
performance

23
Stateful or stateless design?

Stateful Server maintains client-specific state
Shorter requests
Better performance in processing requests
Cache coherence is possible
Server can know whos accessing what
File locking is possible

24
Stateful or stateless design?

Stateless Server maintains no information on
client accesses
Each request must identify file and offsets
Server can crash and recover
No state to lose
Client can crash and recover
No open/close needed
They only establish state
No server space used for state
Dont worry about supporting many clients (with
low activity)
Problems with consistency
E.g., if file is deleted on server
File locking not possible

25
Caching
26
Caching

Goal Hide latency to improve performance for
repeated accesses
Four places to place data
Servers disk
Servers buffer cache
Clients buffer cache
Clients disk
(last two introduce cache consistency problems!)

27
Approaches to caching

Write-through
What if another client reads its cached copy?
Consistency
All accesses will require checking with server
Or Server maintains state and sends invalidations
Performance overheads
Delayed writes
Write data can be buffered locally
overwiriting does not produce additi0onal
overhead
Decide whae to perform writes (when cache is
full or periodically, and on close)
One bulk write is more efficient than lots of
little writes
Problem semantics become ambiguous

28
Approaches to caching

Write on close
Admit that we have session semantics
Centralized control
Keep track of who has what open on each node
Stateful file system with signaling traffic

29
Striping
30
Cluster Architecture
local disk
Processor 1
interconnect
Memory
NIC1
Processor 2
NIC2

Each node has its own (small) disk
Used to store (i.e., copy) the executables, and
some data
For many applications there needs to be a
globally visible file system
Large shared input/output data file that too big
for local disks

31
Distributed File System?

Question how do we make files visible across a
set of machines?
Answer use a distributed file system
dedicate one of the nodes to be the server
attach several (large) disks to it
e.g., NFS

interconnect
32
Distributed File System?

Question how do we make files visible across a
set of machines?
Answer use a distributed file system
use a NAS (Network-Attached Storage)
Does the NFS thing in hardware

interconnect
NAS
33
Distributed File System?

Advantages
Simple and well understood
Disadvantages
The file server can be a bottleneck
Especially for a cluster that runs many
scientific applications at once
The intended usage is that a single process
reads/writes to a file at a time
But parallel applications would most likely
prefer doing concurrent reads and concurrent
writes
Often not built for top performance (NFS)

34
Parallel File System

improves on the drawbacks of distributed file
systems
Multiple disks
Each disk has its own I/O channel
Disks can be used simultaneously
I/O is parallel at both ends
Multiple processes writing/reading
Multiple disks writing/reading
Not necessarily matching numbers

35
Parallel File System
interconnect
Compute Nodes
I/O Nodes Disks
36
Parallel File System
Storage Area Network
interconnect
Compute Nodes
I/O Nodes Disks
37
Parallel File Systems

a number of commercial parallel file systems
e.g., IBMs GPFS
use disk striping
strip factor number of disks
strip depth size of each block

File
Disks
38
Striping

Multiple physical disks separate I/O channels
striping parallel access to a single file
Typically implements some form of RAID to combine
striping with fault-tolerance
e.g., RAID 5
The file system needs to figure out where blocks
are located
Each I/O node maintains some directory
There is a global name service
Concurrent writes locking of blocks not files!

39
Application view Parallel Applications and I/O
B proportion of program that is sequential

Option 1 A single node does all I/O
Amdahls law says if your data is large, forget
parallel speedup
Option 2 Before the application, split the input
data and store it into local disks on the nodes,
then at the end gather output
Cumbersome
Storage may not be sufficient anyway
Option 3 Do parallel I/O with a parallel file
system
Allows non-contiguous pieces of data in parallel
e.g., interleaved pieces of a matrix for a cyclic
data distribution
But the UNIX API is not convenient for writing
parallel applications and accessing a parallel
file system
No complex access patterns
No collective I/O
Different APIs make code non-portable
Solution use MPI I/O (part of MPI 2)

40
Simple Example
File
P0 P1 P2 P3
P4

MPI_File fh
MPI_Status status
MPI_Comm_rank(MPI_COMM_WORLD, rank)
MPI_Comm_size(MPI_COMM_WORLD, nprocs)
bufsize filesize/nprocs
nints bufsize/sizeof(int)
MPI_File_open(MPI_COMM_WORLD, /pfs/data,
MPI_MODE_RDONLY, MPI_INFO_NULL, fh)
MPI_File_seek(fh, rank bufsize, MPI_SEEK_SET)
MPI_File_read(fh, buf, nints, MPI_INT, status)
MPI_File_close(fh)

41
Striping Summary (from an application/app
developer viewpoint)

If your application is stuck doing I/O for most
of its time
Buy I/O hardware, Do not use NFS but rather some
parallel file system
Write code using MPI I/O
All processes should do the same amount of I/O
Make as large I/O requests as possible at a time
to benefit from striping
Performance benefits when compared to the naive
solution can be orders of magnitude
Other striping solutions
Striping FTP server

42
Next

Case Study Freeloader
Case Study on Data access patterns small worlds
and data sharing graph

43
Next classes

Volunteers Discussion leader for Thursday.
Tuesday DFS
Scale and Performance in a Distributed File
System, J. H. Howard et al., ACM Transactions on
Computer Systems Feb. 1988, Vol. 6, No. 1, pp.
51-81. pfd
The Google File System, Ghemawat et al., SOSP
2003 pdf
Thursday Data replication
Efficient Replica Maintenance for Distributed
Storage Systems, Byung-Gon Chun et al. NSDI06
pdf.
Drafting Behind Akamai (Travelocity-Based
Detouring), Ao-Jan Su et al. SIGCOMM06 pdf.

Write a Comment

User Comments (0)

About PowerShow.com

Distributed%20File%20Systems%20(DFS) - PowerPoint PPT Presentation

Distributed%20File%20Systems%20(DFS)

Failure transparency. Client and client programs should operate correctly after server failure. ... semantics: break transparency, reduce functionality, etc. ... – PowerPoint PPT presentation