CS519 Fall 2003 - PowerPoint PPT Presentation

1 / 59
About This Presentation
Title:

CS519 Fall 2003

Description:

... is asked to release or downgrade it to remove the conflict ... If a server is asked to downgrade the lock, it must write dirty data to disk before complying ... – PowerPoint PPT presentation

Number of Views:31
Avg rating:3.0/5.0
Slides: 60
Provided by: csRut
Category:

less

Transcript and Presenter's Notes

Title: CS519 Fall 2003


1
CS519 Fall 2003
  • Distributed File Systems
  • Lecturer Ricardo Bianchini

2
File Service
  • Implemented by a user/kernel process called file
    server
  • A system may have one or several file servers
    running at the same time
  • Two models for file services
  • upload/download files move between server and
    clients, few operations (read file write file),
    simple, requires storage at client, good if whole
    file is accessed
  • remote memory access files stay at server, rich
    interface with many operations, less space at
    client, efficient for small accesses

3
Directory Service
  • Provides naming usually within a hierarchical
    file system
  • Clients can have the same view (global root
    directory) or different views of the file system
    (remote mounting)
  • Location transparent location of the file
    doesnt appear in the name of the file
  • ex /server1/dir1/file specifies the server but
    not where the server is located -gt server can
    move the file in the network without changing the
    path
  • Location independence a single name space that
    looks the same on all machines, files can be
    moved between servers without changing their
    names -gt difficult

4
Two-Level Naming
  • Symbolic name (external), e.g. prog.c binary
    name (internal), e.g. local i-node number as in
    Unix
  • Directories provide the translation from symbolic
    to binary names
  • Binary name format
  • i-node no cross references among servers
  • (server, i-node) a directory in one server can
    refer to a file on a different server
  • Capability specifying address of server, number
    of file, access permissions, etc
  • binary_name binary names refer to the
    original file and all of its backups

5
File Sharing Semantics
  • UNIX semantics total ordering of R/W events
  • easy to achieve in a non-distributed system
  • in a distributed system with one server and
    multiple clients with no caching at client, total
    ordering is also easily achieved since R and W
    are immediately performed at server
  • Session semantics writes are guaranteed to
    become visible only when the file is closed
  • allow caching at client with lazy updating -gt
    better performance
  • if two or more clients simultaneously write one
    file (last one or non-deterministically) replaces
    the other

6
File Sharing Semantics (contd)
  • Immutable files create and read file operations
    (no write)
  • writing a file means to create a new one and
    enter it into the directory replacing the
    previous one with the same name atomic
    operations
  • collision in writing last copy or
    non-deterministically
  • what happens if the old copy is being read?
  • Transaction semantics mutual exclusion on file
    accesses either all file operations are
    completed or none is. Good for banking systems

7
File System Properties
  • Observed in a study by Satyanarayanan (1981)
  • most files are small (lt 10K)
  • reading is much more frequent than writing
  • most RW accesses are sequential (random access
    is rare)
  • most files have a short lifetime -gt create the
    file on the client
  • file sharing is unusual -gt caching at client
  • the average process uses only a few files

8
Server System Structure
  • File directory service combined or not
  • Cache directory hints at client to accelerate the
    path name look up directory and hints must be
    kept coherent
  • State information about clients at the server
  • stateless server no client information is kept
    between requests
  • stateful server servers maintain state
    information about clients between requests

9
Stateless vs. Stateful
10
Caching
  • Three possible places servers memory, clients
    disk, clients memory
  • Caching in servers memory avoids disk access
    but still network access
  • Caching at clients disk (if available) tradeoff
    between disk access and remote memory access
  • Caching at client in main memory
  • inside each process address space no sharing at
    client
  • in the kernel kernel involvement on hits
  • in a separate user-level cache manager flexible
    and efficient if paging can be controlled from
    user-level
  • Server-side caching eliminates coherence problem.
    Client-side cache coherence? Next

11
Client Cache Coherence in DFS
  • How to maintain coherence (according to a model,
    e.g. UNIX semantics or session semantics) of
    copies of the same file at various clients
  • Write-through writes sent to the server as soon
    as they are performed at the client -gt high
    traffic, requires cache managers to check
    (modification time) with server before can
    provide cached content to any client
  • Delayed write coalesces multiple writes better
    performance but ambiguous semantics
  • Write-on-close implements session semantics
  • Central control file server keeps a directory of
    open/cached files at clients and sends
    invalidations -gt Unix semantics, but problems
    with robustness and scalability problem also
    with invalidation messages because clients did
    not solicit them

12
File Replication
  • Multiple copies are maintained, each copy on a
    separate file server - multiple reasons
  • Increase reliability file accessible even if a
    server is down
  • Improve scalability reduce the contention by
    splitting the workload over multiple servers
  • Replication transparency
  • explicit file replication programmer controls
    replication
  • lazy file replication copies made by the server
    in background
  • use group communication all copies made at the
    same time in the foreground
  • How replicas should be modified? Next

13
Modifying Replicas Voting Protocol
  • Updating all replicas using a coordinator works
    but is not robust (if coordinator is down, no
    updates can be performed) gt Voting updates (and
    reads) can be performed if some specified of
    servers agree.
  • Voting Protocol
  • A version (incremented at write) is associated
    with each file
  • To perform a read, a client has to assemble a
    read quorum of Nr servers similarly, a write
    quorum of Nw servers for a write
  • If Nr Nw gt N, then any read quorum will contain
    at least one most recently updated file version
  • For reading, client contacts Nr active servers
    and chooses the file with largest version
  • For writing, client contacts Nw active servers
    asking them to write. Succeeds if they all say
    yes.

14
Modifying Replicas Voting Protocol
  • Nr is usually small (reads are frequent), but Nw
    is usually close to N (want to make sure all
    replicas are updated). Problem with achieving a
    write quorum in the presence of server failures
  • Voting with ghosts allows to establish a write
    quorum when several servers are down by
    temporarily creating dummy (ghost) servers (at
    least one must be real)
  • Ghost servers are not permitted in a read quorum
    (they dont have any files)
  • When server comes back it must restore its copy
    first by obtaining a read quorum

15
Network File System (NFSv3)
  • A stateless DFS from Sun only state is map of
    handles to files
  • An NFS server exports directories
  • Clients access exported directories by mounting
    them
  • Because NFS is stateless, OPEN and CLOSE RPCs are
    not provided by the server (implemented at the
    client) clients need to block on close until all
    dirty data are stored on disk at the server
  • NFS provides file locking (through separate
    network lock manager protocol) but UNIX semantics
    is not achieved due to client caching
  • dirty cache blocks are sent to server in chunks,
    every 30 sec or at close
  • a timer is associated with each cache block at
    the client (3 sec for data blocks, 30 sec for
    directory blocks). When the timer expires, the
    entry is discarded (if clean, of course)
  • when a file is opened, last modification time at
    the server is checked

16
Recent Research in DFS
  • Petal Frangipani (DEC SRC) 2-layer DFS system
  • xFS (Berkeley) a serverless network file system

17
Petal Distributed Virtual Disks
  • A distributed storage system that provides a
    virtual disk abstraction separate from the
    physical resource
  • The virtual disk is globally accessible to all
    Petal clients on the network
  • Virtual disks are implemented on a cluster of
    servers that cooperate to manage a pool of
    physical disks
  • Advantages
  • recover from any single failure
  • transparent reconfiguration and expandability
  • load and capacity balancing
  • low-level service (lower than a DFS) that handles
    distribution problems

18
Petal
19
Virtual to Physical Translation
  • ltvirtual disk, virtual offsetgt -gt ltserver,
    physical disk, physical offsetgt
  • Three data structures virtual disk directory,
    global map, and physical map
  • The virtual disk directory and global map are
    globally replicated and kept consistent
  • Physical map is local to each server
  • One level of indirection (virtual disk to global
    map) is necessary to allow transparent
    reconfiguration. Well discuss reconfiguration
    soon

20
Virtual to Physical Translation (contd)
  • The virtual disk directory translates the virtual
    disk identifier into a global map identifier
  • The global map determines the server responsible
    for translating the given offset (a virtual disk
    may be spread over multiple physical disks). The
    global map also specifies the redundancy scheme
    for the virtual disk
  • The physical map at a specific server translates
    the global map identifier and the offset to a
    physical disk and an offset within that disk.
    The physical map is similar to a page table

21
Support for Backup
  • Petal simplifies a clients backup procedure by
    providing a snapshot mechanism
  • Petal generates snapshots of virtual disks using
    copy-on-write. Creating a snapshot requires
    pausing the clients application to guarantee
    consistency
  • A snapshot is a virtual disk that cannot be
    modified
  • Snapshots require a modification to the
    translation scheme. The virtual disk directory
    translates a virtual disk id into a pair ltglobal
    map id, epoch gt where epoch is incremented at
    each snapshot
  • At each snapshot a new tuple with a new epoch is
    created in the virtual disk directory. The
    snapshot takes the old epoch
  • All accesses to the virtual disk are made using
    the new epoch , so that any write to the
    original disk creates new entries in the new
    epoch rather than overwrites the blocks in the
    snapshot

22
Virtual Disk Reconfiguration
  • Needed when a new server is added or the
    redundancy scheme is changed
  • Steps to perform it at once (not incrementally)
    and in the absence of any other activity
  • create a new global map with desired redundancy
    scheme and server mapping
  • change all virtual disk directories to point to
    the new global map
  • redistribute data to the severs according to the
    translation specified in the new global map
  • The challenge is to perform it incrementally and
    concurrently with normal client requests

23
Incremental Reconfiguration
  • First two steps as before step 3 done in
    background starting with the translations in the
    most recent epoch that have not yet been moved
  • Old global map is used to perform read
    translations which are not found in the new
    global map
  • A write request only accesses the new global map
    to avoid consistency problems
  • Limitation the mapping of the entire virtual
    disk must be changed before any data is moved -gt
    lots of new global map misses on reads -gt high
    traffic. Solution relocate only a portion of
    the virtual disk at a time. Read requests for
    portion of virtual disk being relocated cause
    misses, but not requests to other areas

24
Redundancy with Chained Data Placement
  • Petal uses chained-declustering data placement
  • two copies of each data block are stored on
    neighboring servers
  • every pair of neighboring servers has data blocks
    in common
  • if server 1 fails, servers 0 and 2 will share
    servers read load (not server 3)

server 0 server 1 server 2 server 3 d0 d1
d2 d3 d3 d0 d1
d2 d4 d5 d6 d7 d7 d4
d5 d6
25
Chained Data Placement (contd)
  • In case of failure, each server can offload some
    of its original read load to the next/previous
    server. Offloading can be cascaded across
    servers to uniformly balance load
  • Advantage with simple mirrored redundancy, the
    failure of a server would result in a 100 load
    increase to another server
  • Disadvantage less reliable than simple mirroring
    - if a server fails, the failure of either one of
    its two neighbor servers will result in data
    becoming unavailable
  • In Petal, one copy is called primary, the other
    secondary
  • Read requests can be serviced by any of the two
    servers, while write requests must always try the
    primary first to prevent deadlock (blocks are
    locked before reading or writing, but writes
    require access to both servers)

26
Read Request
  • The Petal client tries primary or secondary
    server depending on which one has the shorter
    queue length. (Each client maintains a small
    amount of high-level mapping information that is
    used to route requests to the most appropriate
    servers. If a request is sent to an
    inappropriate server, the server returns an error
    code, causing the client to update its hints and
    retry the request)
  • The server that receives the request attempts to
    read the requested data
  • If not successful, the client tries the other
    server

27
Write Request
  • The Petal client tries the primary server first
  • The primary server marks data busy and sends the
    request to its local copy and the secondary copy
  • When both complete, the busy bit is cleared and
    the operation is acknowledged to the client
  • If not successful, the client tries the secondary
    server
  • If the secondary server detects that the primary
    server is down, it marks the data element as
    stale on stable storage before writing to its
    local disk
  • When the primary server comes up, the primary
    server has to bring all data marked stale
    up-to-date during recovery
  • Similar if secondary server is down

28
Petal Prototype
29
Petal Performance - Latency
Single client generates requests to random disk
offsets
30
Petal Performance - Throughput
Each of 4 clients making random requests to
single VD. Failed configuration one of 4
servers has crashed
31
Petal Performance - Scalability
32
Frangipani
  • Petal provides disk interface -gt need a file
    system
  • Frangipani is a file system designed to take full
    advantage of Petal
  • Frangipanis main characteristics
  • All users are given a consistent view of the same
    set of files
  • Servers can be added without changing
    configuration of existing servers or interrupting
    their operation
  • Tolerates and recovers from machine, network, and
    disk failures
  • Very simple internally a set of cooperating
    machines that use a common store and synchronize
    access to that store with locks

33
Frangipani
  • Petal takes much of the complexity out of
    Frangipani
  • Petal provides highly available storage that can
    scale in throughput and capacity
  • However, Frangipani improves on Petal, since
  • Petal has no provision for sharing the storage
    among multiple clients
  • Applications use a file-based interface rather
    than the disk-like interface provided by Petal
  • Problems with Frangipani on top of Petal
  • Some logging occurs twice (once in Frangipani and
    once in Petal)
  • Cannot use disk location in placing data, cause
    Petal virtualizes disks
  • Frangipani locks entire files and directories as
    opposed to individual blocks

34
Frangipani Structure
35
Frangipani Disk Layout
  • A Frangipani file system uses only 1 Petal
    virtual disk
  • Petal provides 264 bytes of virtual disk space
  • Commits real disk space when actually used
    (written)
  • Frangipani breaks disk into regions
  • 1st region (1T) stores config parameters and
    housekeeping info
  • 2nd region (1T) stores logs each Frangipani
    server uses a portion of this region for its log.
    Can have up to 256 logs.
  • 3rd region (3T) holds allocation bitmaps,
    describing which blocks in remaining regions are
    free. Each server locks a different portion.
  • 4th region (1T) holds inodes
  • 5th region (128T) holds small data blocks (4
    Kbytes each)
  • Remainder of Petal disk holds large data blocks
    (1 Tbyte each)

36
Frangipani File Structure
  • First 16 blocks (64 KB) of a file are stored in
    small blocks
  • If file becomes larger, store the rest in a 1 TB
    large block

37
Frangipani Dealing with Failures
  • Write-ahead redo logging of metadata user data
    is not logged
  • Each Frangipani server has its own private log
  • Only after a log record is written to Petal does
    the server modify the actual metadata in its
    permanent locations
  • If a server crashes, the system detects the
    failure and another server uses the log to
    recover
  • Because the log is on Petal, any server can get
    to it.

38
Frangipani Synchronization Coherence
  • Frangipani has a lock for each log segment,
    allocation bitmap segment, and each file
  • Multiple-reader/single-writer locks. In case of
    conflicting requests, the owner of the lock is
    asked to release or downgrade it to remove the
    conflict
  • A read lock allows a server to read data from
    disk and cache it. If server is asked to release
    its read lock, it must invalidate the cache entry
    before complying
  • A write lock allows a server to read or write
    data and cache it. If a server is asked to
    release its write lock, it must write dirty data
    to disk and invalidate the cache entry before
    complying. If a server is asked to downgrade the
    lock, it must write dirty data to disk before
    complying

39
Frangipani Lock Service
  • Fully distributed lock service for fault
    tolerance and scalability
  • How to release locks owned by a failed Frangipani
    server?
  • The failure of a server is discovered when its
    lease expires. A lease is obtained by the
    server when it first contacts the lock service.
    All locks acquired are associated with the lease.
    Each lease has an expiration time (30 seconds)
    after its creation or last renewal. A server
    must renew its lease before it expires
  • When a server fails, the locks that it owns
    cannot be released until its log is processed and
    any pending updates are written to Petal

40
Frangipani Performance
41
Frangipani Performance
42
Frangipani Scalability
43
Frangipani Scalability
44
Frangipani Scalability
45
xFS (Context Motivation)
  • A server-less network file system that works over
    a cluster of cooperative workstations
  • Moving away from central FS is motivated by three
    factors
  • hardware opportunity (fast switched LANs) provide
    aggregate bandwidth that scales with the number
    of machines in the network
  • user demand is increasing e.g., multimedia
  • limitations of central FS approach
  • limited scalability
  • Expensive
  • replication for availability increase complexity
    and operation latency

46
xFS (Contribution Limitations)
  • A well-engineered approach which takes advantage
    of several research ideas RAID, LFS, cooperative
    caching
  • A truly distributed network file system (no
    central bottleneck)
  • control processing distributed across the system
    on per-file granularity
  • storage distributed using a software RAID and a
    log-based network striping (Zebra)
  • use cooperative caching to use portions of client
    memory as a large, global file cache
  • Limitation requires machines to trust each other

47
RAID in xFS
  • RAID partitions a stripe of data into N-1 data
    blocks and a parity block (the exclusive-OR of
    the bits of data blocks)
  • Data and parity blocks are stored on different
    storage servers
  • Provides both high bandwidth and fault tolerance
  • Traditional RAID drawbacks
  • multiple accesses for small writes
  • hardware RAID expensive (special hardware to
    compute parity)

48
LFS in xFS
  • High-performance writes buffer writes in memory
    to write them to disk in large, contiguous,
    fixed-size groups called log segments
  • Writes are always appended as logs
  • imap to locate i-nodes stored in memory and
    periodically checkpointed to disk
  • Simple recovery procedure get the last
    checkpoint and then rolls forward reading the
    later segments and in the log and update imap and
    i-nodes
  • Free disk management through log cleaner
    coalesces old, partially empty segments into a
    smaller number of full segments -gt cleaning
    overhead can be large sometime

49
Zebra
  • Combines LFS and RAID LFSs large writes make
    writes to the network RAID efficient
  • Implements RAID in software
  • Writes coalesced into a private per-client log
  • Log-base striping
  • log segment split into log fragments which are
    striped over the storage servers
  • parity fragment computation is local (no network
    access)
  • Deltas stored in the log encapsulate
    modifications to file system states that must be
    performed atomically - used for recovery

50
Metadata and Data Distribution
  • A centralized FS stores all data blocks on its
    local disks
  • manages location of metadata
  • maintains a central cache of data blocks in its
    memory
  • manages cache consistency metadata that lists
    which clients in the system are caching each
    block (not NFS)

51
xFS Metadata and Data Distribution
  • Stores data on storage servers
  • Splits metadata management among multiple
    managers that can dynamically alter the mapping
    from a file to its manager
  • Uses cooperative caching that forwards data among
    client caches under the control of the managers
  • The key design challenge how to locate data and
    metadata in such a completely distributed system

52
xFS Data Structures
53
Manager Map
  • Allows clients to determine which manager to
    contact for a file
  • Manager map is globally replicated (it is small)
  • Two translations are necessary to allow manager
    remapping
  • external file name - gt file index number
    (directory)
  • index number -gt manager (manager map)
  • Manager map can also be used for a coarse-grained
    workload balancing among managers
  • File manager controls disk location metadata
    (Imap I-node) and cache consistency state (list
    of clients caching the block or who has the
    ownership for write)

54
Read Operation
55
Write Operation
  • Clients buffer writes in their local memory until
    committed to a stripe group of storage servers
  • Since xFS uses LFS a write changes the disk
    address of the modified block
  • After a client commits a segment to a storage
    server it notifies the modified blocks managers
    to modify their index nodes and imaps
  • Index nodes and data blocks do not have to be
    simultaneously committed because in Zebra the
    clients log includes a delta that allows
    reconstruction of the managers data structure
    in the event of a crash

56
Cache Consistency
  • Per-block rather than per-file
  • Ownership-based similar to a DSM scheme
  • To modify a block a client must get the ownership
    from the manager
  • The manager invalidates any other cached copies
    of the block, then gives write permission
    (ownership) to the client
  • Ownership can be revoked by the manager
  • Manager keeps the list of clients caching each
    block

57
Log cleaner in xFS
  • Distributed
  • Relies on utilization status which is also
    distributed maintained by the client who wrote
    that segment
  • A leader in each group initiates cleaning and
    decides which cleaners should clean the stripe
    groups segments
  • Each cleaner receives a subset of segments to
    clean
  • Cleaners assume optimistic concurrrency to
    resolve conflicts between cleaner updates and
    normal writes
  • In case of a conflict (because a client is
    writing a block as it is cleaned) the manager
    ensures that client update takes precedence over
    the cleaners update

58
xFS Performance
59
xFS Performance
Write a Comment
User Comments (0)
About PowerShow.com