Title: Chapter 6 DFS Design and Implementation
1Chapter 6 DFS Design and Implementation
2Outline
- Characteristics of a DFS
- DFS Design and Implementation
- Transaction and Concurrency Control
- Data and File Replication
3Overview
- A file system is responsible for the naming,
creation, deletion, retrieval, modification, and
protection of files - DFS (Distributed file system) a file system
consisting of physically dispersed storage sites
but providing a traditional centralized file
system view for users - Transparency
- Directory service (name service)
- Performance and availability ? caching and
replication ? cache coherence and replica
management - Access control and protection
- The unique problems in DFS are due to the need
for sharing and replication of files
4Characteristics of A DFS (I)
- Dispersed clients
- Login transparency uniform login procedure,
uniform view of FS - Access transparency uniform mechanism to access
local and remote files - Dispersed files
- Location transparency file names need not
contain information about the physical locations
of files - Location independence (migration transparency)
files can be moved from one physical location to
another without changing their names
5Characteristics of A DFS (II)
- Multiplicity of users
- Concurrency transparency update to a file by a
process should not interfere with other processes
sharing this file - Support concurrency transparency at the
transaction level - Interleaved file accesses by several applications
- Multiplicity of files
- Replication transparency clients are not aware
that more than one copy of the file exists - Perform atomic updates on the replicas
- Fault tolerance, scalability, heterogeneity
6DFS Design and Implementation
7File Concept
- OS abstracts from the physical storage devices to
define a logical storage unit File - Types
- Data numeric, alphabetic, alphanumeric, binary
- Program source and object form
8Logical components of a file
- File name symbolic name
- When accessing a file, its symbolic name is
mapped to a unique file id (ufid or file handle)
that can locate the physical file - Mapping is the primary function of the directory
service - File attributes next slide
- Data units
- Flat structure of a stream of bytes of sequence
of blocks - Hierarchical structure of indexed records
9File Attributes
- File Handle Unique ID of file
- Name only information kept in human-readable
form - Type needed for systems that support different
types - Location pointer to file location on device
- Size current file (and the maximum allowable)
size - Protection controls who can read, write,
execute - Time, date, and user identification data for
protection, security, and usage monitoring. - Information about files are kept in the directory
structure, which is maintained on the physical
storage device.
10Access Methods
- Sequential access information is processed in
order - read next
- write next (append to the end of the file)
- reset to the beginning of file
- skip forward or backward n records
- Direct access a file is made up of fixed length
logical blocks or records - read n
- write n
- position to n
- read next
- write next
- rewrite n
11Access Methods (Cont.)
- Indexed sequential access
- Data units are addressed directly by using an
index (key) associated with each data block - Requires the maintenance of an search index on
the file, which must be searched to locate a
block address for each access - Usually used only by large file systems in
mainframe computers - Indexed sequential access method (ISAM)
- A two-level scheme to reduce the size of the
search index - Combine the direct and sequential access methods
12Major Components in A File System
A file system organizes and provides access and
protection services for a collection of files
13Directory Structure
- Access to a file must first use a directory
service to locate the file. - A collection of nodes containing information
about all files. - Both the directory structure and the files reside
on disk.
Directory
F 1
F 2
F 3
F 4
F n
Files
14Information in a Directory
- Name
- Type file, directory, symbolic link, special
file - Address device blocks to store a file
- Current length
- Maximum length
- Date last accessed (for archival)
- Date last updated (for dump)
- Owner ID
- Protection information
15Operations Performed on Directory
- Search for a file
- Create a file
- Delete a file
- List a directory
- Rename a file
- Traverse the file system
Some kind of name service
16Tree-Structured Directories Hierarchical
Structure of A File System
Subdirectory is just a special type of file
17Authorization Service
- File access must be regulated to ensure security
- File owner/creator should be able to control
- what can be done
- by whom
- Types of access
- Read
- Write
- Execute
- Append
- Delete
- List
18File Service File Operations
- Create
- Allocate space
- Make an entry in the directory
- Write
- Search the directory
- Write is to take place at the location of the
write pointer - Read
- Search the directory
- Read is to take place at the location of the read
pointer - Reposition within file file seek
- Set the current file pointer to a given value
- Delete
- Search the directory
- Release all file space
- Truncate
- Reset the file to length zero
- Open(Fi)
- Search the directory structure
- Move the content of the directory entry to memory
- Close(Fi)
- move the content in memory to directory structure
on disk - Get/set file attributes
19System Service
- Directory, authorization, and file services are
user interfaces to a file system (FS) - System services are a FSs interface to the
hardware and are transparent to users of FS - Mapping of logical to physical block addresses
- Interfacing to services at the device level for
file space allocation/de-allocation - Actual read/write file operations
- Caching for performance enhancement
- Replicating for reliability improvement
20DFS Architecture NFS Example
21File Mounting
- A useful concept for constructing a large file
system from various file servers and storage
devices - Attach a remote named file system to the clients
file system hierarchy at the position pointed to
by a path name (mounting point) - A mounting point is usually a leaf of the
directory tree that contains only an empty
subdirectory - mount claven.lib.nctu.edu.tw/OS /chow/book
- Once files are mounted, they are accessed by
using the concatenated logical path names without
referencing either the remote hosts or local
devices - Location transparency
- The linked information (mount table) is kept
until they are unmounted
22File Mounting Example
root
root
Export
chow
OS
Mount
paper
book
DFS
DSM
/OS/DSM
Local Client
Remote Server
23File Mounting (Cont.)
- Different clients may perceive a different FS
view - To achieve a global FS view SA enforces
mounting rules - Export a file server restricts/allows the
mounting of all or parts of its file system to a
predefined set of hosts - The information is kept in the servers export
file - File system mounting
- Explicit mounting clients make explicit mounting
system calls whenever one is desired - Boot mounting a set of file servers is
prescribed and all mountings are performed the
clients boot time - Auto-mounting mounting of the servers is
implicitly done on demand when a file is first
opened by a client
24Location Transparency
No global naming
25A Simple Automounter for NFS
26Server Registration
- The mounting protocol is not transparent
require knowledge of the location of file servers - When multiple file servers can provide the same
file service, the location information becomes
irrelevant to the clients - Server registration ? name/address resolution
- File servers register their services with a
registration service, and clients consult with
the registration server before mounting - Clients broadcast mounting requests, and file
servers respond to clients requests
27Stateful and Stateless File Servers
- Stateless file server when a client sends a
request to a server, the server carries out the
request, sends the reply, and then remove from
its internal tables all information about the
request - Between requests, no client-specific information
is kept on the server - Each request must be self-contained full file
name and offset - Stateful file server file servers maintain
state information about clients between requests - State information may be kept in servers or
clients - Opened files and their clients
- File descriptors and file handles
- Current file position pointers
- Mounting information
- Lock status
- Session keys
- Cache or buffer
Session a connection for a sequenceof requests
and responses between aclient and the file server
28A Comparison between Stateless and Stateful
Servers
29Issues of A Stateless File Server
- Idempotency requirement
- Is it practical to structure all file accesses as
idempotent operations? - File locking mechanism
- Should locking mechanism be integrated into the
transaction service? - Session key management
- Can one-time session key be used for each file
access? - Cache consistency
- Is the file server responsible for controlling
cache consistency among clients? - What sharing semantics are to be supported?
30File Sharing
- Overlapping access multiple copies of the same
file - Space multiplexing of the file
- Cache or replication
- Coherency control managing accesses to the
replicas, to provide a coherent view of the
shared file - Desirable to guarantee the atomicity of updates
(to all copies) - Interleaving access multiple granularities of
data access operations - Time multiplexing of the file
- Simple read/write, Transaction, Session
- Concurrency control how to prevent one execution
sequence from interfering with the others when
they are interleaved and how to avoid
inconsistent or erroneous results
31Space Multiplexing
- Remote access no file data is kept in the client
machine. Each access request is transmitted
directly to the remote file server through the
underlying network. - Cache access a small part of the file data is
maintained in a local cache. A write operation or
cache miss results a remote access and update of
the cache - Download/upload access the entire file is
downloaded for local accesses. A remote access or
upload is performed when updating the remote file
32Remote Access VS Download/Upload Access
Remote Access
Download/Upload Access
33Four Places to Caching
Clients disk (optional)
Servers disk
Clients main memory
Servers main memory
Client
Server
34Coherency of Replicated Data
- Four interpretations
- All replicas are identical at all times
- Impossible in distributed systems
- Replicas are perceived as identical only at some
points in time - How to determine the good synchronization points?
- Users always read the most recent data in the
replicas - How to define most recent?
- Based on the completion times of write
operations (the effect of a write operation has
been reflected in all copies) - Write operations are always performed
immediately and their results are propagated in
a best-effort fasion - Coarse attempt to approximate the third definition
35Time Multiplexing
- Simple RW each read/write operation is an
independent request/response access to the file
server - Transaction RW a sequence of read and write
operations is treated as a fundamental unit of
file access (to the same file) - ACID properties
- Session RW a sequence of transaction and simple
RW operations
36Space and Time Concurrencies of File Access
37Semantics of File Sharing
- On a single processor, when a read follows a
write, the value returned by the read is the
value just written (Unix Semantics). - In a distributed system with caching, obsolete
values may be returned.
Solution to coherency andconcurrency control
problemsdepends on the semantics ofsharing
required by applications
38Semantics of File Sharing (Cont.)
39Version Control
- Version control under immutable files
- Implemented as a function of the directory
service - Each file is attached with a version number
- An open to a file always returns the current
version - Subsequently read/write operations to the opened
files are made only to the local working copy - When the file is closed, the local modified
version (tentative version) is presented to the
version control service - If the tentative version is based on the current
version, the update is committed and the
tentative version becomes the current version
with a new version number - What is the tentative version is based on an
older version?
40Version Control (Cont.)
- Action to be taken if based on an older version
- Ignore conflict a new version is created
regardless of what has happened (equivalent to
session semantics) - Resolve version conflict the modified data in
the tentative version are disjoint from those in
the new current version - Merge the updates in the tentative version with
the current version to yield to a new version
that combines all updates - Resolve serializability conflict the modified
data in the tentative version were already
modified by the new current version - Abort the tentative version and roll back the
execution of the client with the new current
version as its working version - The concurrent updates are serialized in some
arbitrary order
41Transaction and Concurrency Control
- Apply the idea of transaction to distributed file
system management
42The Transaction Model
- Transaction a fundamental unit of interaction
between processes (all-or-nothing) - Updating a master tape
- Withdraw money from one account and deposit it in
another
43The Transaction Model (Cont.)
- Examples of primitives for transactions
- Support by underlying distributed OS or language
runtime system
44The Transaction Model (Cont.)
- Transaction to reserve three flights commits
- Transaction aborts when third flight is
unavailable
45The ACID Properties
- Atomicity (Indivisibly)
- Either all of the operations in a transaction are
performed or none of them are, in spite of
failures - Consistency (Serializability) (not violate system
invariants) - The execution of interleaved transaction is
equivalent to a serial execution of the
transactions in some order - Isolation (no interference between transactions)
- Partial results of an incomplete transaction are
not visible to others before the transaction is
successfully committed - Durability
- The system guarantees that the results of a
committed transaction will be made permanent even
if a failure occurs after the commitment
46Nested and Distributed Transactions
Durability applies only to the toptransaction
47Implementation Private Workspace
During transactions,reads/writes go to private
workspace
Performance issue
- The file index and disk blocks for a three-block
file - The situation after a transaction has modified
block 0 and appended block 3 - After committing
48Implementation Write-ahead Log
- Files are actually modified in place, but before
any block is changed, a record is written to a
log telling which transaction is making the
change, which file and block is being changed,
and what the old and new values are - Only after the log has been written successfully
is the change made to the file - Writeahead log can be used for undo (rollback)
and redo
49Transaction Processing System
Execution Phase and Commit Phase
(Transaction ID private workspace)
Satisfy the ACID property
50Execution Phase and Commit Phase
Failures and recovery actions for the 2PC protocol
51Transaction Processing System (Cont.)
- Layered approach for transaction management
- Data (object) manager perform actual read/write
on data - Know nothing about transaction
- Atomic update. Cache and replica management
- Interface to the FS
- Scheduler responsible for properly controlling
concurrency - Determine which transaction is allowed to pass
read/write to DM and at which time - Concurrent control protocol (serializable)
- Enforce isolation and consistency avoid
conflicts - Transaction Manager guarantee atomicity of
transactions - All-or-none two-phase commit
- Maintain write-ahead log and private workspace
for each transaction
52Distributed Transaction Processing System
- General organization of managers for handling
distributed transactions.
53Distributed Transaction Processing System (Cont.)
Another view of the previous slide
- A transaction may invoke operations on remote
objects (files) - The machine initiates a transaction ? coordinator
- The machine on which a remote object is located ?
participant - Two-phase commit between coordinator and
participant
54Distributed Transaction Processing System (Cont.)
- A transaction manager serves as the coordinator,
with the remote transaction managers being the
participants in the two-phase commit protocol - Apply two-phase commit protocol to atomic update
of replicated objects - The object (data) manager where an update is
requested initiates the two-phase commit protocol
in conjunction with other object managers that
are holding a replica of the object
55Concurrency Control Schedulers Responsibility
- The whole idea behind concurrency control
- Properly schedule conflicting operations to
ensure consistency - Conflicting operations operate on the same data
item and if at least one of them is a write
operation - Need to acquire/release locks before/after using
the data items - Read-write conflict and Write-write conflict
- How to discover and handle inconsistency
- Prevent inconsistency two-phase locking
- All access requests are constrained in a certain
format such that interference among conflicts can
be prevented - TM transforms clients transactions into the
restrictive form - Use locks ? scheduler assumes the lock management
functions
56Concurrency Control (Cont.)
- How to discover and handle inconsistency (Cont.)
- Avoid inconsistency timestamp ordering
- Each individual access operation is checked by
the scheduler and a decision of whether the
operation should be accepted, tentatively
accepted, or rejected to avoid conflicts is made
by the scheduler - Schedulers perform the scheduling of operations
based on timestamps ordering - Validate consistency optimistic concurrency
control protocols - Conflicts are completed ignored during the
execution phase of a transaction. The consistency
is validated at the end of the execution phase.
Only transactions that can be globally validated
are allowed to commit. Schedulers are Validation
managers.
57Serializability
- Concurrency control are based on concept of
serializability - Schedule operations execution order
- Legal schedule a schedule that observe the
internal ordering of operations for each
transaction and in which no transactions hold
conflicting locks simultaneously (for locking
algorithm) - Not all legal schedules yield consistent results
or even complete - Serial schedule a special legal schedule that is
formed by a strict sequential execution of the
transactions in some order - Each transaction satisfies ACID
- Ensure the consistency requirement
- A schedule is serializable if the result of its
execution is equivalent to that of a serial
schedule (This is what we want)
58Serializability Example
Already committed
(1,3) and (2,4) have write-write conflicts
59Interleaving Schedules
60Serializability (Cont.)
- Updates are made permanent only if the execution
of the transactions satisfies the serializability
requirement and is successfully committed - Sufficient condition for serializability
- If the interleaved execution of transactions is
to be equivalent to a serial in some order, then
all conflicting objects in the interleaved
serializable schedule must also be executed in
the same order at all object sites - Chapter 12 presents the serialization graph model
to address general serialization problems
61Two-Phase Locking
- Using locking approach, all shared objects in a
well-formed transaction must be locked before
they can be accessed and must be released before
the end of transaction - Two-phasing locking Locking
- A new lock cannot be acquired after the first
release of a lock - Phase 1 growing phase of locking the objects
- Phase 2 shrinking phase of releasing the objects
- Extreme two-phase locking
- Get all locks at the beginning of the transaction
and release all locks at the same time as the end
of the transaction - Example Table 6.1
- 1, 2 are feasible
62Two-Phase Locking (Cont.)
Deadlock may happen (ex. reverse operations 3 4
in t2 of Table 6.1)
Scheduler is responsible for granting and
releasing locks in such a way that only valid
schedules results (solving operation conflict)
63Two-Phase Locking (Cont.)
- Strict two-phase locking.
Deadlock may happen
64Two-Phase Locking (Cont.)
- Strict 2PL only release locks when commit/abort
- Sacrifice some concurrency but easy to implement
- Un-strict 2PL difficult to implement
- TM does not know when the last lock has been
requested - May cause rolling aborts
- Transaction T1 updates X, then release X
- Transaction T2 reads X ? the new X value is read
- T1 aborts ? T2 must abort as well
- T2 has a commit dependence on T1
- Commit of T2 must be delayed until the commit of
T1
65Two-Phase Locking (Cont.)
- Two-phase locking and strict two-phase locking
?deadlock - Two-phase locking in a distributed system
- Centralized 2PL
- A single site is responsible for grant/releasing
locks - Primary 2PL
- Each data is assigned a primary copy. The
scheduler on the copys machine is responsible
for grant/releasing locks - Distributed 2PL
- Data may be replicated across multiple machines.
The scheduler on each machine is responsible for
grant/releasing locks and make sure the operation
is forwarded to the local data manager
66Pessimistic Timestamp Ordering
- Basic idea
- OM follows transaction timestamp order to perform
operations - When an operation on a shared object is invoked,
OM records the timestamp of the invoking
transaction - When a transaction invokes a conflicting
operation on the object - The transaction has a larger timestamp than the
one recorded by the object ? proceed (and record
the new timestamp) - Otherwise ? abort
- No deadlock
- Cascade aborts (schedule 5)
- Tentative write before commit for ensuring
isolation
67Pessimistic Timestamp Ordering (Cont.)
- Timestamp ordering with tentative writes SCH
- Each object is associated with
- RD transaction commitment time for the last
read - WR transaction commitment time for the last
write - A list of tentative times (Ts) for the pending
transactions with a write operation to the object - Tmin the minimum of Ts
68Pessimistic Timestamp Ordering (Cont.)
- Concurrency control using timestamps.
Wait until T3commit/abort
And Restart
Different from Figure 6.7
69Pessimistic Timestamp Ordering (Cont.)
Execution phase enforce or resolve
consistency Commit phase enforce atomicity
Execution Phase
Commit Phase
70Pessimistic Timestamp Ordering (Cont.)
- Read (with transaction timestamp T)
- T lt WR ? abort (to maintain increasing timestamp
order) - WR lt T lt Tmin ?allow to proceed (before any
pending write) - Read result is put into TMs work space and
return to the client - Tmin lt T ?put in the tentative list and waits for
the preceding writes finish (commit or abort)
(already have tentative writes) - Write
- T gt RD and T gt WR ? put into the tentative list
- Inform TM the success or failure of the tentative
write operation - Otherwise ? abort
Enforce or resolve consistency in execution phase
71Pessimistic Timestamp Ordering (Cont.)
- Abort
- Read ? simply discard the waiting read
- Write ? remove from the tentative list
- If a waiting read reaches the head of the list ?
perform read - Commit ? the successful completion of the atomic
commit in TM - Transaction waiting to read ? never happen
(blocked) - Transaction with only completed read operation ?
update the objects RD (the larger of the
transaction timestamp and the objects current
RD) - Transaction with tentative write
- SCH aborts all pending transactions (both waiting
reads and tentative writes) ahead of the
committed transaction - Make the update permanent
- Remove write from tentative list (may allow a
waiting read proceed) - Replicas exist ?call replication manager
Enforce atomicity in abort/commit
72Pessimistic Timestamp Ordering (Cont.)
Allow more transactions to proceed freely, but
with more aborts, since conflicts sometimes
occur and need to be resolved
Waiting reads and tentative writes abort! (to
maintain the consistency)
Tmin
Commit for a transactionwith tentative write
commit
73Example
After 2 in t1 and 4 in t2, maybe exist more
unrelated operations
Sched 1 1, 2, 3, 4
1/2
3/4
RDWRt0
RDWRt0
Tmint1
RDWRt0
Tmint1
t2
If t2 commits first ? t1 has to abort and restart!
RDWRt1
74Example (Cont.)
Sched 3 3, 1, 4, 2
3/4
1/2
RDWRt0
RDWRt0
Tmint2
RDWRt0
Tmint1
t2
If t2 commits first ? t1 has to abort and restart!
RDWRt1
75Example (Cont.)
Sched 5 1, 3, 4, 2
C
1
3
RDWRt0
RDWRt0
Tmint1
RDWRt0
Tmint1
t2
D
4
2
RDWRt0
RDWRt0
Tmint2
RDWRt0
Tmint1
t2
If t2 commits first ? t1 has to abort and restart!
RDWRt1
76Example (Cont.)
t1
t2
t3
Waiting read and pending write for t1
Schedule 3
XX1
X0
RD, WR
Tmint1
t2
RD, WR
Tmint1
t2
RD, WR
Tmint1
t2
t3
Order of commit
XX2XX3
RD, WR
Tmint1
t2
t3
Waiting read and pending write for t2 and t3
77Example (Cont.)
Schedule 4 x0 (t2), x0 (t3), xx3 commit,
x0 (t1)
Waiting read and pending write for t3 (xx3)
t2 aborts and restart
t3 commit
RD, WR
Tmint2
t3
RDWRt3
X0
RDWRt3
t1
t1 aborts and restarts
78Optimistic Timestamp Ordering
- Optimistic timestamp ordering
- Just go ahead and do whatever you want to without
paying attention to what anybody else is doing - System keeps track of which data item have been
read and write - At the point of committing
- Check all other transactions to see if any of its
items have been changed since the transaction
started (Validation) - Yes ? abort
- No ? commit
- Transaction uses private workspace to store
shadow copies of data - Deadlock free and maximum parallelism
- If a transaction fails ? restart and run again
(not good for heavy load)
79Optimistic Timestamp Ordering (Cont.)
- A transaction consists of three phases
- Execution phase
- Just go ahead without paying attention to other
transactions - Need private work space for shadow copies of
shared objects - Validation phase
- Use a two-phase commit protocol to globally
validate - Once a transaction is validated, it is guaranteed
to be committed - All commitments must follow the order of
validation time - Must be atomic
- Update phase
- Make changes permanent in the persistent memory
80Optimistic Timestamp Ordering (Cont.)
- Each transaction ti
- TSi timestamp of the start time of its execution
phase - TVi timestamp of the start time of its
validation phase - Ri the set of data objects read by ti (read
set) - Wi the set of data object written by ti (write
set) - Each object Oj
- RDj Commitment time for the last read operation
- WRj Commitment time for the last write operation
- Also called the version number of Oj
- The transactions are to be serialized w.r.t. the
timestamp TVs of the validated transactions
81Execution Phase
- Begin at a TM when receive a begin transaction
from client - Private work space is created and maintained by
TM - Shadow copies of object version numbers
- Similar to the session semantics of files
- Abort delete the transaction and its work space
- End transaction
- Request for commit
- Move to the validation phase
82Validation Phase
- TM validates mutual consistency between the
requested transaction and other distributed
transactions to ensure serializability - Initiate two-phase validation protocol, as
coordinator - Ri, Wi, and TVi is sent to all participating TMs
for validation - A participant can respond with positive
validation to more than one requests - Each TM has knowledge of all outstanding
transactions tk at its local site - Validation of mutual consistency between ti and
tk - TVi must be greater than TVk and tk must be
completed before ti - if both are validated ? check Wk of tk for
conflict - An accepted validation carries the current
version number of the shared remote object. It is
compared with TVi for work-space consistency. All
WRs must be smaller than TVi
83Validation Phase (Cont.)
All commitments follow the order of validation
time The update of the commitment must also be
atomic
Two-phase commit for validation
84Optimistic Timestamp Ordering (Cont.)
3
4
1
2
85Optimistic Timestamp Ordering (Cont.)
(1) Violate ordering of validation time
TVi
(2) Accept!! Serialized Already!
Ti
Execution
Validation
TVk
Tk
Execution
Validation
Update
86Update Phase
- The transaction moves to the update phase once it
gathered an accepted validation from all
participating TMs and the state of work space is
consistent - A accepted validation is equivalent to a
tentative pre-write in the timestamp ordering
approach - The update phase is similar to the commit phase
in timestamp ordering - Except tentative write can be aborted, while
validation cannot be denies once it is given - Update must be committed in the TV order for
those validate transactions
87Data and File Replication
88Overview
- Advantages of data and file replication
- Parallelism transparency higher performance
(concurrent access to replicas) - Failure transparency higher availability
(redundant replicas) - Necessity
- Replication transparency not aware of the
existence of replicas - Concurrency transparency no interference among
sharing clients - Atomic update updates to all replicas must be
atomic - One-copy serializability
- atomic transaction
- atomic update of replicas
89Overview (Cont.)
- Atomic multicast multicast messages are reliably
delivered to all non-faulty group members and the
order of message deliver must obey a total
ordering - Atomic transaction operations in every
transaction are performed on an all or none basis
and conflicting operations among concurrent
transaction are executed in the same order
(serialized) - Atomic update updates are propagated to all
replicated objects and are serialized
90Overview (Cont.)
- Similarity between atomic multicast, transaction,
update - An atomic multicast is a special transaction
where every message representing an operation
conflicts with every other - Order of message delivery
- Atomic update is very much like a transaction
where every update is a conflicting operation - Atomic update is less stringent in consistency
requirement - Failures of replicas may be allowed, as long as
at least one copy is available - A client may not be interested in the global
coherency of the replicas, as long as it can read
the most recently written data
91Architecture for Management of Replicas
92Options for Read/Write
- READ
- Read-one-primary read from a primary RM
(consistency) - Read-one read from any RM (concurrency)
- Read-quorum read from a quorum of RMs (currency)
- WRITE
- Write-one-primary write to one primary replica
- Primary RM propagates the updates to all other
RMs - Write-all atomic updates to all RMs (subsequent
writes must wait) - Write-all-available atomic updates to all
available (non-faulty) RMs - Failure recovery
- Write-quorum atomic updates to a quorum of RMs
- Write-gossip updates to any RM and are lazily
propagate to others
93One-Copy Serializability
- The execution of transactions on replicated
objects is equivalent to the execution of the
same transactions on non-replicated objects - Read-one-primary/write-one-primary no
replication issue - Serialized by primary RM
- Secondary RMs only for redundancy
- Read-one/write-all
- Consistency two-phase locking or timestamp
ordering protocols - Read/write operations are sub-transactions
- Read-one/write-all-available
- Failure may cause problems with one-copy
serializability (Chap. 12) - Read-quorum/write-quorum
- Conflicts can be preserved if the read set of
replicas of one transaction overlaps with the
write set of another transaction
94One-Copy Serializability (Cont.)
Serial schedule is the only correct execution
Either t1 reads X written by t2, or t2 reads Y
written by t1
Neither t1 nor t2 sees the objectwritten by the
other
Failures and recoveries must also be serialized
w.r.t transaction(failure should appear before
the start of a transaction otherwise, abort)
The failure of Yd in t1 should force t2 to abort
and rollback
95Quorum Voting
Conflict two-phase lockingor timestamp ordering
Witness only carry the necessary information
(file version and id)
96Quorum Voting (Cont.)
- Three examples of the voting algorithm
- A correct choice of read and write set
- A choice that may lead to write-write conflicts
- A correct choice, known as ROWA (read one, write
all)
97Gossip Update Propagation
- If updates are less frequent than reads and
ordering of updates can be relaxed, updates can
be propagated lazily among replicas - Read-one/write-gossip
- Support high availability in an environment where
failures of replicas are likely and reliable
multicast of updates is impractical - Idea
- Both read and update operations are directly to
any RM - RMs bring their data up to date by gossip
information
98A Gossip Architecture
- Basic gossip protocol
- Read/overwrite no definitive group
- Causal order gossip protocol
- Read/modify-update definitive group
99Basic Gossip Protocol
- Timestamps scalar or vector
- TSf of FSA timestamp of the last successful
access operation - TSi of RM i the last update of the data object
- Read
- TSf lt TSi ? RM has more recent data
- Value returned
- Assign TSi to TSf
- TSf gt TSi ? RM has out-of-date data
- Wait or contact other RMs
- Update FSA increments TSf
- TSf gt TSi ? execute update
- Assign TSf to TSi
- TSf lt TSi ? comes too late
- Overwrite?
- Read ? overwrite
- Gossip RMj ? RMi
- Accept if TSj gt TSi
100Basic Gossip Update Example
1
FSA 1 TSf10 Set TSf11
RM1 TS1 0 Set TS1 1
Write OK
RM1 TS1 1 Set TS1 2
3. Gossip Reject
4. Gossip Accept
2
FSA 1 TSf11 Set TSf12
RM2 TS2 0 Set TS2 2
Write OK
5
7
FSA 2 TSf20
RM2 TS2 2
Read OK
FSA 2 TSf2 3
RM1 TS1 2
Read Reject
5
6
FSA 2 TSf22 Set TSf2 3
RM2 Set TS2 3
Write OK
101Causal Order Gossip
- Read/Modify-update with definitive group
- Ex. Multiplied by 2 Incremented by 1 ? Order is
important - Vector timestamps maintained at each RM
- V (VAL) timestamp of current value of the
object - R (WORK) knowledge of update requests in the RM
group - How many works still to be done
- Obtained through gossip by merging (pairwise
maximum) Rs - Update log u r other information
- u (DEP) timestamp issued by the FSA for an
update operation - r (ts) identifier of the update operation
- r for RMi is obtained by taking the corresponding
u and replacing the ith component of u with the
ith component of R
102Casually-Consistent Lazy Replication
103Processing Read Operations
Wait for RM to become up to date(i.e. if DEP(R)
?VAL(i))
- VAL V in textbook
- WORKR (how many more works)
FSA
104Processing Write Operations
5. Stable. If many, execute in causal order)
Reject if DEP lt VAL
MergeVAL and ts
- VAL V in textbook
- WORKR (how many more works)
105Gossip
- A gossip message from RMj to RMi carries RMjs
vector timestamp Rj and log Lj - Rj is merged with Ri
- Li is joined with Lj except for those update
records with r ? Vi - Have been accounted by RMi
106Example
107Example (Cont.)
FSA 1 u,000
RM1V000 R000 Update Log
RM1V000 R100 Update Log u1000, r1100
RM1V000 R000 Update Log
because uV,operation executed!! V is advanced
by mergingwith r
Write
RM1V100 R100 Update Log u1000, r1100
The information has to propagate to other RMs!!!
108Example (Cont.)
TSf lt V ? value at RM1 returned ?TSf is updated
to 100 (merge)
FSA 2 read from RM1 TSf 000
RM1V100 R100 Update Log u000, r100
Read
FSA 2 u,100
RM2V000 R010 Update Log u2100, r2110
FSA 2 u,100
RM2V000 R000 Update Log
u2 gt V ? RM2 does not get the most update-to-date
value
Write
109Example (Cont.)
RM1V100 R100 Update Log u1000, r1100
RM2V000 R010 Update Log u2100, r2110
RM2V000 R110 Update Log u2100,
r2110 u1000, r1100
Gossip!!!
RM2V000 R110 Update Log u2100,
r2110 u1000, r1100
RM2V100 R110 Update Log u2100,
r2110 u1000, r1100
RM2V110 R110 Update Log u2100,
r2110 u1000, r1100
Execute u1(u1 V)
Execute u2(u2 V)
(Stable)
(Stable)
110Example (Cont.)
TSf lt V ? value at RM1 returned ?TSf is updated
to 100 (merge)
FSA 3 read from RM1 TSf 000
RM1V100 R100 Update Log u000, r100
Read
FSA 3 u,100
RM3V000 R001 Update Log u3100, r3101
FSA 3 u,100
RM3V000 R000 Update Log
u3 gt V ? RM3 does not get the most update-to-date
value
Write
111Example (Cont.)
RM2V110 R110 Update Log u2100,
r2110 u1000, r1100
RM3V000 R001 Update Log u3100, r3101
RM3V000 R111 Update Log u3100,
r3101 u2100, r2110 u1000, r1100
Gossip!!!
RM3V100 R111 Update Log u3100,
r3101 u2100, r2110 u1000, r1100
RM3V110 R111 Update Log u3100,
r3101 u2100, r2110 u1000, r1100
Execute u1(u1 V)
Execute u2(u2 V)
112Example (Cont.)
RM3V111 R111 Update Log u3100,
r3101 u2100, r2110 u1000, r1100
Execute u3(u3 lt V)
More issues garbage collection of the logs and
optimization of message count(Chapter 12)
113Cache-Coherence Protocol
- Cache a special case of replication
- controlled by clients (instead of servers)
- Coherence detection strategy when during a
transaction the detection is done - Every access (operation)
- Let the transaction proceed while verification is
taking place - If assumption later proves to be false ? abort
- Verify only when the transaction committed
114Cache-Coherence Protocol (Cont.)
- Coherence enforcement strategy how caches are
kept consistent with the copies stored at servers - Write-invalidate Server sends invalidation to
all caches whenever data is modified - Write-update Server propagates the update
- What happens when a process modifies cache data
- Write-through immediate write back to the server
- Write-back updates to the cache can be batched
and written back to the server periodically