Chapter 6 DFS Design and Implementation

About This Presentation

Title:

Chapter 6 DFS Design and Implementation

Description:

Time, date, and user identification data for protection, security, and usage monitoring. ... Can one-time session key be used for each file access? Cache consistency ... – PowerPoint PPT presentation

Number of Views:37

Avg rating:3.0/5.0

Slides: 115

Provided by: web2CcN

Category:

more less

Transcript and Presenter's Notes

Title: Chapter 6 DFS Design and Implementation

1
Chapter 6 DFS Design and Implementation
2
Outline

Characteristics of a DFS
DFS Design and Implementation
Transaction and Concurrency Control
Data and File Replication

3
Overview

A file system is responsible for the naming,
creation, deletion, retrieval, modification, and
protection of files
DFS (Distributed file system) a file system
consisting of physically dispersed storage sites
but providing a traditional centralized file
system view for users
Transparency
Directory service (name service)
Performance and availability ? caching and
replication ? cache coherence and replica
management
Access control and protection
The unique problems in DFS are due to the need
for sharing and replication of files

4
Characteristics of A DFS (I)

Dispersed clients
Login transparency uniform login procedure,
uniform view of FS
Access transparency uniform mechanism to access
local and remote files
Dispersed files
Location transparency file names need not
contain information about the physical locations
of files
Location independence (migration transparency)
files can be moved from one physical location to
another without changing their names

5
Characteristics of A DFS (II)

Multiplicity of users
Concurrency transparency update to a file by a
process should not interfere with other processes
sharing this file
Support concurrency transparency at the
transaction level
Interleaved file accesses by several applications
Multiplicity of files
Replication transparency clients are not aware
that more than one copy of the file exists
Perform atomic updates on the replicas
Fault tolerance, scalability, heterogeneity

6
DFS Design and Implementation
7
File Concept

OS abstracts from the physical storage devices to
define a logical storage unit File
Types
Data numeric, alphabetic, alphanumeric, binary
Program source and object form

8
Logical components of a file

File name symbolic name
When accessing a file, its symbolic name is
mapped to a unique file id (ufid or file handle)
that can locate the physical file
Mapping is the primary function of the directory
service
File attributes next slide
Data units
Flat structure of a stream of bytes of sequence
of blocks
Hierarchical structure of indexed records

9
File Attributes

File Handle Unique ID of file
Name only information kept in human-readable
form
Type needed for systems that support different
types
Location pointer to file location on device
Size current file (and the maximum allowable)
size
Protection controls who can read, write,
execute
Time, date, and user identification data for
protection, security, and usage monitoring.
Information about files are kept in the directory
structure, which is maintained on the physical
storage device.

10
Access Methods

Sequential access information is processed in
order
read next
write next (append to the end of the file)
reset to the beginning of file
skip forward or backward n records
Direct access a file is made up of fixed length
logical blocks or records
read n
write n
position to n
read next
write next
rewrite n

11
Access Methods (Cont.)

Indexed sequential access
Data units are addressed directly by using an
index (key) associated with each data block
Requires the maintenance of an search index on
the file, which must be searched to locate a
block address for each access
Usually used only by large file systems in
mainframe computers
Indexed sequential access method (ISAM)
A two-level scheme to reduce the size of the
search index
Combine the direct and sequential access methods

12
Major Components in A File System
A file system organizes and provides access and
protection services for a collection of files
13
Directory Structure

Access to a file must first use a directory
service to locate the file.
A collection of nodes containing information
about all files.
Both the directory structure and the files reside
on disk.

Directory
F 1
F 2
F 3
F 4
F n
Files
14
Information in a Directory

Name
Type file, directory, symbolic link, special
file
Address device blocks to store a file
Current length
Maximum length
Date last accessed (for archival)
Date last updated (for dump)
Owner ID
Protection information

15
Operations Performed on Directory

Search for a file
Create a file
Delete a file
List a directory
Rename a file
Traverse the file system

Some kind of name service
16
Tree-Structured Directories Hierarchical
Structure of A File System
Subdirectory is just a special type of file
17
Authorization Service

File access must be regulated to ensure security
File owner/creator should be able to control
what can be done
by whom
Types of access
Read
Write
Execute
Append
Delete
List

18
File Service File Operations

Create
Allocate space
Make an entry in the directory
Write
Search the directory
Write is to take place at the location of the
write pointer
Read
Search the directory
Read is to take place at the location of the read
pointer
Reposition within file file seek
Set the current file pointer to a given value

Delete
Search the directory
Release all file space
Truncate
Reset the file to length zero
Open(Fi)
Search the directory structure
Move the content of the directory entry to memory
Close(Fi)
move the content in memory to directory structure
on disk
Get/set file attributes

19
System Service

Directory, authorization, and file services are
user interfaces to a file system (FS)
System services are a FSs interface to the
hardware and are transparent to users of FS
Mapping of logical to physical block addresses
Interfacing to services at the device level for
file space allocation/de-allocation
Actual read/write file operations
Caching for performance enhancement
Replicating for reliability improvement

20
DFS Architecture NFS Example
21
File Mounting

A useful concept for constructing a large file
system from various file servers and storage
devices
Attach a remote named file system to the clients
file system hierarchy at the position pointed to
by a path name (mounting point)
A mounting point is usually a leaf of the
directory tree that contains only an empty
subdirectory
mount claven.lib.nctu.edu.tw/OS /chow/book
Once files are mounted, they are accessed by
using the concatenated logical path names without
referencing either the remote hosts or local
devices
Location transparency
The linked information (mount table) is kept
until they are unmounted

22
File Mounting Example
root
root
Export
chow
OS
Mount
paper
book
DFS
DSM
/OS/DSM
Local Client
Remote Server
23
File Mounting (Cont.)

Different clients may perceive a different FS
view
To achieve a global FS view SA enforces
mounting rules
Export a file server restricts/allows the
mounting of all or parts of its file system to a
predefined set of hosts
The information is kept in the servers export
file
File system mounting
Explicit mounting clients make explicit mounting
system calls whenever one is desired
Boot mounting a set of file servers is
prescribed and all mountings are performed the
clients boot time
Auto-mounting mounting of the servers is
implicitly done on demand when a file is first
opened by a client

24
Location Transparency
No global naming
25
A Simple Automounter for NFS
26
Server Registration

The mounting protocol is not transparent
require knowledge of the location of file servers
When multiple file servers can provide the same
file service, the location information becomes
irrelevant to the clients
Server registration ? name/address resolution
File servers register their services with a
registration service, and clients consult with
the registration server before mounting
Clients broadcast mounting requests, and file
servers respond to clients requests

27
Stateful and Stateless File Servers

Stateless file server when a client sends a
request to a server, the server carries out the
request, sends the reply, and then remove from
its internal tables all information about the
request
Between requests, no client-specific information
is kept on the server
Each request must be self-contained full file
name and offset
Stateful file server file servers maintain
state information about clients between requests
State information may be kept in servers or
clients
Opened files and their clients
File descriptors and file handles
Current file position pointers
Mounting information
Lock status
Session keys
Cache or buffer

Session a connection for a sequenceof requests
and responses between aclient and the file server
28
A Comparison between Stateless and Stateful
Servers
29
Issues of A Stateless File Server

Idempotency requirement
Is it practical to structure all file accesses as
idempotent operations?
File locking mechanism
Should locking mechanism be integrated into the
transaction service?
Session key management
Can one-time session key be used for each file
access?
Cache consistency
Is the file server responsible for controlling
cache consistency among clients?
What sharing semantics are to be supported?

30
File Sharing

Overlapping access multiple copies of the same
file
Space multiplexing of the file
Cache or replication
Coherency control managing accesses to the
replicas, to provide a coherent view of the
shared file
Desirable to guarantee the atomicity of updates
(to all copies)
Interleaving access multiple granularities of
data access operations
Time multiplexing of the file
Simple read/write, Transaction, Session
Concurrency control how to prevent one execution
sequence from interfering with the others when
they are interleaved and how to avoid
inconsistent or erroneous results

31
Space Multiplexing

Remote access no file data is kept in the client
machine. Each access request is transmitted
directly to the remote file server through the
underlying network.
Cache access a small part of the file data is
maintained in a local cache. A write operation or
cache miss results a remote access and update of
the cache
Download/upload access the entire file is
downloaded for local accesses. A remote access or
upload is performed when updating the remote file

32
Remote Access VS Download/Upload Access
Remote Access
Download/Upload Access
33
Four Places to Caching
Clients disk (optional)
Servers disk
Clients main memory
Servers main memory
Client
Server
34
Coherency of Replicated Data

Four interpretations
All replicas are identical at all times
Impossible in distributed systems
Replicas are perceived as identical only at some
points in time
How to determine the good synchronization points?
Users always read the most recent data in the
replicas
How to define most recent?
Based on the completion times of write
operations (the effect of a write operation has
been reflected in all copies)
Write operations are always performed
immediately and their results are propagated in
a best-effort fasion
Coarse attempt to approximate the third definition

35
Time Multiplexing

Simple RW each read/write operation is an
independent request/response access to the file
server
Transaction RW a sequence of read and write
operations is treated as a fundamental unit of
file access (to the same file)
ACID properties
Session RW a sequence of transaction and simple
RW operations

36
Space and Time Concurrencies of File Access
37
Semantics of File Sharing

On a single processor, when a read follows a
write, the value returned by the read is the
value just written (Unix Semantics).
In a distributed system with caching, obsolete
values may be returned.

Solution to coherency andconcurrency control
problemsdepends on the semantics ofsharing
required by applications
38
Semantics of File Sharing (Cont.)
39
Version Control

Version control under immutable files
Implemented as a function of the directory
service
Each file is attached with a version number
An open to a file always returns the current
version
Subsequently read/write operations to the opened
files are made only to the local working copy
When the file is closed, the local modified
version (tentative version) is presented to the
version control service
If the tentative version is based on the current
version, the update is committed and the
tentative version becomes the current version
with a new version number
What is the tentative version is based on an
older version?

40
Version Control (Cont.)

Action to be taken if based on an older version
Ignore conflict a new version is created
regardless of what has happened (equivalent to
session semantics)
Resolve version conflict the modified data in
the tentative version are disjoint from those in
the new current version
Merge the updates in the tentative version with
the current version to yield to a new version
that combines all updates
Resolve serializability conflict the modified
data in the tentative version were already
modified by the new current version
Abort the tentative version and roll back the
execution of the client with the new current
version as its working version
The concurrent updates are serialized in some
arbitrary order

41
Transaction and Concurrency Control

Apply the idea of transaction to distributed file
system management

42
The Transaction Model

Transaction a fundamental unit of interaction
between processes (all-or-nothing)
Updating a master tape
Withdraw money from one account and deposit it in
another

43
The Transaction Model (Cont.)

Examples of primitives for transactions
Support by underlying distributed OS or language
runtime system

44
The Transaction Model (Cont.)

Transaction to reserve three flights commits
Transaction aborts when third flight is
unavailable

45
The ACID Properties

Atomicity (Indivisibly)
Either all of the operations in a transaction are
performed or none of them are, in spite of
failures
Consistency (Serializability) (not violate system
invariants)
The execution of interleaved transaction is
equivalent to a serial execution of the
transactions in some order
Isolation (no interference between transactions)
Partial results of an incomplete transaction are
not visible to others before the transaction is
successfully committed
Durability
The system guarantees that the results of a
committed transaction will be made permanent even
if a failure occurs after the commitment

46
Nested and Distributed Transactions
Durability applies only to the toptransaction
47
Implementation Private Workspace
During transactions,reads/writes go to private
workspace
Performance issue

The file index and disk blocks for a three-block
file
The situation after a transaction has modified
block 0 and appended block 3
After committing

48
Implementation Write-ahead Log

Files are actually modified in place, but before
any block is changed, a record is written to a
log telling which transaction is making the
change, which file and block is being changed,
and what the old and new values are
Only after the log has been written successfully
is the change made to the file
Writeahead log can be used for undo (rollback)
and redo

49
Transaction Processing System
Execution Phase and Commit Phase
(Transaction ID private workspace)
Satisfy the ACID property
50
Execution Phase and Commit Phase
Failures and recovery actions for the 2PC protocol
51
Transaction Processing System (Cont.)

Layered approach for transaction management
Data (object) manager perform actual read/write
on data
Know nothing about transaction
Atomic update. Cache and replica management
Interface to the FS
Scheduler responsible for properly controlling
concurrency
Determine which transaction is allowed to pass
read/write to DM and at which time
Concurrent control protocol (serializable)
Enforce isolation and consistency avoid
conflicts
Transaction Manager guarantee atomicity of
transactions
All-or-none two-phase commit
Maintain write-ahead log and private workspace
for each transaction

52
Distributed Transaction Processing System

General organization of managers for handling
distributed transactions.

53
Distributed Transaction Processing System (Cont.)
Another view of the previous slide

A transaction may invoke operations on remote
objects (files)
The machine initiates a transaction ? coordinator
The machine on which a remote object is located ?
participant
Two-phase commit between coordinator and
participant

54
Distributed Transaction Processing System (Cont.)

A transaction manager serves as the coordinator,
with the remote transaction managers being the
participants in the two-phase commit protocol
Apply two-phase commit protocol to atomic update
of replicated objects
The object (data) manager where an update is
requested initiates the two-phase commit protocol
in conjunction with other object managers that
are holding a replica of the object

55
Concurrency Control Schedulers Responsibility

The whole idea behind concurrency control
Properly schedule conflicting operations to
ensure consistency
Conflicting operations operate on the same data
item and if at least one of them is a write
operation
Need to acquire/release locks before/after using
the data items
Read-write conflict and Write-write conflict
How to discover and handle inconsistency
Prevent inconsistency two-phase locking
All access requests are constrained in a certain
format such that interference among conflicts can
be prevented
TM transforms clients transactions into the
restrictive form
Use locks ? scheduler assumes the lock management
functions

56
Concurrency Control (Cont.)

How to discover and handle inconsistency (Cont.)
Avoid inconsistency timestamp ordering
Each individual access operation is checked by
the scheduler and a decision of whether the
operation should be accepted, tentatively
accepted, or rejected to avoid conflicts is made
by the scheduler
Schedulers perform the scheduling of operations
based on timestamps ordering
Validate consistency optimistic concurrency
control protocols
Conflicts are completed ignored during the
execution phase of a transaction. The consistency
is validated at the end of the execution phase.
Only transactions that can be globally validated
are allowed to commit. Schedulers are Validation
managers.

57
Serializability

Concurrency control are based on concept of
serializability
Schedule operations execution order
Legal schedule a schedule that observe the
internal ordering of operations for each
transaction and in which no transactions hold
conflicting locks simultaneously (for locking
algorithm)
Not all legal schedules yield consistent results
or even complete
Serial schedule a special legal schedule that is
formed by a strict sequential execution of the
transactions in some order
Each transaction satisfies ACID
Ensure the consistency requirement
A schedule is serializable if the result of its
execution is equivalent to that of a serial
schedule (This is what we want)

58
Serializability Example
Already committed
(1,3) and (2,4) have write-write conflicts
59
Interleaving Schedules
60
Serializability (Cont.)

Updates are made permanent only if the execution
of the transactions satisfies the serializability
requirement and is successfully committed
Sufficient condition for serializability
If the interleaved execution of transactions is
to be equivalent to a serial in some order, then
all conflicting objects in the interleaved
serializable schedule must also be executed in
the same order at all object sites
Chapter 12 presents the serialization graph model
to address general serialization problems

61
Two-Phase Locking

Using locking approach, all shared objects in a
well-formed transaction must be locked before
they can be accessed and must be released before
the end of transaction
Two-phasing locking Locking
A new lock cannot be acquired after the first
release of a lock
Phase 1 growing phase of locking the objects
Phase 2 shrinking phase of releasing the objects
Extreme two-phase locking
Get all locks at the beginning of the transaction
and release all locks at the same time as the end
of the transaction
Example Table 6.1
1, 2 are feasible

62
Two-Phase Locking (Cont.)
Deadlock may happen (ex. reverse operations 3 4
in t2 of Table 6.1)
Scheduler is responsible for granting and
releasing locks in such a way that only valid
schedules results (solving operation conflict)
63
Two-Phase Locking (Cont.)

Strict two-phase locking.

Deadlock may happen
64
Two-Phase Locking (Cont.)

Strict 2PL only release locks when commit/abort
Sacrifice some concurrency but easy to implement
Un-strict 2PL difficult to implement
TM does not know when the last lock has been
requested
May cause rolling aborts
Transaction T1 updates X, then release X
Transaction T2 reads X ? the new X value is read
T1 aborts ? T2 must abort as well
T2 has a commit dependence on T1
Commit of T2 must be delayed until the commit of
T1

65
Two-Phase Locking (Cont.)

Two-phase locking and strict two-phase locking
?deadlock
Two-phase locking in a distributed system
Centralized 2PL
A single site is responsible for grant/releasing
locks
Primary 2PL
Each data is assigned a primary copy. The
scheduler on the copys machine is responsible
for grant/releasing locks
Distributed 2PL
Data may be replicated across multiple machines.
The scheduler on each machine is responsible for
grant/releasing locks and make sure the operation
is forwarded to the local data manager

66
Pessimistic Timestamp Ordering

Basic idea
OM follows transaction timestamp order to perform
operations
When an operation on a shared object is invoked,
OM records the timestamp of the invoking
transaction
When a transaction invokes a conflicting
operation on the object
The transaction has a larger timestamp than the
one recorded by the object ? proceed (and record
the new timestamp)
Otherwise ? abort
No deadlock
Cascade aborts (schedule 5)
Tentative write before commit for ensuring
isolation

67
Pessimistic Timestamp Ordering (Cont.)

Timestamp ordering with tentative writes SCH
Each object is associated with
RD transaction commitment time for the last
read
WR transaction commitment time for the last
write
A list of tentative times (Ts) for the pending
transactions with a write operation to the object
Tmin the minimum of Ts

68
Pessimistic Timestamp Ordering (Cont.)

Concurrency control using timestamps.

Wait until T3commit/abort
And Restart
Different from Figure 6.7
69
Pessimistic Timestamp Ordering (Cont.)
Execution phase enforce or resolve
consistency Commit phase enforce atomicity
Execution Phase
Commit Phase
70
Pessimistic Timestamp Ordering (Cont.)

Read (with transaction timestamp T)
T lt WR ? abort (to maintain increasing timestamp
order)
WR lt T lt Tmin ?allow to proceed (before any
pending write)
Read result is put into TMs work space and
return to the client
Tmin lt T ?put in the tentative list and waits for
the preceding writes finish (commit or abort)
(already have tentative writes)
Write
T gt RD and T gt WR ? put into the tentative list
Inform TM the success or failure of the tentative
write operation
Otherwise ? abort

Enforce or resolve consistency in execution phase
71
Pessimistic Timestamp Ordering (Cont.)

Abort
Read ? simply discard the waiting read
Write ? remove from the tentative list
If a waiting read reaches the head of the list ?
perform read
Commit ? the successful completion of the atomic
commit in TM
Transaction waiting to read ? never happen
(blocked)
Transaction with only completed read operation ?
update the objects RD (the larger of the
transaction timestamp and the objects current
RD)
Transaction with tentative write
SCH aborts all pending transactions (both waiting
reads and tentative writes) ahead of the
committed transaction
Make the update permanent
Remove write from tentative list (may allow a
waiting read proceed)
Replicas exist ?call replication manager

Enforce atomicity in abort/commit
72
Pessimistic Timestamp Ordering (Cont.)
Allow more transactions to proceed freely, but
with more aborts, since conflicts sometimes
occur and need to be resolved
Waiting reads and tentative writes abort! (to
maintain the consistency)
Tmin
Commit for a transactionwith tentative write
commit
73
Example
After 2 in t1 and 4 in t2, maybe exist more
unrelated operations
Sched 1 1, 2, 3, 4
1/2
3/4
RDWRt0
RDWRt0
Tmint1
RDWRt0
Tmint1
t2
If t2 commits first ? t1 has to abort and restart!
RDWRt1
74
Example (Cont.)
Sched 3 3, 1, 4, 2
3/4
1/2
RDWRt0
RDWRt0
Tmint2
RDWRt0
Tmint1
t2
If t2 commits first ? t1 has to abort and restart!
RDWRt1
75
Example (Cont.)
Sched 5 1, 3, 4, 2
C
1
3
RDWRt0
RDWRt0
Tmint1
RDWRt0
Tmint1
t2
D
4
2
RDWRt0
RDWRt0
Tmint2
RDWRt0
Tmint1
t2
If t2 commits first ? t1 has to abort and restart!
RDWRt1
76
Example (Cont.)
t1
t2
t3
Waiting read and pending write for t1
Schedule 3
XX1
X0
RD, WR
Tmint1
t2
RD, WR
Tmint1
t2
RD, WR
Tmint1
t2
t3
Order of commit
XX2XX3
RD, WR
Tmint1
t2
t3
Waiting read and pending write for t2 and t3
77
Example (Cont.)
Schedule 4 x0 (t2), x0 (t3), xx3 commit,
x0 (t1)
Waiting read and pending write for t3 (xx3)
t2 aborts and restart
t3 commit
RD, WR
Tmint2
t3
RDWRt3
X0
RDWRt3
t1
t1 aborts and restarts
78
Optimistic Timestamp Ordering

Optimistic timestamp ordering
Just go ahead and do whatever you want to without
paying attention to what anybody else is doing
System keeps track of which data item have been
read and write
At the point of committing
Check all other transactions to see if any of its
items have been changed since the transaction
started (Validation)
Yes ? abort
No ? commit
Transaction uses private workspace to store
shadow copies of data
Deadlock free and maximum parallelism
If a transaction fails ? restart and run again
(not good for heavy load)

79
Optimistic Timestamp Ordering (Cont.)

A transaction consists of three phases
Execution phase
Just go ahead without paying attention to other
transactions
Need private work space for shadow copies of
shared objects
Validation phase
Use a two-phase commit protocol to globally
validate
Once a transaction is validated, it is guaranteed
to be committed
All commitments must follow the order of
validation time
Must be atomic
Update phase
Make changes permanent in the persistent memory

80
Optimistic Timestamp Ordering (Cont.)

Each transaction ti
TSi timestamp of the start time of its execution
phase
TVi timestamp of the start time of its
validation phase
Ri the set of data objects read by ti (read
set)
Wi the set of data object written by ti (write
set)
Each object Oj
RDj Commitment time for the last read operation
WRj Commitment time for the last write operation
Also called the version number of Oj
The transactions are to be serialized w.r.t. the
timestamp TVs of the validated transactions

81
Execution Phase

Begin at a TM when receive a begin transaction
from client
Private work space is created and maintained by
TM
Shadow copies of object version numbers
Similar to the session semantics of files
Abort delete the transaction and its work space
End transaction
Request for commit
Move to the validation phase

82
Validation Phase

TM validates mutual consistency between the
requested transaction and other distributed
transactions to ensure serializability
Initiate two-phase validation protocol, as
coordinator
Ri, Wi, and TVi is sent to all participating TMs
for validation
A participant can respond with positive
validation to more than one requests
Each TM has knowledge of all outstanding
transactions tk at its local site
Validation of mutual consistency between ti and
tk
TVi must be greater than TVk and tk must be
completed before ti
if both are validated ? check Wk of tk for
conflict
An accepted validation carries the current
version number of the shared remote object. It is
compared with TVi for work-space consistency. All
WRs must be smaller than TVi

83
Validation Phase (Cont.)
All commitments follow the order of validation
time The update of the commitment must also be
atomic
Two-phase commit for validation
84
Optimistic Timestamp Ordering (Cont.)
3
4
1
2
85
Optimistic Timestamp Ordering (Cont.)
(1) Violate ordering of validation time
TVi
(2) Accept!! Serialized Already!
Ti
Execution
Validation
TVk
Tk
Execution
Validation
Update
86
Update Phase

The transaction moves to the update phase once it
gathered an accepted validation from all
participating TMs and the state of work space is
consistent
A accepted validation is equivalent to a
tentative pre-write in the timestamp ordering
approach
The update phase is similar to the commit phase
in timestamp ordering
Except tentative write can be aborted, while
validation cannot be denies once it is given
Update must be committed in the TV order for
those validate transactions

87
Data and File Replication
88
Overview

Advantages of data and file replication
Parallelism transparency higher performance
(concurrent access to replicas)
Failure transparency higher availability
(redundant replicas)
Necessity
Replication transparency not aware of the
existence of replicas
Concurrency transparency no interference among
sharing clients
Atomic update updates to all replicas must be
atomic
One-copy serializability
atomic transaction
atomic update of replicas

89
Overview (Cont.)

Atomic multicast multicast messages are reliably
delivered to all non-faulty group members and the
order of message deliver must obey a total
ordering
Atomic transaction operations in every
transaction are performed on an all or none basis
and conflicting operations among concurrent
transaction are executed in the same order
(serialized)
Atomic update updates are propagated to all
replicated objects and are serialized

90
Overview (Cont.)

Similarity between atomic multicast, transaction,
update
An atomic multicast is a special transaction
where every message representing an operation
conflicts with every other
Order of message delivery
Atomic update is very much like a transaction
where every update is a conflicting operation
Atomic update is less stringent in consistency
requirement
Failures of replicas may be allowed, as long as
at least one copy is available
A client may not be interested in the global
coherency of the replicas, as long as it can read
the most recently written data

91
Architecture for Management of Replicas
92
Options for Read/Write

READ
Read-one-primary read from a primary RM
(consistency)
Read-one read from any RM (concurrency)
Read-quorum read from a quorum of RMs (currency)
WRITE
Write-one-primary write to one primary replica
Primary RM propagates the updates to all other
RMs
Write-all atomic updates to all RMs (subsequent
writes must wait)
Write-all-available atomic updates to all
available (non-faulty) RMs
Failure recovery
Write-quorum atomic updates to a quorum of RMs
Write-gossip updates to any RM and are lazily
propagate to others

93
One-Copy Serializability

The execution of transactions on replicated
objects is equivalent to the execution of the
same transactions on non-replicated objects
Read-one-primary/write-one-primary no
replication issue
Serialized by primary RM
Secondary RMs only for redundancy
Read-one/write-all
Consistency two-phase locking or timestamp
ordering protocols
Read/write operations are sub-transactions
Read-one/write-all-available
Failure may cause problems with one-copy
serializability (Chap. 12)
Read-quorum/write-quorum
Conflicts can be preserved if the read set of
replicas of one transaction overlaps with the
write set of another transaction

94
One-Copy Serializability (Cont.)
Serial schedule is the only correct execution
Either t1 reads X written by t2, or t2 reads Y
written by t1
Neither t1 nor t2 sees the objectwritten by the
other
Failures and recoveries must also be serialized
w.r.t transaction(failure should appear before
the start of a transaction otherwise, abort)
The failure of Yd in t1 should force t2 to abort
and rollback
95
Quorum Voting
Conflict two-phase lockingor timestamp ordering
Witness only carry the necessary information
(file version and id)
96
Quorum Voting (Cont.)

Three examples of the voting algorithm
A correct choice of read and write set
A choice that may lead to write-write conflicts
A correct choice, known as ROWA (read one, write
all)

97
Gossip Update Propagation

If updates are less frequent than reads and
ordering of updates can be relaxed, updates can
be propagated lazily among replicas
Read-one/write-gossip
Support high availability in an environment where
failures of replicas are likely and reliable
multicast of updates is impractical
Idea
Both read and update operations are directly to
any RM
RMs bring their data up to date by gossip
information

98
A Gossip Architecture

Basic gossip protocol
Read/overwrite no definitive group
Causal order gossip protocol
Read/modify-update definitive group

99
Basic Gossip Protocol

Timestamps scalar or vector
TSf of FSA timestamp of the last successful
access operation
TSi of RM i the last update of the data object
Read
TSf lt TSi ? RM has more recent data
Value returned
Assign TSi to TSf
TSf gt TSi ? RM has out-of-date data
Wait or contact other RMs

Update FSA increments TSf
TSf gt TSi ? execute update
Assign TSf to TSi
TSf lt TSi ? comes too late
Overwrite?
Read ? overwrite
Gossip RMj ? RMi
Accept if TSj gt TSi

100
Basic Gossip Update Example
1
FSA 1 TSf10 Set TSf11
RM1 TS1 0 Set TS1 1
Write OK
RM1 TS1 1 Set TS1 2
3. Gossip Reject
4. Gossip Accept
2
FSA 1 TSf11 Set TSf12
RM2 TS2 0 Set TS2 2
Write OK
5
7
FSA 2 TSf20
RM2 TS2 2
Read OK
FSA 2 TSf2 3
RM1 TS1 2
Read Reject
5
6
FSA 2 TSf22 Set TSf2 3
RM2 Set TS2 3
Write OK
101
Causal Order Gossip

Read/Modify-update with definitive group
Ex. Multiplied by 2 Incremented by 1 ? Order is
important
Vector timestamps maintained at each RM
V (VAL) timestamp of current value of the
object
R (WORK) knowledge of update requests in the RM
group
How many works still to be done
Obtained through gossip by merging (pairwise
maximum) Rs
Update log u r other information
u (DEP) timestamp issued by the FSA for an
update operation
r (ts) identifier of the update operation
r for RMi is obtained by taking the corresponding
u and replacing the ith component of u with the
ith component of R

102
Casually-Consistent Lazy Replication
103
Processing Read Operations
Wait for RM to become up to date(i.e. if DEP(R)
?VAL(i))

VAL V in textbook
WORKR (how many more works)

FSA

DEPu in text book

104
Processing Write Operations
5. Stable. If many, execute in causal order)
Reject if DEP lt VAL
MergeVAL and ts

ts r in textbook
DEP u

VAL V in textbook
WORKR (how many more works)

105
Gossip

A gossip message from RMj to RMi carries RMjs
vector timestamp Rj and log Lj
Rj is merged with Ri
Li is joined with Lj except for those update
records with r ? Vi
Have been accounted by RMi

106
Example
107
Example (Cont.)
FSA 1 u,000
RM1V000 R000 Update Log
RM1V000 R100 Update Log u1000, r1100
RM1V000 R000 Update Log
because uV,operation executed!! V is advanced
by mergingwith r
Write
RM1V100 R100 Update Log u1000, r1100
The information has to propagate to other RMs!!!
108
Example (Cont.)
TSf lt V ? value at RM1 returned ?TSf is updated
to 100 (merge)
FSA 2 read from RM1 TSf 000
RM1V100 R100 Update Log u000, r100
Read
FSA 2 u,100
RM2V000 R010 Update Log u2100, r2110
FSA 2 u,100
RM2V000 R000 Update Log
u2 gt V ? RM2 does not get the most update-to-date
value
Write
109
Example (Cont.)
RM1V100 R100 Update Log u1000, r1100
RM2V000 R010 Update Log u2100, r2110
RM2V000 R110 Update Log u2100,
r2110 u1000, r1100
Gossip!!!
RM2V000 R110 Update Log u2100,
r2110 u1000, r1100
RM2V100 R110 Update Log u2100,
r2110 u1000, r1100
RM2V110 R110 Update Log u2100,
r2110 u1000, r1100
Execute u1(u1 V)
Execute u2(u2 V)
(Stable)
(Stable)
110
Example (Cont.)
TSf lt V ? value at RM1 returned ?TSf is updated
to 100 (merge)
FSA 3 read from RM1 TSf 000
RM1V100 R100 Update Log u000, r100
Read
FSA 3 u,100
RM3V000 R001 Update Log u3100, r3101
FSA 3 u,100
RM3V000 R000 Update Log
u3 gt V ? RM3 does not get the most update-to-date
value
Write
111
Example (Cont.)
RM2V110 R110 Update Log u2100,
r2110 u1000, r1100
RM3V000 R001 Update Log u3100, r3101
RM3V000 R111 Update Log u3100,
r3101 u2100, r2110 u1000, r1100
Gossip!!!
RM3V100 R111 Update Log u3100,
r3101 u2100, r2110 u1000, r1100
RM3V110 R111 Update Log u3100,
r3101 u2100, r2110 u1000, r1100
Execute u1(u1 V)
Execute u2(u2 V)
112
Example (Cont.)
RM3V111 R111 Update Log u3100,
r3101 u2100, r2110 u1000, r1100
Execute u3(u3 lt V)
More issues garbage collection of the logs and
optimization of message count(Chapter 12)
113
Cache-Coherence Protocol

Cache a special case of replication
controlled by clients (instead of servers)
Coherence detection strategy when during a
transaction the detection is done
Every access (operation)
Let the transaction proceed while verification is
taking place
If assumption later proves to be false ? abort
Verify only when the transaction committed

114
Cache-Coherence Protocol (Cont.)

Coherence enforcement strategy how caches are
kept consistent with the copies stored at servers
Write-invalidate Server sends invalidation to
all caches whenever data is modified
Write-update Server propagates the update
What happens when a process modifies cache data
Write-through immediate write back to the server
Write-back updates to the cache can be batched
and written back to the server periodically

Write a Comment

User Comments (0)