Title: CS 603 Review
1CS 603Review
2Seminar Announcements
- Saurabh Bagchi, Hierarchical Error Detection in
a Distributed Software Implemented Fault
Tolerance (SIFT) Environment - April 25, 1030-1130, MSEE 239
- Fabian E. Bustamante, The Active Streams
Approach to Adaptive Distributed Systems - April 29, 1030-1130, CS 101
3Review
- Why do we want distributed systems?
- Scaling
- Heterogeneity
- Geographic Distribution
- What is a distributed system?
- Transparency vs. Exposing Distribution
- Hardware Basics
- Communication Mechanisms
4Basic Software Concepts
- Hiding vs. Exposing
- Distribution Distributed OS
- Location, but not distribution Middleware
- None Network OS
- Concurrency Primitives
- Semaphores
- Monitors
- Distributed System Models
- Client-Server
- Multi-Tier
- Peer to Peer
5Communication Mechanisms
- Shared Memory
- Enforcement of single-system view
- Delayed consistency d-Common Storage
- Message Passing
- Reliability and its limits
- Stream-oriented Communications
- Remote Procedure Call
- Remote Method Invocation
6RPC Mechanisms
- DCE
- Language / Platform Independent
- Implementation Issues
- Data Conversion
- Underlying Mechanisms
- Fault Tolerance Approaches
- Java RMI
- SOAP
- Interoperable
- Language independent
- Transport independent (anything that moves XML)
7Naming Requirements
- Disambiguate only
- Access resource given the name
- Build a name to find a resource
- Do humans need to use name?
- Static/Dynamic Resource
- Performance Requirements
8Registry Example X.500
- Goal Global white pages
- Lookup anyone, anywhere
- Developed by Telecommunications Industry
- ISO standard directory for OSI networks
- Idea Distributed Directory
- Application uses Directory User Agent to access a
Directory Access Point - Basis for LDAP, ActiveDirectory
9Directory Information Base(X.501)
- Tree structure
- Root is entire directory
- Levels are groups
- Country
- Organization
- Individual
- Entry structure
- Unique name
- Build from tree
- Attributes Type/value pairs
- Schema enforces type rules
- Alias entries
10X.500
- Directory Entry
- Organization level CNPurdue University, LWest
Lafayette - Person level CNChris Clifton, SNClifton,
TITLEAssociate Professor - Directory Operations
- Query, Modify
- Authorization / Access control
- To directory
- Directory as mechanism to implement for others
11X.500 Distributed Directory
- Directory System Agent
- Referrals
- Replication
- Cache vs. Shadow copy
- Access control
- Modifications at Master only
- Consistency
- Each entry must be internally consistent
- DSA giving copy must identify as copy
12Clock Synchronization
- Definition All nodes agree on time
- What do we mean by time?
- What do we mean by agree?
- Lamport Definition Events
- Events partially ordered
- Clock counts the order
13Event-based definition(Lamport 78)
- Define partial order of processes
- A ? B A happened before B Smallest relation
such that - If A and B in same process and A occurs first, A
? B - If A is sending a message and B is receipt of a
message, A ? B - If A ? B and B ? C, then A ? C
- Clock C(x) is time x occurs
- C(x) Ci(x) where x running on node i.
- Clocks correct if ? a,b a?b ? C(a) lt C(b)
14Lamport Clock Implementation
- Node i Increments Ci between any two successive
events - If event a is sending of a message m from i to j,
- m contains timestamp Tm Ci(a)
- Upon receiving m, set Cj current Cj and gt Tm
- Can now define total ordering. a ? b iff
- Ci(a) lt Cj(b)
- Ci(a) Cj(b) and Pi lt Pj
15What if we want wall clock time?
- Ci must run at correct rate
- ? ? ltlt 1 such that dCi(t)/dt 1 lt ?
- Synchronized
- ? small e such that ? i,j Ci(t) Cj(t) lt e
- Assume transmission time between µ and µ?
- Algorithm Upon receiving message m,set Cj(t)
max(Cj(t), Tmµ) - Theorem Assume every t seconds a message with
unpredictable delay ? is sent over every arc.
Then ? t t0 td, e d(2?t ?)
16Clock SynchronizationLimits
- Best Possible Delay Uncertainty
- Actually e(1 1/n)
- Synchronization with Faults
- Faulty clock
- Communication Failure
- Malicious processor
- Worst case Can only synchronize if lt 1/3
processors faulty - Better if clocks can be authenticated
17Process Synchronization
- Problem Shared Resources
- Model as sequential or parallel process
- Assumes global state!
- Alternative Mutual Exclusion when Needed
- Coordinator approach
- Token Passing
- Timestamp
18Mutual Exclusion
- Requirements
- Does it guarantee mutual exclusion?
- Does it prevent starvation?
- Is it fair?
- Does it scale?
- Does it handle failures?
19Mutual ExclusionColored Ticket Algorithm
- Goals
- Decentralized
- Fair
- Fault tolerant
- Space Efficient
- Idea Numbered Tickets
- Next number gets resource
- Problem Unbounded Space
- Solution Reissue blocks
20Multi-ResourceMutual Exclusion
- New Problem Deadlock
- Processes using all resources
- Each needs additional resource to proceed
- Dining Philosophers Problem
- Coordinated vs. truly distributed solutions
- Problems with deterministic solutions
- Probabilistic solution Lehman Rabin
- Starvation / fairness properties
21Distributed Transactions
- ACID properties
- Issues
- Commit Protocols
- Fault Tolerance
- Why is this enough?
- Failure Models and Limitations
- Mechanisms
- Two-phase commit
- Three-phase commit
22Two-Phase Commit(Lamport 76, Gray 79)
- Central coordinator initiates protocol
- Phase 1
- Coordinator asks if participants can commit
- Participants respond yes/no
- Phase 2
- If all votes yes, coordinator sends Commit
- Participants respond when done
- Blocks on failure
- Participants must replace coordinator
- If participant and coordinator fail, wait for
recovery - While blocked, transaction must remain Isolated
- Prevents other transactions from completing
23Transaction Model
- Transaction Model
- Global Transaction State
- Reachable State Graph
- Local states potentially concurrent if a
reachable global state contains both local states - Concurrency set C(s) is all states potentially
concurrent with s - Sender set S(s) local states t t sends m and
s can receive m - Failure Model
- Site failure assumed when expected message not
received in time - Independent Recovery
24Problems with 2-PC
- Blocking on failure
- 3-PC as solution
- Theorems on recovery limits
- Independent recovery No two-site failure
- Non-independent recovery
- Anything short of total failure okay
- Recovery protocol for total failure
253PC assuming timeout on receipt of message
Coordinator
Participant
q1
q2
start xact/ no
start xact/ yes
xact request/ start xact
abort/ -
w1
w2
no/ abort
yes/ pre-commit
pre-commit/ ack
p1
p2
ack/commit
commit/ -
26Termination Protocol
- If participant times out in w2 or p2
- Elect new Coordinator
- If coordinator alive, would have
committed/aborted - New coordinator requests state of all processes.
Termination rules - If any aborted, broadcast abort
- If any committed, broadcast commit
- If all w2, broadcast abort
- If any p2, send pre-commit and enter state p1
- Complete failure protocol
27Data Replication
- Fault Tolerance
- Hot backup
- Catastrophic failure
- Performance
- Parallelism
- Decreased reliance on network
- Correctness criterion Replication invisible
- One-copy serializability (1SR)
28Data Replication How?
- Goal Ensure one-copy serializability
- Write-all solution All copies identical
- Write goes to every site
- Read from any site
- Standard single-copy concurrency control
- Guarantees 1SR
- Single-copy concurrency control gives
serializable execution - Equivalent to serial execution where all writes
happen in one transaction
29Write All Approach
Writer
Reader
5
read
5
5
5
read
3
3
3
3
5
5
5
30Problem Site Failure
- Failure causes write to block
- Must maintain locks
- Clogs up entire system
- Is this fault tolerance?
- What about write all available?
- T0 w0xA w0xB w0yC c0
- B-fails
- T1 r1yC w1xA c1
- B-recovers
- T2 r2xB w2yC c2
- What is the serial equivalent order?
31Write All Available FailsEven if no recovery!
32Solutions
- Validate availability on commit
- Check if any failed writes now available
- Check that all sites read or written still
available - Enforces serializability for site failures
- Doesnt work with communication failures!
33Formalisms for Relaxed consistency
- Goal Relaxed consistency constraints
- Meet application needs
- Outperform true transparent replication
- How do we ensure constraints meet needs?
- Formalisms to describe application needs
- Methods to prove constraints adequate
34Quasi-Copies(Alonso, Barbará, Garcia-Molina 90)
- Data Caching
- Each site keeps copy of data likely to be used
locally - Propagation cost of writes high
- User-Defined Cache
- Controlled Divergence
- Weak consistency constraints
- Bounds on the differences between copies
- User defines constraints
35Assumptions
- Read-only copies
- Updates sent to master copy
- E.g., ORACLE Materialized View
- User Specified Coherency
- Strict limits
- Hints
- Example Stock Purchase
- Place order based on delayed price
- Limit order to ensure price paid okay
36Selection Conditions
- Identification clause
- Select/Project Query
- Modifier Clause
- Add / drop from cache
- Compulsory or advisory cache
- Static / Dynamic As new objects meet the
identification clause, are they cached? - Triggering delay on dynamic
37Coherency Conditions
- Default (always enforced) Value was true once
- Delay W(x,a) Max time lag
- Version V(x) Number of updates
- Periodic P(x) Time for refresh
- Arithmetic A(x) Bounded Difference
- Combine conditions with logical operators
- Multi-object conditions
- Consistency conditions on a group
- Order of application in a group
38CS 603Review
39Remote Operation Mechanisms
- Client-Server Model
- Remote Procedure Call
- Problem Remote Site must already know what we
want to do! - Process consists of
- Code
- Resources (files, devices, etc.)
- Execution (data, stack, registers, etc.)
- Fork copies everything
- Is this needed?
- Solution Copy part of the process
40So where are we?
- Models for Remote Processing
- Server Request documented service
- RPC Request execution of existing procedure
- What if operation we want isnt offered remotely?
- Solution Agents / Code Migration
41Types of Code Migration
From Andrew Tanenbaum, Distributed Operating
Systems, 1995.
42Types of Code Migration
- Weak mobility Copy only code
- Program starts from initial state
- Example Java applets
- Strong mobility Copy code and execution
- Resume execution where it stopped
- But doesnt necessarily have same resources (less
than fork) - Example DAgents (later), cluster computing
(Condor, LSF)
43Types of Code Migration
- Sender Initiated
- Receiver Initiated
- Examples
- Java Applets
- Receiver Initiated
- Cluster computing
- Sender Initiated?
- Central manager initiated?
44Types of Code Migration
- Where executed?
- In target process
- In new process
- Strong Mobility Move vs. Copy
- Migrate process Ceases at originating site
- Clone process Two copies in parallel
45Resource Binding
Resource to Machine Binding Resource to Machine Binding Resource to Machine Binding Resource to Machine Binding Resource to Machine Binding
Process to Resource Binding Unattached Fastened Fixed
Process to Resource Binding Identifier Move Global Reference Global Reference
Process to Resource Binding Value Copy Value Global Reference Global Reference
Process to Resource Binding Type Rebind Locally Rebind locally Rebind Locally
46The Hard Part Resources
- Migrated process still needs resources
- Options to Connect to a Resource (Fugetta et al.,
1998) - Binding by identifier (e.g., URL)
- Attach to the same resource
- Binding by value (e.g., standard libraries)
- Bind to equivalent resource
- Bind by type (e.g., local printer)
- Bind to resource with same function
47The Hard Part Resources
- Alternative Move the Resource
- Unattached resources (e.g., data files)
- Relatively easy to move
- Fastened resource (e.g., database)
- Expensive to move
- Fixed resource (e.g., communications endpoint)
- Cant be moved
48DCOM What is it?
- Start with COM Component Object Model
- Language-independent object interface
- Add interprocess communication
49DCOMDistributed COM
- Looks like COM to the client
- Built on DCE RPC
- Extends to support full COM functionality
50DCOM Architecture
51Locating ObjectsActivation
- CoCreateInstance(Ex)(ltCLSIDgt)
- Interface pointer to uninitialized instance
- Same as COM
- CoiGetInstanceFromFile, FromStorage
- Create new instance
- CoGetClassObject(ltCLSIDgt)
- Factory object that creates objects of ltCLSIDgt
- CoGetClassObjectFromURL
- Downloads necessary code from URL and
instantiates - Can take server name as parameter
- Or default to server specified in DCOM
configuration on client machine - HKEY_CLASSES_ROOT\APPID\ltappid-guidgt
- "RemoteServerName""ltDNS namegt
- Also store information in ActiveDirectory
52DCOM vs. CORBA
- CORBA
- Single interface name
- Multiple inheritance
- Dynamic Invocation Interface
- C-style Exception Handling
- Explicit and Implicit reference counts
- Implemented by ORB with replaceable services
- DCOM
- Distinction between Class and Instance Identifier
- Implement multiple interfaces
- Type libraries for on-demand marshaling
- 32 Bit Error Code
- Explicit reference count only
- Implemented by many independent services
53What is .NET?
- Language for distributed computation
- C, VB.NET, JScript
- Protocols
- SOAP, HTTP
- Run-time environment
- Common Language Runtime (CLR)
- ActiveDirectory
- Web Servers (ASP.NET)
54COM/DCOM ? .NET
- DCOM
- IDL
- Name, Monikers
- Registry / ActiveDirectory
- C, Visual Basic
- DCE RPC
- DCOM Network protocol (based on DCE standards)
- .NET
- Web Services Description Language (WSDL)
- DISCO (URI grammar)
- Universal Description Discovery and Integration
(UDDI) - C, VB.NET
- SOAP
- HTTP (presumed ubiquitous), SMTP (!?)
55How .NET works
- Query UDDI directory to get service location
- Query service to get WSDL (interface
specification) - Build call (XML) based on WSDL spec.
- Make call using SOAP
- Parse XML results based on WSDL spec.
56JiniJava Middleware
- Tools to construct federation
- Multiple devices, each with Java Virtual Machine
- Multiple services
- Uses (doesnt replace) Java RMI
- Adds infrastructure to support distribution
- Registration
- Lookup
- Security
57Service
- Basic unit of JINI system
- Members provide services
- Federate to share access to services
- Services combined to accomplish tasks
- Communicate using service protocol
- Initial set defined
- Add more on the fly
58InfrastructureKey Components
- RMI
- Basic communication model
- Distributed Security System
- Integrated with RMI
- Extends JVM security model
- Discovery/join protocol
- How to register and advertise services
- Lookup services
- Returns object implementing service (really a
local proxy)
59Programming Model
- Lookup
- Leasing
- Extends Java reference with notion of time
- Events
- Extends JavaBeans event model
- Adds third-party transfer, delivery and
timeliness guarantees, possibility of delay - Transaction Interfaces
60Jini Component Categories
- Infrastructure Base features
- Programming Model How you use them
- Services What you build
- Java / Jini Comparison
61Failure Models
- Failure System doesnt give desired behavior
- Component-level failure (can compensate)
- System-level failure (incorrect result)
- Fault Cause of failure (component-level)
- Transient Not repeatable
- Intermittent Repeats, but (apparently)
independent of system operations - Permanent Exists until component repaired
- Failure Model How the system behaves when it
doesnt behave properly
62Failure Model(Flaviu Cristian, 1991)
- Dependency
- Proper operation of Database depends on proper
operation of processor, disk - Failure Classification
- Type of response to failure
- Failure semantics
- State of system after given class of failure
- Failure masking
- High-level operation succeeds even if they depend
on failed services
63Failure Classification
- Correct
- In response to inputs, behaves in a manner
consistent with the service specification - Omission Failure
- Doesnt respond to input
- Crash After first omission failure, subsequent
requests result in omission failure - Timing failure (early, late)
- Correct response, but outside required time
window - Response failure
- Value Wrong output for inputs
- State Transition Server ends in wrong state
64Crash Failure types(based on recovery behavior)
- Amnesia
- Server recovers to predefined state independent
of operations before crash - Partial amnesia
- Some part of state is as before crash, rest to
predefined state - Pause
- Recovers to state before omission failure
- Halting
- Never restarts
65Failure Semantics
u
r
l
sr
f(sr)
sr
f(sr)
- Max delay on link d Max service time p
- Should get response in 2dp
- Assume omission failure only
- If no response in 2dp, resend request
- What if performance failure possible?
- Must distinguish between response to sr and sr
66Failure Semantics
- Specification for service must include
- Failure-free (normal) semantics
- Failure semantics (likely failure behaviors)
- Multiple semantics
- Combine to give (weaker) semantics
- Arbitrary failure semantics Weakest possible
- Choice of failure semantics
- Is class of failure likely?
- Probability of type of failure
- What is the cost of failure
- Catastrophic?
67Failure Masking
- Hierarchical failure masking
- Dependency Higher level gets (at best) failure
semantics of lower level - Can compensate for lower level failure to improve
this - Group Failure Masking
- Redundant servers
- Allows failure semantics of group to be higher
than individuals - k-fault tolerant
- Group can mask k concurrent group member failures
from client
68Fault Tolerance
- A distributed program A is said to tolerate
faults from a fault class F for an invariant P
iff there exists a predicate T for which - At any configuration where P holds, T also holds
(i.e., P ? T) - Starting from any state where T holds, if any
actions of A or F are executed, the resulting
state will always be one in which T holds (i.e.,
T is closed in A and T is closed in F) - Starting from any state where T holds, every
computation that executes actions from A alone
eventually reaches a state where P holds - If a program A tolerates faults from a fault
class F for invariant P, we say that A is
F-tolerant for P.
69Forms of fault tolerance
Live Not live
Safe Masking Fail safe
Not safe Nonmasking none
- For each entry, determine
- F Fault class handled
- T Set of states that can be reached
70Reliable Multicast
- Classes
- Sender-initiated Acknowledge all packets
- Scales poorly in normal operation
- Receiver-initiated Request missing packets
- Sender doesnt need receiver list
- Scales poorly on failure (cascading failure?)
- Tree-based, Ring-based protocols
71Tree-based Protocols
- Organize multicast group into tree
- Children acknowledge to parent
- Parent acknowledges when all children have
acknowledged - Advantages
- Sender doesnt need to know full group
- Solves unbounded memory
- Scalable
- Disadvantages
- Rate paced by slowest acknowledgement path in tree
72Ring-based protocols
- Idea Token site responsible for retransmit
- Sender multicasts
- Token site multicasts ACK
- Receivers request retransmit from token site if
ACK doesnt match what they have - Can only accept token if youve received
everything acknowledged - Keep packets since last time you had token
- Advantages
- Space
- Low load on sender
73Disaster Recovery
- Problem complete failure at single site
- Must have multiple sites
- Thus a distributed problem
- Two examples
- Distributed Storage Palladio
- Think wide-area RAID
- Distributed Transactions Epoch algorithm
74Epoch Algorithm (Garcia-Molina, Polyzois, and
Hagmann 1990)
- 1-Safe backup
- No performance penalty
- Multiple transaction streams
- Use distribution to improve performance
- Multiple Logs
- Avoid single bottleneck
75Algorithm Overview
- Idea Transactions that can be committed
together grouped into epochs - Primaries write marker in log
- Must agree when safe to write marker
- Keep track of current epoch number
- Master broadcasts when to end epoch
- Backups commit epoch when all backups have
received marker
76Correctnes Criteria
- Atomicity If any writes of a transaction appear
at backup, all must appear - If ?W(Tx, d) at backup then?W(Tx, d), W(Tx, d)
exists at backup - Consistency If Ti ? Tj at primary, then
- Local Tj installed at backup ? Ti installed at
backup - Mutual If W(Ti, d) and W(Tj, d), thenW(Ti, d)
? W(Tj, d) - Minimum Divergence If Tj is at the backup and
does not depend on a missing transaction, then it
should be installed at the backup
77Single-Mark Algorithm
- Problem Is it locally safe to mark when
broadcast received? - Might be in the middle of a transaction
- Solution Share epoch at commit
- Prepare to commit includes local epoch number
- If received number greater than local, end epoch
- At Backup When all sites have epoch ?n, Commit
transactions where - C(Ti) ? ?n
- P(Ti) ? ?n, local site is not coordinator, and
coordinator has C(Ti) ? ?n
78Test Basics
- Mechanics Open book/notes
- No electronic aids
- Two questions
- Each multi-part
- Will include scoring suggestions
- Underlying question Do you understand the
material? - No need to regurgitate best in literature
answer - Reasonable self-designed solution fine
- Key Do you really understand your answer
- Can you build CORRECT distributed systems?