Fault Tolerant SDSM - PowerPoint PPT Presentation

1 / 12
About This Presentation
Title:

Fault Tolerant SDSM

Description:

Saves processor state, shared pages, page table, DSM protocol state and log ... Cashmere peer logging (1996) 6. Checkpointing. Checkpointing without logging ... – PowerPoint PPT presentation

Number of Views:50
Avg rating:3.0/5.0
Slides: 13
Provided by: camarsK
Category:

less

Transcript and Presenter's Notes

Title: Fault Tolerant SDSM


1
Fault Tolerant SDSM
  • Checkpointing
  • Logging
  • Data replication
  • 2001.12.27
  • NRL. Soyeon Park

2
Contents
  • Background
  • History
  • Coordinated checkpointing (199094)
  • Communication-induced checkpointing (198893)
  • Shared-read logging (1993)
  • Reduced overhead logging LRC (1995)
  • Coherence-centric logging HLRC (1999)
  • VMMC GeNIMA data replication (2001)
  • Issues
  • Schedule

3
Background (1/2)
  • Checkpointing
  • Saves processor state, shared pages, page table,
    DSM protocol state and log
  • When to checkpoint?
  • Logging
  • Saves information about DSM to volatile memory
  • Which information ?
  • tolerant to multiple-node failure
  • receiver-based logging
  • when to flush log to stable storage ?
  • tolerant to single-node failure
  • sender-based logging

4
Background (2/2)
  • Coordinated (consistent) checkpointing
  • Problems
  • high overhead of synchronization
  • not appropriate for large cluster
  • Schemes
  • reducing synchronization overhead (using barrier)
  • reducing no. of nodes for synchronization
    (dependent process)
  • Independent checkpointing
  • Problems
  • domino effect, multiple checkpointing
  • Schemes
  • communication-induced checkpointing
  • logging inter-process communication

5
History
communication-induced checkpointing (198893)
coordinated checkpointing (199094)
shared-read logging (1993)
TreadMark two checkpointing (1996)
reduced overhead logging LRC (1995)
coherence-centric logging HLRC (1999)
logging for ADSM (1999)
Lazy Log Trimming (LLT) and Checkpoint Garbage
Collection (CGC) (2000)
Cashmere peer logging (1996)
VMMC GeNIMA data replication (2001)
6
Checkpointing
  • Checkpointing without logging
  • Coordinated checkpointing
  • Communication-induced checkpointing
  • For SC
  • For RC

req
r(x)
w(x)
req
r(y)
  • flush modified pages to disk
  • process checkpointing

cp
cp
r(x)
w(x)
sync
r(y)
cp
7
Logging Schemes (1/3)
  • Logging schemes
  • Message passing systems
  • a program explicitly specifies the receipt of a
    message
  • logging the received messages
  • checkpointing for non-deterministic events
  • DSM systems
  • communication is inherently non-deterministic
  • Shared-read logging SC (1993)

r(x)
req
w(x)
req
r(y)
  • periodic checkpointing
  • logs pages at reads

cp
log flush
log flush
8
Logging Schemes (2/3)
  • Reduced overhead logging (1995)
  • For SC
  • For LRC

Init count
Init count
count
  • periodic checkpointing
  • logs a count and fetched pages

req
r(x)
req
r(y)
w(x)
cp
log flush
log flush
lockWN
flush
  • periodic checkpointing
  • logs write-notices and diffs

log
diff req
flush
log
diff
lock req
flush
log
lockWN
9
Logging Schemes (3/3)
  • Coherence-Centric Logging (CCL) (1999)
  • For HLRC

lockWN
lockWN
flush
flush
log
log
get page
get page
log
page
page
log diff
flush
log
flush
log src
diff
diff
home
requestor
lock owner
home
requestor
lock owner
lt Message Logging gt
lt CCL gt
  • small sized log
  • recovery complexity
  • simple
  • logging overhead

10
Data Replication VMMCGeNIMA (2001)
  • Characteristics
  • Guarantees that no global data is lost
  • maintains replicas using remote deposit op.
  • Not use stable storage
  • Not separate nodes to appl. and backup systems
  • Eliminate extensive logs
  • short independent checkpoints at release
  • Primary secondary home
  • Page
  • consistent at the end of release
  • two-phase diff propagation
  • Lock
  • homes also records the second to last node

11
Issues
  • Log-based approaches
  • Latency for flushing logs to disk
  • especially when it is on the critical path
  • Additional techniques for log trimming / garbage
    collection
  • Data replication
  • No logging, no latency
  • Overhead for maintaining two homes
  • memory overhead
  • latency for consistency between them
  • Tolerance
  • Single node failure vs. multiple node failure
  • Cost
  • Failure-free overhead vs. recovery time

12
Schedule
11
12
1
2
3
4
5
6
zero-copy
analysis writing
interrupt
any ideas?
DSM study
F-DSM survey
design
KDSM ?? ??
coding
...
any ideas?
Write a Comment
User Comments (0)
About PowerShow.com