A Recovery-Friendly, Self-Managing Session State Store - PowerPoint PPT Presentation

About This Presentation
Title:

A Recovery-Friendly, Self-Managing Session State Store

Description:

A Recovery-Friendly, Self-Managing Session State Store Benjamin Ling, Emre Kiciman, Armando Fox {bling,emrek,fox}_at_cs.stanford.edu – PowerPoint PPT presentation

Number of Views:103
Avg rating:3.0/5.0
Slides: 45
Provided by: Benja54
Category:

less

Transcript and Presenter's Notes

Title: A Recovery-Friendly, Self-Managing Session State Store


1
A Recovery-Friendly, Self-Managing Session State
Store
  • Benjamin Ling, Emre Kiciman, Armando
    Foxbling,emrek,fox_at_cs.stanford.edu

2
Outline
  • Motivation What is Session State?
  • SSM
  • Architecture
  • Algorithm
  • Backpressure and Admission Control
  • SSM Pinpoint
  • Self-recovering, self-monitoring
  • Benchmarks
  • Next steps Sun Reference AppServer integration
  • Conclusion

3
Proliferation of J2EE and Web Services
  • J2EE embraced as industry standard
  • Framework
  • Simplifies development
  • Allows for portability of services
  • Standardized interfaces
  • However, difficulties remain

4
The Pain Administration and Maintenance
  • Administration is difficult and costly
  • -- Database admins cost 200K/yr a head
  • Development efficiency negatively impacted
  • Failure/Recovery is costly
  • Recovery slow, especially site outages
  • Data loss on crashes
  • Users adversely affected

5
Not All State is Created Equal
  • Various types of state in J2EE
  • User profile state
  • Persistent shared state
  • Transaction history state
  • But usually stored in the same place
  • Stored in DB or FS
  • Focus on particular class
  • Exploit its properties
  • Simplify Administration and Maintenance

6
Example of Session State
7
Properties of Session State
  • Subcategory of session state
  • Single-user, serial access, semi-persistent data
  • Examples Temporary application data, application
    workflow
  • Example of usage (e.g. J2EE)

8
Goal
  • Build a session state store that is
  • Failure-friendly
  • Does not lose data on crash
  • Degrades gracefully
  • Recovery-friendly
  • Recovers fast
  • Self-Managing

9
Outline
  • Motivation What is Session State?
  • SSM
  • Architecture
  • Algorithm
  • Backpressure and Admission Control
  • SSM Pinpoint
  • Self-recovering, self-monitoring
  • Benchmarks
  • Next steps Sun Reference AppServer integration
  • Conclusion

10
Session State Manager (SSM)
RAM, Network Interface
Redundant, in-memory hash table distributed
across nodes
  • Algorithm Redundancy similar to quorums
  • Write to many random nodes, wait for few (avoid
    performance coupling)
  • Read one

11
Write example Write to Many, Wait for Few
Try to write to W random bricks, W 4Must wait
for WQ bricks to reply, WQ 2
Brick 1
Brick 2
Browser
Brick 3
Brick 4
Brick 5
12
Write example Write to Many, Wait for Few
Try to write to W random bricks, W 4Must wait
for WQ bricks to reply, WQ 2
Brick 1
Brick 2
Browser
Brick 3
Brick 4
Brick 5
13
Write example Write to Many, Wait for Few
Try to write to W random bricks, W 4Must wait
for WQ bricks to reply, WQ 2
Brick 1
Brick 2
Browser
Brick 3
Brick 4
Brick 5
14
Write example Write to Many, Wait for Few
Try to write to W random bricks, W 4Must wait
for WQ bricks to reply, WQ 2
Brick 1
Brick 2
Browser
Brick 3
Brick 4
Brick 5
15
Write example Write to Many, Wait for Few
Try to write to W random bricks, W 4Must wait
for WQ bricks to reply, WQ 2
Brick 1
Brick 2
Browser
14
Brick 3
Brick 4
Cookie holds metadata
Brick 5
16
Read example
Try to read from Bricks 1, 4
Brick 1
14
Brick 2
Browser
Brick 3
Brick 4
Brick 5
17
Read example
14
Brick 1
Brick 2
Browser
Brick 3
Brick 4
Brick 5
18
Read example
Brick 1 crashes
Brick 1
Brick 2
Browser
Brick 3
Brick 4
Brick 5
19
Read example
Brick 2
Browser
Brick 3
Brick 4
Brick 5
20
SSM Failure and Recovery
  • Failure of single node
  • No data loss, WQ-1 remain
  • State is available for R/W during failure
  • Recovery
  • Restart No recovery
  • No special case recovery code
  • State is available for R/W during brick restart
  • Session state is self-recovering
  • Users access pattern causes data to be rewritten

21
Backpressure and Admission Control
Brick 1
Brick 2
Drop Requests
Brick 3
Brick 4
Brick 5
Heavy flow to Brick 3
22
Backpressure and Admission Control
Brick 1
Brick 2
Drop Requests
Brick 3
Brick 4
Reduce Sending
Brick 5
Reject requests
23
Outline
  • Motivation What is Session State?
  • SSM
  • Architecture
  • Algorithm
  • Backpressure and Admission Control
  • SSM Pinpoint
  • Self-recovering, self-monitoring
  • Benchmarks
  • Next steps Sun Reference AppServer integration
  • Conclusion

24
Recovery Philosophy
RECOVERY COST
Cheap
Expensive
Accurate
Lax
Aggressive
DETECTION ACCURACY
25
Failure detection and Recovery
SSM Failure masked
Instant recovery
26
False Positives
Normal Operation
False positivetriggered
Instant recovery
27
Statistical Monitoring
Statistics
Statistics NumElementsMemoryUsedInboxSizeNumDro
ppedNumReadsNumWrites
Brick 1
Brick 2
Brick 3
Brick 4
Brick 5
28
Statistical Monitoring
Statistics
Statistics NumElementsMemoryUsedInboxSizeNumDro
ppedNumReadsNumWrites
Brick 1
Brick 2
Brick 3
Brick 4
Brick 5
REBOOT
29
Statistical Monitoring
Statistics
Statistics NumElementsMemoryUsedInboxSizeNumDro
ppedNumReadsNumWrites
Brick 1
Brick 2
Brick 3
Brick 4
Brick 5
30
SSM Monitoring
  • N replicated bricks handle read/write requests
  • Cannot do structural anomaly detection!
  • Alternative features (performance, mem usage,
    etc)
  • Activity statistics How often did a brick do
    something?
  • Msgs received/sec, dropped/sec, etc.
  • Same across all peers, assuming balanced workload
  • Use anomalies as likely failures
  • State statistics Current state of system
  • Memory usage, queue length, etc.
  • Similar pattern across peers, but may not be in
    phase
  • Look for patterns in time-series differences in
    patterns indicate failure at a node.

31
Surprising Patterns in Time-Series
  • 1. Discretize time-series into string. Keogh
  • 0.2, 0.3, 0.4, 0.6, 0.8, 0.2 -gt aaabba
  • 2. Calculate the frequencies of short substrings
    in the string.
  • aa occurs twice ab, bb, ba occurs once.
  • 3. Compare frequencies to normal, look for
    substrings that occur much less or much more than
    normal.

32
Outline
  • Motivation What is Session State?
  • SSM
  • Architecture
  • Algorithm
  • Backpressure and Admission Control
  • SSM Pinpoint
  • Self-recovering, self-monitoring
  • Benchmarks
  • Next steps Sun Reference AppServer integration
  • Conclusion

33
Microbenchmarks
  • UC Berkeley Millennium Cluster
  • Six bricks running
  • Candidate Write Set 3, Write quota 2
  • Candidate Read Set 2
  • State Size 8K

34
Induced Fault
35
Memory fault
36
Network Fault 70 packet loss
37
Performance Fault
38
Macrobenchmark
  • TellMes Email-By-Phone Application
  • Session state stored in memory
  • Email header information
  • Index information
  • Alter application to store session state using
  • Disk
  • SSM

39
Macrobenchmark
Throughput preserved compared to disk
25 Throughput Degradation compared to in-memory
40
Future Work
  • Integrate with Suns reference Application Server
  • Enterprise benchmarks
  • Statistical Anomaly Detection
  • Too many magic numbers
  • Integrated ROC-J2EE application server

41
Conclusion
  • SSMA Recovery-Friendly, Self-ManagingSession
    State Store
  • Benjamin Lingbling_at_cs.stanford.eduhttp//swig.st
    anford.edu/

42
Existing solutions
  • File System and Databases
  • Poor failure behavior
  • Lose data (FS)
  • Slow recovery (Both)
  • Difficult to administer (DB)
  • Difficult to tune (both)
  • In-memory replication using primary/secondary
  • Performance coupling
  • Poor failover (uneven load balancing)

43
Other implementation details
  • Garbage collection
  • Generational hash table
  • Hash table of hash tables
  • Each hash table has an associated time range
  • When time has passed, GC that table
  • No reference counting, scanning, etc.

44
SSM Self-Managing
  • Adaptive
  • Stub maintains count of maximum allowable
    in-flight requests to each brick
  • Additive increase on successful request
  • Multiplicative decrease on timeout
  • Stubs discover capacity of each brick
  • ? Self-Tuning
  • Admission control
  • Stubs say no if insufficient bricks
  • Propagate backpressure from bricks to clients
  • Turn users away under overload
  • ? Self-Protecting
Write a Comment
User Comments (0)
About PowerShow.com