Optimized State Replication for Highly Available Services - PowerPoint PPT Presentation

1 / 45
About This Presentation
Title:

Optimized State Replication for Highly Available Services

Description:

Triggering events for reconfiguring. 17. System - Model. Management ... Lower when reconfiguring. Relation between quality and inconsistency. 44. Conclusion ... – PowerPoint PPT presentation

Number of Views:17
Avg rating:3.0/5.0
Slides: 46
Provided by: Flev
Category:

less

Transcript and Presenter's Notes

Title: Optimized State Replication for Highly Available Services


1
Optimized State Replication for Highly Available
Services
  • Group 05gr1084b
  • Erling V. Matthiesen, Jakob K. Larsen, Flemming
    Olufsen
  • June 2005

2
Motivation
  • VoIP, Video on demand, MMORPG.
  • Network centric and session based.
  • Requires high capacity and stateful servers.
  • Requires high dependability.
  • Server pool.
  • Several servers used in parallel.
  • Enables dynamic failover.
  • Requires state sharing between servers.
  • Scalability issues.
  • State sharing introduces additional overhead.

3
TOC/outline
  • Background Knowledge (Erling)
  • Problem Statement
  • Related Work
  • General Solution
  • System Model (Jakob)
  • Methods
  • Algorithms (Flemming)
  • Simulation
  • Results
  • Evaluation
  • Future Work

4
Background Knowledge
  • Example application.
  • SIP (Session Initiation Protocol)
  • Uses servers for initiating and maintaining
    communication between two entities
  • Example Proxy server, that maintains state of
    ongoing transactions.

5
Background Knowledge
  • RSerPool
  • Provides platform for highly available services.
  • Architechture
  • Name server
  • Pool User
  • Pool Element

6
Background Knowledge
  • State sharing approaches
  • All-to-All.
  • Major overhead from state updates in large server
    pools.
  • Hierarchical
  • Slow propagation of state updates.
  • Flat subset structure
  • Fast propagation, small amount of state updates.

7
Problem Statement
  • How to deliver dependable, consistent and
    stateful services with large server pools in a
    scalable manner?

8
Problem Statement
  • Evaluation parameters
  • Availability
  • Response time
  • Robustness
  • Reliability
  • Scalability
  • Consistancy

9
Related Work
  • Massively Replicating Services in Wide Area
    Internetworks by Katia Obraczka.
  • Divides a server pool into logical floodig
    topology. This solution minimizes the load on the
    network but all updates will reach all members of
    the pool.
  • Fault tolerant platforms for IP-based Session
    Control Systems by Marjan Boinovski
  • Analyzes several highly available fault tolerant
    platforms. Optimizes control algorithms,
    selection policies and integrates these into
    RSerPool.

10
General Solution
  • Dependability issue solution
  • Using a server pool with statesharing.
  • Scalability issue solution
  • Dividing the pool into subsets.
  • Goal
  • Find an optimum set of subsets spanning the
    whole server pool. Called a partition.
  • Reduce inconsistency within a subset.

11
TOC/outline
  • Background Knowledge (Erling)
  • Problem Statement
  • Related Work
  • General Solution
  • System Model (Jakob)
  • Methods
  • Algorithms (Flemming)
  • Simulation
  • Results
  • Evaluation
  • Future Work

12
System Model
  • Entities
  • Name server
  • Manages servers
  • Manages subsets (Extension)
  • Selects servers for clients
  • SSP (Modified)
  • Calculates partitions (Extension)
  • Pool Element
  • Provides the service to the clients
  • Sends list of failover candidate to PU (Modified)
  • Replicates their states onto subset members
    (Extension)
  • Sends communication cost values to NS (Extension)

13
System Model
  • Entities
  • Pool User
  • Uses the service that the servers provide
  • Uses name servers as gateway to service
  • Fails over according to list of failover
    candidates
  • Subset (Extension)
  • Group of servers
  • States are replicated within the same subset
  • Partition (Extension)
  • A group of subsets spanning the whole server pool

14
System Model
15
System Model
  • Communication cost
  • Cost between servers are represented by a cost
    matrix.
  • The cost between two servers are represented by
    one value (0-255).
  • Ex Delay, Packet loss etc.

16
System Model
  • Considerations
  • High availability is needed.
  • High consistancy within subsets.
  • Asumptions
  • Only PE to PE cost is considered.
  • N is a multiple of k.
  • NS is stable.
  • NS communication is stable.
  • Problems
  • Dividing pool into subsets.
  • Server selection policy.
  • Measure cost of communication.
  • Triggering events for reconfiguring.

17
System - Model
  • Management

NAME_RESOLUTION
NAME_RESOLUTION_RESPONSE(1,2,3)
SESSION
SESSION
2
SIPREGISTER
BUSINESS_CARD(2,3,1)
SUMS
3
1
18
Methods
  • Methods used for analysis
  • Traffic modelling by example
  • Estimation of subset size
  • Birth death chains
  • Matrix exponential distribution
  • Availability graphs
  • Quality of partition

19
Methods
  • Traffic modelling by example

20
Methods
  • Estimation of subset size
  • Subset availability
  • MTTF120h
  • MTTR4h

21
Methods
  • Birth death chains
  • Used to estimate the rate of server failures
    within a pool of servers.

22
Methods
  • Matrix exponential distributions
  • 4 servers MTTCF is 9.341105 hours, for
    MTTF120h and MTTR4h

23
Methods
  • Determination of subset topology
  • Availability graphs A(service)98.3

24
Methods
  • Availability graphs (cont)
  • A(service)99.89
  • Higher availability if subset is spread on
    several network devices

25
Methods
  • Quality of subsets
  • The mean cost between any serverpair.
  • Quality of a partition
  • The sum of the subset qualities.

26
TOC/outline
  • Background Knowledge (Erling)
  • Problem Statement
  • Related Work
  • General Solution
  • System Model (Jakob)
  • Methods
  • Algorithms (Flemming)
  • Simulation
  • Results
  • Evaluation
  • Future Work

27
Algorithms
  • Optimum Division (OD)
  • Go through all legal solutions, choose the best
    set of subsets based on quality metric.
  • Complexity very high.

28
Algorithms
  • Simple division (SD)
  • Put servers into subsets sequentially
  • Very simple division, highly dependant upon
    server order.
  • Used for comparison as an optimum solution
    regarding speed.
  • Complexity O(N)

29
Algorithms
  • Simplified One Pass Compare and Swap (SOPCS)
  • Put servers in to current subset sequentially.
  • The cost of communicating with the first server
    in the subset is compared for each server.
  • Swaps to improve subset quality
  • Tradeoff between scalability and quality of the
    partition.
  • Complexity O(N2)

30
Algorithms - SOPCS
9
8
1
7
2
6
3
5
4
31
Algorithms
  • One Pass Compare and Swap(OPCS)
  • Put servers in to current subset sequentially.
  • Swap servers in current subset with servers
    enhancing the quality of the subset.
  • Comparison are made between all combinations of
    servers.
  • Tradeoff between scalability and quality of the
    partition.
  • Complexity O(N2)

32
Algorithms - OPCS
9
8
1
7
2
6
3
5
4
33
Algorithms - LRTD
  • Limited Recursive Tree Division
  • Makes three trees with partitions.
  • Servers put in subset are chosen with SOPCS
    strategy.
  • When a subset is full it branches into three next
    unused servers.
  • Complexity O(aN)

34
Algorithms
  • Six servers, two in each subset
  • Subset
  • ---------------------------------------
  • 2. Subset
  • ---------------------------------------
  • 3. subset

1
2
3
2
3
4
5
4
5
3
5
3
4
6
6
6
35
Simulation - Matlab
  • Algorithms evaluated in Matlab
  • Poolsizes 8 40 servers
  • Steps of 4 servers
  • Runtime
  • cputime() Matlab function used
  • Quality
  • Topology sensitivity
  • 100 random generated topologies per step
  • Reordering sensitivity
  • 1 random topology, permutating labels 100 times
    each step

36
Results
  • Mean runtime

37
Results
  • Quality (topology sensitivity)

38
Simulation C
  • Discrete event simulator
  • Entities
  • Pool user
  • Pool element
  • Name server
  • Events
  • State update
  • Request/response
  • Measures inconsistency

39
Simulation
  • Mobility Model 180x180

40
Simulation
  • Quality of SOPCS

41
Simulation
  • Inconsistency with SOPCS

42
Evaluation
  • Run time
  • Small difference between SD, OPCS and SOPCS
  • LRTD slower, steeper incline
  • OD extremely slow in larger pools
  • Quality
  • SOPCS, OPCS no significant difference
  • SD the worst quality
  • LRTD best quality and close to OD for small pools
  • OD best quality (for small pools)

43
Evaluation
  • Inconsistency
  • Lower when reconfiguring
  • Relation between quality and inconsistency

44
Conclusion
  • Solution properties
  • High availability (99.999)
  • Subset size four, MTTCF 9.341105 hours
  • Scalability
  • Pool capacity increased
  • State update overhead reduced
  • Inconsistency reduced

45
Future work
  • Inconsistency prediction
  • Based on quality
  • Improve client perception
  • Include client-server communication
  • Improve simulation
  • Failure model
  • Determination of beneficial cost parameters
Write a Comment
User Comments (0)
About PowerShow.com