Optimized State Replication for Highly Available Services - PowerPoint PPT Presentation

1 / 45

About This Presentation

Title:

Optimized State Replication for Highly Available Services

Description:

Triggering events for reconfiguring. 17. System - Model. Management ... Lower when reconfiguring. Relation between quality and inconsistency. 44. Conclusion ... – PowerPoint PPT presentation

Number of Views:17

Avg rating:3.0/5.0

Slides: 46

Provided by: Flev

Category:

more less

Transcript and Presenter's Notes

Title: Optimized State Replication for Highly Available Services

1
Optimized State Replication for Highly Available
Services

Group 05gr1084b
Erling V. Matthiesen, Jakob K. Larsen, Flemming
Olufsen
June 2005

2
Motivation

VoIP, Video on demand, MMORPG.
Network centric and session based.
Requires high capacity and stateful servers.
Requires high dependability.
Server pool.
Several servers used in parallel.
Enables dynamic failover.
Requires state sharing between servers.
Scalability issues.
State sharing introduces additional overhead.

3
TOC/outline

Background Knowledge (Erling)
Problem Statement
Related Work
General Solution
System Model (Jakob)
Methods
Algorithms (Flemming)
Simulation
Results
Evaluation
Future Work

4
Background Knowledge

Example application.
SIP (Session Initiation Protocol)
Uses servers for initiating and maintaining
communication between two entities
Example Proxy server, that maintains state of
ongoing transactions.

5
Background Knowledge

RSerPool
Provides platform for highly available services.
Architechture
Name server
Pool User
Pool Element

6
Background Knowledge

State sharing approaches
All-to-All.
Major overhead from state updates in large server
pools.
Hierarchical
Slow propagation of state updates.
Flat subset structure
Fast propagation, small amount of state updates.

7
Problem Statement

How to deliver dependable, consistent and
stateful services with large server pools in a
scalable manner?

8
Problem Statement

Evaluation parameters
Availability
Response time
Robustness
Reliability
Scalability
Consistancy

9
Related Work

Massively Replicating Services in Wide Area
Internetworks by Katia Obraczka.
Divides a server pool into logical floodig
topology. This solution minimizes the load on the
network but all updates will reach all members of
the pool.
Fault tolerant platforms for IP-based Session
Control Systems by Marjan Boinovski
Analyzes several highly available fault tolerant
platforms. Optimizes control algorithms,
selection policies and integrates these into
RSerPool.

10
General Solution

Dependability issue solution
Using a server pool with statesharing.
Scalability issue solution
Dividing the pool into subsets.
Goal
Find an optimum set of subsets spanning the
whole server pool. Called a partition.
Reduce inconsistency within a subset.

11
TOC/outline

Background Knowledge (Erling)
Problem Statement
Related Work
General Solution
System Model (Jakob)
Methods
Algorithms (Flemming)
Simulation
Results
Evaluation
Future Work

12
System Model

Entities
Name server
Manages servers
Manages subsets (Extension)
Selects servers for clients
SSP (Modified)
Calculates partitions (Extension)
Pool Element
Provides the service to the clients
Sends list of failover candidate to PU (Modified)
Replicates their states onto subset members
(Extension)
Sends communication cost values to NS (Extension)

13
System Model

Entities
Pool User
Uses the service that the servers provide
Uses name servers as gateway to service
Fails over according to list of failover
candidates
Subset (Extension)
Group of servers
States are replicated within the same subset
Partition (Extension)
A group of subsets spanning the whole server pool

14
System Model
15
System Model

Communication cost
Cost between servers are represented by a cost
matrix.
The cost between two servers are represented by
one value (0-255).
Ex Delay, Packet loss etc.

16
System Model

Considerations
High availability is needed.
High consistancy within subsets.
Asumptions
Only PE to PE cost is considered.
N is a multiple of k.
NS is stable.
NS communication is stable.
Problems
Dividing pool into subsets.
Server selection policy.
Measure cost of communication.
Triggering events for reconfiguring.

17
System - Model

Management

NAME_RESOLUTION
NAME_RESOLUTION_RESPONSE(1,2,3)
SESSION
SESSION
2
SIPREGISTER
BUSINESS_CARD(2,3,1)
SUMS
3
1
18
Methods

Methods used for analysis
Traffic modelling by example
Estimation of subset size
Birth death chains
Matrix exponential distribution
Availability graphs
Quality of partition

19
Methods

Traffic modelling by example

20
Methods

Estimation of subset size
Subset availability
MTTF120h
MTTR4h

21
Methods

Birth death chains
Used to estimate the rate of server failures
within a pool of servers.

22
Methods

Matrix exponential distributions
4 servers MTTCF is 9.341105 hours, for
MTTF120h and MTTR4h

23
Methods

Determination of subset topology
Availability graphs A(service)98.3

24
Methods

Availability graphs (cont)
A(service)99.89
Higher availability if subset is spread on
several network devices

25
Methods

Quality of subsets
The mean cost between any serverpair.
Quality of a partition
The sum of the subset qualities.

26
TOC/outline

Background Knowledge (Erling)
Problem Statement
Related Work
General Solution
System Model (Jakob)
Methods
Algorithms (Flemming)
Simulation
Results
Evaluation
Future Work

27
Algorithms

Optimum Division (OD)
Go through all legal solutions, choose the best
set of subsets based on quality metric.
Complexity very high.

28
Algorithms

Simple division (SD)
Put servers into subsets sequentially
Very simple division, highly dependant upon
server order.
Used for comparison as an optimum solution
regarding speed.
Complexity O(N)

29
Algorithms

Simplified One Pass Compare and Swap (SOPCS)
Put servers in to current subset sequentially.
The cost of communicating with the first server
in the subset is compared for each server.
Swaps to improve subset quality
Tradeoff between scalability and quality of the
partition.
Complexity O(N2)

30
Algorithms - SOPCS
9
8
1
7
2
6
3
5
4
31
Algorithms

One Pass Compare and Swap(OPCS)
Put servers in to current subset sequentially.
Swap servers in current subset with servers
enhancing the quality of the subset.
Comparison are made between all combinations of
servers.
Tradeoff between scalability and quality of the
partition.
Complexity O(N2)