Philip Bianco

About This Presentation

Title:

Philip Bianco

Description:

Fail-Over Measurements. Show your graphs, one by one, over the ... 15-20 fail-overs ... active replication, show what the fail-over times vs. run-time ... –

Number of Views:97

Avg rating:3.0/5.0

Slides: 30

Provided by: ece9

Learn more at: https://course.ece.cmu.edu

Category:

more less

Transcript and Presenter's Notes

Title: Philip Bianco

1
Team 5 Virtual Online Blackjack 17-654
Analysis of Software Artifacts 18-841
Dependability Analysis of Middleware

Philip Bianco
John Robert
Vorachat Tamarree
Lutz Wrage
Gene Wilson

2
Team Members
3
Virtual Online Blackjack

An interactive client server application that
allows multiple players to play blackjack in a
virtual casino.
The server performs all the functions of a dealer
in a Las Vegas casino including
Selling chips to the players
Taking bets
Dealing the initial hand to each player
Presenting options to the player (Hit or Stay)
Officiating the game
This is an interesting application because
Multiple server side elements (Casino Floor,
Bank, Table)
Clear fault tolerance and performance
requirements
Completely Java solution
The application uses the Sun Microsystems IDL ORB
for the following reasons
The price was certainly well within our budget
The team wanted to play with CORBA

4
Baseline Architecture
5
Fault Tolerance Goals

Fault tolerant goals
Client automatically connects to a back up casino
server.
New backup is automatically started.
Minimize transient state and store all state on
the database.
Replicated Components
Casino Server (IDL interfaces for Casino Floor,
Bank and Table)
All State is stored in a magnificently designed
database.
Installed MS SQL Server on a PC in the Cave
Shared server with at least one other team
Sacred functions
Database
Naming Service
Players (clients)

6
Fault Tolerant Elements

Replication Manager
Pings (1 per sec) all Casino servers to detect
service faults
Automatically starts new servers to maintain the
number of Casino servers (2)
User interface enables injecting faults (killing
a server)
Very configurable using configuration files
Proxy classes for all communication
Isolates most fault tolerance functions from the
application

7
FT-Baseline Architecture

8
FT-Baseline Architecture

9
FT-Baseline Architecture

10
Mechanisms for Fail-Over

How do you accomplish fail-over?
How do you detect a fault?
Which exceptions do you handle (mention the
names)?
What do you do, upon catching one of these
exceptions?
When do you obtain the names of the server
references? What if you run out of live
references?

11
Local method call

Local Methods Fault Free Standard Garbage
Collection

12
Local Method call with failover to remote server

13
Fault Free Remote method call with standard GC

14
Remote Method call with failover to local server

15
Remote with server running IGC Fault Free

16
Remote with client and server running IGC

17
Timing of failover and activation time during
failover

18
Active Replication Timing Data
19
Fail-Over Measurements

Show your graphs, one by one, over the next few
slides
Place one graph per slide
Select at most one graph (out of the entire set
that you have)
Pick the most interesting graph
Showing at least 15-20 fail-overs
Showing the spike of the Naming Service (or
Replication Manager) communication

20
RT-FT-Baseline Architecture

Should describe your strategy for reducing the
fail-over time, in the interests of obtaining
real-time bounded behavior under faults

21
Bounded Real-Time Fail-Over Measurements

Show your RT-FT-baseline graphs
Select at most one graph (out of the entire set
that you have)
Pick the most interesting graph
Showing at least 15-20 fail-overs
Showing the spike of the Naming Service (or
Replication Manager) communication being
mitigated
Include on the slide the percentage by which
youve reduced the spike
Tell us what the bounds for the fail-over now are

22
RT-FT-Performance Strategy

Used active replication strategy to address
performance
The proxy classes that handle everything.
During player startup the AR proxies get
references to all running replicas.
Each method call is sent to all replicas. The
sequence numbers for one call are identical
across replicas.
What mechanisms did you need in addition to what
your system has?

23
Performance Measurements

Show your performance graphs for active
replication or load balancing
Select at most one graph (out of the entire set
that you have)
Pick the most interesting graph
For load balancing, show the system performance
under several clients (try to scale up to more
than 20 clients)
For active replication, show what the fail-over
times vs. run-time performance trade-offs are, as
compared to cold passive replication

24
Other Features

List other features that you used
Used CVS throughout the project
Moved to ant (from make files) after baseline
Some use of a scripting language for automated
player (clients)
Genes tool to find current usage of cluster
machines
Explored garbage collection
Incremental GC
Turned off
How to design for testability (fault injection)
This is where you get to show off about how
youve gone the extra mile in this project!
Performance at different times of the day
Extensive use of configuration files enabled
greater flexibility for implementation and
testing

25
Insights from Measurements

What insights did you gain from the three sets of
measurements, and from analyzing the data?
Java garbage collection is the dominate factor
for performance
Time is double for remote clients.
Changing garbage collection impacts the
performance by ???
Replication tradeoffs
In the tested configuration, network latency
impact on performance is negligible compared to
the database access time.
How did you use each set of insights in the next
phases of the project?

26
Open Issues

List any issues that you still need to resolve,
and that you might want to see discussed openly
Profiler testing with Java?
Impact of other JVMs?
If you had the time, what are the 2-3 additional
features that you would have liked to have
implemented for your system?
Improve the user interface
Examine impact of security requirements on FT, RT
and performance
Analyzing why the performance edges up over time

27
Conclusions

Lessons Learned
Technical Lessons
Fault tolerance rapidly increases complexity of
the system
Active replication is not-trivial.
In Java applications garbage collection is the
largest performance bottleneck.
Name server lookups contributed a very minor
amount of delay to failover recovery compared to
state recovery from the database.
Configuration Issues for Remote Development
We did most of our development independenly and
remotely.
Linux, CVS, ssh and afs all made this much
easier.
Simple scripts can make life much easier
clusterload, project.env.
e-mail, instant messaging, shared file system
space help communication.
Separate databases for each developer.
Design still need face-to-face meetings to be
effective.

28
Conclusions

Accomplishments
Met objective to create a distributed middleware
application and demonstrated improvements at each
milestone
Our Fault-Tolerant design worked very well fast,
scalable, robust.
Replication Manager with fault injection,
powerful user interface.
Player manager could start and control many
clients on multilple hosts at once.
Active Replication.
Name caching.
Automatic player script with Expect.
Found the sources of the worst performance
bottlenecks.
1.3 babies

29
Conclusions

What would you do differently, if you could start
the project from scratch now?
Focus on state management earlier in the
development.
Design on active replication much earlier.
Using the clients as callback servers is a very
bad idea, makes Active Replication and/or load
balancing hard to impossible.
Have root access to the development machines.
Better configuration management, keeping better
track of milestone versions.
Better test plans.
A little more structure to our team
Better meeting scheduling.
Agendas for meetings.
Minutes for meetings.
Maybe a team leader, possibly as a rotating
assignment (i.e. 3 weeks/person)