Philip Bianco - PowerPoint PPT Presentation

About This Presentation
Title:

Philip Bianco

Description:

Fail-Over Measurements. Show your graphs, one by one, over the ... 15-20 fail-overs ... active replication, show what the fail-over times vs. run-time ... – PowerPoint PPT presentation

Number of Views:97
Avg rating:3.0/5.0
Slides: 30
Provided by: ece9
Category:
Tags: bianco | fail | philip

less

Transcript and Presenter's Notes

Title: Philip Bianco


1
Team 5 Virtual Online Blackjack 17-654
Analysis of Software Artifacts 18-841
Dependability Analysis of Middleware
  • Philip Bianco
  • John Robert
  • Vorachat Tamarree
  • Lutz Wrage
  • Gene Wilson

2
Team Members
3
Virtual Online Blackjack
  • An interactive client server application that
    allows multiple players to play blackjack in a
    virtual casino.
  • The server performs all the functions of a dealer
    in a Las Vegas casino including
  • Selling chips to the players
  • Taking bets
  • Dealing the initial hand to each player
  • Presenting options to the player (Hit or Stay)
  • Officiating the game
  • This is an interesting application because
  • Multiple server side elements (Casino Floor,
    Bank, Table)
  • Clear fault tolerance and performance
    requirements
  • Completely Java solution
  • The application uses the Sun Microsystems IDL ORB
    for the following reasons
  • The price was certainly well within our budget
  • The team wanted to play with CORBA

4
Baseline Architecture
5
Fault Tolerance Goals
  • Fault tolerant goals
  • Client automatically connects to a back up casino
    server.
  • New backup is automatically started.
  • Minimize transient state and store all state on
    the database.
  • Replicated Components
  • Casino Server (IDL interfaces for Casino Floor,
    Bank and Table)
  • All State is stored in a magnificently designed
    database.
  • Installed MS SQL Server on a PC in the Cave
  • Shared server with at least one other team
  • Sacred functions
  • Database
  • Naming Service
  • Players (clients)

6
Fault Tolerant Elements
  • Replication Manager
  • Pings (1 per sec) all Casino servers to detect
    service faults
  • Automatically starts new servers to maintain the
    number of Casino servers (2)
  • User interface enables injecting faults (killing
    a server)
  • Very configurable using configuration files
  • Proxy classes for all communication
  • Isolates most fault tolerance functions from the
    application

7
FT-Baseline Architecture

8
FT-Baseline Architecture

9
FT-Baseline Architecture

10
Mechanisms for Fail-Over
  • How do you accomplish fail-over?
  • How do you detect a fault?
  • Which exceptions do you handle (mention the
    names)?
  • What do you do, upon catching one of these
    exceptions?
  • When do you obtain the names of the server
    references? What if you run out of live
    references?

11
Local method call
  • Local Methods Fault Free Standard Garbage
    Collection

12
Local Method call with failover to remote server

13
Fault Free Remote method call with standard GC

14
Remote Method call with failover to local server

15
Remote with server running IGC Fault Free

16
Remote with client and server running IGC

17
Timing of failover and activation time during
failover

18
Active Replication Timing Data
19
Fail-Over Measurements
  • Show your graphs, one by one, over the next few
    slides
  • Place one graph per slide
  • Select at most one graph (out of the entire set
    that you have)
  • Pick the most interesting graph
  • Showing at least 15-20 fail-overs
  • Showing the spike of the Naming Service (or
    Replication Manager) communication

20
RT-FT-Baseline Architecture
  • Should describe your strategy for reducing the
    fail-over time, in the interests of obtaining
    real-time bounded behavior under faults

21
Bounded Real-Time Fail-Over Measurements
  • Show your RT-FT-baseline graphs
  • Select at most one graph (out of the entire set
    that you have)
  • Pick the most interesting graph
  • Showing at least 15-20 fail-overs
  • Showing the spike of the Naming Service (or
    Replication Manager) communication being
    mitigated
  • Include on the slide the percentage by which
    youve reduced the spike
  • Tell us what the bounds for the fail-over now are

22
RT-FT-Performance Strategy
  • Used active replication strategy to address
    performance
  • The proxy classes that handle everything.
  • During player startup the AR proxies get
    references to all running replicas.
  • Each method call is sent to all replicas. The
    sequence numbers for one call are identical
    across replicas.
  • What mechanisms did you need in addition to what
    your system has?

23
Performance Measurements
  • Show your performance graphs for active
    replication or load balancing
  • Select at most one graph (out of the entire set
    that you have)
  • Pick the most interesting graph
  • For load balancing, show the system performance
    under several clients (try to scale up to more
    than 20 clients)
  • For active replication, show what the fail-over
    times vs. run-time performance trade-offs are, as
    compared to cold passive replication

24
Other Features
  • List other features that you used
  • Used CVS throughout the project
  • Moved to ant (from make files) after baseline
  • Some use of a scripting language for automated
    player (clients)
  • Genes tool to find current usage of cluster
    machines
  • Explored garbage collection
  • Incremental GC
  • Turned off
  • How to design for testability (fault injection)
  • This is where you get to show off about how
    youve gone the extra mile in this project!
  • Performance at different times of the day
  • Extensive use of configuration files enabled
    greater flexibility for implementation and
    testing

25
Insights from Measurements
  • What insights did you gain from the three sets of
    measurements, and from analyzing the data?
  • Java garbage collection is the dominate factor
    for performance
  • Time is double for remote clients.
  • Changing garbage collection impacts the
    performance by ???
  • Replication tradeoffs
  • In the tested configuration, network latency
    impact on performance is negligible compared to
    the database access time.
  • How did you use each set of insights in the next
    phases of the project?

26
Open Issues
  • List any issues that you still need to resolve,
    and that you might want to see discussed openly
  • Profiler testing with Java?
  • Impact of other JVMs?
  • If you had the time, what are the 2-3 additional
    features that you would have liked to have
    implemented for your system?
  • Improve the user interface
  • Examine impact of security requirements on FT, RT
    and performance
  • Analyzing why the performance edges up over time

27
Conclusions
  • Lessons Learned
  • Technical Lessons
  • Fault tolerance rapidly increases complexity of
    the system
  • Active replication is not-trivial.
  • In Java applications garbage collection is the
    largest performance bottleneck.
  • Name server lookups contributed a very minor
    amount of delay to failover recovery compared to
    state recovery from the database.
  • Configuration Issues for Remote Development
  • We did most of our development independenly and
    remotely.
  • Linux, CVS, ssh and afs all made this much
    easier.
  • Simple scripts can make life much easier
    clusterload, project.env.
  • e-mail, instant messaging, shared file system
    space help communication.
  • Separate databases for each developer.
  • Design still need face-to-face meetings to be
    effective.

28
Conclusions
  • Accomplishments
  • Met objective to create a distributed middleware
    application and demonstrated improvements at each
    milestone
  • Our Fault-Tolerant design worked very well fast,
    scalable, robust.
  • Replication Manager with fault injection,
    powerful user interface.
  • Player manager could start and control many
    clients on multilple hosts at once.
  • Active Replication.
  • Name caching.
  • Automatic player script with Expect.
  • Found the sources of the worst performance
    bottlenecks.
  • 1.3 babies

29
Conclusions
  • What would you do differently, if you could start
    the project from scratch now?
  • Focus on state management earlier in the
    development.
  • Design on active replication much earlier.
  • Using the clients as callback servers is a very
    bad idea, makes Active Replication and/or load
    balancing hard to impossible.
  • Have root access to the development machines.
  • Better configuration management, keeping better
    track of milestone versions.
  • Better test plans.
  • A little more structure to our team
  • Better meeting scheduling.
  • Agendas for meetings.
  • Minutes for meetings.
  • Maybe a team leader, possibly as a rotating
    assignment (i.e. 3 weeks/person)
Write a Comment
User Comments (0)
About PowerShow.com