Testing the Limits of a Transactional Networked Service - PowerPoint PPT Presentation

About This Presentation
Title:

Testing the Limits of a Transactional Networked Service

Description:

When it comes to scalability of cloud services, questions arise on efficiency of the software system used, performance and cost factors. Instart Logic created a system called Lava that measures and tests the scalability limits of systems. Lava is being used to stress test all major systems at Instart Logic as it reduces the length of time taken for a single stress test experiment. – PowerPoint PPT presentation

Number of Views:28

less

Transcript and Presenter's Notes

Title: Testing the Limits of a Transactional Networked Service


1
TESTING THE LIMITS OF A TRANSACTIONAL NETWORKED
SERVICE
BY BOWEI DU
2
(No Transcript)
3
INTRODUCTION
  • One of the defining characteristics of a cloud
    service is scale, and with scale comes the
    question of performance and cost. How efficient
    are the software systems that we run? How many
    computing resources are required to meet our
    current demands, and how much more will be
    required in the future?
  • At Instart Logic, we have created a system
    called Lava that enables us to measure and test
    the scalability limits of our systems. Lava is
    focused on transactional networked services and
    systems that serve independent requests sent over
    a network from a large number of clients.
    Examples include HTTP frontends, data caches and
    API endpoints.
  • Performance measurement is a deep topic with many
    facets. Lava seeks to solve a specific slice of
    the performance measurement problem how can we
    quickly find the maximum load a service can
    handle? While today there are many open source
    tools for stress testing, we found most of them
    to be too inflexible and slow to use for this
    purpose. This poses a problem as we have a large
    space of experimental parameters to explore
    during system stress tests.
  • Lava decomposes this problem into two pieces
  • a set of extensible protocol-specific agents that
    generate a controllable amount of load on the
    system in test
  • a control function that uses feedback from
    metrics generated by the stress test to find
    system limits.
  • While the ideas used in the Lava system are not
    novel, we feel that the particular combination of
    features used will be interesting to a broader
    audience.

4
BACKGROUND
The most important metrics for our Lava use cases
are throughput and latency. Throughput is the
rate of request that can be processed and latency
is the time from the start of a request to
reception of the response.
5
  • Figure 1 is a graph of the typical response time
    behavior with respect to increasing request
    volume. Service response time is stable under
    increasing request rate until we reach a
    saturation point at which the service cannot
    keep up with the request ingress rate. Beyond the
    saturation point, internal queues overflow and
    service response times degrade past acceptable
    thresholds.
  • It is important to know what the saturation point
    is for each of our services. In development, we
    use the results obtained from Lava to find
    performance regressions and guide our performance
    improvement efforts. In production, we use these
    results for capacity planning, as services need
    to be provisioned with enough headroom to absorb
    service failures and request spikes.
  • There are many existing performance frameworks
    for network protocols such as HTTP, the main ones
    being Tsung, Apache Bench, Siege and JMeter. We
    encountered the following issues with these
    frameworks
  • First, many of the frameworks available run a set
    workload without any feedback mechanisms for load
    control. Our stress runs can be sensitive around
    the saturation point and slightly too much load
    can cause high variance in the output, leading to
    unstable results.
  • Lack of feedback also meant that finding the
    saturation point required many runs of the stress
    tools probing at different load levels. Even with
    a guided binary search, this proved to be too
    slow to be viable for exploring large sets of
    experimental parameters.
  • Finally, while this is not fundamental, we found
    that the Lava system was simple enough that
    implementation of the mechanisms within our own
    framework did not incur undue engineering cost.

6
DESIGN
7
  • The Lava system (Figure 2) consists of two main
    components
  • A set of agents running on worker threads that
    generate application-specific loads. For example,
    in a stress test of an HTTP frontend, each state
    machine executes a sequence of HTTP
    request/response interactions. For saturation
    point measurement, each agent generates a
    constant number of requests per second for easy
    load control.
  • A control function component that receives
    real-time metrics aggregated from the state
    machines and adjusts the parameters of the stress
    run. The control function manages the number of
    state machines that are active and the state of
    the Lava system overall.
  • Each Lava run consists of three phases ramp-up,
    search and measurement. During the ramp-up phase,
    the Lava control function steadily increases the
    number of active agents until a metric threshold
    has been exceeded. The ramp-up phase is not
    strictly necessary however, we have found it is
    useful to distinguish for debugging purposes.
    Lava then transitions to the search phase, in
    which the number of agents is varied up and down
    around the saturation point, to find the maximum
    load possible that still meets the threshold.
    When the search phase has stabilized, Lava
    transitions to the measurement phase, in which
    the number of agents is held constant for a
    configurable time period. During the measurement
    phase, all metrics should be stable. If high
    variance occurs, it is an indication that either
    something is wrong with the system under test or
    with the test setup itself. Figure 3 shows the
    agent count and metric graphs for each of the
    phases.

8
Each agent in Lava simulates a constant rate
workload from a client. By increasing or
decreasing the number of active agents, Lava can
adjust the amount of load placed on the system in
test. Each agent has (modulo code transformations
to facilitate non-blocking I/O) the following
inner loop
void Agentrun() while (true)
Operation op create_next_operation()
op-gtrun() sleep(1/rate)
Agents can be implemented as extensions in C or
via the Lua scripting language. In addition to
the system limit exploration, we have also
implemented agents that replay request traces
taken from production.
METRICS AND CONTROL FUNCTIONS
We track an extensible set of metrics from the
active agents and feed them to a control function
that determines how to adjust the load. Metrics
are tracked by each agent and aggregated by the
central control function component.
9
class Controller public enum Signal
STABLE, DECREASE, INCREASE virtual Signal
update( const Metrics metrics) 0
For most applications, we have found that a
simple linear controller tracking a moving window
of 95th/99th percentile operation latency
suffices
ControllerSignal LinearControllerupdate(
const Metrics metrics) double delta
metrics-gtp95_latency() - limit if (delta gt
epsilon) return DECREASE if (delta lt
-epsilon) return INCREASE return STABLE
More sophisticated control functions with faster
convergence are possible but currently not
explored.
10
EXAMPLE
11
Figure 4 shows a sample result from a Lava run
testing an HTTP-based service. In this graph, we
set a threshold for the 95th percentile latency
of 2 milliseconds with the linear control
function. The top graph shows the throughput we
are getting from the system. The middle graph
shows the sliding window metrics we are
measuring. Note that the metrics can vary due to
inherent system variabilities and randomness. The
bottom graph shows the number of agents that are
active through the run. We can see Lava
transition through the ramp-up, search and
measurement states from the agent graph.
CONCLUSION
Lava is currently being used to stress test all
major systems at Instart Logic, replacing all 3rd
party stress test frameworks. Adoption of Lava
has reduced the length of time taken for a single
stress test experiment by an order of magnitude.
For example, our HTTP-based stress tests using
Tsung and binary search took around twenty
minutes to converge. A similar run using Lava can
converge in under five minutes.  We are in the
process of open-sourcing our Lava software as we
feel the feedback-control-based stress test
framework is widely applicable and useful.  To
read additional technical content from the
Instart Logic engineering team, visit
our technology blog.
Write a Comment
User Comments (0)
About PowerShow.com