Title: Fast Recovery Statistical Anomaly Detection = Self-*
1Fast Recovery Statistical Anomaly Detection
Self-
- RADS/KATZ CATS Panel
- June 2004 ROC Retreat
2Outline
- Motivation approach complex systems of black
boxes - Measurements that respect black boxes
- Box-level Micro-recovery cheap enough to survive
false positives - Differences from related efforts
- Early case studies
- Research agenda
3Complex Systems of Black Boxes
- ...our ability to analyze and predict the
performance of the enormously complex software
systems that lies at the core of our economy is
painfully inadequate. (Choudhury Weikum, 2000
PITAC Report) - Build model of acceptable operating envelope by
measurement analysis - Control theory, statistical correlation, anomaly
detection... - Rely on external control, using inexpensive and
simple mechanisms that respect the black box, to
keep system in its acceptable operating envelope - Increase the size of the DB connection pool
Hellerstein et al - Reallocate one or more whole machines
Lassettre et al - Rejuvenate/reboot one or more machines
Trivedi, Fox, others - Shoot one of the blocked txns everyone
- Induce memory pressure on other apps
Waldspurger et al
4Differences from some existing problems
- intrusion detection (Hofmeyr et al 98, others)
- Detections must be actionable in a way that is
likely to improve system (sacrificing
availability for safety is unacceptable) - bug finding via anomaly detection (Engler,
others) - Human-level monitoring/verification of detections
not feasible, due to number of observations and
short timescales for reaction - Can separate recovery from diagnosis/repair
(dont always need to know root cause to recover) - modeling/predicting SLO violations (Hellerstein,
Goldszmidt, others) - Labeled training set not necessarily available
5Many other examples, but the point is...
Statistical techniques identify interesting
features and relationships from large datasets,
but frequent tradeoff between detection rate (or
detection time) and false positives
Make micro-recovery so inexpensive that
occasional false positives dont matter
- Granularity of black box should match
granularity of available external control
mechanisms
6Micro-recovery to survive false positives
- Goal provide recovery management invariants
- Salubrious returns some part of system to
known state - Reclaim resources (memory, DB conns, sockets,
DHCP lease...) - Throw away corrupt transient state
- Possibly setup to retry operation, if appropriate
- Safe affects only performance, not correctness
- Non-disruptive performance impact is small
- Predictable impact and time-to-complete is
stable
Observe, Analyze, Act Not recovery, but
continuous adaptation
7Crash-Only Building Blocks
Subsystem Control point How realized Statistical monitoring
SSM (diskless session state store) NSDI 04 Whole-node fast reboot (doesnt preserve state) Quorum-like redundancy Relaxed consistency Repair cost spread over many operations Time series of state metrics (Tarzan)
DStore (persistent hashtable) in preparation Whole-node reboot (preserves state) Quorum-like redundancy Relaxed consistency Repair cost spread over many operations Time series of state metrics (Tarzan)
JAGR (J2EE application server) AMS 2003 in prep. Microreboots of EJBs Modify appserver to undeploy/ redeploy EJBs and stall pending reqs Anomalous code paths and component interactions (Probabilistic context-free grammar)
- Control points are safe, predictable,
non-disruptive - Crash-only design shutdowncrash,
recoverrestart - Makes state-management subsystems as easy to
manage as stateless Web servers
8Example Managing DStore and SSM
- Rebooting is the only control mechanism
- Has predictable effect and takes predictable
time, regardless of what the process is doing - Like kill -9, turning off a VM, or pulling
power cord - Intuition the infrastructure supporting the
power switch is simpler than the applications
using it - Due to slight overprovisioning inherent in
replication, rebooting can have minimal effect on
throughput latency - Relaxed consistency guarantees allow this to work
- Activity and state statistics collected per brick
every second any deviation gt reboot brick - Makes it as easy as managing a stateless server
farm - Backpressure at many design points prevents
saturation
9Design Lessons Learned So Far
- A spectrum of cleaning operations (Eric
Anderson, HP Labs) - Consequence as t??, all problems will converge
to repair of corrupted persistent data - Trade unnecessary consistency for faster
recovery - spread recovery actions out incrementally/lazily
(read repair) rather than doing it all at once
(log replay) - gives predictable return-to-service time and
acceptable variation in performance after
recovery - keeps data available for reads and writes
throughout recovery - Use single phase ops to avoid coupling/locking
and the issues they raise, and justify the cost
in consistency - Its OK to say no (backpressure)
- Several places our design got it wrong in SSM
- But even those mistakes could have been worked
around by guard timers
10Potential Limitations and Challenges
- Hard failures
- Configuration failures
- Although similar approach has been used to
troubleshoot those - Corruption of persistent state
- Data structure repair work (Rinard et al.) may be
combinable with automatic inference (Lam et al.) - Challenges
- Stability and the autopilot problem
- The base-rate fallacy
- Multilevel learning
- Online implementations of SLT techniques
- Nonintrusive data collection and storage
11An Architecture for Observe, Analyze, Act
- Separates systems concerns from algorithm
development - Programmable network elements provide extension
of approach to other layers - Consistent with technology trends
- Explicit //ism in CPU usage
- Lots of disk storage with limited bandwidth
12Conclusion
The real reason to reduce MTTRis to tolerate
false positives recovery ? adaptation
- ...Ultimately, these aspects of autonomic
systems will be emergent properties of a general
architecture, and distinctions will blur into a
more general notion of self-maintenance. (The
Vision of Autonomic Computing)
13Breakout sessions?
- James H Reserve some resources to deal with
problems (by filtering or pre-reservation) - Joe H How black is the black box? What gray
box prior knowledge can you exploit (so you
dont ignore the obvious)? - Joe H Human role - can make statements about
how system should act, so doesnt have to be
completely hands-off training. Similarly, during
training, human can give feedback about what
anomalies are actually relevant (labeling). - Lakshmi What kinds of apps is this intended to
apply to? Where do ROC-like and OASIS-like apps
differ? - Mary Baker People can learn to game the system
-gt randomness can be your friend. If behaviors
have small number of modes, just have to look for
behaviors in the valleys
14Breakouts
- 19 -golden nuggets to guide architecture, e.g.,
persistent identifiers for path-based
analysis...what else? - 8 - act what safe,fast,predictable behaviors
of the system should we expose (other than, eg,
rebooting)? Esp. those that contribute to
security as well as dependability? - 11 - architectures for different types of
stateful systems - what kinds of
persistent/semi-persistent state need to be
factored out of apps, and how to store it
interfaces, etc - 20 - Given your goal of generic techniques for
distributed systems, how will you know when
youve succeeded/how do you validate the
techniques? (What are the proof points you
can hand to others to convince them youve
succeeded, including but not limited to metrics?)
Aaron/Dave Metrics How do you know youre
observing the right things? What benchmarks will
be needed?
15Open Mic
- James Hamilton - The Security Economy
16Conclusion
The real reason to reduce MTTRis to tolerate
false positives recovery ? adaptation
- Toward new science in autonomic computing
- ...Ultimately, these aspects of autonomic
systems will be emergent properties of a general
architecture, and distinctions will blur into a
more general notion of self-maintenance. (The
Vision of Autonomic Computing)
17Autonomic Technology Trends
- CPU speed increases slowing down, need more
explicit parallelism - Use extra CPU to collect and locally analyze
data exploit temporal locality - Disk space is free (though bandwidth and
disaster-recovery arent) - Can keep history of parallel as well as
historical models for regression analysis,
trending, etc. - VMs being used as unit of software distribution
- Fault isolation
- Opportunity for nonintrusive observation
- Action that is independent of the hosted app
18Data collection monitoring
- Component frameworks allow for non-intrusive data
collection without modifying the applications - Inter-EJB calls through runtime-managed level of
indirection - Slightly coarser grain of analysis restrictions
on legal paths make it more likely we can spot
anomalies - Aspect-oriented programming allows further
monitoring without perturbing application logic - Virtual machine monitors provide additional
observation points - Already used by ASPs, for load balancing, app
migration, etc. - Transparent to applications and hosted OSs
- Likely to become the unit of software
distribution (intra- and inter-cluster)
19Optimizing for Specialized State Types
- Two single-key (Berkeley DB) get/set state
stores - Used for user session state, application workflow
state, persistent user profiles, merchandise
catalogs, ... - Replication to a set of N bricks provides
durability - Write to subset, wait for subset, remember subset
- DStore state persists forever as long as ?N/2?
bricks survive - SSM If client loses cookie, state is lost
otherwise, persists for time t with probability
p, where t, p F(N, node MTBF) - Recoveryrestart, takes seconds or less
- Efficacy doesnt depend on whether replica is
behaving correctly - SSM node state not preserved (in-memory only)
- DStore node state preserved, read-repair fixes
20Detection recovery in SSM
- 9 State statistics collected once per second
from each brick - Tarzan time series analysis keep N-length time
series, discretize each data point - count relative frequencies of all substrings of
length k or shorter - compare against peer bricks reboot if at least 6
stats anomalous works for aperiodic or
irregular-period signals
- Remember! We are not SLT/ML researchers!
21Detection recovery in DStore
- Metrics and algorithm comparable to those used in
SSM - We inject fail-stutter behavior by increasing
request latency - Bottom case more aggressive detection also
results in 2 unnecessary reboots - But they dont matter much
- Currently some voodoo constants for thresholds in
both SSM and DStore - Trade-off of fast detection vs. false positives
22What faults does this handle?
- Substantially all non-Byzantine faults we
injected - Node crash, hang/timeout/freeze
- Fail-stutter Network loss (drop up to 70 of
packets randomly) - Periodic slowdown (eg from garbage collection)
- Persistent slowdown (one node lags the others)
- Underlying (weak) assumption Most bricks are
doing mostly the right thing most of the time - All anomalies can be safely coerced to crash
faults - If that turned out to be the wrong thing, it
didnt cost you much to try it - Human notified after threshold number of restarts
- These systems are always recovering
23Path-based analysis Microreboots
- Pinpoint captures execution paths through EJBs
as dynamic call trees (intra-method calls hidden) - Build probabilistic context-free grammar from
these - Detect trees that correspond to very low
probability parses - Respond by micro-rebooting(uRB) suspected-faulty
EJBs - uRB takes 100s of msecs, vs.whole-app restart
(8-10 sec) - Component interaction analysiscurrently finds
55-75 of failures - Path shape analysis detects gt90 of failures
but correctlylocalizes fewer
24Crash-Only Design Lessons from SSM
- Eliminate coupling
- No dependence on any specific brick, just on a
subset of minimum size -- even at the granularity
of individual requests - Not even across phases of an operation
single-phase nonblocking ops only gt predictable
amount of work/request - Use randomness to avoid deterministic worst cases
and hotspots - We initially violated this guideline by using an
off-the-shelf JMS implementation that was
centralized - Make parts interchangeable
- Any replica in a write-set is as good as any
other - Unlike erasure coding, only need 1 replica to
survive - Cost is higher storage overhead, but were
willing to pay that to get the self- properties
25Enterprise Service Workloads
Observation Consequence
Internet service workloads consist of large numbers of independent users Large number of independent samples gives basis for success of statistical techniques
Even a flaky service is doing mostly the right thing most of the time Steady-state behavior can be extracted from normal operation
Heavy traffic volume means most of the service is exercised in a relatively short time Baseline model can be learned rapidly and updated in place periodically
3. We can continuously extract models from the
production system orthogonally to the application
26Building models through measurement
- Finding bugs using distributed assertion sampling
Liblit et al, 2003 - Instrument source code with assertions on pairs
of variables (features) - Use sampling so that any given run of program
exercises only a few assertions (to limit
performance impact) - Use classification algorithm to identify which
features are most predictive of faults (observed
program crashes) - Goal bug finding
27JAGR JBoss with Micro-reboots
- performability of RUBiS (goodput/sec vs. time)
- vanilla JBoss w/manual restarting of app-server,
vs. JAGR w/automatic recovery and
micro-rebooting - JAGR/RUBiS does 78 better than JBoss/RUBiS
- Maintains 20 req/sec, even in the face of faults
- Lower steady-state after recovery in first graph
class reloading, recompiling, etc., which is not
necessary with micro-reboots - Also used to fix memory leaks without rebooting
whole appserver
28Fast Recovery Statistical Anomaly Detection
Self-
- Armando Fox and Emre Kiciman, Stanford
UniversityMichael Jordan, Randy Katz, David
Patterson, Ion Stoica,University of California,
Berkeley - SoS Workshop, Bertinoro, Italy