Title: Enabling Grids for EsciencE
1Early failure detection a method and some
applications
Enabling Grids for E-sciencE
Cécile Germain-Renaud and Ales Krenek
Purpose and Method
- A generic mechanism for autonomic detection of
EGEE failures involving abrupt changes in the
behavior of relevant quantities available from
existing monitors - Complex distributions preclude naive thresholding
- Page-Hinckley statistics
- Minimizes the expected time before a change
detection for a fixed false positive rate - Raise an alarm on a threshold test
- Standalone at the CE level, or by end-user, is
possible - Best usage as WMS input for temporary
blacklisting
Using arrival and service rates
- Average number of jobs in/out per reasonable unit
of time - Immediate computation from local monitors (e.g.
Torque logs), from the Real Time Monitor, from
the Job Provenance, from LB - Misbehaviour unusually high arrival and service
rates - Aggregated at the CE and correlated
- arrival and service suspicion of blackhole
- Aggregated per VO suspicion of a
- software bug
Data from LAL
s 33 minutes
s 28 minutes
From examples to operational
Tuning the threshold and the smooth-change
tolerance parameter d requires examples of normal
and pathological behaviors
d controls the risk of missing an event (false
negative), and the timeliness of the alarm
Data from the GridPP Real Time Monitor
The codes for exploiting the test on archived
data, including both the extraction of the
quantities of interest and the test itself, will
be released through the Grid Observatory
EGEE-II INFSO-RI-031688
www.lri.fr/Demain