D0 reprocessing on OSG facilities. Issue report.

About This Presentation

Title:

D0 reprocessing on OSG facilities. Issue report.

Description:

... problem reports does not help reconstituting overall picture. ... No Assign/no std output. OSG middleware failure. Misconfig./incompat. Cluster node. SAMGrid ... – PowerPoint PPT presentation

Number of Views:11

Avg rating:3.0/5.0

Slides: 15

Provided by: andrewba6

Learn more at: https://pingprod.fnal.gov

Category:

more less

Transcript and Presenter's Notes

Title: D0 reprocessing on OSG facilities. Issue report.

1
D0 reprocessing on OSG facilities. Issue report.
Heuristic method of problem analysis

A. Baranovski 02/26/07

2
Overview

Approach to generalize the problem of issue
detection and classification in the distributed
environment.
The test data has been collected from D0 data
reprocessing effort on OSG facilities using
SAMGrid.
Actual Issue reports

Next slide is the SAMGrid high level data
processing work flow.
Each step has distinct failure patterns that vary
in time and place of occurrence.

4
SAMgrid processing steps

Submit jobs
Assign (to cluster worker) job
Get pilot executable
Get D0 run time (and unpack)
Get raw file
Run
Store output

5
Problem analysis

In the distributed environment prompt problem
detection is the major consideration to ensure
stable output
Tens of thousands of log files
We cant look into each and every one of them
(neither you can)!
Looking at individual problem reports does not
help reconstituting overall picture.

6
Problem analysis

Ideally, we are in need of a tool that would
display success/failure statistics for every
step/substep of the processing.
In the past, farmers were doing that job by hand
Now that we drew the line between resources
providers and resource users (us) more
automation is needed
Unfortunately, we have not had enough focus on
that (and some things are even out of our
control)
Even if we had time, the tool would be
constrained to predetermined set of monitoring
params.

7
Plots

Ive been trying to put together several plots on
efficiency and generalized success rates
Plots did a terrific job aggravating upper
management
Have been little help to actually address
underlying issues
Way too generic

8
What about log files ?

Log files are popular means to dump information
that may be relevant for future problem
diagnostics
Free form text that represents the work flow
context evolution in time
Verbosity of the log files typically sets the
depth of the context changes
Problem log files can hardly be formalized to
be used for machine processing.

9
Log size

Log files do have one feature in common their
size.
Overly verbose SAMGrid log files can be broken
into size categories that correlate with distinct
workflow outcomes.
Log file size categories relate to outcomes that
do not change from job to job
All jobs are essentially the same (production!)
little variations in parameters that would impact
log size
Log file size is very easy to observe
We want to look at representatives and make a
report

10
Examples

0K job has never been assigned.
5K core dumped at the worker node
10-20K job failed to get pilot tar ball
75-100K failure to download D0 rte/establish
connection to SAM, timeout for raw files from SAM
100-120K timeout/failure staging raw files SAM
storage

11
Method overall

The method gives some assurance that individual
problem reports are representative.
Can be helpful to detect atypical failures
Indirect
Several different failure scenarios will manifest
in similar output.
However

12
Log file size for the past 20 days
Misconfig./incompat. Cluster node
No Assign/no std output OSG middleware failure
SAMGrid forwarding node crash
RTE/Raw/ Generic SAM problems
Please use zoom on the next slide
13
(No Transcript)
14
Summary