D0 reprocessing on OSG facilities. Issue report. - PowerPoint PPT Presentation

1 / 14
About This Presentation
Title:

D0 reprocessing on OSG facilities. Issue report.

Description:

... problem reports does not help reconstituting overall picture. ... No Assign/no std output. OSG middleware failure. Misconfig./incompat. Cluster node. SAMGrid ... – PowerPoint PPT presentation

Number of Views:11
Avg rating:3.0/5.0
Slides: 15
Provided by: andrewba6
Category:

less

Transcript and Presenter's Notes

Title: D0 reprocessing on OSG facilities. Issue report.


1
D0 reprocessing on OSG facilities. Issue report.
Heuristic method of problem analysis
  • A. Baranovski 02/26/07

2
Overview
  • Approach to generalize the problem of issue
    detection and classification in the distributed
    environment.
  • The test data has been collected from D0 data
    reprocessing effort on OSG facilities using
    SAMGrid.
  • Actual Issue reports

3
  • Next slide is the SAMGrid high level data
    processing work flow.
  • Each step has distinct failure patterns that vary
    in time and place of occurrence.

4
SAMgrid processing steps
  • Submit jobs
  • Assign (to cluster worker) job
  • Get pilot executable
  • Get D0 run time (and unpack)
  • Get raw file
  • Run
  • Store output

5
Problem analysis
  • In the distributed environment prompt problem
    detection is the major consideration to ensure
    stable output
  • Tens of thousands of log files
  • We cant look into each and every one of them
    (neither you can)!
  • Looking at individual problem reports does not
    help reconstituting overall picture.

6
Problem analysis
  • Ideally, we are in need of a tool that would
    display success/failure statistics for every
    step/substep of the processing.
  • In the past, farmers were doing that job by hand
  • Now that we drew the line between resources
    providers and resource users (us) more
    automation is needed
  • Unfortunately, we have not had enough focus on
    that (and some things are even out of our
    control)
  • Even if we had time, the tool would be
    constrained to predetermined set of monitoring
    params.

7
Plots
  • Ive been trying to put together several plots on
    efficiency and generalized success rates
  • Plots did a terrific job aggravating upper
    management
  • Have been little help to actually address
    underlying issues
  • Way too generic

8
What about log files ?
  • Log files are popular means to dump information
    that may be relevant for future problem
    diagnostics
  • Free form text that represents the work flow
    context evolution in time
  • Verbosity of the log files typically sets the
    depth of the context changes
  • Problem log files can hardly be formalized to
    be used for machine processing.

9
Log size
  • Log files do have one feature in common their
    size.
  • Overly verbose SAMGrid log files can be broken
    into size categories that correlate with distinct
    workflow outcomes.
  • Log file size categories relate to outcomes that
    do not change from job to job
  • All jobs are essentially the same (production!)
  • little variations in parameters that would impact
    log size
  • Log file size is very easy to observe
  • We want to look at representatives and make a
    report

10
Examples
  • 0K job has never been assigned.
  • 5K core dumped at the worker node
  • 10-20K job failed to get pilot tar ball
  • 75-100K failure to download D0 rte/establish
    connection to SAM, timeout for raw files from SAM
  • 100-120K timeout/failure staging raw files SAM
    storage

11
Method overall
  • The method gives some assurance that individual
    problem reports are representative.
  • Can be helpful to detect atypical failures
  • Indirect
  • Several different failure scenarios will manifest
    in similar output.
  • However

12
Log file size for the past 20 days
Misconfig./incompat. Cluster node
No Assign/no std output OSG middleware failure
SAMGrid forwarding node crash
RTE/Raw/ Generic SAM problems
Please use zoom on the next slide
13
(No Transcript)
14
Summary
  • 50 of failures (red) are related to issues with
    OSG middleware
  • Need to work with individual sites. (time
    consuming)
  • Jobs indeed stay in the idle queue gt 5days
    (unlikely)
  • 50 (blue) are related to SAMGrid
  • Forwarding node crashes
  • Failure to talk to SAM due to job overstaying in
    idle queues
  • Storage node failures (including Sprace dCache)
Write a Comment
User Comments (0)
About PowerShow.com