LHCb experience with GGUS and neighbors - PowerPoint PPT Presentation

About This Presentation
Title:

LHCb experience with GGUS and neighbors

Description:

Since (more than) one year from now GGUS represents one of ... of site related problems (see Maradona hints collected during months of site debugging activity) ... – PowerPoint PPT presentation

Number of Views:26
Avg rating:3.0/5.0
Slides: 17
Provided by: NIC8207
Category:

less

Transcript and Presenter's Notes

Title: LHCb experience with GGUS and neighbors


1
LHCb experience with GGUS and neighbors
2
Outlook
  • History of the LHCb problem management system
  • GGUS experience
  • The current use of GGUS
  • Beyond the current system
  • Improvements/wishes for GGUS and SFT

3
History breakdown
  • Since (more than) one year from now GGUS
    represents one of the main way for reporting
    site-related problems (and not only) (1 FTE
    dedicated for chasing up problems went always
    through it)
  • for three months from now everybody in LHCb
    goes through GGUS for submitting tickets to ROCs
    and other User Support Units.
  • LHCb always represented (one of) the main
    consumer of GGUS both in a pioneer era and in a
    production regime
  • GGUS is being part of the LHCb mechanism for
    handling all collaboration problems the system
    is in a continuous refinement and improvement
    toward a fully automatic problem management
    system.

4
the pre-history
  • Experiment specific contact person at each (LHCb)
    T1 site mandated to chase up site specific
    problems. Problem circulating in a (general
    purpose) experiment mailing list - so that
    relevant people were informed and problem
    correctly tackled.
  • Quick reaction once the contact person realizes
    the problem is of his competence
  • Efficient on-site resolution of problems (the
    problem, once addressed was solved)
  • Mail boxes flooded by site related problems
    messages (not necessarily interesting to all)
  • Lack of more detailed information (like LCG
    specific ones as the logs from CondorG in the
    RB)
  • No traceability of the problems, not political
    power for forcing site admins, no statistics or
    history easily available..

5
  • Acronjobs on LHCb UIs analyze the last day jobs
    around the Grid and if for each site there are
    more failures than a given threshold then it
    triggers a restricted pool of people to
    investigate via e-mail.

6
  • During LHCb hard-production (DC04, its
    2004-2005 re-run, physics productions,
  • occasional other flavored productions, stripping
    and so forth ) we (unlucky pool of experts)
  • eventually receive up to 30/day of such
    messages, each one expecting to be correctly
    addressed to the relevant people
  • What those unlucky experts are supposed to do?
  • Analyze the problem
  • Integrate with further information from LCG side
  • Retrieve from the log of the central services
    (RB(s), LFC , FTS)
  • Send e-mail to the relevant people with well
    compiled information and hints for fixing it
  • Interact with the site managers for providing
    further (required) information
  • Look continuously to the best definition of the
    LHCb policies and procedures for dealing with
    unresponsive site.
  • Ex. (no so far away from the reality)
  • The problem is spotted and reported to the site
    admin directly.
  • The site doesnt react as expected.

7
before GGUS the history(the direct mail
notification to the site)
  • Cooked at home CIC/ROC operational infrastructure
    without the right political power for pushing
    sites
  • Extremely wasting time of the poor LHCb experts
  • Difficulty of keeping history and statistics (so
    far) about problems unless further development is
    foreseen
  • Traceability of the problems
  • Pursue of problem extremely difficult with lack
    of procedures and manpower
  • Efficient way because you live the problem with
    the site managers and you interactively help them
    fixing it
  • Personal enrichment about knowledge of the
    plethora of site related problems (see Maradona
    hints collected during months of site debugging
    activity)
  • http//goc.grid.sinica.edu.tw/gocwiki/Cannot_read_
    JobWrapper_output2e2e2e

8
GGUS vs LHCb a useful interaction
  • During this year we gave to GGUS team many
    suggestions for improving the system in a way
    tailored to LHCb needs
  • notification of updated tickets with a clear
    description on the subject of ticket itself for
    an easy and immediate association to the problem
    handled. The supporters arent longer obliged to
    open the mail from ggus.
  • Possibility of browsing either by specifying a
    range of dates when the ticket has been submitted
    or the VO the submitter belongs to.
  • Bugs about carbon copy recipients not notified by
    GGUS system
  • Notified developers about spammer attack via GGUS
    (Nov. 2005)
  • Maximum size of the attachment and mail body
    length that was preventing submission of tickets
    via mail.
  • Problem in submitting tickets via special browser
    (Mozilla)
  • .and many others that we didnt archived in our
    historical records but just private conversation
    with Flavia and Co.!

9
The present.
  • The Gracianis Robot is currently the main
    providers of alarmswe simply shifted the way of
    reporting problems to site and/or Central
    Services providers (LFC/FTS/DPM,CASTOR and so
    forth) from the mail to GGUS
  • Its still a big effort from just few (2) people
  • Its still needed to collect more information
    manually

10
An intermediate solution a cgi-bin application
  • Everybody can easily follow the status of a
    problem in a given site all LHCb production
    managers can react to new problems
  • Everybody can automatically fetch and provide
    through GGUS useful information for debugging
    the site problem.
  • History of the problem experienced at a given
    site immediately available.
  • Good incentive for chasing up a problem with a
    high-impact visualization via use of different
    colors and increasing number of row for each day
    spent in fixing the problem.

11
The mechanism
THE MECHANISM
()
() Soon the new agent director will do this task
12
The web interface
13
What next?
  • Reduce the load of this kind of work for
    debugging sites by
  • making the mechanism fully automatic (creating
    tickets via mail with the same process that
    handles problems)
  • leaving the rest of the work to the LCG/gLite
    operational infrastructure.

14
And SFT/FCR?
  • LHCb - first among the HEP experiments - built
    its own experiment specific test for Site
    Functional Test (SFT).
  • The SFT framework is currently used for
    installing/check application software.
  • LHCb never used FCR for automatic exclusion of
    faulty sites
  • For maximizing the computing power from the Grid
  • For avoiding that just a temporary problem caught
    by SFT would turn into lost of valuable resources
  • - SFT will be soon considered as yet another
    source of alarms (like the old Gracianis Robot
    is).
  • An operator (no necessarily from LHCb) will take
    care of facing all SFT discovered problems by
    submitting GGUS tickets accordingly
  • The big advantage in this case would be that not
    real LCG/gLite jobs are classified as failed in
    the LHCb statistics.

15
(No Transcript)
16
The future(LHCb Suggestions/Improvements/Wishes)
  • GGUS should be able not just to dispatch the
    problem to the right unit but also maintain a
    know-how that helps the unit fixing the problem
    (a knowledge DB that keeps strategies adopted for
    fixing past analogous problems)
  • Give the possibility (on demand) to a group (well
    defined) of production managers for
    handling/updating GGUS tickets (VOMS in GGUS?)
  • LHCb SFT tests not only used for installing
    software (right now) but to be integrated with
    GGUS so that tickets get submitted in a
    completely transparent (to LHCb) way, ensuring
    (enforcing) the readiness of the sites
  • The dream is to have a fully automatic system
    (GGUS integrated) where the human intervention
    (LHCb side) is minimized and where the failures
    of the past represent a good lesson for the
    future.
Write a Comment
User Comments (0)
About PowerShow.com