USCMS Integration Grid Testbed - PowerPoint PPT Presentation

1 / 8
About This Presentation
Title:

USCMS Integration Grid Testbed

Description:

12.13.2002 Chicago. Troubleshooting and Fault Tolerance in Grid Environments. 1. USCMS ... Orphaned jobs. Identified by their 'strange' behavior only. ... – PowerPoint PPT presentation

Number of Views:23
Avg rating:3.0/5.0
Slides: 9
Provided by: Anz7
Category:

less

Transcript and Presenter's Notes

Title: USCMS Integration Grid Testbed


1
USCMS Integration Grid Testbed
Deployment, Operation and Troubleshooting
  • Anzar Afaq
  • FermiLab

2
Introduction
  • Test, Integration and Production Grid Test Beds
  • Tier-I/Tier-II resources combined into Grid
    effort.
  • About 230 CPU (750 MHz equivalent, RedHat Linux
    6.1)
  • Additional 80 CPU at 2.4 GHz running RedHat Linux
    7.X
  • About 5 TB local disk space plus Enstore Mass
    storage at FNAL using dCache.
  • Caltech, Fermilab, U Florida, UC SanDiego, (UW
    Madison support).
  • CERN/LCG (with 80 CPUs) is joining.
  • Dedicated resources.
  • Middleware
  • Globus and Condor (VDT 1.1.3 Sever and Client)
  • Running two assignments
  • Assignment to produce 1 Million eGamma Bigjets
    events by Christmas 2002, all steps (720K done).
  • Assignment to produce 500K additional events,
    cmsim step only (50k done!).

3
Results (so far)
Time to process 1 event 500 sec _at_ 750
MHz Speedup Avg factor of 100 speedup during
current run Resources Approximately 230 CPU
_at_750 MHz equiv. Sustained efficiency about
43.5
4
Success counts..
5
Limitations..
  • Breakdowns----scalability ?
  • Unforeseen issues (not any more!)
  • Larger submissions (30)
  • Parallel g-u-c (FTSH)
  • NFS timeouts
  • Larger farms
  • Disk management
  • Failing jobs garbage collection
  • Large number of jobs (kills the system, YP
    server)
  • Configuration issues.
  • Mainly bottlenecks (nfs, sockets, disk)
  • No dead-end (work arounds exist)
  • Throttling, code adjustments, tune-ups
  • Cleanup and start over

6
Troubleshooting
  • Several middleware and application software
    issues/bugs found/fixed/worked-around.
  • (RefDB, MCRunJob, MOP, Condor_G, Applications)
  • Condor_G/gahp_server failure at higher number of
    jobs in queue (major issue).
  • Gahp_server stop communicating with sites.
  • Gahp_server spawns a large number of threads.
  • Orphaned jobs (cpu-time wastage/garbage)
  • Held jobs (need re-submission).
  • Significantly hampers efficiency
  • 35-40 of all failures.
  • Condor team is working on the problem.
  • (kill and restart Condor_G) ?----- Fault
    Tolerance.

7
Effort
  • Deployment and Configuration still not very easy.
  • Misleading errors, experts required.
  • End-To-End job tracking is tough (tool ?).
  • Have to look into several log files and queues to
    establish the relationship.
  • Limitations lead to disaster (experts findout
    reason of failure).
  • Need baby sitting (reducing with time).
  • Have to keep an eye on various sites.
  • Monitoring and alarming helps.
  • Orphaned jobs
  • Identified by their strange behavior only.
  • Need access to remote site for (job, disk)
    cleanup.
  • Missing scheduler at Grid level.

8
Conclusions Plans
  • Happy with the results ?
  • USCMS production grid in January 2003
  • Improved middleware
  • Bugs fixed
  • Grid level Scheduling (several options)
  • Future development needs understood
  • Changes to MCRunJob
  • Better tracking
  • Changes to MOP (The Show Master)
  • More generalized interface
  • Multi-site job submission
  • BOSS in picture.
  • Better organization and closer cooperation
  • Shifts, Policies, Bug reporting
Write a Comment
User Comments (0)
About PowerShow.com