Vladimir Litvin, Harvey Newman, - PowerPoint PPT Presentation

1 / 30

About This Presentation

Title:

Vladimir Litvin, Harvey Newman,

Description:

PBS accounting cannot calculate correctly CPU used time when ssh was used to ... PBS has a set of limitations MAXJOB limit, finest granularity is node, not CPU, ... – PowerPoint PPT presentation

Number of Views:111

Avg rating:3.0/5.0

Slides: 31

Provided by: RandyB153

Category:

more less

Transcript and Presenter's Notes

Title: Vladimir Litvin, Harvey Newman,

1
Long-Term Massive Production Runs on Alliance
resources Experience

Vladimir Litvin, Harvey Newman,
Sergei Shevchenko
Caltech CMS

2
Introduction

Caltech Higgs diphoton decay channel study is in
the second stage when 10M of background events
with full detector simulation is simulating
reconstructing and analysing.
4.5M of events has been done since April 2003
Physics results have been presented on Les
Houghes conference and reported on CMS weeks and
in CMS notes.
2004-2005 suppose to be 100M run on TeraGrid
Alliance facility.

3
Summary
4
Analysis Chain

Production analysis chain
- cmsim125 and ORCAv6 (with ObjyDB) have been
used
- FORTRAN part can be running everywhere
- ORCA C part was running on Caltech pTier2
only due to ObjyDB RH6x restrictions

5

Hierarchy of problems

There is a hierarchy of problems which might
appear on any level
We will concentrate mainly on the lowest,
infrastructure, level including clusters,
network, mass storage system, social issues
Even at this level there are a lot of open issues

3-tier architecture

6
NCSA(I)

NCSA Platinum Cluster Technical Summary
IBM eServer x330 thin server (dual-processor)
ECC SDRAM 1.5GB (compute nodes)
Access nodes 4 (8 CPU)
Compute nodes 484 (968 CPU)
Storage nodes 32 (64 CPU)
Intel PIII 1GHz, 256kB full-speed L2 cache (peak
performance 1 Gflop)
Network Interconnect
Access node Gigabit Ethernet
Compute node Myrinet 2000
Disks local 10GB per node and 4 NFS mounted FSs
650GB each

7
NCSA Statistics

NCSA per day SU usage

NCSA total usage

8
NCSA Statistics

NCSA fraction of idle (Q) and running (R) time
per day

9
NCSA Statistics

NCSA fraction of idle (Q) and running (R) time vs
number of jobs completed per day

10
NCSA (II)

Should be mulitple runs submitted in one PBS
job.Due to NCSA computing policy, PBS cannot
allocate one CPU it always allocates one NODE.
The same will be on future TeraGrid
Two different tasks of two different users can
ask large RAM in both tasks
During one CPU allocation, second CPU is sitting
idle anyway.
Smaller number of allocated nodes - lower the
priority of this PBS job
Large number of nodes allocated per one PBS
job is not good from Caltech HPSS usage point of
view
Fair maui policy is even more unfair

11
NCSA (III)

Job submission
!/bin/csh
PBS -q standard
PBS -N cmsim
PBS -l nodes4ppn2prod
PBS -l walltime120000
set PBSHOST hostname
foreach node (cat PBS_NODEFILE)
if ( node PBSHOST ) then
GEN_EXEout.file
else
( ssh -a -x -q node "GEN_EXEout.file
)
endif
end
wait

12
NCSA (IV)

Walltime limit exceeding problem
Huge LAN traffic from another computing nodes in
the same segment
Job was started incorrectly
Random order of running jobs
If 100 same jobs were started, the first running
job might be any job hard to predict the chunk
of already ready events
PBS MAXJOB limit was set to 50 and jobs will be
killed by PBS when number of jobs will exceed
this limit

13
NCSA (V)

Blocked Jobs
maui can start a job and if it fails, job will be
blocked and sitting in a queue without any
notification. Unblock must be done by hand asking
support team to do so. There isn't any way to be
notified by maui/PBS.
Accounting Problem
PBS accounting cannot calculate correctly CPU
used time when ssh was used to start runs on
another allocated nodes
NCSA own Sybase based accounting system is wrong
as well (overestimate the CPU usage)

14
NCSA (VI)

NCSA custom made utility usage is overestimates
used CPU hours
corrected PBS accounting is too crude for the
estimation as well

15
Caltech pTier2 (I)

Server node
Dell PowerEdge 4400, 2GB SDRAM
3 RAID arrays 3TB in a total
Computing nodes
20 rack mounted dual CPU PIII 800MHz Intel
SDRAM 512 MB and 133 MHz
10GB local disk

16
Caltech pTier2 (II)

Caltech pTier2 layout

17
Caltech pTier2 Statistics

Caltech per day SU usage

Caltech total SU usage

18
Caltech pTier2 Ganglia monitor

Screenshot

19
Caltech pTier2 Statistics

Caltech pTier2 fraction of idle (Q) and running
(R) time per day

20
Caltech pTier2 Statistics

Caltech pTier2 fraction of idle (Q) and running
(R) time vs total number of jobs

21
Caltech pTier2 (III)

NFS related problem high CPU load average
without any real work to do.
Typical diagnostics in /var/log/messages
Jul 24 062738 t007 kernel nfs server
tier2 not responding, still trying
Jul 24 062738 t007 last message repeated 4
times
Jul 24 062930 t007 kernel nfs task 3281
can't get a request slot
Jul 24 064252 t007 automount5189
expired /data/raid1
Jul 24 064406 t007 kernel nfs task 3283
can't get a request slot
Looks like NFS server cannot send a data on
request due to overloading or other reasons and
when the length of pending requests from one
client starts to exceed the certain threshold

22
Caltech pTier2 (IV)

Good node (Ganglia)

Bad node (Ganglia)

23
Caltech pTier2 (V)

Good node (Ganglia)

Bad node (Ganglia)

24
Condor (UW-Madison)

Condor flock of chaotically distributed nodes
Intel/Linux 614 nodes
Intel/WinNT50 113
SUN4u/Solaris28 105
SUN4x/Solaris28 6
We are using it for FORTRAN part only

25
Condor (UW-Madison) Statistics

Condor per day SU usage

Condor total usage

26
Condor (UW-Madison)

NFS problem evicted job cannot open existed
file after start on a new node
Failure rate depends on type of job and varied
from
Reason is unknown

27
Caltech HPSS

Hardware
5SP2 four processor Silver nodes
1 SP2 eight processor High node
IBM 3494 Robotic tape library with
6 IBM Magstar 3590 drives (10GB per tape, 9MB/s
2300 tape slots (23 TB capacity)
StorageTek 4410 Robotic tape silo
4 STK Redwood drives (50GB per tape, 11MB/s)
6000 tape slots (300TB capacity)

28
Caltech HPSS

Cannot keep more than 200-300 connections at the
same time. 400 connections are killed whole
system
Future TeraGrid Caltech Unit will have 100TB of
disks under PVFS

29
Conclusion (I)

In addition to Application and Grid Middleware
problems we have a set of infrastructure
technical problems, policies and social open
issues which can prevent fully utilize future
grid capacity
Technical issues are closely connected with
Computing Center policies.
PBS has a set of limitations MAXJOB limit,
finest granularity is node, not CPU, wrong
sequence of running jobs

30
Conclusion (II)

Lack of good monitoring and accounting system.
This is the basis of any future activity of
resource brokers and it is practically impossible
to have robust broker without detailed
statistical information
Reliable Mass Storage System is the critical
issue for LHC data handling. Current limitation
on opened connections at Caltech HPPS is raised
questions about MSS reliability. Further
investigations with HPSS and PVFS are essential