Summary of Category 3 HENP Computing Systems and Infrastructure

1 / 19

About This Presentation

Title:

Summary of Category 3 HENP Computing Systems and Infrastructure

Description:

Heard general talks about building and securing large multi ... Example from CDF: Central Analysis Facility is very well used. Future (very near future) plan is ... –

Number of Views:14

Avg rating:3.0/5.0

Slides: 20

Provided by: cms565

Learn more at: https://www.slac.stanford.edu

Category:

more less

Transcript and Presenter's Notes

Title: Summary of Category 3 HENP Computing Systems and Infrastructure

1
Summary of Category 3HENP Computing Systems and
Infrastructure

Ian Fisk and Michael Ernst
CHEP 2003
March 28, 2003

2
Introduction

We tried to break the week into themes
We Discussed Fabrics and Architectures on Monday
Heard general talks about building and securing
large multi-purpose facilities
As well as updates from a number of HEPN
computing efforts
We discussed emerging hardware and software
technology on Tuesday
Review of the most recent pasta report and update
of commodity disk storage work
Software for flexible clusters MOSIX. Advanced
storage and data serving CASTOR, ENSTOR, dCache,
Data Farm and ROOT-IO
We discussed Grid and other services on Thursday
Grid Interfaces and Storage Management over the
grid
Monitoring services
It was a full week with a lot to discuss.
Special thanks to all those who presented.
There is no way to cover very much of what was
presented in a thirty minute talk.

3
General Observations

Grid functionality is coming quickly
Basic underlying concepts of distributed,
parasitic, and multi-purpose computing are
already being deployed in running experiments
Early implementation of interfaces for grid
services to fabrics
I would expect by the time the LHC experiments
have real data that the tools and techniques will
have been well broken-in by experiments running
today
Shift to commodity equipment accelerated since
the last CHEP
I would argue that the shift is nearly complete
At least two large computing centers admitted to
having nothing in their work rooms but Linux
systems and a few Suns to debug software
This has resulted in the development of tools to
help handle this complicated component
environment
With notable exceptions high energy computing
does not work well together
The individual experiments often have subtly
different requirements, which results in
completely independent development efforts

4
Distributed Computing

Example from CDF Central Analysis Facility is
very well used
Future (very near future) plan is
to deploy satellite analysis farms
to increase the computing
resources.

5
Distributed Computing

Peter Elmer presented how the Babar experiment
has been able to take advantage of distributed
computing resources for primary event
reconstruction
By splitting their prompt
calibration and event
reconstruction, they now
take advantage of 5
reconstruction farms at
SLAC and 4 in Padova

6
Parasitic Computing

Bill Lee presented the CLuED0 work of the D0
experiment
CLuED0 is a cluster of D0 desktop machines which
along with some custom management software
provides D0 with 50 of their analysis CPU cycles
parasitically.
Heterogeneous system with distributed support
The US LHC experiments submitted a proposal on
Monday which, among many other topics, discussed
the use of economic theories to optimize resource
allocations.
Techniques already used in D0

7
Multipurpose Computing

Fundamental to a grid connected facility is the
ability to support multiple experiments at a
minimum and ideally multiple disciplines
The people responsible for computing systems have
been thinking about how to make this possible,
because so many regional computing centers have
to support multiple experiments and user
communities.
John Gordon gave an interesting talk on whether
it was possible to build a multipurpose center
John identified 6 categories of problems and
discussed possible solutions
Software levels
experts
Local rules
Security
Firewalls
The accelerator centres

8
Early Interfacing of Grid Services to Fabrics

Alex Sim gave a talk on the Storage Resource
Manager SRM Functionality
Manage space
Negotiate and assign space to users, Manage
lifetime of spaces
Manage files on behalf of a user
Pin files in storage till they are released,
Manage lifetime of files
Manage action when pins expire (depends on file
types)
Manage file sharing
Policies on what should reside on a storage
resource at any one time
Policies on what to evict when space is needed
Get files from remote locations when necessary
Purpose to simplify clients task
Manage multi-file requests
A brokering function queue file requests,
pre-stage when possible
Provide grid access to/from mass storage systems
HPSS (LBNL, ORNL, BNL), Enstore (Fermi), JasMINE
(Jlab), Castor (CERN), MSS (NCAR),

9
Early Implementation

The functionality of SRM is impressive, leads to
interesting analysis scenarios
Equally interesting is the number of places that
are prepared to interface their storage to the
WAN using SRM
Robust file replication between BNL and LBNL

10
Shift to commodity equipment
11
Benefits and Complications

The benefit is very substantial computing
resources at a reasonable hardware cost.
The complication is the scale and complexity of
the commodity computing cluster
A reasonably big computing cluster today might be
1000 systems
With all the possible hardware problems
associated with 1000 systems bought from the
lowest bidder
Considerable amount of deployment, integration,
and development effort to create tools that allow
a shelf or rack of linux boxes to behave like a
computing resource.
Configuration Tools
Monitoring Tools
Tools for systems control
Scheduling Tools
Security Techniques

12
Configuration Tools

We heard an interesting talk from Thorsten
Kleinwork on install and running systems at CERN
Systems are installed with kickstart and RPMs
CERN and several other centers are deploying the
configuration tools from EDG WP4
Pan CDB (Configuration Data Base) for
describing hosts
Pan is a very flexible language for describing
host configuration information
Expressed in templates (ASCII)
Allows includes (inheritance)
Pan is compiled into XML, inside CDB
XML is downloaded and the information provided by
CCConfig, which is the high level API
Complicated even to track what it is you have.
We had an interesting presentation from Jens
Kreutzkamp from DESY about how they track their
IT assets.

13
Monitoring Tools

Systems are complicated consisting of many
components this has lead to the development of
lots of monitoring tools
Very functional, complete and scalable though
complicated to extend tools like NGOP, which
Tanya Levshina presented

14
Monitoring Tools (cont.)

On the opposite end where examples of extremely
lightweight monitoring packages for Babar
presented by Matthias Wittgen.
Monitors CPU and network usage as well packets
sent to disk and number of processes
Writes it to a central server where it is kept on
a flat file.

15
Tools for system control

Andras Horvath presented a technique for secure
system control and reset access for a reasonable
cost
This solutions doesnt scale to 6000 boxes
System Andras is implementing consists of serial
connections for console access and relays
attached to the reset switch on the motherboard
for resets

16
Security Techniques

Number of systems in these large commodity
clusters makes for interesting security work
Doubly so when worrying about making grid
interfaces
The work to secure the BNL facility was presented
Work prioritizing their assets and forming
responses for security breaches

17
Field doesnt cooperate well

This is not necessarily a problem, nor is it a
criticism, simply an observation
One doesnt see a lot of common detector building
projects, maybe it isnt surprising that there
arent a lot common computing development efforts
I noticed during the week that there is a lot of
duplication of effort, even between experiments
that are geographically close
We have forums for exchange like HEPIX and the
Large Cluster Workshop meetings
Even with these, we dont seem to do much
development in common
There are notable exceptions
Alan Silverman presented the work to write a
guide to building and operating a large cluster
Their noble if somewhat ambitious goal is to
Produce the definitive guide to building and
running a cluster - how to choose, acquire, test
and operate the hardware software installation
and upgrade tools performance mgmt, logging,
accounting, alarms, security, etc, etc

18
Grid Projects

The grid projects are another area in which the
field is working effectively together
A number of sites indicated the desire to use
common tools developed by EDG Work Package 4
Good buy in from fabric managers about the use of
SRM
Software deployment through the VDT

19
Conclusions

It was a long and interesting week
Apologies for not being able to summarize
everything
We had very interesting discussions and
presentations yesterday about how to interface
the fabrics and the grid services
I also didnt get a change to cover some of the
hardware and software RD results
I encourage people to look at the web page.
Almost all the talks were posted.