FermiGrid/CDF/D0/OSG

About This Presentation

Title:

FermiGrid/CDF/D0/OSG

Description:

Goal: reprocess ~500 M RunII events with newly calibrated detector and improved ... Ongoing use of OSG & participate in one or more activity groups. Open Science Grid ... – PowerPoint PPT presentation

Number of Views:19

Avg rating:3.0/5.0

Slides: 48

Provided by: cddocd

Learn more at: https://pingprod.fnal.gov

Category:

more less

Transcript and Presenter's Notes

Title: FermiGrid/CDF/D0/OSG

1
FermiGrid/CDF/D0/OSG
2
Global Collaboration With Grids

Ziggy wants his humans home by the end of the day
for food and attention
Follow Ziggy through National, Campus, and
Community grids to see how it happens

3
What is DØ?

The DØ experiment consists of a worldwide
collaboration of scientists conducting research
on the fundamental nature of matter.
500 scientists and engineers
60 institutions
15 countries
The research is focused on precise studies of
interactions of protons and antiprotons at the
highest available energies.

4
DØ Detector

The detector is designed to stop as many as
possible of the subatomic particles created from
energy released by colliding proton/antiproton
beams.
The intersection region where the
matter-antimatter annihilation takes place is
close to the geometric center of the detector.
The beam collision area is surrounded by tracking
chambers in a strong magnetic field parallel to
the direction of the beam(s).
Outside the tracking chamber are the pre-shower
detectors and the calorimeter.

5
What is reprocessing?

Periodically an experiment will reprocess data
taken previously due to improvements in
understanding the detector
the calorimeter recalibration
improvements in the algorithms used in the
analysis
The reprocessing effort pushes the limits of
software and infrastructure to get the most
physics out of the data collected by the DØ
detector

A new layer of silicon detector of the DZERO
detector
6
Case for using OSG resources

Goal reprocess 500 M RunII events with newly
calibrated detector and improved reconstruction
software by end of March 07 when the data have
to be ready for physics analysis
Input 90Tb of detector data 250 Tb in
executables
Output 60 Tb of data in 500 CPU years
Estimated resources need about 1500-2000 CPUs
for a period of about 4 months.
Problem DØ did not have enough dedicated
resources to complete the task in the target 3
months
Solution Use SAM-GridOSG interoperability to
allow SAM-Grid jobs to be executed in OSG
clusters.

7
OSG Usage Model

Opportunistic usage model
Agreed to share computing cycles with OSG users
Exact amount of resources at any time can not be
guaranteed

OSG Clusters CPUs
Brazil 230
CC-IN2P3 Lyon 500
LOUISIANA LTU-CCT 250 (128)
UCSD 300 (70)
PURDUE-ITaP 600 (?)
Oklahoma University 200
Indiana University 250
NERSC LBL 250
University of Nebraska 256
CMS FNAL 2 250
8
SAM-Grid

SAM-Grid is an infrastructure that understands DØ
processing needs and
maps them into available resources (OSG)
Implements job to resource mappings
Both computing and storage
Uses SAM (Sequential Access via Metadata)
Automated management of storage elements
Metadata cataloguing
Job submission and job status tracking
Progress monitoring

9
SAMGrid Architecture
10
Challenge Certification
Compare production at a new site with standard
production at the DØ farm
OSG Cluster
Reference
the same
Certified!
Note Experienced problems during the
certification on virtual OS. default random seed
in python was set to the same value on all
machines
11
Challenge Data Accessibility Test
2000 secs to transfer data (30 streams)
10000 secs to transfer data (30 streams)
NOT Acceptable
Acceptable
12
Challenge Troubleshooting
Most jobs succeed (04/17/2007)
OSG-Related Problems before the intervention of
the Troubleshooting Team (03/27/2007)

The OSG Troubleshooting team was instrumental to
the success of the project.
13
Reprocessing Summary

This was the first major production of real high
energy physics
data (as opposed to simulations) ever run on OSG
resources"
said Brad Abbott , head of the DØ Computing
group.
On OSG, DØ sustained execution of over 1000
simultaneous jobs, and overall moved over 70
Terabytes of data.
Reprocessing was completed in June. Towards the
end of the production run the throughput on OSG
was more than 5 million events per day two to
three times more than originally planned.
In addition to the reprocessing effort, OSG
provided 300,000 CPU hours to DØ for one of the
most precise measurements to date of the top
quark mass, and to achieve this result in time
for the spring physics conferences

14
Reprocessing over time
15
D0 DiscoverySingle Top Production

Top quark discovered in 1995 at the Tevatron
using the pair production mode
Prediction of single top quark has recently been
confirmed by the D0 data
Important measurement of the t-b coupling
Similar final state as WH -gt lv bb search
Therefore also a key milestone in the Higgs search

16
Conclusion

Successful and pioneering effort in data
intensive production in an opportunistic
environment
Challenges in support, coordination of resource
usage, and reservation of the shared resources
Iterative approach in enabling new resources
helped make computing problem more manageable

17
The Collider Detector at Fermilab (CDF)
Muon detector
Central hadronic calorimeter
Central outer tracker (COT)
18
A Mountain of Data
5.8 x 109 events 804TB raw data 2.4 PB total
data At least 2x more data coming before end of
run.
19
Computing Model
Each event is independentone job can fail and
others will continue No inter-process
communication Mostly integer computing
20
The Computing ProblemWW candidate event
Reconstruction/analysis Connecting the dots on
3-D spiral tracks Correlate with calorimeter
energy Find missing energy (large red
arrow) Combinatoric fitting to see what is
consistent with W particle.
21
CAF Software

Front end submission, authentication and
monitoring software
Users submit, debug, monitor from desktop
Works with various
batch systems
CDF began with dedicated facilities at Fermilab
and remote institutions
Monitoring page at http//cdfcaf.fnal.gov/

22
Why the Open Science Grid

Majority of CPU load is simulation
Requires 10GHz-sec per event
Some analyses need gt 1 billion simulated events
Increasing data volume mean that demand for
computing is growing faster than dedicated
resources at FNAL and elsewhere.
Simulation relatively easy to set up on remote
sites
CDF member institutions that previously had
dedicated CDF facilities now are using grid
interfaces
Strategy
Data analysis mostly close to home (FermiGrid
CAF)
Monte Carlo simulations spread across the OSG
(NAMCAF).

23
Condor Glide-ins

Submit pilot job to a number of remote sites
Pilot job calls home server to get a work unit
Integrity of job and executable checked by MD5
checksums
To CDF userslooks like a local batch pool
Glidekeeper daemons monitor remote sites, submits
enough jobs in advance to use available slots.

24
GlideCAF overview
GlideCAF (Portal)
Batch queue
Collector
Submitter Daemon
Globus
Negotiator
Grid Pool
Monitoring Daemons
Grid Pool
Main Schedd
Batch queue
Glidekeeper Daemon
Glide-in Schedd
Globus
25
NAMCAFCDF Computing On Open Science Grid

North American CAFsingle submission point for
all OSG Sites
CDF user interface, uses OSG tools underneath
no CDF-specific hardware or software at OSG sites
Accesses OSG sites at MIT, Fermilab, UCSD,
Florida Chicago
OSG sites at Purdue, Toronto, Wisconsin, McGill
to be added

Provides up to 1000 job slots already Similar
entry points to European sites (LCGCAF) and
Taiwan, Japan sites (PACCAF)
26
CDF OSG Usage
DØ
CDF
27
Auxiliary tools--gLExec

All Glidein jobs on the grid appear to come from
same user.
gLExec uses Globus callouts to contact site
authentication infrastructure
EGEELCAS/LCMAPS OSGGUMS/SAZ
Each individual user job authenticates to the
site at the start of the job
Gives site independent control on who it takes
glideins from.

28
W boson mass measurement
LEP experiments _at_CERN
The CDF Run 2 result is the most precise single
measurement of the W mass (used million CPU
hours for mass fitting)
29
What is FermiGrid?

FermiGrid is
The Fermilab campus Grid and Grid portal.
The site globus gateway.
Accepts jobs from external (to Fermilab) sources
and forwards the jobs onto internal clusters.
A set of common services to support the campus
Grid and interface to Open Science Grid (OSG) /
LHC Computing Grid (LCG)
VOMS, VOMRS, GUMS, SAZ, MyProxy, Squid, Gratia
Accounting, etc.
A forum for promoting stakeholder
interoperability and resource sharing within
Fermilab
CMS, CDF, D0
KTeV, miniBoone, minos, mipp, etc.
The Open Science Grid portal to Fermilab Compute
and Storage Services.
FermiGrid Web Site Additional Documentation
http//fermigrid.fnal.gov/
Work supported by the U.S. Department of Energy
under contract No. DE-AC02-07CH11359.

30
Jobmanager-cemon MatchMaking Service

What is it?
FermiGrid has a matchmaking service deployed on
the central gatekeeper (fermigrid1.fnal.gov).
This service is used to match the incoming jobs
against the various resources available at the
point in time that the job was submitted.
How can users make use of the MatchMaking
Service?
Users begin by submitting jobs to the fermigrid1
central gatekeeper through jobmanager-cemon.
By default, the value of the "requirements"
attribute is set such that users job will be
matched against clusters which support the users
VO (Virtual Organization) and have at least one
free slot available at the time when the job is
submitted to fermigrid1.
However, users have the ability to add additional
conditions to this "requirements attribute,
using the attribute named "gluerequirements" in
the condor submit file.
These additional conditions should be specified
in terms of Glue Schema attributes.
More information
http//fermigrid.fnal.gov/matchmaking.html

31
FermiGrid - Current Architecture
VOMS Server
Periodic Synchronization
GUMS Server
Site Wide Gateway
SAZ Server
clusters send ClassAds via CEMon to the site wide
gateway
BlueArc
Exterior Interior
CMS WC1
CDF OSG1
CDF OSG2
D0 CAB1
GP Farm
D0 CAB2
32
SAZ - Animation
DN
VO
Role
Gatekeeper
CA
33
FermiGrid - Current Performance

VOMS
Current record 1700 voms-proxy-inits/day.
Not a driver for FermiGrid-HA.
GUMS
Current record gt 1M mapping requests/day
Maximum system load lt3 at a CPU utilization of
130 (max 200)
SAZ
Current record gt 129K authorization
decisions/day.
Maximum system load lt5.

34
Bluearc/dCache

Open Science Grid has two storage methods
NFS-mounted OSG_DATA
Implemented with BlueArc NFS filer
SRM/dCache
Volatile area, 7TB, for any grid user
Large areas backed up on tape for Fermi
experiments

35
FermiGrid-HA - Component Design
VOMS Active
VOMS Active
LVS Active
LVS Active
MySQL Active
GUMS Active
Replication
Heartbeat
Heartbeat
MySQL Active
GUMS Active
LVS Standby
LVS Standby
SAZ Active
SAZ Active
36
FermiGrid-HA - Actual Component Deployment
Xen Domain 0
Xen Domain 0
LVS (Active)
LVS (Standby)
Active fg5x1
Active fg6x1
VOMS
VOMS
Xen VM 1
Xen VM 1
Active fg5x2
Active fg6x2
GUMS
GUMS
Xen VM 2
Xen VM 2
Active fg5x3
Active fg6x3
SAZ
SAZ
Xen VM 3
Xen VM 3
Active fg5x4
Active fg6x4
MySQL
MySQL
Xen VM 4
Xen VM 4
Active
fermigrid5
Active
fermigrid6
37

Supported by the Department of Energy Office of
Science SciDAC-2 program from the High Energy
Physics, Nuclear Physics and Advanced Software
and Computing Research programs, and the
National Science Foundation Math and Physical
Sciences, Office of CyberInfrastructure and
Office of International Science and Engineering
Directorates.
38
Open Science Grid

The Vision
Transform compute and data intensive science
through a cross-domain self-managed national
distributed cyber-infrastructure that brings
together campus and community infrastructure and
facilitating the needs of Virtual Organizations
at all scales
Submit local, Run Global

39
Open Science Grid
Science Community Infrastructure (e.g.
ATLAS,CMS, LIGO, )
Need to be harmonized Into a well Integrated
whole
CS/IT Campus Grids (e.g. DOSAR, Fermigrid, GLOW,
GPN, GROW)
40
Open Science Grid International PartnersEGEE,
Teragrid, Nordugrid, NYSGrid, GROW, GLOW, APAC,
DiSUN, FermiGrid, LCG, TIGRE, ASGC, NWICG
An International Science Community Common
Goals, Shared Data, Collaborative work
41
Open Science Grid
42
Open Science Grid Rosetta A non-physics
experiment
For each protein we design, we consume about
3,000 CPU hours across 10,000 jobs, says
Kuhlman. Adding in the structure and atom
design process, weve consumed about 100,000 CPU
hours in total so far.
43
Open Science Grid CHARMM
CHARMM CHemistry at HARvard Macromolecular
Mechanics Im running many different
simulations to determine how much water exists
inside proteins and whether these water molecules
can influence the proteins, Damjanovic says.
44
Open Science Grid How it all comes together
Virtual Organization Management services (VOMS)
allow registration, administration and control of
members of the group. Resources trust
and authorize VOs not individual users OSG
infrastructure provides the fabric for
job submission and scheduling, resource
discovery, security, monitoring,
VO Middleware Applications
VO Management Service
OSG Infrastructure
Resources that Trust the VO
45
Open Science Grid Software Stack
User Science Codes and Interfaces
VO Middleware
Applications

HEP Data and workflow management etc
Biology Portals, databases etc
Astrophysics Data replication etc
OSG Release Cache OSG specific configurations,
utilities, etc
Virtual Data Toolkit (VDT) Core technologies
software needed by stakeholders, many components
shared with EGEE
Infrastructure
Core Grid Technology Distributions Condor,
Globus, MyProxy shared with TeraGrid and others
Resource
Existing Operating, Batch systems and Utilities.
46
Open Science Grid Security

Operational security is a priority
Incident response
Signed agreements, template policies
Auditing, assessment and training
Symmetry of Sites and VOs
VO and Site two faces of a coin we believe
in symmetry
VO and Site each have responsibilities
Trust Relationships
A Sites trust the VOs that use it.
A VO trusts the Sites it runs on.
VOs trust their users.

47
Open Science Grid Come Join OSG !!!

How to become an OSG Citizen
Join the OSGEDU VO
Run small applications after learning how to use
OSG from schools
Be part of the Engagement program and Engage VO
Support within the Facility to bring applications
to production on the distributed infrastructure
Be a standalone VO and a Member of the
Consortium
Ongoing use of OSG participate in one or more
activity groups.