Title: FermiGrid/CDF/D0/OSG
1FermiGrid/CDF/D0/OSG
2Global Collaboration With Grids
- Ziggy wants his humans home by the end of the day
for food and attention - Follow Ziggy through National, Campus, and
Community grids to see how it happens
3What is DØ?
- The DØ experiment consists of a worldwide
collaboration of scientists conducting research
on the fundamental nature of matter. - 500 scientists and engineers
- 60 institutions
- 15 countries
- The research is focused on precise studies of
interactions of protons and antiprotons at the
highest available energies.
4DØ Detector
- The detector is designed to stop as many as
possible of the subatomic particles created from
energy released by colliding proton/antiproton
beams. - The intersection region where the
matter-antimatter annihilation takes place is
close to the geometric center of the detector. - The beam collision area is surrounded by tracking
chambers in a strong magnetic field parallel to
the direction of the beam(s). - Outside the tracking chamber are the pre-shower
detectors and the calorimeter.
5What is reprocessing?
- Periodically an experiment will reprocess data
taken previously due to improvements in
understanding the detector - the calorimeter recalibration
- improvements in the algorithms used in the
analysis - The reprocessing effort pushes the limits of
software and infrastructure to get the most
physics out of the data collected by the DØ
detector
A new layer of silicon detector of the DZERO
detector
6Case for using OSG resources
- Goal reprocess 500 M RunII events with newly
calibrated detector and improved reconstruction
software by end of March 07 when the data have
to be ready for physics analysis - Input 90Tb of detector data 250 Tb in
executables - Output 60 Tb of data in 500 CPU years
- Estimated resources need about 1500-2000 CPUs
for a period of about 4 months. - Problem DØ did not have enough dedicated
resources to complete the task in the target 3
months - Solution Use SAM-GridOSG interoperability to
allow SAM-Grid jobs to be executed in OSG
clusters.
7OSG Usage Model
- Opportunistic usage model
- Agreed to share computing cycles with OSG users
- Exact amount of resources at any time can not be
guaranteed
OSG Clusters CPUs
Brazil 230
CC-IN2P3 Lyon 500
LOUISIANA LTU-CCT 250 (128)
UCSD 300 (70)
PURDUE-ITaP 600 (?)
Oklahoma University 200
Indiana University 250
NERSC LBL 250
University of Nebraska 256
CMS FNAL 2 250
8SAM-Grid
- SAM-Grid is an infrastructure that understands DØ
processing needs and - maps them into available resources (OSG)
- Implements job to resource mappings
- Both computing and storage
- Uses SAM (Sequential Access via Metadata)
- Automated management of storage elements
- Metadata cataloguing
- Job submission and job status tracking
- Progress monitoring
9SAMGrid Architecture
10Challenge Certification
Compare production at a new site with standard
production at the DØ farm
OSG Cluster
Reference
the same
Certified!
Note Experienced problems during the
certification on virtual OS. default random seed
in python was set to the same value on all
machines
11Challenge Data Accessibility Test
2000 secs to transfer data (30 streams)
10000 secs to transfer data (30 streams)
NOT Acceptable
Acceptable
12Challenge Troubleshooting
Most jobs succeed (04/17/2007)
OSG-Related Problems before the intervention of
the Troubleshooting Team (03/27/2007)
The OSG Troubleshooting team was instrumental to
the success of the project.
13Reprocessing Summary
- This was the first major production of real high
energy physics - data (as opposed to simulations) ever run on OSG
resources" - said Brad Abbott , head of the DØ Computing
group. - On OSG, DØ sustained execution of over 1000
simultaneous jobs, and overall moved over 70
Terabytes of data. - Reprocessing was completed in June. Towards the
end of the production run the throughput on OSG
was more than 5 million events per day two to
three times more than originally planned. - In addition to the reprocessing effort, OSG
provided 300,000 CPU hours to DØ for one of the
most precise measurements to date of the top
quark mass, and to achieve this result in time
for the spring physics conferences
14Reprocessing over time
15D0 DiscoverySingle Top Production
- Top quark discovered in 1995 at the Tevatron
using the pair production mode - Prediction of single top quark has recently been
confirmed by the D0 data - Important measurement of the t-b coupling
- Similar final state as WH -gt lv bb search
- Therefore also a key milestone in the Higgs search
16Conclusion
- Successful and pioneering effort in data
intensive production in an opportunistic
environment - Challenges in support, coordination of resource
usage, and reservation of the shared resources - Iterative approach in enabling new resources
helped make computing problem more manageable
17The Collider Detector at Fermilab (CDF)
Muon detector
Central hadronic calorimeter
Central outer tracker (COT)
18A Mountain of Data
5.8 x 109 events 804TB raw data 2.4 PB total
data At least 2x more data coming before end of
run.
19Computing Model
Each event is independentone job can fail and
others will continue No inter-process
communication Mostly integer computing
20The Computing ProblemWW candidate event
Reconstruction/analysis Connecting the dots on
3-D spiral tracks Correlate with calorimeter
energy Find missing energy (large red
arrow) Combinatoric fitting to see what is
consistent with W particle.
21CAF Software
- Front end submission, authentication and
monitoring software - Users submit, debug, monitor from desktop
- Works with various
batch systems - CDF began with dedicated facilities at Fermilab
and remote institutions - Monitoring page at http//cdfcaf.fnal.gov/
22Why the Open Science Grid
- Majority of CPU load is simulation
- Requires 10GHz-sec per event
- Some analyses need gt 1 billion simulated events
- Increasing data volume mean that demand for
computing is growing faster than dedicated
resources at FNAL and elsewhere. - Simulation relatively easy to set up on remote
sites - CDF member institutions that previously had
dedicated CDF facilities now are using grid
interfaces - Strategy
- Data analysis mostly close to home (FermiGrid
CAF) - Monte Carlo simulations spread across the OSG
(NAMCAF).
23Condor Glide-ins
- Submit pilot job to a number of remote sites
- Pilot job calls home server to get a work unit
- Integrity of job and executable checked by MD5
checksums - To CDF userslooks like a local batch pool
- Glidekeeper daemons monitor remote sites, submits
enough jobs in advance to use available slots.
24GlideCAF overview
GlideCAF (Portal)
Batch queue
Collector
Submitter Daemon
Globus
Negotiator
Grid Pool
Monitoring Daemons
Grid Pool
Main Schedd
Batch queue
Glidekeeper Daemon
Glide-in Schedd
Globus
25NAMCAFCDF Computing On Open Science Grid
- North American CAFsingle submission point for
all OSG Sites - CDF user interface, uses OSG tools underneath
- no CDF-specific hardware or software at OSG sites
- Accesses OSG sites at MIT, Fermilab, UCSD,
Florida Chicago - OSG sites at Purdue, Toronto, Wisconsin, McGill
to be added
Provides up to 1000 job slots already Similar
entry points to European sites (LCGCAF) and
Taiwan, Japan sites (PACCAF)
26CDF OSG Usage
DØ
CDF
27Auxiliary tools--gLExec
- All Glidein jobs on the grid appear to come from
same user. - gLExec uses Globus callouts to contact site
authentication infrastructure - EGEELCAS/LCMAPS OSGGUMS/SAZ
- Each individual user job authenticates to the
site at the start of the job - Gives site independent control on who it takes
glideins from.
28W boson mass measurement
LEP experiments _at_CERN
The CDF Run 2 result is the most precise single
measurement of the W mass (used million CPU
hours for mass fitting)
29What is FermiGrid?
- FermiGrid is
- The Fermilab campus Grid and Grid portal.
- The site globus gateway.
- Accepts jobs from external (to Fermilab) sources
and forwards the jobs onto internal clusters. - A set of common services to support the campus
Grid and interface to Open Science Grid (OSG) /
LHC Computing Grid (LCG) - VOMS, VOMRS, GUMS, SAZ, MyProxy, Squid, Gratia
Accounting, etc. - A forum for promoting stakeholder
interoperability and resource sharing within
Fermilab - CMS, CDF, D0
- KTeV, miniBoone, minos, mipp, etc.
- The Open Science Grid portal to Fermilab Compute
and Storage Services. - FermiGrid Web Site Additional Documentation
- http//fermigrid.fnal.gov/
- Work supported by the U.S. Department of Energy
under contract No. DE-AC02-07CH11359.
30Jobmanager-cemon MatchMaking Service
- What is it?
- FermiGrid has a matchmaking service deployed on
the central gatekeeper (fermigrid1.fnal.gov).
This service is used to match the incoming jobs
against the various resources available at the
point in time that the job was submitted. - How can users make use of the MatchMaking
Service? - Users begin by submitting jobs to the fermigrid1
central gatekeeper through jobmanager-cemon. - By default, the value of the "requirements"
attribute is set such that users job will be
matched against clusters which support the users
VO (Virtual Organization) and have at least one
free slot available at the time when the job is
submitted to fermigrid1. - However, users have the ability to add additional
conditions to this "requirements attribute,
using the attribute named "gluerequirements" in
the condor submit file. - These additional conditions should be specified
in terms of Glue Schema attributes. - More information
- http//fermigrid.fnal.gov/matchmaking.html
31FermiGrid - Current Architecture
VOMS Server
Periodic Synchronization
GUMS Server
Site Wide Gateway
SAZ Server
clusters send ClassAds via CEMon to the site wide
gateway
BlueArc
Exterior Interior
CMS WC1
CDF OSG1
CDF OSG2
D0 CAB1
GP Farm
D0 CAB2
32SAZ - Animation
DN
VO
Role
Gatekeeper
CA
33FermiGrid - Current Performance
- VOMS
- Current record 1700 voms-proxy-inits/day.
- Not a driver for FermiGrid-HA.
- GUMS
- Current record gt 1M mapping requests/day
- Maximum system load lt3 at a CPU utilization of
130 (max 200) - SAZ
- Current record gt 129K authorization
decisions/day. - Maximum system load lt5.
34Bluearc/dCache
- Open Science Grid has two storage methods
- NFS-mounted OSG_DATA
- Implemented with BlueArc NFS filer
- SRM/dCache
- Volatile area, 7TB, for any grid user
- Large areas backed up on tape for Fermi
experiments
35FermiGrid-HA - Component Design
VOMS Active
VOMS Active
LVS Active
LVS Active
MySQL Active
GUMS Active
Replication
Heartbeat
Heartbeat
MySQL Active
GUMS Active
LVS Standby
LVS Standby
SAZ Active
SAZ Active
36FermiGrid-HA - Actual Component Deployment
Xen Domain 0
Xen Domain 0
LVS (Active)
LVS (Standby)
Active fg5x1
Active fg6x1
VOMS
VOMS
Xen VM 1
Xen VM 1
Active fg5x2
Active fg6x2
GUMS
GUMS
Xen VM 2
Xen VM 2
Active fg5x3
Active fg6x3
SAZ
SAZ
Xen VM 3
Xen VM 3
Active fg5x4
Active fg6x4
MySQL
MySQL
Xen VM 4
Xen VM 4
Active
fermigrid5
Active
fermigrid6
37Supported by the Department of Energy Office of
Science SciDAC-2 program from the High Energy
Physics, Nuclear Physics and Advanced Software
and Computing Research programs, and the
National Science Foundation Math and Physical
Sciences, Office of CyberInfrastructure and
Office of International Science and Engineering
Directorates.
38Open Science Grid
- The Vision
- Transform compute and data intensive science
through a cross-domain self-managed national
distributed cyber-infrastructure that brings
together campus and community infrastructure and
facilitating the needs of Virtual Organizations
at all scales - Submit local, Run Global
39Open Science Grid
Science Community Infrastructure (e.g.
ATLAS,CMS, LIGO, )
Need to be harmonized Into a well Integrated
whole
CS/IT Campus Grids (e.g. DOSAR, Fermigrid, GLOW,
GPN, GROW)
40Open Science Grid International PartnersEGEE,
Teragrid, Nordugrid, NYSGrid, GROW, GLOW, APAC,
DiSUN, FermiGrid, LCG, TIGRE, ASGC, NWICG
An International Science Community Common
Goals, Shared Data, Collaborative work
41Open Science Grid
42Open Science Grid Rosetta A non-physics
experiment
For each protein we design, we consume about
3,000 CPU hours across 10,000 jobs, says
Kuhlman. Adding in the structure and atom
design process, weve consumed about 100,000 CPU
hours in total so far.
43Open Science Grid CHARMM
CHARMM CHemistry at HARvard Macromolecular
Mechanics Im running many different
simulations to determine how much water exists
inside proteins and whether these water molecules
can influence the proteins, Damjanovic says.
44Open Science Grid How it all comes together
Virtual Organization Management services (VOMS)
allow registration, administration and control of
members of the group. Resources trust
and authorize VOs not individual users OSG
infrastructure provides the fabric for
job submission and scheduling, resource
discovery, security, monitoring,
VO Middleware Applications
VO Management Service
OSG Infrastructure
Resources that Trust the VO
45Open Science Grid Software Stack
User Science Codes and Interfaces
VO Middleware
Applications
HEP Data and workflow management etc
Biology Portals, databases etc
Astrophysics Data replication etc
OSG Release Cache OSG specific configurations,
utilities, etc
Virtual Data Toolkit (VDT) Core technologies
software needed by stakeholders, many components
shared with EGEE
Infrastructure
Core Grid Technology Distributions Condor,
Globus, MyProxy shared with TeraGrid and others
Resource
Existing Operating, Batch systems and Utilities.
46Open Science Grid Security
- Operational security is a priority
- Incident response
- Signed agreements, template policies
- Auditing, assessment and training
- Symmetry of Sites and VOs
- VO and Site two faces of a coin we believe
in symmetry - VO and Site each have responsibilities
- Trust Relationships
- A Sites trust the VOs that use it.
- A VO trusts the Sites it runs on.
- VOs trust their users.
47Open Science Grid Come Join OSG !!!
- How to become an OSG Citizen
- Join the OSGEDU VO
- Run small applications after learning how to use
OSG from schools - Be part of the Engagement program and Engage VO
- Support within the Facility to bring applications
to production on the distributed infrastructure - Be a standalone VO and a Member of the
Consortium - Ongoing use of OSG participate in one or more
activity groups.