FermiGrid/CDF/D0/OSG - PowerPoint PPT Presentation

1 / 47
About This Presentation
Title:

FermiGrid/CDF/D0/OSG

Description:

Goal: reprocess ~500 M RunII events with newly calibrated detector and improved ... Ongoing use of OSG & participate in one or more activity groups. Open Science Grid ... – PowerPoint PPT presentation

Number of Views:18
Avg rating:3.0/5.0
Slides: 48
Provided by: cddocd
Category:

less

Transcript and Presenter's Notes

Title: FermiGrid/CDF/D0/OSG


1
FermiGrid/CDF/D0/OSG
2
Global Collaboration With Grids
  • Ziggy wants his humans home by the end of the day
    for food and attention
  • Follow Ziggy through National, Campus, and
    Community grids to see how it happens

3
What is DØ?
  • The DØ experiment consists of a worldwide
    collaboration of scientists conducting research
    on the fundamental nature of matter.
  • 500 scientists and engineers
  • 60 institutions
  • 15 countries
  • The research is focused on precise studies of
    interactions of protons and antiprotons at the
    highest available energies.

4
DØ Detector
  • The detector is designed to stop as many as
    possible of the subatomic particles created from
    energy released by colliding proton/antiproton
    beams.
  • The intersection region where the
    matter-antimatter annihilation takes place is
    close to the geometric center of the detector.
  • The beam collision area is surrounded by tracking
    chambers in a strong magnetic field parallel to
    the direction of the beam(s).
  • Outside the tracking chamber are the pre-shower
    detectors and the calorimeter.

5
What is reprocessing?
  • Periodically an experiment will reprocess data
    taken previously due to improvements in
    understanding the detector
  • the calorimeter recalibration
  • improvements in the algorithms used in the
    analysis
  • The reprocessing effort pushes the limits of
    software and infrastructure to get the most
    physics out of the data collected by the DØ
    detector

A new layer of silicon detector of the DZERO
detector
6
Case for using OSG resources
  • Goal reprocess 500 M RunII events with newly
    calibrated detector and improved reconstruction
    software by end of March 07 when the data have
    to be ready for physics analysis
  • Input 90Tb of detector data 250 Tb in
    executables
  • Output 60 Tb of data in 500 CPU years
  • Estimated resources need about 1500-2000 CPUs
    for a period of about 4 months.
  • Problem DØ did not have enough dedicated
    resources to complete the task in the target 3
    months
  • Solution Use SAM-GridOSG interoperability to
    allow SAM-Grid jobs to be executed in OSG
    clusters.

7
OSG Usage Model
  • Opportunistic usage model
  • Agreed to share computing cycles with OSG users
  • Exact amount of resources at any time can not be
    guaranteed

OSG Clusters CPUs
Brazil 230
CC-IN2P3 Lyon 500
LOUISIANA LTU-CCT 250 (128)
UCSD 300 (70)
PURDUE-ITaP 600 (?)
Oklahoma University 200
Indiana University 250
NERSC LBL 250
University of Nebraska 256
CMS FNAL 2 250
8
SAM-Grid
  • SAM-Grid is an infrastructure that understands DØ
    processing needs and
  • maps them into available resources (OSG)
  • Implements job to resource mappings
  • Both computing and storage
  • Uses SAM (Sequential Access via Metadata)
  • Automated management of storage elements
  • Metadata cataloguing
  • Job submission and job status tracking
  • Progress monitoring

9
SAMGrid Architecture
10
Challenge Certification
Compare production at a new site with standard
production at the DØ farm
OSG Cluster
Reference
the same
Certified!
Note Experienced problems during the
certification on virtual OS. default random seed
in python was set to the same value on all
machines
11
Challenge Data Accessibility Test
2000 secs to transfer data (30 streams)
10000 secs to transfer data (30 streams)
NOT Acceptable
Acceptable
12
Challenge Troubleshooting
Most jobs succeed (04/17/2007)
OSG-Related Problems before the intervention of
the Troubleshooting Team (03/27/2007)

The OSG Troubleshooting team was instrumental to
the success of the project.
13
Reprocessing Summary
  • This was the first major production of real high
    energy physics
  • data (as opposed to simulations) ever run on OSG
    resources"
  • said Brad Abbott , head of the DØ Computing
    group.
  • On OSG, DØ sustained execution of over 1000
    simultaneous jobs, and overall moved over 70
    Terabytes of data.
  • Reprocessing was completed in June. Towards the
    end of the production run the throughput on OSG
    was more than 5 million events per day two to
    three times more than originally planned.
  • In addition to the reprocessing effort, OSG
    provided 300,000 CPU hours to DØ for one of the
    most precise measurements to date of the top
    quark mass, and to achieve this result in time
    for the spring physics conferences

14
Reprocessing over time
15
D0 DiscoverySingle Top Production
  • Top quark discovered in 1995 at the Tevatron
    using the pair production mode
  • Prediction of single top quark has recently been
    confirmed by the D0 data
  • Important measurement of the t-b coupling
  • Similar final state as WH -gt lv bb search
  • Therefore also a key milestone in the Higgs search

16
Conclusion
  • Successful and pioneering effort in data
    intensive production in an opportunistic
    environment
  • Challenges in support, coordination of resource
    usage, and reservation of the shared resources
  • Iterative approach in enabling new resources
    helped make computing problem more manageable

17
The Collider Detector at Fermilab (CDF)
Muon detector
Central hadronic calorimeter
Central outer tracker (COT)
18
A Mountain of Data
5.8 x 109 events 804TB raw data 2.4 PB total
data At least 2x more data coming before end of
run.
19
Computing Model
Each event is independentone job can fail and
others will continue No inter-process
communication Mostly integer computing
20
The Computing ProblemWW candidate event
Reconstruction/analysis Connecting the dots on
3-D spiral tracks Correlate with calorimeter
energy Find missing energy (large red
arrow) Combinatoric fitting to see what is
consistent with W particle.
21
CAF Software
  • Front end submission, authentication and
    monitoring software
  • Users submit, debug, monitor from desktop
  • Works with various
    batch systems
  • CDF began with dedicated facilities at Fermilab
    and remote institutions
  • Monitoring page at http//cdfcaf.fnal.gov/

22
Why the Open Science Grid
  • Majority of CPU load is simulation
  • Requires 10GHz-sec per event
  • Some analyses need gt 1 billion simulated events
  • Increasing data volume mean that demand for
    computing is growing faster than dedicated
    resources at FNAL and elsewhere.
  • Simulation relatively easy to set up on remote
    sites
  • CDF member institutions that previously had
    dedicated CDF facilities now are using grid
    interfaces
  • Strategy
  • Data analysis mostly close to home (FermiGrid
    CAF)
  • Monte Carlo simulations spread across the OSG
    (NAMCAF).

23
Condor Glide-ins
  • Submit pilot job to a number of remote sites
  • Pilot job calls home server to get a work unit
  • Integrity of job and executable checked by MD5
    checksums
  • To CDF userslooks like a local batch pool
  • Glidekeeper daemons monitor remote sites, submits
    enough jobs in advance to use available slots.

24
GlideCAF overview
GlideCAF (Portal)
Batch queue
Collector
Submitter Daemon
Globus
Negotiator
Grid Pool
Monitoring Daemons
Grid Pool
Main Schedd
Batch queue
Glidekeeper Daemon
Glide-in Schedd
Globus
25
NAMCAFCDF Computing On Open Science Grid
  • North American CAFsingle submission point for
    all OSG Sites
  • CDF user interface, uses OSG tools underneath
  • no CDF-specific hardware or software at OSG sites
  • Accesses OSG sites at MIT, Fermilab, UCSD,
    Florida Chicago
  • OSG sites at Purdue, Toronto, Wisconsin, McGill
    to be added

Provides up to 1000 job slots already Similar
entry points to European sites (LCGCAF) and
Taiwan, Japan sites (PACCAF)
26
CDF OSG Usage

CDF
27
Auxiliary tools--gLExec
  • All Glidein jobs on the grid appear to come from
    same user.
  • gLExec uses Globus callouts to contact site
    authentication infrastructure
  • EGEELCAS/LCMAPS OSGGUMS/SAZ
  • Each individual user job authenticates to the
    site at the start of the job
  • Gives site independent control on who it takes
    glideins from.

28
W boson mass measurement
LEP experiments _at_CERN
The CDF Run 2 result is the most precise single
measurement of the W mass (used million CPU
hours for mass fitting)
29
What is FermiGrid?
  • FermiGrid is
  • The Fermilab campus Grid and Grid portal.
  • The site globus gateway.
  • Accepts jobs from external (to Fermilab) sources
    and forwards the jobs onto internal clusters.
  • A set of common services to support the campus
    Grid and interface to Open Science Grid (OSG) /
    LHC Computing Grid (LCG)
  • VOMS, VOMRS, GUMS, SAZ, MyProxy, Squid, Gratia
    Accounting, etc.
  • A forum for promoting stakeholder
    interoperability and resource sharing within
    Fermilab
  • CMS, CDF, D0
  • KTeV, miniBoone, minos, mipp, etc.
  • The Open Science Grid portal to Fermilab Compute
    and Storage Services.
  • FermiGrid Web Site Additional Documentation
  • http//fermigrid.fnal.gov/
  • Work supported by the U.S. Department of Energy
    under contract No. DE-AC02-07CH11359.

30
Jobmanager-cemon MatchMaking Service
  • What is it?
  • FermiGrid has a matchmaking service deployed on
    the central gatekeeper (fermigrid1.fnal.gov).
    This service is used to match the incoming jobs
    against the various resources available at the
    point in time that the job was submitted.
  • How can users make use of the MatchMaking
    Service?
  • Users begin by submitting jobs to the fermigrid1
    central gatekeeper through jobmanager-cemon.
  • By default, the value of the "requirements"
    attribute is set such that users job will be
    matched against clusters which support the users
    VO (Virtual Organization) and have at least one
    free slot available at the time when the job is
    submitted to fermigrid1.
  • However, users have the ability to add additional
    conditions to this "requirements attribute,
    using the attribute named "gluerequirements" in
    the condor submit file.
  • These additional conditions should be specified
    in terms of Glue Schema attributes.
  • More information
  • http//fermigrid.fnal.gov/matchmaking.html

31
FermiGrid - Current Architecture
VOMS Server
Periodic Synchronization
GUMS Server
Site Wide Gateway
SAZ Server
clusters send ClassAds via CEMon to the site wide
gateway
BlueArc
Exterior Interior
CMS WC1
CDF OSG1
CDF OSG2
D0 CAB1
GP Farm
D0 CAB2
32
SAZ - Animation
DN
VO
Role
Gatekeeper
CA
33
FermiGrid - Current Performance
  • VOMS
  • Current record 1700 voms-proxy-inits/day.
  • Not a driver for FermiGrid-HA.
  • GUMS
  • Current record gt 1M mapping requests/day
  • Maximum system load lt3 at a CPU utilization of
    130 (max 200)
  • SAZ
  • Current record gt 129K authorization
    decisions/day.
  • Maximum system load lt5.

34
Bluearc/dCache
  • Open Science Grid has two storage methods
  • NFS-mounted OSG_DATA
  • Implemented with BlueArc NFS filer
  • SRM/dCache
  • Volatile area, 7TB, for any grid user
  • Large areas backed up on tape for Fermi
    experiments

35
FermiGrid-HA - Component Design
VOMS Active
VOMS Active
LVS Active
LVS Active
MySQL Active
GUMS Active
Replication
Heartbeat
Heartbeat
MySQL Active
GUMS Active
LVS Standby
LVS Standby
SAZ Active
SAZ Active
36
FermiGrid-HA - Actual Component Deployment
Xen Domain 0
Xen Domain 0
LVS (Active)
LVS (Standby)
Active fg5x1
Active fg6x1
VOMS
VOMS
Xen VM 1
Xen VM 1
Active fg5x2
Active fg6x2
GUMS
GUMS
Xen VM 2
Xen VM 2
Active fg5x3
Active fg6x3
SAZ
SAZ
Xen VM 3
Xen VM 3
Active fg5x4
Active fg6x4
MySQL
MySQL
Xen VM 4
Xen VM 4
Active
fermigrid5
Active
fermigrid6
37

Supported by the Department of Energy Office of
Science SciDAC-2 program from the High Energy
Physics, Nuclear Physics and Advanced Software
and Computing Research programs, and the
National Science Foundation Math and Physical
Sciences, Office of CyberInfrastructure and
Office of International Science and Engineering
Directorates.
38
Open Science Grid
  • The Vision
  • Transform compute and data intensive science
    through a cross-domain self-managed national
    distributed cyber-infrastructure that brings
    together campus and community infrastructure and
    facilitating the needs of Virtual Organizations
    at all scales
  • Submit local, Run Global

39
Open Science Grid
Science Community Infrastructure (e.g.
ATLAS,CMS, LIGO, )
Need to be harmonized Into a well Integrated
whole
CS/IT Campus Grids (e.g. DOSAR, Fermigrid, GLOW,
GPN, GROW)
40
Open Science Grid International PartnersEGEE,
Teragrid, Nordugrid, NYSGrid, GROW, GLOW, APAC,
DiSUN, FermiGrid, LCG, TIGRE, ASGC, NWICG
An International Science Community Common
Goals, Shared Data, Collaborative work
41
Open Science Grid
42
Open Science Grid Rosetta A non-physics
experiment
For each protein we design, we consume about
3,000 CPU hours across 10,000 jobs, says
Kuhlman.  Adding in the structure and atom
design process, weve consumed about 100,000 CPU
hours in total so far.
43
Open Science Grid CHARMM
CHARMM CHemistry at HARvard Macromolecular
Mechanics Im running many different
simulations to determine how much water exists
inside proteins and whether these water molecules
can influence the proteins, Damjanovic says. 
44
Open Science Grid How it all comes together
Virtual Organization Management services (VOMS)
allow registration, administration and control of
members of the group. Resources trust
and authorize VOs not individual users OSG
infrastructure provides the fabric for
job submission and scheduling, resource
discovery, security, monitoring,
VO Middleware Applications
VO Management Service
OSG Infrastructure
Resources that Trust the VO
45
Open Science Grid Software Stack
User Science Codes and Interfaces
VO Middleware
Applications

HEP Data and workflow management etc
Biology Portals, databases etc
Astrophysics Data replication etc
OSG Release Cache OSG specific configurations,
utilities, etc
Virtual Data Toolkit (VDT) Core technologies
software needed by stakeholders, many components
shared with EGEE
Infrastructure
Core Grid Technology Distributions Condor,
Globus, MyProxy shared with TeraGrid and others
Resource
Existing Operating, Batch systems and Utilities.
46
Open Science Grid Security
  • Operational security is a priority
  • Incident response
  • Signed agreements, template policies
  • Auditing, assessment and training
  • Symmetry of Sites and VOs
  • VO and Site two faces of a coin we believe
    in symmetry
  • VO and Site each have responsibilities
  • Trust Relationships
  • A Sites trust the VOs that use it.
  • A VO trusts the Sites it runs on.
  • VOs trust their users.

47
Open Science Grid Come Join OSG !!!
  • How to become an OSG Citizen
  • Join the OSGEDU VO
  • Run small applications after learning how to use
    OSG from schools
  • Be part of the Engagement program and Engage VO
  • Support within the Facility to bring applications
    to production on the distributed infrastructure
  • Be a standalone VO and a Member of the
    Consortium
  • Ongoing use of OSG participate in one or more
    activity groups.
Write a Comment
User Comments (0)
About PowerShow.com