Title: Purdue Campus Grid
1Purdue Campus Grid
- Preston Smith
- psmith_at_purdue.edu
- Condor Week 2006
- April 24, 2006
2Overview
- RCAC
- Community Clusters
- Grids at Purdue
- Campus
- Regional
- NWICG
- National
- OSG
- CMS Tier-2
- NanoHUB
- Teragrid
- Future Work
3Purdues RCAC
- Rosen Center for Advanced Computing
- Division of Information Technology at Purdue
(ITaP) - Wide variety of systems shared memory and
clusters - 352 CPU IBM SP
- Five 24-processor Sun F6800s, Two 56-processor
Sun E10ks - Five Linux clusters
4Linux clusters in RCAC
- Recycled clusters
- Systems retired from student labs
- Nearly 1000 nodes of single-CPU PIII, P4, and
2-CPU Athlon MP and EM64T Xeons for general use
by Purdue researchers
5Community Clusters
- Federate resources at a low level
- Separate researchers buy sets of nodes to
federate into larger clusters - Enables larger clusters than a scientist could
support on his own - Leverage central staff and infrastructure
- No need to sacrifice a grad student to be a
sysadmin!
6Community Clusters
- Macbeth
- 126 nodes dual Opteron (1 Tflops)
- 1.8 GHz
- 4-16GB RAM
- Infiniband, GigE for IP traffic
- 7 owners (ME, Biology, HEP Theory)
- Lear
- 512 nodes dual Xeon 64 bit (6.4 Tflops)
- 3.2 GHz
- 4GB and 6 GB RAM
- GigE
- 6 owners (EEx2, CMS, Provost, VPR, Teragrid)
- Hamlet
- 308 nodes dual Xeon (3.6 Tflops)
- 3.06 GHz to 3.2 GHz
- 2 GB and 4 GB RAM
- GigE, Infiniband
- 5 owners (EAS, BIOx2, CMS, EE)
7Community Clusters
- Primarily scheduled with PBS
- Contributing researchers are assigned a queue
that can run as many slots as they have
contributed. - Condor co-schedules alongside PBS
- When PBS is not running a job, a node is fair
game for Condor! - But Condor work is subject to preemption if PBS
assigns work to the node.
8Condor on Community Clusters
- All in all, Condor joins together 4 clusters
(2500 CPU) within RCAC.
9Grids at Purdue - Campus
- Instructional computing group manages a 1300-node
Windows Condor pool to support instruction. - Mostly used by computer graphics classes for
rendering animations - Maya, etc.
- Work in progress to connect Windows pool with
RCAC pools.
10Grids at Purdue - Campus
- Condor pools around campus
- Physics department 100 nodes, flocked
- Envision Center 48 nodes, flocked
- Potential collaborations
- Libraries 200 nodes on Windows terminals
- Colleges of Engineering 400 nodes in existing
pool - Or any department interested in sharing cycles!
11Grids at Purdue - Regional
- Northwest Indiana Computational Grid
- Purdue West Lafayette
- Purdue Calumet
- Notre Dame
- Argonne Labs
- Condor pools available to NWICG today.
- Partnership with OSG?
12Open Science Grid
- Purdue active in Open Science Grid
- CMS Tier-2 Center
- NanoHUB
- OSG/Teragrid Interoperability
- Campus Condor pools accessible to OSG
- Condor used for access to extra, non-dedicated
cycles for CMS and is becoming the preferred
interface for non-CMS VOs.
13CMS Tier-2 - Condor
- MC production from UW-HEP ran this spring on RCAC
Condor pools. - Processed 23 or so of entire production.
- High rates of preemption, but thats expected!
- 2006 will see addition of dedicated Condor worker
nodes to Tier-2, in addition to PBS clusters. - Condor running on resilient dCache nodes.
14NanoHUB
Science Gateway
Campus Grids Purdue, GLOW
Capability Computing
Workspaces
Grid
Middleware
VM
nanoHUB VO
Virtual backends Virtual Cluster with VIOLIN
Capacity Computing
Research apps
15Teragrid
- Teragrid Resource Provider
- Resources offered to Teragrid
- Lear cluster
- Condor pools
- Data collections
16Teragrid
- Two current projects active in Condor pools via
Teragrid allocations - Database of Hypothetical Zeolite Structures
- CDF Electroweak MC Simulation
- Condor-G Glide-in
- Great exercise in OSG/TG Interoperability
- Identifying other potential users
17Teragrid
- TeraDRE - Distributed Rendering on the Teragrid
- Globus, Condor, and IBRIX FusionFS enables
Purdues Teragrid site to serve as a render farm - Maya and other renderers available
18Grid Interoperability
Lear
19Grid Interoperability
- Tier-2 to Tier-2 connectivity via dedicated
Teragrid WAN (UCSD-gtPurdue) - Aggregating resources at low level makes
interoperability easier! - OSG stack available to TG users and vice versa
- Bouncer Globus job forwarder
20Future of Condor at Purdue
- Add resources
- Continue growth around campus
- RCAC
- Other departments
- Add Condor capabilities to resources
- Teragrid data portal adding on-demand processing
with Condor now - Federation
- Aggregate Condor pools with other institutions?
21Condor at Purdue
22PBS/Condor Interaction
PBS Prologue Prevent new Condor jobs and push
any existing ones off /opt/condor/bin/condor_con
fig_val -rset -startd \ PBSRunningTrue gt
/dev/null /opt/condor/sbin/condor_reconfig
-startd gt /dev/null if ( condor_status -claimed
-direct (hostname) 2gt/dev/null \ grep -q
Machines ) then condor_vacate gt /dev/null sleep
5 fi
23PBS/Condor Interaction
- PBS Epilogue
- /opt/condor/bin/condor_config_val -rset -startd \
- PBSRunningFalse gt /dev/null
- /opt/condor/sbin/condor_reconfig -startd gt
/dev/null - Condor START Expression in condor_config.local
- PBSRunning False
- Only start jobs if PBS is not currently running
a job - PURDUE_RCAC_START_NOPBS ( (PBSRunning)
False ) - START (START) (PURDUE_RCAC_START_NOPBS)