Title: Cluster Resources Training
1- Cluster Resources Training
- February 2006
2Presentation Protocols
- For problems or questions Send email to
- training_at_clusterresources.com
- We will pause for questions at the end of each
section - Please remain on mute except for questions
- Please do not put your call on hold (the entire
group will hear your music) - Please be conscientious of the other people
attending the conference - You can also submit questions during the training
to the AOL Instant Messenger screen name CRI Web
Training
3Session 2
- 6. Reporting Monitoring
- 7. Grids
- 8. Utility
- 9. Torque
- 10. Future
46. Accounting and Statistics
- Job and System Statistics
- Event Log
- Fairshare Stats
- Client Statistic Reports
- Realtime and Historical Charts with Moab Cluster
Manager - Native Resource Manager
- GMetrics
- GEvents
5Accounting Overview
- Job and Reservation Accounting
- Resource Accounting
- Credential Accounting
moab.cfg USERCFGDEFAULT ENABLEPROFILINGTRUE
6Job and System Statistics
- Determining cumulative cluster performance over a
fixed timeframe - Graphing changes in cluster utilization and
responsiveness over time - Identifying which compute resources are most
heavily used - Charting resource usage distribution among users,
groups, projects, and classes - Determining allocated resources, responsiveness,
and/or failure conditions for jobs completed in
the past - Providing real-time statistics updates to
external accounting systems
7Event Log
- Report trace state and utilization records at
events - Scheduler start, stop and failure
- Job create, start, end, cancel, migrate, failure
- Reservation create, start, stop, failure
- Configurable with RECORDEVENTLIST
- Can be exported to external systems
- http//clusterresources.com/moabdocs/a.fparameters
.shtmlrecordeventlist - http//clusterresources.com/moabdocs/14.2logging.s
htmllogevent
8Fairshare stats
- Provide credential-based
- usage distributions over time
- mdiag f
- Maintained for all credentials
- Stored in stats/FS.epochtime
- Shows detailed time-distribution usage by
fairshare metric
9Client Statistic Reports
- In-Memory reports available for nodes and
credentials - Node categorization allows fine-grained localized
usage tracking
10Realtime and Historical Charts with Moab Cluster
Manager
- Reports nodes and all creds
- Allows arbitrary querying of historical
timeframes with arbitrary correlations
11Service Monitoring and Management
12Real-Time Performance Accounting Analysis
13Search for Specific Jobs, Track Jobs Status
14View and Manage Fairshare Settings
15Manage Priority Across the Largest Set of Options
16View the Cluster by virtually any attribute
17Improving Resource Monitoring and Reporting
- Native Resource Manager Interface
- Generic Features/Consumable Resources
- Generic Metrics
- Generic Events
18Native Resource Manager Interface
- Everything youve ever wanted to do with Moab --
An interface that allows sites to replace or
augment their already existing resource managers
with information from the following - Example Usage
- Arbitrary Scripts
- Ganglia
- FlexLM
- MySQL
http//clusterresources.com/moabdocs/13.5nativerm.
shtml http//clusterresources.com/moabdocs/13.7l
icensemanagement.shtml
19Native Resource Manager Example
moab.cfg interface w/TORQUE RMCFGtorque
TYPEPBS interface w/flexLM RMCFGflexLM
TYPENATIVE RTYPElicense RMCFGflexLM
CLUSTERQUERYURLexec///HOME/tools/license.mon.fl
exlm.pl integrate local node health check
script data RMCFGlocal TYPENATIVE RMCFGlo
cal CLUSTERQUERYURLfile///opt/moab/localto
ols/healthcheck.dat
20Generic Features / Consumable Resources
- Node Features
- Opaque string tags associated with compute
resources - Can be requested by jobs, reservations, etc
- Generic Consumable Resources
- Can specify any arbitrary consumable resource
- Can be requested by jobs, reservations, etc
- Reserved by advance reservations
moab.cfg NODECFGnode1 FEATURESfast,bigmem N
ODECFGnode2 FEATURESbigmem NODECFGDEFAULT
GRESmatlab4
http//clusterresources.com/moabdocs/12.2nodeattri
butes.shtmlfeatures http//clusterresources.com/m
oabdocs/12.4consumablegres.shtml
http//clusterresources.com/moabdocs/9.2accountin
g.shtmlgevents
21Generic Metrics
- Moab allows organizations to enable generic
performance metrics. These metrics allow
decisions to be made and reports to be generated
based on site specific environmental factors.
This increases Moab's awareness of what is
occurring within a given cluster environment, and
allows arbitrary information to be associated
with resources and the workload within the
cluster. Uses of these metrics are widespread
and can cover anything from tracking node
temperature, to memory faults, to application
effectiveness. - Execute triggers when specified thresholds are
reached - Modify node allocation affinity for specific jobs
initiate automated notifications when thresholds
are reached - Display current, average, maximum, and minimum
metrics values in reports and charts within Moab
Cluster Manager
http//clusterresources.com/moabdocs/9.2accounting
.shtmlgmetrics
22Generic Metric Example
moab.cfg RMCFGnative TYPENATIVE
CLUSTERQUERYURLfile///HOME/tools/temp.txt
NODECFGn01 TRIGGERatypeexec,action/bin/dra
in.pl OID", etypethreshold,thresholdGMetrictem
pgt150
example temp.txt temperature output node001
GMETRICtemp113 node002 GMETRICtemp107 node00
3 GMETRICtemp83 node004 GMETRICtemp85
23Generic Events
- Generic Events
- Can report arbitrary events and failures
- Can associate human readable messages with event
- Events viewable via Moab clients
moab.cfg RMCFGnative TYPENATIVE
CLUSTERQUERYURLfile///HOME/tools/healthcheck.tx
t GEVENTCFGdiskfull ACTIONnotify,exec/opt/moab
/nodepurge.sh GEVENTCFGpower ACTIONavoid,record
,notify GEVENTCFGcpufailure ACTIONreserve,disab
le,record,notify
example healthcheck.txt node017
GEVENTcpufailureCPU2 Down node135
GEVENTdiskfull/var/tmp Full node139
GEVENTdiskfull/home Full node407
GEVENTpowerTransient Power Supply Failure
http//clusterresources.com/moabdocs/9.2accounting
.shtmlgmetrics
24Presentation Protocols
- For problems or questions Send email to
- training_at_clusterresources.com
- We will pause for questions at the end of each
section - Please remain on mute except for questions
- Please do not put your call on hold (the entire
group will hear your music) - Please be conscientious of the other people
attending the conference - You can also submit questions during the training
to the AOL Instant Messenger screen name CRI Web
Training
257. Peer to Peer (Grids)
- Cluster Stack / Framework
- Moab P2P Grid
- Peer Configuration
- Resource Control Overview
- Data Management
- Security
http//clusterresources.com/moabdocs/17.0peertopee
r.shtml
26- Peer Flow
- Resource Affinity
- Management
- Grids and Globus GRAM
- Grid Troubleshooting
- Data Staging
- Information Services
27Cluster Stack / Framework
Grid Workload Manager Scheduler, Policy
Manager, Integration Platform
Cluster Workload Manager Scheduler, Policy
Manager, Integration Platform
Application
Resource Manager
Portal
Application
Serial
Parallel
Security
GUI
Message Passing
CLI
Operating System
Hardware (Cluster or SMP)
Admin
28Grid Types
Local Area Grid (LAG)
Wide Area Grid (WAG)
A Local Area Grid uses one instance of Moab
within an environment that shares a user and data
space across multiple clusters, that may or may
not have multiple hardware types, operating
systems and compute resource managers (e.g.
LoadLeveler, TORQUE, LSF, PBS Pro, etc.)
A Wide Area Grid uses multiple Moab instances
working together within an environment that can
have multiple user and data spaces across
multiple clusters, that may or may not have
multiple hardware types, operating systems and
compute resource managers (e.g. LoadLeveler,
TORQUE, LSF, PBS Pro, etc.). Wide Area Grid
management rules can be centralized, locally
controlled or mixed.
Moab
Moab (Master)
Shared User Space
Multiple User Spaces
Shared Data Space
Multiple Data Spaces
Cluster A
Cluster C
Cluster B
Moab
Moab
Moab
Cluster A
Cluster C
Cluster B
Grid Management Scenarios
Centralized Local Management
Centralized Management
Local Management Peer to Peer
Moab (Grid Head Node)
Moab (Grid Head Node)
Local Grid Rules
Local Grid Rules
Shared Grid Rules
All Grid Rules
Moab
Moab
Cluster C
Cluster A
Local Grid Rules
Moab
Moab
Local Grid Rules
Moab
Local Grid Rules
Local Grid Rules
Cluster C
Cluster A
Cluster B
Moab
Moab
Moab
Moab
Cluster C
Cluster A
Cluster B
Cluster B
29Grid Benefits
- Scalability
- Resource Access
- Load-Balancing
- Single System Image (SSI)
- High Availability
30Drawbacks of Layered Approach
- Stability
- Additional failure layer
- Centralized grid management
(single point of failure) - Optimization
- Limited local information and control
- Admin Experience
- Additional tool to learn/configure
- Policy Duplication and Conflicts
- Additional tool to manage/troubleshoot
- User Experience
- Additional submission language/environment
- Additional tool to track, manage workload
- http//clusterresources.com/moabdocs/17.12p2pgrid.
shtml
31Moab P2P Approach
- Little to no user training
- Little to no admin training
- Single Policy set
- Transparent Grid
http//clusterresources.com/moabdocs/17.0peertopee
r.shtml
32Integrated Moab P2P/Grid Capabilities
- Distributed Resource Management
- Distributed Job Management
- Grid Information Management
- Resource and Job Views
- Credential Management and Mapping
- Distributed Accounting
- Data Management
33Grid Relationship Combinations
Moab is able to facilitate virtually any grid
relationship 1. Join local area grids into wide
are grids 2. Join wide area grids to other wide
area grids (whether they be managed centrally,
locally - peer to peer or mixed) 3. Resource
sharing can be in one direction for use with
hosting centers, or to bill out resources to
other sites 4. Have multiple levels of grid
relationships (e.g. conglomerates within
conglomerates within conglomerates)
4
Moab (Grid Head Node)
Shared Grid Rules
Multiple User Spaces
Multiple Data Spaces
1
2
Local Grid Rules
Local Grid Rules
Local Area Grid Rules
Local Grid Rules
Local Grid Rules
Local Grid Rules
3
Moab
Moab
Moab
Moab
Moab
Moab
Cluster E
Cluster H
Cluster D
Cluster F
Hosting Site
Shared User Space
Shared Data Space
Local Grid Rules
Cluster A
Cluster C
Cluster B
Moab
Cluster G
34Basic P2P Example
moab.cfg for Cluster A SCHEDCFGClusterA RMCFG
ClusterB TYPEMOAB SERVERnode0341000 RMCFGClus
terB.INBOUND FLAGSCLIENT CLIENTClusterB
moab-private.cfg for Cluster A CLIENTCFGRMClus
terB KEYfetwl02 AUTHadmin1
moab.cfg for Cluster B SCHEDCFGClusterB RMCFG
ClusterA TYPEMOAB SERVERnode0141000 RMCFGClus
terA.INBOUND FLAGSCLIENT CLIENTClusterA
moab-private.cfg for Cluster B CLIENTCFGRMClus
terA KEYfetwl02 AUTHadmin1
35Peer Configuration
- Resource Reporting
- Credential Config
- Data Config
- Usage Limits
- Bi-Directional Job Flow
moab.cfg (server 1) SCHEDCFGserver1
SERVERserver1.omc.com42005 MODENORMAL
RMCFGserver2-out TYPEMOAB SERVERserver2.omc.c
om42005 CLIENTserver2 RMCFGserver2-in
FLAGSclient CLIENTserver2
moab-private.cfg (server 1) CLIENTCFGserver2
KEY443db-writ4
36Jobs
- Submitting Jobs to the Grid
- msub
- Uses Resource Managers submission language and
translates to msub - Viewing Node and Job Information
- Each destination Moab server will report all
compute nodes it finds back to the source Moab
server - Show as local nodes each within a partition
associated with the resource manager reporting
them.
37Resource Control Overview
- Full resource information
- nodes appear with complete remote hostnames and
full attribute information - Remapped resource information
- nodes appear with remapped local hostnames and
full attribute information - Grid mode
- information regarding nodes reported from a
remote peer is aggregated and transformed into
one or more SMP-like large pseudo nodes
38Controlling Resource Information
- Direct
- nodes are reported to remote clusters exactly as
they appear in the local cluster - Mapped
- nodes are reported as individual nodes, but node
names are mapped to a unique name when imported
into the remote cluster - Grid
- node information is aggregated into a single
large SMP-like pseudo-node before it is reported
to the remote cluster
39Grid Sandbox
- Constrains external resource access and limits
which resources are reported to other peers
moab.cfg SRCFGsandbox1 PERIODINFINITY
HOSTLISTnode01,node02,node03 SRCFGsandbox1
CLUSTERLISTALL FLAGSALLOWGRID
40Access Controls
- Granting Access to Local Jobs
- Peer Access Control
moab.cfg SRCFGsandbox2 PERIODINFINITY
HOSTLISTnode04,node05,node06 SRCFGsandbox2
FLAGSALLOWGRID QOSLISThigh GROUPLISTengineer
moab.cfg (Cluster 1) SRCFGsandbox1
PERIODINFINITY HOSTLISTnode01,node02,node03,node
04,node05 SRCFGsandbox1 FLAGSALLOWGRID
CLUSTERLISTClusterB SRCFGsandbox2
PERIODINFINITY HOSTLISTnode6 FLAGSALLOWGRID
SRCFGsandbox2 CLUSTERLISTClusterB,ClusterC,Clus
terD USERLISTALL
41Controlling Peer Workload Information
- Local workload exporting
- Help simplify administration of different
clusters by centralizing monitoring and
management of jobs at one peer and avoids forcing
each peer to the type SLAVE
moab.cfg (ClusterB - Destination Peer)
RMCFGClusterA FLAGSCLIENT,LOCALWORKLOA
DEXPORT source peer
42Data Management Configuration
- Global file systems
- Replicated data servers
- Need based direct input
- Output data migration
moab.cfg (NFS data server) RMCFGstorage
TYPEnative SERVERomc.omc13.com42004
RTYPESTORAGE RMCFGstorage SYSTEMMODIFYURLexec
//HOME/tools/storage.ctl.nfs.pl RMCFGstorage
SYSTEMQUERYURLexec//HOME/tools/storage.query.nf
s.pl
moab.cfg (SCP data server) RMCFGstorage
TYPEnative SERVERomc.omc13.com42004
RTYPESTORAGE RMCFGstorage SYSTEMMODIFYURLexec
//HOME/tools/storage.ctl.scp.pl RMCFGstorage
SYSTEMQUERYURLexec//HOME/tools/storage.query.sc
p.pl
43Security
- Secret key based security is enabled via the
moab-private.cfg file - Globus Credential Based Server Authentication
(4.2.4)
44Credential Management
- Peer Credential Mapping
- Source and Destination Side Credential Mapping
moab.cfg SCHEDCFGmaster1 MODEnormal
RMCFGslave1 OMAPfile///opt/moab/omap.dat
/opt/moab/omap.dat (source object map file)
userjoe,jsmith usersteve,sjohnson
grouptest,staff classbatch,serial
user,grid
45- Preventing User Space Collisions
moab.cfg SCHEDCFGmaster1 MODEnormal
RMCFGslave1 OMAPfile///opt/moab/omap.dat
FLAGSclient
/opt/moab/omap.dat (source object map file)
user,c1_ group,_ grid account,temp_
- Interfacing with Globus GRAM
moab.cfg SCHEDCFGc1 SERVERhead.c1.hpc.org
RMCFGc2 SERVERhead.c2.hpc.org TYPEmoab
JOBSTAGEMETHODglobus
46- Limiting Access To Peers
- Limiting Access From Peers
moab.cfg SCHEDCFG SERVERc1.hpc.org only
allow staff or members of the research and demo
account to use remote resources on c2
RMCFGc2 SERVERhead.c2.hpc.org TYPEmoab
RMCFGc2 AUTHGLISTstaff AUTHALISTresearch,demo
moab.cfg SCHEDCFG SERVERc1.hpc.org
FLAGSclient only allow jobs from remote
cluster c1 with group credentials staff or
account research or demo to use local resources
RMCFGc2 SERVERhead.c2.hpc.org TYPEmoab
RMCFGc2 AUTHGLISTstaff AUTHALISTresearch,demo
47Utilizing Multiple Resource Managers
- Migrate jobs between resource managers
- Aggregate Information into a cohesive node view
moab.cfg RESOURCELIST node01,node02 ... RMCFGba
se TYPEPBS RMCFGnetwork TYPENATIVEAGFULL RMC
FGnetwork CLUSTERQUERYURL/tmp/network.sh RMCFG
fs TYPENATIVEAGFULL RMCFGfs CLUSTERQUERYURL/
tmp/fs.sh
sample network script _RX/sbin/ifconfig eth0
grep "RX by" cut -d -f2 cut -d' ' -f1
\ _TX/sbin/ifconfig eth0 grep "TX by" cut
-d -f3 cut -d' ' -f1 \ echo hostname
NETUSAGEecho "_RX _TX" bc
48P2P Resource Affinity
- Certain compute architectures are able to execute
certain compute jobs more effectively than others
- From a given location, staging jobs to various
clusters may require more expensive allocations,
more data and network resources, and more use of
system services - Certain compute resources are owned by external
organizations and should be utilized sparingly - Moab allow the use of peer resource affinity to
guide jobs to the clusters which make the best
fit according to a number of criteria
49Management and Troubleshooting
- Peer Management Overview
- Use 'mdiag -R' to view interface health and
performance/usage statistics - Use 'mrmctl' to enable/disable peer interfaces
- Use 'mrmctl -m' to dynamically modify/configure
peer interfaces - Peer Management Overview
- Use 'mdiag -R' to diagnose general RM interfaces
- Use 'mdiag -S' to diagnose general scheduler
health - Use 'mdiag -R ltRMIDgt --flagssubmit-check' to
diagnose peer-to-peer job migration
50Sovereignty Local vs. Centralized Management
Policies
Local Admin can apply policies to manage 1.
Local user access to local cluster resources 2.
Local user access to grid resources 3. Outside
grid user access to local cluster resources
(general or specific policies)
Grid Administration Body can apply policies to
manage 1. General grid policies (Sharing,
Priority, Limits, etc.)
Grid Administration Body
Grid Allocated Resources
3
Each Admin can manage their own cluster
1
Local Admin
2
Portion Allocated to Grid
1
- Submit to either
- Local cluster
- Specified cluster(s) in the grid
- Generically to the grid
Local Cluster A Resources
Local Users
Outside Grid Users
51Grids and Globus
- Globus authentication credentials are used to
determine trust between Moab peers and/or grid
users - Trusted peers/users may use the Globus GRAM
service to submit grid workload - Trusted peers/users may use the Globus GridFTP
service to stage data
52Advanced P2P Example
53Data Staging
- Data Staging
- Data Staging Models
- Interface Scripts for a Storage Resource Manager
54Data Staging
- Manages intra-cluster and inter-cluster job data
staging requirements so as to minimize resource
inefficiencies and maximize system utilization - Prevent the loss of compute resources due data
blocking and can significantly improve cluster
performance.
55Data Management Increasing Efficiency
Data Staging Levels of Efficiency and
Control 0. No data staging. 1. Non-Verified
Data Staging is the traditional use of data
staging where CPU requests and data staging
requests are not coordinated, leaving the CPU
request to cause blocking on the compute node
when the data is not available to process. 2.
Verified Data Staging is the added intelligence
to have the workload manager verify that the data
has arrived at the needed location prior to
launching the job, in order to avoid workload
blocking. 3. Prioritized Data Staging uses the
capabilities of Verified Data Staging, but adds
the ability to intercept the data staging
requests and to submit them in an order of
priority that matches that of the corresponding
jobs. 4. Fully Scheduled Data Staging uses all of
the capabilities of Prioritized Data Staging, but
adds the ability to estimate staging periods,
thus allowing workload to be scheduled more
intelligently around data staging conditions.
This capability, unlike the others can be applied
to both external and internal storage scenarios,
while others simply apply to external storage.
Fully Scheduled Data Staging Prioritized Data
Staging Verified Data Staging Non-Verified Data
Staging No Data Staging
4
Optimized Data Staging
3
2
1
Traditional Data Staging
0
56Optimized Data Staging
- Automatically pre-stages input data and stages
back output data with event policies - Coordinate data stage time with compute resource
allocation - Use GASS, gridftp, and scp for data management
- Reserve network resources to guarantee data
staging and inter-process communication
Traditional Inefficient Method
CPU Reservation
Prestage
Stage Back
Processing
Compute resources are wasted/ Blocked during
data staging
Compute resources are available to other workload
during data staging
CPU Reservation
Reservation
Reservation
Optimized Data Staging
Prestage
Stage Back
Processing
57Efficiencies from Optimized Data Staging
Processor Start Time
Traditional Inefficient Method
Reservation
Reservation
Prestage
Stage Back
Processing
Prestage
Stage Back
Processing
Reservation
Reservation
Prestage
Stage Back
Processing
Prestage
Stage Back
Processing
Intelligent Event-based Data Staging
- 7.5 Jobs Completed
- Efficient use of CPU
- Efficient use of Network
Reservation
Event
Event
Reservation
Event
Event
Reservation
Event
Event
Reservation
Event
Event
Processing
Processing
Processing
Processing
Prestage
Stage Back
Prestage
Stage Back
Prestage
Stage Back
Prestage
Stage Back
Reservation
Event
Event
Reservation
Event
Event
Reservation
Event
Event
Reservation
Event
Prestage
Stage Back
Processing
Prestage
Stage Back
Processing
Prestage
Stage Back
Processing
Prestage
Processing
58Data Staging Models
- Verified Data Staging
- Prioritized Data Staging
- Fully-Scheduled Data Staging
- Data Staging to Allocated Nodes
Attribute Description
TYPE must be NATIVE in all cases
RESOURCETYPE must be set to STORAGE in all cases
SYSTEMQUERYURL specifies method of determining file attributes such as size, ownership, etc.
CLUSTERQUERYURL specifies method of determining current and configured storage manager resources such as available disk space, etc.
SYSTEMMODIFYURL specifies method of initiating file creation, file deletion, and data migration
59Verified Data Staging
Verified Data Staging (Start job after the file
is verified to be in the right location) To
prevent job blocking caused by jobs whos data
has not finished data staging when all data
staging is controlled via external data managers
and no methods exist to control what is staged or
in what order 1. User submits jobs via portal,
or job script like mechanism. Data staging needs
are communicated to a data manager mechanism (HSM
manager, staging tool, script, command, etc.).
Job consideration requests are sent to Moab in
order to decide how and when to run. 2. Moab
periodically queries storage system (SAN, NAS,
Storage Nodes) to see if the file is there yet.
3. The data manager moves the data to the
desired location when it is able. 4. Moab
verifies that the file is there, then releases
the job for submission as long as it satisfies
established policies.
Storage System
Data Manager
Benefits Prevents non-staged jobs from blocking
usage of nodes Drawbacks No job-centric
prioritization takes place in the order of which
data gets staged first
3
1
Job Submission
2
Local Grid Rules
Moab
Cluster A
4
60Prioritized Data Staging
Prioritized Data Staging (priority order of data
staging) When Moab intercepts data staging
requests submits them through a data manager
according to priority order 1. User submits
jobs via portal, or job script like mechanism.
Data staging needs and Job consideration requests
are sent to Moab in order to decide how and when
to run and to decide priority order of submitting
data staging requests. 2. Moab evaluates
priority, reservations and other factors, and
then submits data staging requests to a data
manager mechanism (HSM manager, staging tool,
script, command, etc.) in the best order to match
established policies. 3. Moab periodically
queries storage system (SAN, NAS, Storage Nodes)
to see if the file is there yet. 4. The data
manager moves the data to the desired location
when it is able. 5. Moab verifies that the file
is there, then releases the job for submission
as long as it satisfies established policies.
Storage System
Data Manager
Benefits Prevents non-staged jobs from blocking
usage of nodes Provides soft prioritization of
data staging requests Drawbacks Prioritization
is only softly provided Insufficient information
for informed CPU reservations to take place
4
Priority Jobs First
1
Job Submission
2
3
Local Grid Rules
Moab
Cluster A
5
61Fully Scheduled Data Staging External Storage
Fully Scheduled Data Staging (priority order of
data staging and data-staging centric
scheduling) When Moab intercepts data staging
requests to manage data staging order reserves
CPU and other resources based on estimates of
data staging periods 1. User submits jobs via
portal, or job script like mechanism. Data
staging needs and Job consideration requests are
sent to Moab in order to decide how and when to
run and to decide priority order of submitting
data staging requests. 2. Moab evaluates data
size and network speeds to estimate data staging
duration, then uses this estimate to reserve
manage submission of data staging requests and
reservations of CPUs and other resources. 3. Moab
evaluates priority, reservations and other
factors, and then submits data staging requests
to a data manager mechanism (HSM manager, staging
tool, script, command, etc.) in the best order to
match established policies. 4. Moab periodically
queries storage system (SAN, NAS, Storage Nodes)
to see if the file is there yet. 5. The data
manager moves the data to the desired location
when it is able. 6. Moab verifies that the file
is there, then releases the job for submission
as long as it satisfies established policies.
Storage System
Data Manager
Benefits Prevents non-staged jobs from blocking
usage of nodes Provides soft prioritization of
data staging requests Intelligently schedule
resources based on data staging
information Drawbacks Prioritization is only
softly provided
5
Priority Jobs First
1
2
3
Job Submission
4
Local Grid Rules
Moab
Cluster A
6
62Fully Scheduled Data Staging Local Storage
Fully Scheduled Data Staging (priority order of
data staging and data-staging centric
scheduling) When Moab intercepts data staging
requests to manage data staging order reserves
CPU and other resources based on estimates of
data staging periods 1. User submits jobs via
portal, or job script like mechanism. Data
staging needs and Job consideration requests are
sent to Moab in order to decide how and when to
run and to decide priority order of submitting
data staging requests. 2. Moab evaluates data
size and network speeds to estimate data staging
duration, then uses this estimate to reserve
manage submission of data staging requests and
reservations of CPUs and other resources. 3. Moab
evaluates priority, reservations and other
factors, and then submits data staging requests
to a data manager mechanism (HSM manager, staging
tool, script, command, etc.) in the best order to
match established policies. 4. Moab periodically
queries storage system (SAN, NAS, Storage Nodes)
to see if the file is there yet. 5. The data
manager moves the data to the desired location
when it is able. 6. Moab verifies that the file
is there, then releases the job for submission
as long as it satisfies established policies.
Local Grid Rules
6
Benefits Prevents non-staged jobs from blocking
usage of nodes Provides soft prioritization of
data staging requests Intelligently reserves
resources based on data staging
information Drawbacks Prioritization is only
softly provided
1
Moab
Cluster A
Job Submission
4
S
S
3
S
2
Priority Jobs First
5
Storage is on Local Compute Nodes
Data Manager
63Data Staging Diagnostics
- Checkjob
- Stage type - input or output
- File name - reports destination file only
- Status - pending, active, or complete
- File size - size of file to transfer
- Data transferred - for active transfers, reports
number of bytes already transferred - Checknode
- Active and max storage manager data staging
operations - Dedicated and max storage manager disk usage
- File name - reports destination file only
- Status - pending, active, or complete
- File size - size of file to transfer
- Data transferred - for active transfers, reports
number of bytes already transferred
64Interface Scripts for a Storage Resource Manager
- Moab's data staging capabilities can utilize up
to 3 different native resource manager interfaces - Cluster Query Interface
- System Query Interface
- System Modify Interface
65Prioritized Data Staging Example
moab.cfg RMCFGdata TYPENATIVE
RESOURCETYPESTORAGE RMCFGdata
SYSTEMQUERYURLexec///opt/moab/tools/dstage.syste
mquery.pl RMCFGdata CLUSTERQUERYURLexec///opt
/moab/tools/dstage.clusterquery.pl RMCFGdata
SYSTEMMODIFYURLexec///opt/moab/tools/dstage.syst
emmodify.pl
66Information Services
- Monitoring performance statistics of multiple
independent clusters - Detecting and diagnosing failures from
geographically distributed clusters - Tracking cluster, storage, network, service, and
application resources - Generating load-balancing and resource state
information for users and middleware services
67Cluster Resources Training
- We will reconvene at 100 pm EST
68Presentation Protocols
- For problems or questions Send email to
- training_at_clusterresources.com
- We will pause for questions at the end of each
section - Please remain on mute except for questions
- Please do not put your call on hold (the entire
group will hear your music) - Please be conscientious of the other people
attending the conference - You can also submit questions during the training
to the AOL Instant Messenger screen name CRI Web
Training
698. Utility Computing
- Utility Computing Overview
- Configuration
- Resource Monitoring
- Virtual Private Clusters
- Resource Access
- Accounting
- Setting Up a Test Center
70The Utility Computing Vision
- Customer Point of View
- Creating a compute resource which dynamic grows
and shrinks with load on demand (with or
without local resources) - Creating a compute resource which can acquire
specialized resources as needed - Creating a compute resource which automatically
replaces failed resources whether they be compute
nodes, network, storage, or other components - Provider Point of View
- Creating a compute resource which dynamically
customizes itself to user needs - Creating a compute resource which can guarantee
service levels - Creating a compute resource with tight
integration and transparent usage
http//www.clusterresources.com/products/mwm/moabd
ocs/19.0utilitycomputing.shtml
71What is Utility Computing?
- Allows an organization to provide custom-tailored
resources or services to customers - A hosting center requires one or more of the
following - Secure remote access
- Guaranteed resource availability at a fixed time
or series of times - Integrated auditing/accounting/billing services
- Tiered service level (QoS/SLA) based resource
access - Dynamic compute node provisioning
- Full environment management over compute,
network, storage, and application/service based
resources - Intelligent workload optimization
- High availability, failure recovery, and
automated re-allocation -
http//www.clusterresources.com/products/mwm/moabd
ocs/19.0utilitycomputing.shtml
72Utility Computing
- Moab enables true utility computing by allowing
compute resources to be reserved, allocated, and
dynamically provisioned to meet the needs of
internal or external workload.
73Usage Models
- Manual
- As easy as going to a web site, specifying what
is needed, selecting one of the available
options, and logging in when the virtual cluster
is activated - Automatic
- The user simply submits jobs to the local
cluster. User is never aware a hosting center
exists
74Creating A Utility Computing Hosting Center
- Define Hosting Center Objectives
- Determine Customer Environment Needs
- Determine Resource Integration Methodology
- Determine Customer Service Agreement Needs
- Identify Resource Monitoring Requirements
- Identify Resource Provisioning Requirements
- Identify Complete Virtual Cluster Packages
75Initial Configuration
- Enable Resource Monitoring
- Enable Resource Provisioning
- Identify Initial Virtual Cluster Packages
76Advanced Configuration
- Identify Complete Virtual Cluster Packages
- Provide User Interface
- Enable Customer Registration
- Enable Self-Service Web Site
- Enable Email Notifications/Alerts
- Enable Service Policies
- Automate Customer Management
77Resource Monitoring
- You need to configure Moab to be aware of what
resources are available
Sample output
node001 STATEIdle CPROC2 CMEM512 node002
STATEIdle CPROC2 CMEM512 node010 STATEDown
CPROC2 CMEM1024 n ode011 STATEIdle CPROC2
CMEM1024
78Provisioning Resources and Managing Dynamic
Security
- Moab allocates the compute nodes before the
specified time-frame, and also allocates the
resources required to provision and customize the
node. - Provisioned, customized, and secured.
- To properly schedule a request, Moab makes
certain that all needed resources are available
at the appropriate times, whether used by the
requester directly or indirectly, or used by the
system to create, customize or manage the
resource. - Virtual private cluster
79Virtual Private Cluster
- Configuring Virtual Cluster Profiles
- VCPROFILE
- Sample attributes DESCRIPTION, NODESETLIST
- Requesting Resources
- mshow a
- Can specify XML format
- Creating VPCs
- -c vpc argument of mschedctl
vpc creation gt mshow -a -p pkgA -w
minprocs4,duration100000 Partition Tasks
Nodes Duration StartOffset
StartDate --------- ----- -----
------------ ------------ -------------- ALL
4 4 100000 000000
132809_04/27 TID4 ReqID0 ALL
4 4 100000 100000
171448_04/28 TID5 ReqID0 ALL
4 4 100000 200000
210127_04/29 TID6 ReqID0 gt mschedctl -c
vpc -a resources5 -a packagepkgA vpc.721
- http//clusterresources.com/moabdocs/commands/msho
wa.shtml
80VPCs cont.
- Listing VPCs
- mschedctl -l vpc
- Modifying VPCs
- mschedctl -m vpcltVPCIDgt
- Destroying VPCs
- mschedctl -d vpcltVPCIDgt
81Service Level Agreements and QoS Guarantees
- Guaranteed Level of Resource Delivery Per Unit
Time - Dedicated Resources During Specified Time-frames
- Guaranteed Resource Availability Response Time
82Tight Integration with Customer Resources
- Integration between customer and utility
computing batch system tools - Customization of utility computing resources to
provide similar batch environment - Queues, node features, policies, node ownership,
etc - Customization of utility computing resources to
provide similar execution environment - Operating system, applications, directory
structure, environment variables, etc - Creation of compatible user and group credentials
- Automated job migration to utility computing
environment - Automated data migration between customer and
utility computing hosting center - Automated export of utility computing job and
resource status information
83Enabling Manual Resource Access Requests
- Customer must register with hosting service
- Accomplished directly in Moab or Moab can extract
it from another database - Provide User Interface
- may provide detailed information regarding
resources which can be made available - may provide per-customer views of available
resources where each customer only sees resources
available to him or to his class of service - may provide no general information regarding
resource availability and rather only replies to
the availability of resources for explicit
requests
84Automating Resource Access Requests
- Load Based Allocation
- adjust their available resources based on the
nature and quantity of queued workload. - Time Based Allocation
- specification of the period and timeframe for
allocation of resources - Failure Based Allocation
- Trigger launches provisioning resources to handle
the workload
85Standby Resources
- Allows an organization to reflect its ability to
dynamically allocate and provision utility
resources to its end users, indicating both which
resources can be allocated and how quickly they
can be available. - Assists in better planning
- Consistent picture of resource availability
86Accounting, Costing, and Automated Billing
- Automating Accounting
- Automating Billing
- Monitoring Customer Usage
- Evaluating Center Effectiveness
87Setting Up a Test Center
88Presentation Protocols
- For problems or questions Send email to
- training_at_clusterresources.com
- We will pause for questions at the end of each
section - Please remain on mute except for questions
- Please do not put your call on hold (the entire
group will hear your music) - Please be conscientious of the other people
attending the conference - You can also submit questions during the training
to the AOL Instant Messenger screen name CRI Web
Training
899. TORQUE
- Installation
- Configuration
- Job Administration
- Cluster Administration
- Troubleshooting and Diagnostics
- Upgrading TORQUE Versions
- Integrating with Moab
90Resource Manager Responsibilities
- Role of a Resource Manager
- Provides job queuing facility
- Monitors resource configuration, utilization, and
health - Provides remote job execution and job management
facilities - Reports information to cluster scheduler
- Receives direction from cluster scheduler
- Handles user client requests
91TORQUE Basics
- pbs_server
- Manages queue
- Collects information from mom daemons
- Routes job management requests to mom
- Reports to scheduler
- Supports user client commands
- pbs_mom
- Locally monitors individual compute hosts
- Reports to server
- Performs low-level job management functions as
directed - Coordinates activities across resources allocated
to parallel job - http//clusterresources.com/torquedocs20
92Installation
- Extract and build the distribution on the machine
that will act as the TORQUE server
gt tar -xzvf torqueXXX.tar.gz gt cd torqueXXX gt
./configure gt make gt make install
http//www.clusterresources.com/wiki/doku.php?idt
orque1.1_installation
93Compute Node Installation
- Create the self-extracting, distributable
packages with make packages - Use the parallel shell command from your cluster
management suite to copy and execute the package
on all nodes - Run pbs_mom on all compute nodes
- http//clusterresources.com/torquedocs20/a.ltorque
quickstart.shtml
94Configuration
- torque.setup ltUSERgt or pbs_server t create
- Compute nodes
- MOM server must be configured to trust the
pbs_server daemon - (TORQUECFG)/mom_priv/config
- pbsserver parameter
- (TORQUECFG)/server_name
- server hostname
95Configuration cont.
- On the TORQUE server, append the list of newly
configured compute nodes to the
(TORQUECFG)/server_priv/nodes
server_priv/nodes computenode001.cluster.org com
putenode002.cluster.org computenode003.cluster.org
96Testing Installation
- shutdown server
- qterm
- start server
- pbs_server
- verify all queues are properly configured
- qstat q
- view additional server configuration
- qmgr -c 'p s'
- verify all nodes are correctly reporting
- pbsnodes -a
- submit a basic job
- echo sleep 30 qsub
- verify jobs display
- qstat
97Running Jobs in TORQUE
- Start the default TORQUE scheduler
- FIFO
- Or
- Integrate with an external scheduler
- Moab Workload Manager
- Maui Scheduler
98Advanced Configuration
- Customizing the Install
- Most recommended configure options have been
selected as default. The few exceptions include
with-scp and possibly enable-syslog. - Configuring Job Submission Hosts
- Use acl_hosts
- Use torque.cfg submithosts,allowcomputehosts
- Configuring TORQUE on a Multi-Homed Server
- Specifying Non-Root Administrators
gt qmgr Qmgr set server managers
josh_at_.fsc.com Qmgr set server operators
josh_at_.fsc.com
99Job Submission
- qsub
- Batch and Interactive
- Requesting Resources
- Examples
- To ask for 2 processors on each of four nodes
- qsub -l nodes4ppn2
- The following job will wait until node01 is free
with 200 MB of available memory - qsub -l nodesnode01,mem200mb
/home/user/script.sh - Directives can be embedded into job script
- example on next page
- http//clusterresources.com/torquedocs20/commands/
qsub.shtml
100Example Job Script
- !/bin/sh
- PBS -N ds14FeedbackDefaults
- PBS -S /bin/sh
- PBS -l nodes1ppn2,walltime2400000
- PBS -M user_at_mydomain.com
- PBS -m ae
- source /.bashrc
- cat PBS_NODEFILE
- cat PBS_O_JOBID
-
101Monitoring Jobs
- gt qstat
- Job id Name User
Time Use S Queue - ---------------- ----------------
---------------- -------- - ----- - 4807 scatter user01
125634 R batch
102Canceling Jobs
qstat Job id Name User
Time Use S Queue ----------------
---------------- ---------------- -------- -
----- 4807 scatter user01
125634 R batch ... qdel -m "hey! Stop
abusing the NFS servers" 4807
103Job Preemption
- OS level preemption
- Supports job cancel, requeue, suspend/resume, and
system initiated application level checkpointing - Supports custom suspend/resume and checkpoint
signals via external scheduler configuration - Using a custom checkpoint script
- Can be used to specify a particular action to
take when checkpointing a job
104Adding/Removing Nodes
- Dynamic configuration with qmgr
- Or
- Manually edit the nodes file
- TORQUEHOME/server_priv/nodes
- Restart pbs_server daemon after change
gt qmgr -c "create node node003"
105Setting Node Properties
- Node Property Attributes
- Can apply multiple properties per node
- Properties are opaque
- Each property can be applied to multiple nodes
- Properties can not be consumed
- Dynamically with qmgr
- Or
- Manually edit the nodes file
- TORQUEHOME/server_priv/nodes
- FORMAT ltNODEIDgt ltPROPERTYgt
- Restart pbs_server after change
gt qmgr -c "set node node001 properties
bigmem" gt qmgr -c "set node node001 properties
dualcore"
106Node States
- States
- down (down)
- offline (drained)
- job-exclusive (busy)
- free (idle/running)
- Changing node state
- Offline
- pbsnodes -o ltnodenamegt
- Online
- pbsnodes -c ltnodenamegt
- Viewing nodes of a particular state
- pbsnodes l
107Queue Configuration
- Configure Queue Attributes
- Ex. enabled, max_queuable, kill_delay
- http//clusterresources.com/torquedocs20/4.1queuec
onfig.shtml
108Configuring Data Management
- For shared file systems, use usecp
- For distributed file systems, use scp
- See section 6 of online TORQUE documentation for
details on configuration and trouble-shooting - http//clusterresources.com/torquedocs20/6.1scpset
up.shtml
109Monitoring Resources
- TORQUE reports a number of attributes broken into
3 major categories - Configuration
- Includes both detected hardware configuration,
and specified batch attributes - Can report static generic resources via
specification in the mom config file - Utilization
- Includes information regarding the amount of node
resources currently available (in use) as well as
information about who or what is consuming it - Can report dynamic generic resources via
specification of a monitor script in the mom
config file - State
- Includes administrative status, general node
health information, and general usage status - http//clusterresources.com/torquedocs20/a.cmomcon
fig.shtml
110Accounting Records
- TORQUE maintains accounting records for batch
jobs in the directory - TORQUEROOT/server_priv/accounting/ltTIMESTAMPgt
Record Marker Record Type Description
D delete job has been deleted
E exit job has exited (either successfully or unsuccessfully)
Q queue job has been submitted/queued
S start an attempt to start the job has been made (if the job fails to properly start, it may have multiple job start records)
111Troubleshooting
- TORQUE Log Files
- ltTORQUE_HOME_DIRgt/server_logs/
- pbs_mom logs
- ltTORQUE_HOME_DIRgt/mom_logs/
- Use loglevel or qmgr c s s loglevelx
- momctl
- momctl d 3 -h ltHOSTNAMEgt
- tracejob
- tracejob -n ltDAYSgt ltJOBIDgt
- External scheduler diagnostics
112Compute Node Health Check
- Configured via the pbs_mom config file using the
parameters - node_check_script
- node_check_interval
- Example Health Check Script
!/bin/sh /bin/mount grep global if ? !
"0" then echo "ERROR cannot locate
filesystem global" fi
113Considerations Before Upgrading
- If upgrading from OpenPBS, PBSPro, or TORQUE
1.0.3 or earlier, queued jobs whether active or
idle will be lost. In such situations, job queues
should be completely drained of all jobs. - If not using the pbs_mom -r or -p flag, running
jobs may be lost. In such cases, running jobs
should be allowed to completed or should be
requeued before upgrading TORQUE. - pbs_mom and pbs_server daemons of differing
versions may be run together. However, not all
combinations have been tested and unexpected
failures may occur.
114Upgrade Steps
- Build new release (do not install) - See TORQUE
Quick Start Guide - Stop all TORQUE daemons - See qterm and momctl -s
- Install new TORQUE - use make install
- Start all TORQUE daemons - See sections 7 and 8
of the TORQUE Quick Start Guide
115Integrating with Moab
- Auto version detection
- Auto import of TORQUE config
- Auto configuration of interface
- Just works
116Presentation Protocols
- For problems or questions Send email to
- training_at_clusterresources.com
- We will pause for questions at the end of each
section - Please remain on mute except for questions
- Please do not put your call on hold (the entire
group will hear your music) - Please be conscientious of the other people
attending the conference - You can also submit questions during the training
to the AOL Instant Messenger screen name CRI Web
Training
11710. A Look To The Future
- Hosting Centers
- Virtual Private Clusters
- Hierarchical Grids
- Moab Learning (Job Templates, Feedback Loops)
- Dynamic Job Enhancements
- Service Management
- MPI Jobs
118Future cont.
- RM Virtualization (Job Translation/Migration)
- More Integrated MCM Wizards
- Enhanced Information Service
- Input and Output
- Grid based Web Services (SOAP)
- Continued Scaling to 100K jobs/100K nodes
- Single-Point Workload Management