Title: Experiment and Analysis towards effective cluster management system
1Experiment and Analysis towards effective cluster
management system
- Chokchai Box Leangsuksun (Louisiana Tech U)
- Tirumala Rao Balumuri
- Jie XU
- Stephen Scott (Oak Ridge National Lab)
- Richard Libby (Intel)
2Introduction
- Management several challenges in cluster
environments - Performance, Reliability, Availability,
Serviceability - Typical two phase process monitor and control
- Existing open source monitoring and management
standards lead the next phase of cluster
management
3Related research
- Ganglia Clumon are widely used cluster
monitoring tools which have similar architecture
and slightly varying functionality and
characteristics
Clumon Architecture
1-level Ganglia Architecture
4Related research
- Other monitoring frameworks like dproc, supermon,
bigbrother - similar critical issues in cluster
monitoring - Scalability , lack of management capability
- Ganglia proposed the N level architecture to meet
the challenges of massive clusters and grids and
maintain low overhead
5Related research
- IPMI Intelligent Platform Management Interface
- IPMI defines the interface to communicate the
sensor values and control hw components. - Power up/down, reset, reading temperatures etc.
- IPMI provides vender-independent interface with
monitoring and management capabilities to cluster
environment - Open source projects openIPMI, freeIPMI,
dcpclient etc.
6Implementation and Experimental
- Studied the existing monitoring tools like
ganglia clumon and IPMI management framework - Prototype experimental management tool by
enhancing existing monitor tools, ganglia
clumon with hardware metrics (IPMI) - Benchmark and scalability analysis - our
prototypes to meet our requirements
7Implementation and Experimental
- Studied the influence of enhancing ganglia
clumon with hardware capability on the cluster
monitoring characteristics like scalability,
fault tolerance and resources - Considered the issues of the level of monitoring
detail required
8Implementation and Experimental
9Experimental results and Analysis
- 9 Intel 1.2GHz dual XEON servers systems with
IPMI-enabled, 512MB memory and 100MBits/s
Ethernet port. - The cluster was built with OSCAR 3.0 and Red Hat
Linux 9.0 - Resource overhead comparisons were made between
enhanced ganglia clumon
10Screen samples of experiments
11Experimental results and Analysis
In ganglia environment the CPU usage increased at
the rate of 0.026 for each node added to the
cluster In clumon the CPU usage increased at the
rate of .03 for each node
12Experimental results and Analysis
Fig 4.3. IPMI Enabled Ganglia Clumon Network
Traffic Comparison
- In ganglia environment the network traffic
increases at the rate of 3.2 for every node
added to the cluster - In clumon environment the network traffic
increases at the rate of 4
13Experimental results and Analysis
- Studied a set of other IPMI management
capabilities in our effort to encapsulate
management system into the monitoring tools - Measured a set of management operations such as
power on/off, reboot, sensor query, id on/off and
sel clear
Time taken to issue IPMI commands to remote node
collected by dpccli
14Experimental results and Analysis
Results obtained from dpccli
15Survey and experiments of IPMI tools
16Experimental results and Analysis
- IPMItool response time was close to OpenIPMI and
they have similar features - Better than dpccli in response time
- Freeipmi ipmipower utility is much faster than
all of these tools but it was tested to provide
poor authentication layer
17Effective management Hardware perspective
- Explored the IPMI capability to cluster framework
- IPMI PEF support to reduce the load of analyzing
the number of events around the cluster
18Effective management Hardware perspective
Our observations
- Using the IPMI control capabilities to tweak the
sensor, power and any IPMI component behavior for
each cluster node - The experimental results provided insight into
some techniques of ensuring effective control - Hardware events gathered at the cluster nodes can
be correlated to predict imminent failures
19Summary existing monitoring tools
- mature cluster open source monitoring tools
- current tools are not well integrated for
complete RAS management - only presents monitoring information, no
interpretation - does not assure quick detection of abnormalities
- provides no means for management (monitor only)
20Intelligent Cluster Monitoring
- Monitoring Control Channel (Management Channel)
- ? improve manageability
- Local Management Agents
- Central Manager
- Distributed control centralized intelligence
management - ? better fault handling
21Central Manager Function Unit
22Monitoring Agents
23Cluster Management Protocol
- SNMP ? network resource management
- CMIP ? cluster resource management
- Basic Commands
- Get Request
- Get Response
- Set Request
- Exec Request
- Alert Response
-
24Conclusion Future work
- We conducted our research from two directions
the hardware aspect and software aspect - Investigated how a popular hardware management
platform like IPMI can be incorporated into
existing cluster monitoring tool to provide
valuable hardware information - Proposed intelligent management framework
- Event based correlation of hardware events
Policy based hardware monitoring and notification - Studying the deviation patterns from the regular
pattern and cross correlation