Ram Workshop - PowerPoint PPT Presentation

About This Presentation
Title:

Ram Workshop

Description:

The ORNL Cluster Computing Experience Stephen L. Scott Oak Ridge National Laboratory Computer Science and Mathematics Division Network and Cluster Computing Group – PowerPoint PPT presentation

Number of Views:91
Avg rating:3.0/5.0
Slides: 22
Provided by: SSc96
Learn more at: https://www.csm.ornl.gov
Category:

less

Transcript and Presenter's Notes

Title: Ram Workshop


1
The ORNL Cluster Computing Experience
Stephen L. Scott Oak Ridge National
Laboratory Computer Science and Mathematics
Division Network and Cluster Computing
Group December 6, 2004 RAMS Workshop Oak Ridge,
TN
scottsl_at_ornl.gov www.csm.ornl.gov/sscott
2
OSCAR
3
What is OSCAR?
Open Source Cluster Application Resources
Step 8 Done!
  • OSCAR Framework (cluster installation
    configuration and management)
  • Remote installation facility
  • Small set of core components
  • Modular package test facility
  • Package repositories
  • Use best known methods
  • Leverage existing technology where possible
  • Wizard based cluster software installation
  • Operating system
  • Cluster environment
  • Administration
  • Operation
  • Automatically configures cluster components
  • Increases consistency among cluster builds
  • Reduces time to build / install a cluster
  • Reduces need for expertise

Step 6
Step 5
4
OSCAR Components
  • Administration/Configuration
  • SIS, C3, OPIUM, Kernel-Picker, NTPconfig cluster
    services (dhcp, nfs, ...)
  • Security Pfilter, OpenSSH
  • HPC Services/Tools
  • Parallel Libs MPICH, LAM/MPI, PVM
  • Torque, Maui, OpenPBS
  • HDF5
  • Ganglia, Clumon, monitoring systems
  • Other 3rd party OSCAR Packages
  • Core Infrastructure/Management
  • System Installation Suite (SIS), Cluster Command
    Control (C3), Env-Switcher,
  • OSCAR DAtabase (ODA), OSCAR Package Downloader
    (OPD)

5
Open Source Community Development Effort
  • Open Cluster Group (OCG)
  • Informal group formed to make cluster computing
    more practical for HPC research and development
  • Membership is open, direct by steering committee
  • OCG working groups
  • OSCAR (core group)
  • Thin-OSCAR (Diskless Beowulf)
  • HA-OSCAR (High Availability)
  • SSS-OSCAR (Scalable Systems Software)
  • SSI-OSCAR (Single System Image)
  • BIO-OSCAR (Bioinformatics cluster system)

6
OSCAR Core Partners
  • Indiana University
  • NCSA
  • Oak Ridge National Laboratory
  • Université de Sherbrooke
  • Louisiana Tech Univ.
  • Dell
  • IBM
  • Intel
  • Bald Guy Software
  • RevolutionLinux

November 2004
7
eXtreme TORC powered by OSCAR
  • Disk Capacity 2.68 TB
  • Dual interconnects
  • - Gigabit Fast Ethernet
  • 65 Pentium IV Machines
  • Peak Performance 129.7 GFLOPS
  • RAM memory 50.152 GB

8
(No Transcript)
9
HA-OSCAR
RAS Management for HPC cluster
  • The first known field-grade open source HA
    Beowulf cluster release
  • Self-configuration Multi-head Beowulf system
  • HA and HPC clustering techniques to enable
    critical HPC infrastructure
  • Active/Hot Standby
  • Self-healing with 3-5 sec automatic failover time

10
(No Transcript)
11
Scalable Systems Software
12
Scalable Systems Software
IBM Cray Intel SGI
ORNL ANL LBNL PNNL
SNL LANL Ames
NCSA PSC SDSC
Participating Organizations
Problem
  • Computer centers use incompatible, ad hoc set of
    systems tools
  • Present tools are not designed to scale to
    multi-Teraflop systems

Goals
  • Collectively (with industry) define standard
    interfaces between systems components for
    interoperability
  • Create scalable, standardized management tools
    for efficiently running our large computing
    centers

Impact
  • Reduced facility mgmt costs.
  • More effective use of machines by scientific
    applications.

www.scidac.org/ScalableSystems
To learn more visit
13
SSS-OSCAR
  • Scalable System Software

Leverage OSCAR framework to package and
distribute the Scalable System Software (SSS)
suite, sss-oscar. sss-oscar A release of
OSCAR containing all SSS software in single
downloadable bundle.
SSS project developing standard interface for
scalable tools Improve interoperability Improve
long-term usability manageability Reduce costs
for supercomputing centers Map out functional
areas Schedulers, Job Mangers System
Monitors Accounting User management Checkpoint
/Restart Build Configuration
systems Standardize the system interfaces Open
forum of universities, labs, industry
reps Define component interfaces in XML Develop
communication infrastructure
14
OSCAR-ized SSS Components
  • Bamboo Queue/Job Manager
  • BLCR Berkeley Checkpoint/Restart
  • Gold Accounting Allocation Management System
  • LAM/MPI (w/ BLCR) Checkpoint/Restart enabled
    MPI
  • MAUI-SSS Job Scheduler
  • SSSLib SSS Communication library
  • Includes SD, EM, PM, BCM, NSM, NWI
  • Warehouse Distributed System Monitor
  • MPD2 MPI Process Manager

15
Cluster Power Tools
16
C3 Power Tools
  • Command-line interface for cluster system
    administration and parallel user tools.
  • Parallel execution cexec
  • Execute across a single cluster or multiple
    clusters at same time
  • Scatter/gather operations cpush/cget
  • Distribute or fetch files for all
    node(s)/cluster(s)
  • Used throughout OSCAR and as underlying mechanism
    for tools like OPIUMs useradd enhancements.

17
C3 Building Blocks
  • System administration
  • cpushimage - push image across cluster
  • cshutdown - Remote shutdown to reboot or halt
    cluster
  • User system tools
  • cpush - push single file -to- directory
  • crm - delete single file -to- directory
  • cget - retrieve files from each node
  • ckill - kill a process on each node
  • cexec - execute arbitrary command on each node
  • cexecs serial mode, useful for debugging
  • clist list each cluster available and its
    type
  • cname returns a node name from a given node
    position
  • cnum returns a node position from a given node
    name

18
C3 Power Tools
  • Example to run hostname on all nodes of default
    cluster
  • cexec hostname
  • Example to push an RPM to /tmp on the first 3
    nodes
  • cpush 1-3 helloworld-1.0.i386.rpm /tmp
  • Example to get a file from node1 and nodes 3-6
  • cget 1,3-6 /tmp/results.dat /tmp
  • Can leave off the destination with cget and
    will use the same location as source.

19
Motivation for Success!
20
RAMS Summer 2004

21
Preparation for Success!
  • Personality Attitude
  • Adventurous
  • Self starter
  • Self learner
  • Dedication
  • Willing to work long hours
  • Able to manage time
  • Willing to fail
  • Work experience
  • Responsible
  • Mature personal and professional behavior
  • Academic
  • Minimum of Sophomore standing
  • CS major
  • Above average GPA
  • Extremely high faculty recommendations
  • Good communication skills
  • Two or more programming languages
  • Data structures
  • Software engineering
Write a Comment
User Comments (0)
About PowerShow.com