Questions from Yesterday

About This Presentation

Title:

Questions from Yesterday

Description:

DIY is fun and cheap, but is time consuming. Buy gear engineered for thermal ... really want to debug a 1024-node keyboard/video/mouse network? Cluster bring up ' ... – PowerPoint PPT presentation

Number of Views:57

Avg rating:3.0/5.0

Slides: 88

Provided by: Mason71

Category:

more less

Transcript and Presenter's Notes

Title: Questions from Yesterday

1
Questions from Yesterday?
2
Rocks Concepts
3
Software Installation
Collection of all possible software
packages (AKA Distribution)
Descriptive information to configure a node
Kickstart file
RPMs
Appliances
Compute Node
IO Server
Web Server
4
Software Repository
Collection of all possible software
packages (AKA Distribution)
Descriptive information to configure a node
Kickstart file
RPMs
Appliances
Compute Node
IO Server
Web Server
5
Installation Instructions
Collection of all possible software
packages (AKA Distribution)
Descriptive information to configure a node
Kickstart file
RPMs
Appliances
Compute Node
IO Server
Web Server
6
NPACI Rocks

Software Repository
Red Hat derived distribution
Managed with rocks-dist

Installation Instructions
Based on Kickstart
Variables in SQL
Functional decomposition into XML files
100 nodes
1 graph

7
Related Work
8
Real World Computing Partnership (RWCP)

Research group started in 1992, and based in
Tokyo.
Score software
Semi-automated node integration using RedHat
Job launcher similar to UCBs REXEC
MPC, multi-threaded C using templates
PM, wire protocol for Myrinet
CDROMs were available at SC2001

9
Scyld Beowulf

Single System Image
Global process ID
Not a globals file system
Heavy OS modifications to support BProc
Patches kernel
Patches libraries (libc)
Current release is based on RedHat 6.2
Job start on the frontend and are pushed to
compute nodes
Hooks remain on the frontend
Does this scale to 1000 nodes?

10
Scalable Cluster Environment (SCE)

Developed at Kasetsart University in Thailand
SCE is a software suite that includes
Tools to install, manage, and monitor compute
nodes
Diskless (SSI)
Diskfull (RedHat)
A batch scheduler to address the difficulties in
deploying and maintaining clusters
VRML based monitoring tools
User installs frontend with RedHat and adds SCE
packages.
Rocks and SCE are starting to work together
Rocks is good at low level cluster software
SCE is good at high level cluster software

11
Open Cluster Group (OSCAR)

OSCAR is a collection of clustering best
practices
PBS/Maui
OpenSSH
In the form of tar balls
Frontend is manually installed and OSCAR is added
Installing Compute Nodes
Linux Utility for cluster Install (IBM)
Distribution nuetral OS installer
Same functionality as RedHats installer
Only supports Red Hat
System Imager (VA/Linux)
Disk System Imaging
Combinations both are use to manage your cluster

12
Extreme Linux

Started in early 1998 at the Extreme Linux
Workshop''.
Red Hat and NASA CESDIS jointly released a CD
containing a distribution to help build
Beowulf-class clusters.
This was really a collection of now-standard
cluster tools like MPI and PVM.
Development halted after the release.

13
System Imager

From VA/Linux (used to sell clusters)
System imaging installation tools
Manages the files on a compute node
Better than managing the disk blocks
Use
Install a system manually
Appoint the node as the golden master
Clone the golden master onto other nodes
Problems
Doesnt support heterogeneous
Not method for managing the software on the
golden master

14
Cfengine

Policy-based configuration management tool for
UNIX or NT hosts
Flat ASCII (looks like a Makefile)
Supports macros and conditionals
Popular to manage desktops
Patching services
Verifying the files on the OS
Auditing user changes to the OS
Nodes pull their Cfengine file and run every
night
System changes on the fly
One bad change kills everyone (in the middle of
the night)
Can help you make changes to a running cluster

15
Kickstart

RedHat
Automates installation
Used to install desktops
Foundation of Rocks
Description based installation
Flat ASCII file
No conditionals or macros
Set of packages and shell scripts that run to
install a node

16
LCFG

Edinburgh University
Anderson and Scobie
Description based installation
Flat ASCII file
Conditionals, macros, and statements
Full blown (proprietary) language to describe a
node
Compose description file out of components
Using file inclusion
Not a graph as in Rocks
Do not use kickstart
Must replicate the work of RedHat
Very interesting group
Design goals very close to Rocks
Implementation is also similar

17
Everyone is building cluster

Currently too many cluster distributions
Replicated effort
Big problem remain unsolved
Global storage
Job launching and control
System Monitoring
Users do not care
We need portability between these efforts
Same PBS script on any cluster
Multiple efforts to standardize clusters
SciDac - DOE sponsored
Linux HA - Open effort for HA cluster
GGF - Cluster standards WG (forming)

18
Trouble Shooting
19
Meteor Cluster at SDSC

Rocks v2.2
2 Frontends
4 NFS Servers
100 nodes
Compaq
800, 933, IA-64
SCSI, IDA
IBM
733, 1000
SCSI
50 GB RAM
Ethernet
For management
Myrinet 2000

20
Pick good HW components

Rack-mount gear is easy to deploy and maintain
DIY is fun and cheap, but is time consuming
Buy gear engineered for thermal
White boxes often run hot

21
Software Infrastructure

Restrain from special configuration
Open-source software moves fast if you
customize lots of modules, youll have to
remember how to reconfigure when you install
upgrades
Also, sometimes configuration format changes
need to learn new configuration format

22
Software Infrastructure

Leverage others work
Before inventing, investigate
Once you write it, you have to maintain it

23
Minimize Cables

Before adding any network to your nodes, be sure
you really want it
Do you really want to debug a 1024-node serial
console network?
Do you really want to debug a 1024-node
keyboard/video/mouse network?

24
Cluster bring up

Trunk all cables (power, Ethernet, Myrinet,
etc)
Neatness counts
Will help when components break

chassis
chassis
chassis
chassis
chassis
chassis
chassis
chassis
25
Cluster bring up

Minimize cables that cross racks
With an ethernet switch in each rack, all compute
node ethernet cables are contained within the
rack
Just one uplink ethernet cable exits the rack

26
Cluster bring up

Deploy nodes in groups of 8
Good match for switches
Switches come in sizes 8, 16, 24, 32, etc.

27
Cluster bring up

Thoroughly test nodes before putting cluster into
production
Test in groups of 8
Isolate all problems
Then test entire system

28
Cluster Debugging
29
Myrinet Network
compute-0-10 gmId 1
compute-1-15 gmId 2
2
compute-0-16 gmId 3
compute-1-16 gmId 4
4
30
Myrinet Debugging

When running an MPI job, and you see an error
message like MPI id 0, gmID 2 cant find MPI id
1, gmID 4
Determine suspect node from error message

31
Myrinet Debugging

First, run a diagnostic test
We run High-performance Linpack over Myrinet
Linpack stresses the CPUs and sends MPI-based
messages over Myrinet

32
Myrinet Debugging

Run gm_board_info

Route table for this node follows The mapper
48-bit ID was 0060dd7f9b1d gmID MAC Address
gmName Route ----
----------------- --------------------------------
--------- 1 0060dd7f9ad4
compute-0-10 b8 b9 89 2 0060dd7f9ad1
compute-1-15 b8 bf 86 3
0060dd7f9b15
compute-0-16 b8 81 84 4 0060dd7f80ea
compute-1-16 b8 b5 88
33
Myrinet Debugging

Compare gm_board_info from a node that is
operational
If different between gmID and gmName, then one of
the nodes is bad

Route table for this node follows The mapper
48-bit ID was 0060dd7f9b1d gmID MAC Address
gmName Route ----
----------------- --------------------------------
-------- 1 0060dd7f9ad4
compute-0-10 b8 b9 8a 2 0060dd7f9ad1
compute-1-15 b8 bf 87 3
0060dd7f9b15
compute-0-16 b8 81 85 4 0060dd7f80ea
compute-1-16 b8 b5 89
Route table for this node follows The mapper
48-bit ID was 0060dd7f9b1d gmID MAC Address
gmName Route ----
----------------- --------------------------------
--------- 1 0060dd7f9ad4
compute-0-10 b8 b9 89 2 0060dd7f9ad1
compute-1-15 b8 bf 86 3
0060dd7f9b15
compute-0-16 b8 81 84 4 0060dd7f80ea
compute-1-5 b8 b5 88
34
Myrinet Debugging

To double check, run gm_board_info on one more
node
This is the The Byzantine Generals Problem
We need to develop a program that automatically
does this

35
After You Find the Dead Node

Replace cable Easy
Run diagnostic
Try a different port on the switch
Run diagnostic
Replace Myrinet card
Run diagnostic
Call Myricom Hard

36
PBS/Maui debugging

For large clusters (greater than 64), try
increasing timeout values
In the file /usr/spool/maui/maui.cfg
RMPOLLINTERVAL
RMTIMEOUT0
Try restarting the services
/etc/rc.d/init.d/pbs-server restart
/etc/rc.d/init.d/maui restart

37
Node Debugging
38
UNIX is a prerequisite

A single unix machine is complex
A cluster is more complex than the sum of its
parts
Understand how a single machine works before
building a cluster
Build your own Linux desktop machine to get
started

39
Laws of Electronics

If something fails try
Plugging it in
Turning it on
Recycling the power
This will fix most of your problems
We make mistakes like this all the time
E.g.
Forgot to connect the floppy drive to the MB
Forget to plug in the main power supply

40
BIOS

Functions
Power on Self Test
OS Boot
Windows
LILO
Linux does not use BIOS
Bad settings are fatal

41
New Machine

Reset the BIOS to factory defaults
Jumper setting
Software options
Verify machine can boot without keyboard
Verify virus protection is disabled
Boot order
CDROM
Floppy
Hard Disk
Network (PXE)

42
Bugs

Cannot find kickstart file
Frontend
compute

43
Frontend

File format
Make sure ks.cfg was saved as a Unix file on the
floppy
RedHats kickstart demands that ks.cfg is a
Unix-based text file
Best way is to save the file on Unix system
Floppy Media
Save kickstart file to another floppy
Dont recycle AOL disks
Floppy Drive
Replace the drive and try again
This happened to us (took hours to debug)

44
Compute Node

Suspected crashed frontend services
Restart the DHCP Server
Restart the Apache (web) Server
Restart the Database (mysql) Server
Check network connection
Private side network is eth0
Public side network is eth1
Verify CAT5e cables
Look at the link lights
If this is the first time the node is powered up,
check that insert-ethers is running on the
frontend

45
Understand your network

This is the key component
When it fails a cluster becomes a bunch of PCs

46
OSI Network Model

International Standard Organization - Open System
Interconnect
7 network layers
In every network text book
This is not the Internet

47
Internet (TCP/IP)

Simplifies OSI model
OSI is theory
TCP/IP is practice
4 network layers
Link
Network
Transport
Application

48
Link (Ethernet)

MAC Addresses
6 bytes
E.g. 0010b55516b5
ARP
MAC -gt IP lookup

49
Network (IP)

Infrastructure
Message Reassembly
Inspected by firewall rules

50
UDP Transport

Properties
Unreliable
Connectionless
Datagram (Packets)
Clients
NFS
Ganglia
DHCP
Syslog

51
TCP Transport

Properties
Reliable
Connection Oriented
Byte stream
Clients
SSH
HTTP
Kickstart file requests
RPM downloads

52
tcpdump

Poor Mans network analyzer
Pro
Lowest level debugging
Extremely verbose
De facto standard
Con
Lowest level debugging
Extemely verbose
Learn to love it
Great tool debugging for clusters

53
Example

tcpdump -i eth0
070002.737039 arp who-has 2.31.117.203.in-addr.
arpa tell 31.31.117.203.in-addr.arpa
070002.942326 arp who-has 2.31.117.203.in-addr.
arpa tell 31.31.117.203.in-addr.arpa
070003.122296 165.0.168.192.in-addr.arpa.49206
gt 1.0.168.192.in-addr.arpa.domain
41414domain
070003.126042 1.0.168.192.in-addr.arpa.domain
gt 165.0.168.192.in-addr.arpa.49206
41414domain
070003.143136 165.0.168.192.in-addr.arpa.49206
gt 1.0.168.192.in-addr.arpa.domain
41126domain
070003.146833 1.0.168.192.in-addr.arpa.domain
gt 165.0.168.192.in-addr.arpa.49206
41126domain
070003.153560 165.0.168.192.in-addr.arpa.49206
gt 1.0.168.192.in-addr.arpa.domain
34968domain
070003.157088 1.0.168.192.in-addr.arpa.domain
gt 165.0.168.192.in-addr.arpa.49206
34968domain

54
Network Verification

Install Frontend
Boot compute node
Run tcpdump on the frontend for eth0
Network correct
Will see DHCP request from compute nodes
Will see DHCP response from frontend
Network incorrect
Will see nothing
Or will see public-side network traffic

55
Syslog

Standard UNIX application event logger
Multiple logging facilities
USER
Kernel
LOCAL.0-9 - for user applications
Logging to
/var/log/
Frontend (in /var/log/)

56
Higher level than tcpdump

tail -f /var/log/message
On the frontend
Will show all events for the entire cluster
Use in combination with tcpdump

57
Application Optimization
58
Job launching

With many users, use a queuing system
Or else competing jobs will slow (or halt!) other
jobs

59
Data storage

Use many file servers
E.g., NAS, NFS
But this has its problems too
Want to distribute load
But, there are challenges

60
Challenges with File Servers
Access

Gigabit connect to file server
Access data at 90 MB/s
1 TB / server
11 seconds to read 1 GB
3 hours to scan all data
Worse if compute nodes have 100 Mb connections
Access data at 9 MB/s ? 110 seconds to read 1 GB

Data
61
Buy a good compiler

Your code will compile
G77 doesnt support F90 constructs
Cant promote real values to double precision
Your code will execute faster
At spec.org (CPU benchmark), not one entry used
gcc or g77
Searched over 1000 entries
Your code will work
Experience with G77 has been known to give
incorrect results

62
Source www.polyhedron.com/complnx.html
63
Buy a good compiler

Theyre cheap
Intel Fortran compiler
USD600 first year
USD250 / year
Portland Group Fortran compiler
USD500 first year
USD125 / year

64
Know your compiler flags

Also from spec.org

65
Data locality

Write code that is cache-aware and memory-aware
Flexible to the changing sizes
And not just larger!
L1 cache decreased from 16 KB in Pentium III to 8
KB in Pentium 4.
Deployed main memory increases by a factor of 10
every 4 years
So plan on 78 more memory / node / year

66
Data Locality
67
Memory is cheap and fast

Buy lots of it!
USD0.20 / megabyte

68
Software Development Trends

In the vector days, programmers focused on
enhancing code within loops
E.g., 64 element, 64-bit vector registers

69
Software Development Trends

Now, in MPP bottleneck is inter-node
communication
Focus on minimizing communication

70
Back to the Future?

Commodity processors are incorporating more
vector registers and instructions
Pentium 4 8, 128-bit registers
Playstation 2 32, 128-bit registers
Nvidia GeForce 3 41, 128-bit registers

71
Back to the Future?

Compiler writers (and programmers) are writing
vector-aware programs
E.g., taking advantage of SSE, Altivec, 3DNOW

72
Enhancing code

Common programmer method
Insert timers
Run code
Find a bottleneck
Goto 2 until fatigued

73
Enhancing code

For 32-bit addressable processors (e.g., x86)
write values that will be reused to memory
(rather than disk)
Since only can address 2 GB, create a data server

Compute Node
compute process
data server process
74
Program Profiling

gprof
Good for serial programs
Performance API (PAPI)
Developed at UTK
Same people who developed Linpack, ATLAS, PVM,
NetSolve, etc.
Provide an API to access performance counters on
microprocessors
Supported CPUs x86, IA-64, IBM Power series,
Alpha, Cray
Supported Software Linux, AIX, Tru64, Unicos

75
PAPI

Programs have been developed using PAPI
Perfometer, DEEP/MPI,

76
DEEP/MPI

Program analysis and debugging tools for MPI
programs written in Fortran or C
USD 700 - 950

77
DEEP/MPI
78
DEEP/MPI
79
PAPI Example

Application developer at SDSC used PAPI on two
separate hardware implementations for IA-64 to
examine performance differences
Embedded PAPI function calls in test program

80
PAPI