Title: Questions from Yesterday
1Questions from Yesterday?
2Rocks Concepts
3Software Installation
Collection of all possible software
packages (AKA Distribution)
Descriptive information to configure a node
Kickstart file
RPMs
Appliances
Compute Node
IO Server
Web Server
4Software Repository
Collection of all possible software
packages (AKA Distribution)
Descriptive information to configure a node
Kickstart file
RPMs
Appliances
Compute Node
IO Server
Web Server
5Installation Instructions
Collection of all possible software
packages (AKA Distribution)
Descriptive information to configure a node
Kickstart file
RPMs
Appliances
Compute Node
IO Server
Web Server
6NPACI Rocks
- Software Repository
- Red Hat derived distribution
- Managed with rocks-dist
- Installation Instructions
- Based on Kickstart
- Variables in SQL
- Functional decomposition into XML files
- 100 nodes
- 1 graph
7Related Work
8Real World Computing Partnership (RWCP)
- Research group started in 1992, and based in
Tokyo. - Score software
- Semi-automated node integration using RedHat
- Job launcher similar to UCBs REXEC
- MPC, multi-threaded C using templates
- PM, wire protocol for Myrinet
- CDROMs were available at SC2001
9Scyld Beowulf
- Single System Image
- Global process ID
- Not a globals file system
- Heavy OS modifications to support BProc
- Patches kernel
- Patches libraries (libc)
- Current release is based on RedHat 6.2
- Job start on the frontend and are pushed to
compute nodes - Hooks remain on the frontend
- Does this scale to 1000 nodes?
10Scalable Cluster Environment (SCE)
- Developed at Kasetsart University in Thailand
- SCE is a software suite that includes
- Tools to install, manage, and monitor compute
nodes - Diskless (SSI)
- Diskfull (RedHat)
- A batch scheduler to address the difficulties in
deploying and maintaining clusters - VRML based monitoring tools
- User installs frontend with RedHat and adds SCE
packages. - Rocks and SCE are starting to work together
- Rocks is good at low level cluster software
- SCE is good at high level cluster software
11Open Cluster Group (OSCAR)
- OSCAR is a collection of clustering best
practices - PBS/Maui
- OpenSSH
- In the form of tar balls
- Frontend is manually installed and OSCAR is added
- Installing Compute Nodes
- Linux Utility for cluster Install (IBM)
- Distribution nuetral OS installer
- Same functionality as RedHats installer
- Only supports Red Hat
- System Imager (VA/Linux)
- Disk System Imaging
- Combinations both are use to manage your cluster
12Extreme Linux
- Started in early 1998 at the Extreme Linux
Workshop''. - Red Hat and NASA CESDIS jointly released a CD
containing a distribution to help build
Beowulf-class clusters. - This was really a collection of now-standard
cluster tools like MPI and PVM. - Development halted after the release.
13System Imager
- From VA/Linux (used to sell clusters)
- System imaging installation tools
- Manages the files on a compute node
- Better than managing the disk blocks
- Use
- Install a system manually
- Appoint the node as the golden master
- Clone the golden master onto other nodes
- Problems
- Doesnt support heterogeneous
- Not method for managing the software on the
golden master
14Cfengine
- Policy-based configuration management tool for
UNIX or NT hosts - Flat ASCII (looks like a Makefile)
- Supports macros and conditionals
- Popular to manage desktops
- Patching services
- Verifying the files on the OS
- Auditing user changes to the OS
- Nodes pull their Cfengine file and run every
night - System changes on the fly
- One bad change kills everyone (in the middle of
the night) - Can help you make changes to a running cluster
15Kickstart
- RedHat
- Automates installation
- Used to install desktops
- Foundation of Rocks
- Description based installation
- Flat ASCII file
- No conditionals or macros
- Set of packages and shell scripts that run to
install a node
16LCFG
- Edinburgh University
- Anderson and Scobie
- Description based installation
- Flat ASCII file
- Conditionals, macros, and statements
- Full blown (proprietary) language to describe a
node - Compose description file out of components
- Using file inclusion
- Not a graph as in Rocks
- Do not use kickstart
- Must replicate the work of RedHat
- Very interesting group
- Design goals very close to Rocks
- Implementation is also similar
17Everyone is building cluster
- Currently too many cluster distributions
- Replicated effort
- Big problem remain unsolved
- Global storage
- Job launching and control
- System Monitoring
- Users do not care
- We need portability between these efforts
- Same PBS script on any cluster
- Multiple efforts to standardize clusters
- SciDac - DOE sponsored
- Linux HA - Open effort for HA cluster
- GGF - Cluster standards WG (forming)
18Trouble Shooting
19Meteor Cluster at SDSC
- Rocks v2.2
- 2 Frontends
- 4 NFS Servers
- 100 nodes
- Compaq
- 800, 933, IA-64
- SCSI, IDA
- IBM
- 733, 1000
- SCSI
- 50 GB RAM
- Ethernet
- For management
- Myrinet 2000
20Pick good HW components
- Rack-mount gear is easy to deploy and maintain
- DIY is fun and cheap, but is time consuming
- Buy gear engineered for thermal
- White boxes often run hot
21Software Infrastructure
- Restrain from special configuration
- Open-source software moves fast if you
customize lots of modules, youll have to
remember how to reconfigure when you install
upgrades - Also, sometimes configuration format changes
need to learn new configuration format
22Software Infrastructure
- Leverage others work
- Before inventing, investigate
- Once you write it, you have to maintain it
23Minimize Cables
- Before adding any network to your nodes, be sure
you really want it - Do you really want to debug a 1024-node serial
console network? - Do you really want to debug a 1024-node
keyboard/video/mouse network?
24Cluster bring up
- Trunk all cables (power, Ethernet, Myrinet,
etc) - Neatness counts
- Will help when components break
chassis
chassis
chassis
chassis
chassis
chassis
chassis
chassis
25Cluster bring up
- Minimize cables that cross racks
- With an ethernet switch in each rack, all compute
node ethernet cables are contained within the
rack - Just one uplink ethernet cable exits the rack
26Cluster bring up
- Deploy nodes in groups of 8
- Good match for switches
- Switches come in sizes 8, 16, 24, 32, etc.
27Cluster bring up
- Thoroughly test nodes before putting cluster into
production - Test in groups of 8
- Isolate all problems
- Then test entire system
28Cluster Debugging
29Myrinet Network
compute-0-10 gmId 1
compute-1-15 gmId 2
2
compute-0-16 gmId 3
compute-1-16 gmId 4
4
30Myrinet Debugging
- When running an MPI job, and you see an error
message like MPI id 0, gmID 2 cant find MPI id
1, gmID 4 - Determine suspect node from error message
31Myrinet Debugging
- First, run a diagnostic test
- We run High-performance Linpack over Myrinet
- Linpack stresses the CPUs and sends MPI-based
messages over Myrinet
32Myrinet Debugging
Route table for this node follows The mapper
48-bit ID was 0060dd7f9b1d gmID MAC Address
gmName Route ----
----------------- --------------------------------
--------- 1 0060dd7f9ad4
compute-0-10 b8 b9 89 2 0060dd7f9ad1
compute-1-15 b8 bf 86 3
0060dd7f9b15
compute-0-16 b8 81 84 4 0060dd7f80ea
compute-1-16 b8 b5 88
33Myrinet Debugging
- Compare gm_board_info from a node that is
operational - If different between gmID and gmName, then one of
the nodes is bad
Route table for this node follows The mapper
48-bit ID was 0060dd7f9b1d gmID MAC Address
gmName Route ----
----------------- --------------------------------
-------- 1 0060dd7f9ad4
compute-0-10 b8 b9 8a 2 0060dd7f9ad1
compute-1-15 b8 bf 87 3
0060dd7f9b15
compute-0-16 b8 81 85 4 0060dd7f80ea
compute-1-16 b8 b5 89
Route table for this node follows The mapper
48-bit ID was 0060dd7f9b1d gmID MAC Address
gmName Route ----
----------------- --------------------------------
--------- 1 0060dd7f9ad4
compute-0-10 b8 b9 89 2 0060dd7f9ad1
compute-1-15 b8 bf 86 3
0060dd7f9b15
compute-0-16 b8 81 84 4 0060dd7f80ea
compute-1-5 b8 b5 88
34Myrinet Debugging
- To double check, run gm_board_info on one more
node - This is the The Byzantine Generals Problem
- We need to develop a program that automatically
does this
35After You Find the Dead Node
- Replace cable Easy
- Run diagnostic
- Try a different port on the switch
- Run diagnostic
- Replace Myrinet card
- Run diagnostic
- Call Myricom Hard
36PBS/Maui debugging
- For large clusters (greater than 64), try
increasing timeout values - In the file /usr/spool/maui/maui.cfg
- RMPOLLINTERVAL
- RMTIMEOUT0
- Try restarting the services
- /etc/rc.d/init.d/pbs-server restart
- /etc/rc.d/init.d/maui restart
37Node Debugging
38UNIX is a prerequisite
- A single unix machine is complex
- A cluster is more complex than the sum of its
parts - Understand how a single machine works before
building a cluster - Build your own Linux desktop machine to get
started
39Laws of Electronics
- If something fails try
- Plugging it in
- Turning it on
- Recycling the power
- This will fix most of your problems
- We make mistakes like this all the time
- E.g.
- Forgot to connect the floppy drive to the MB
- Forget to plug in the main power supply
40BIOS
- Functions
- Power on Self Test
- OS Boot
- Windows
- LILO
- Linux does not use BIOS
- Bad settings are fatal
41New Machine
- Reset the BIOS to factory defaults
- Jumper setting
- Software options
- Verify machine can boot without keyboard
- Verify virus protection is disabled
- Boot order
- CDROM
- Floppy
- Hard Disk
- Network (PXE)
42Bugs
- Cannot find kickstart file
- Frontend
- compute
43Frontend
- File format
- Make sure ks.cfg was saved as a Unix file on the
floppy - RedHats kickstart demands that ks.cfg is a
Unix-based text file - Best way is to save the file on Unix system
- Floppy Media
- Save kickstart file to another floppy
- Dont recycle AOL disks
- Floppy Drive
- Replace the drive and try again
- This happened to us (took hours to debug)
44Compute Node
- Suspected crashed frontend services
- Restart the DHCP Server
- Restart the Apache (web) Server
- Restart the Database (mysql) Server
- Check network connection
- Private side network is eth0
- Public side network is eth1
- Verify CAT5e cables
- Look at the link lights
- If this is the first time the node is powered up,
check that insert-ethers is running on the
frontend
45Understand your network
- This is the key component
- When it fails a cluster becomes a bunch of PCs
46OSI Network Model
- International Standard Organization - Open System
Interconnect - 7 network layers
- In every network text book
- This is not the Internet
47Internet (TCP/IP)
- Simplifies OSI model
- OSI is theory
- TCP/IP is practice
- 4 network layers
- Link
- Network
- Transport
- Application
48Link (Ethernet)
- MAC Addresses
- 6 bytes
- E.g. 0010b55516b5
- ARP
- MAC -gt IP lookup
49Network (IP)
- Infrastructure
- Message Reassembly
- Inspected by firewall rules
50UDP Transport
- Properties
- Unreliable
- Connectionless
- Datagram (Packets)
- Clients
- NFS
- Ganglia
- DHCP
- Syslog
51TCP Transport
- Properties
- Reliable
- Connection Oriented
- Byte stream
- Clients
- SSH
- HTTP
- Kickstart file requests
- RPM downloads
52tcpdump
- Poor Mans network analyzer
- Pro
- Lowest level debugging
- Extremely verbose
- De facto standard
- Con
- Lowest level debugging
- Extemely verbose
- Learn to love it
- Great tool debugging for clusters
53Example
- tcpdump -i eth0
- 070002.737039 arp who-has 2.31.117.203.in-addr.
arpa tell 31.31.117.203.in-addr.arpa - 070002.942326 arp who-has 2.31.117.203.in-addr.
arpa tell 31.31.117.203.in-addr.arpa - 070003.122296 165.0.168.192.in-addr.arpa.49206
gt 1.0.168.192.in-addr.arpa.domain
41414domain - 070003.126042 1.0.168.192.in-addr.arpa.domain
gt 165.0.168.192.in-addr.arpa.49206
41414domain - 070003.143136 165.0.168.192.in-addr.arpa.49206
gt 1.0.168.192.in-addr.arpa.domain
41126domain - 070003.146833 1.0.168.192.in-addr.arpa.domain
gt 165.0.168.192.in-addr.arpa.49206
41126domain - 070003.153560 165.0.168.192.in-addr.arpa.49206
gt 1.0.168.192.in-addr.arpa.domain
34968domain - 070003.157088 1.0.168.192.in-addr.arpa.domain
gt 165.0.168.192.in-addr.arpa.49206
34968domain
54Network Verification
- Install Frontend
- Boot compute node
- Run tcpdump on the frontend for eth0
- Network correct
- Will see DHCP request from compute nodes
- Will see DHCP response from frontend
- Network incorrect
- Will see nothing
- Or will see public-side network traffic
55Syslog
- Standard UNIX application event logger
- Multiple logging facilities
- USER
- Kernel
- LOCAL.0-9 - for user applications
- Logging to
- /var/log/
- Frontend (in /var/log/)
56Higher level than tcpdump
- tail -f /var/log/message
- On the frontend
- Will show all events for the entire cluster
- Use in combination with tcpdump
57Application Optimization
58Job launching
- With many users, use a queuing system
- Or else competing jobs will slow (or halt!) other
jobs
59Data storage
- Use many file servers
- E.g., NAS, NFS
- But this has its problems too
- Want to distribute load
- But, there are challenges
60Challenges with File Servers
Access
- Gigabit connect to file server
- Access data at 90 MB/s
- 1 TB / server
- 11 seconds to read 1 GB
- 3 hours to scan all data
- Worse if compute nodes have 100 Mb connections
- Access data at 9 MB/s ? 110 seconds to read 1 GB
Data
61Buy a good compiler
- Your code will compile
- G77 doesnt support F90 constructs
- Cant promote real values to double precision
- Your code will execute faster
- At spec.org (CPU benchmark), not one entry used
gcc or g77 - Searched over 1000 entries
- Your code will work
- Experience with G77 has been known to give
incorrect results
62Source www.polyhedron.com/complnx.html
63Buy a good compiler
- Theyre cheap
- Intel Fortran compiler
- USD600 first year
- USD250 / year
- Portland Group Fortran compiler
- USD500 first year
- USD125 / year
64Know your compiler flags
65Data locality
- Write code that is cache-aware and memory-aware
- Flexible to the changing sizes
- And not just larger!
- L1 cache decreased from 16 KB in Pentium III to 8
KB in Pentium 4. - Deployed main memory increases by a factor of 10
every 4 years - So plan on 78 more memory / node / year
66Data Locality
67Memory is cheap and fast
- Buy lots of it!
- USD0.20 / megabyte
68Software Development Trends
- In the vector days, programmers focused on
enhancing code within loops - E.g., 64 element, 64-bit vector registers
69Software Development Trends
- Now, in MPP bottleneck is inter-node
communication - Focus on minimizing communication
70Back to the Future?
- Commodity processors are incorporating more
vector registers and instructions - Pentium 4 8, 128-bit registers
- Playstation 2 32, 128-bit registers
- Nvidia GeForce 3 41, 128-bit registers
71Back to the Future?
- Compiler writers (and programmers) are writing
vector-aware programs - E.g., taking advantage of SSE, Altivec, 3DNOW
72Enhancing code
- Common programmer method
- Insert timers
- Run code
- Find a bottleneck
- Goto 2 until fatigued
73Enhancing code
- For 32-bit addressable processors (e.g., x86)
write values that will be reused to memory
(rather than disk) - Since only can address 2 GB, create a data server
Compute Node
compute process
data server process
74Program Profiling
- gprof
- Good for serial programs
- Performance API (PAPI)
- Developed at UTK
- Same people who developed Linpack, ATLAS, PVM,
NetSolve, etc. - Provide an API to access performance counters on
microprocessors - Supported CPUs x86, IA-64, IBM Power series,
Alpha, Cray - Supported Software Linux, AIX, Tru64, Unicos
75PAPI
- Programs have been developed using PAPI
- Perfometer, DEEP/MPI,
76DEEP/MPI
- Program analysis and debugging tools for MPI
programs written in Fortran or C - USD 700 - 950
77DEEP/MPI
78DEEP/MPI
79PAPI Example
- Application developer at SDSC used PAPI on two
separate hardware implementations for IA-64 to
examine performance differences - Embedded PAPI function calls in test program
80PAPI
- DEEP/MPI is just one of a half dozen applications
based on PAPI - To us it looks like one of the more interesting
ones - To explore PAPI further
- www.utk.edu/papi
81Contact Info
82How to Get More Info
- People
- Philip phil_at_sdsc.edu
- Mason mjk_at_sdsc.edu
- Greg bruno_at_sdsc.edu
- Rocks web site
- http//rocks.npaci.edu
- Discussion list
- Send email to npaci-rocks-discussion_at_npaci.edu
83Final Thoughts
84Driving in Singapore
- Fluid lanes
- Watch out for the speed cameras
- 24 points suspended license
- Can you believe how much the COE is?
85Red Man, Stand
86My Cell Phone Before
87(No Transcript)