Questions from Yesterday - PowerPoint PPT Presentation

1 / 87
About This Presentation
Title:

Questions from Yesterday

Description:

DIY is fun and cheap, but is time consuming. Buy gear engineered for thermal ... really want to debug a 1024-node keyboard/video/mouse network? Cluster bring up ' ... – PowerPoint PPT presentation

Number of Views:57
Avg rating:3.0/5.0
Slides: 88
Provided by: Mason71
Category:

less

Transcript and Presenter's Notes

Title: Questions from Yesterday


1
Questions from Yesterday?
2
Rocks Concepts
3
Software Installation
Collection of all possible software
packages (AKA Distribution)
Descriptive information to configure a node
Kickstart file
RPMs
Appliances
Compute Node
IO Server
Web Server
4
Software Repository
Collection of all possible software
packages (AKA Distribution)
Descriptive information to configure a node
Kickstart file
RPMs
Appliances
Compute Node
IO Server
Web Server
5
Installation Instructions
Collection of all possible software
packages (AKA Distribution)
Descriptive information to configure a node
Kickstart file
RPMs
Appliances
Compute Node
IO Server
Web Server
6
NPACI Rocks
  • Software Repository
  • Red Hat derived distribution
  • Managed with rocks-dist
  • Installation Instructions
  • Based on Kickstart
  • Variables in SQL
  • Functional decomposition into XML files
  • 100 nodes
  • 1 graph

7
Related Work
8
Real World Computing Partnership (RWCP)
  • Research group started in 1992, and based in
    Tokyo.
  • Score software
  • Semi-automated node integration using RedHat
  • Job launcher similar to UCBs REXEC
  • MPC, multi-threaded C using templates
  • PM, wire protocol for Myrinet
  • CDROMs were available at SC2001

9
Scyld Beowulf
  • Single System Image
  • Global process ID
  • Not a globals file system
  • Heavy OS modifications to support BProc
  • Patches kernel
  • Patches libraries (libc)
  • Current release is based on RedHat 6.2
  • Job start on the frontend and are pushed to
    compute nodes
  • Hooks remain on the frontend
  • Does this scale to 1000 nodes?

10
Scalable Cluster Environment (SCE)
  • Developed at Kasetsart University in Thailand
  • SCE is a software suite that includes
  • Tools to install, manage, and monitor compute
    nodes
  • Diskless (SSI)
  • Diskfull (RedHat)
  • A batch scheduler to address the difficulties in
    deploying and maintaining clusters
  • VRML based monitoring tools
  • User installs frontend with RedHat and adds SCE
    packages.
  • Rocks and SCE are starting to work together
  • Rocks is good at low level cluster software
  • SCE is good at high level cluster software

11
Open Cluster Group (OSCAR)
  • OSCAR is a collection of clustering best
    practices
  • PBS/Maui
  • OpenSSH
  • In the form of tar balls
  • Frontend is manually installed and OSCAR is added
  • Installing Compute Nodes
  • Linux Utility for cluster Install (IBM)
  • Distribution nuetral OS installer
  • Same functionality as RedHats installer
  • Only supports Red Hat
  • System Imager (VA/Linux)
  • Disk System Imaging
  • Combinations both are use to manage your cluster

12
Extreme Linux
  • Started in early 1998 at the Extreme Linux
    Workshop''.
  • Red Hat and NASA CESDIS jointly released a CD
    containing a distribution to help build
    Beowulf-class clusters.
  • This was really a collection of now-standard
    cluster tools like MPI and PVM.
  • Development halted after the release.

13
System Imager
  • From VA/Linux (used to sell clusters)
  • System imaging installation tools
  • Manages the files on a compute node
  • Better than managing the disk blocks
  • Use
  • Install a system manually
  • Appoint the node as the golden master
  • Clone the golden master onto other nodes
  • Problems
  • Doesnt support heterogeneous
  • Not method for managing the software on the
    golden master

14
Cfengine
  • Policy-based configuration management tool for
    UNIX or NT hosts
  • Flat ASCII (looks like a Makefile)
  • Supports macros and conditionals
  • Popular to manage desktops
  • Patching services
  • Verifying the files on the OS
  • Auditing user changes to the OS
  • Nodes pull their Cfengine file and run every
    night
  • System changes on the fly
  • One bad change kills everyone (in the middle of
    the night)
  • Can help you make changes to a running cluster

15
Kickstart
  • RedHat
  • Automates installation
  • Used to install desktops
  • Foundation of Rocks
  • Description based installation
  • Flat ASCII file
  • No conditionals or macros
  • Set of packages and shell scripts that run to
    install a node

16
LCFG
  • Edinburgh University
  • Anderson and Scobie
  • Description based installation
  • Flat ASCII file
  • Conditionals, macros, and statements
  • Full blown (proprietary) language to describe a
    node
  • Compose description file out of components
  • Using file inclusion
  • Not a graph as in Rocks
  • Do not use kickstart
  • Must replicate the work of RedHat
  • Very interesting group
  • Design goals very close to Rocks
  • Implementation is also similar

17
Everyone is building cluster
  • Currently too many cluster distributions
  • Replicated effort
  • Big problem remain unsolved
  • Global storage
  • Job launching and control
  • System Monitoring
  • Users do not care
  • We need portability between these efforts
  • Same PBS script on any cluster
  • Multiple efforts to standardize clusters
  • SciDac - DOE sponsored
  • Linux HA - Open effort for HA cluster
  • GGF - Cluster standards WG (forming)

18
Trouble Shooting
19
Meteor Cluster at SDSC
  • Rocks v2.2
  • 2 Frontends
  • 4 NFS Servers
  • 100 nodes
  • Compaq
  • 800, 933, IA-64
  • SCSI, IDA
  • IBM
  • 733, 1000
  • SCSI
  • 50 GB RAM
  • Ethernet
  • For management
  • Myrinet 2000

20
Pick good HW components
  • Rack-mount gear is easy to deploy and maintain
  • DIY is fun and cheap, but is time consuming
  • Buy gear engineered for thermal
  • White boxes often run hot

21
Software Infrastructure
  • Restrain from special configuration
  • Open-source software moves fast if you
    customize lots of modules, youll have to
    remember how to reconfigure when you install
    upgrades
  • Also, sometimes configuration format changes
    need to learn new configuration format

22
Software Infrastructure
  • Leverage others work
  • Before inventing, investigate
  • Once you write it, you have to maintain it

23
Minimize Cables
  • Before adding any network to your nodes, be sure
    you really want it
  • Do you really want to debug a 1024-node serial
    console network?
  • Do you really want to debug a 1024-node
    keyboard/video/mouse network?

24
Cluster bring up
  • Trunk all cables (power, Ethernet, Myrinet,
    etc)
  • Neatness counts
  • Will help when components break

chassis
chassis
chassis
chassis
chassis
chassis
chassis
chassis
25
Cluster bring up
  • Minimize cables that cross racks
  • With an ethernet switch in each rack, all compute
    node ethernet cables are contained within the
    rack
  • Just one uplink ethernet cable exits the rack

26
Cluster bring up
  • Deploy nodes in groups of 8
  • Good match for switches
  • Switches come in sizes 8, 16, 24, 32, etc.

27
Cluster bring up
  • Thoroughly test nodes before putting cluster into
    production
  • Test in groups of 8
  • Isolate all problems
  • Then test entire system

28
Cluster Debugging
29
Myrinet Network
compute-0-10 gmId 1
compute-1-15 gmId 2
2
compute-0-16 gmId 3
compute-1-16 gmId 4
4
30
Myrinet Debugging
  • When running an MPI job, and you see an error
    message like MPI id 0, gmID 2 cant find MPI id
    1, gmID 4
  • Determine suspect node from error message

31
Myrinet Debugging
  • First, run a diagnostic test
  • We run High-performance Linpack over Myrinet
  • Linpack stresses the CPUs and sends MPI-based
    messages over Myrinet

32
Myrinet Debugging
  • Run gm_board_info

Route table for this node follows The mapper
48-bit ID was 0060dd7f9b1d gmID MAC Address
gmName Route ----
----------------- --------------------------------
--------- 1 0060dd7f9ad4
compute-0-10 b8 b9 89 2 0060dd7f9ad1
compute-1-15 b8 bf 86 3
0060dd7f9b15
compute-0-16 b8 81 84 4 0060dd7f80ea
compute-1-16 b8 b5 88
33
Myrinet Debugging
  • Compare gm_board_info from a node that is
    operational
  • If different between gmID and gmName, then one of
    the nodes is bad

Route table for this node follows The mapper
48-bit ID was 0060dd7f9b1d gmID MAC Address
gmName Route ----
----------------- --------------------------------
-------- 1 0060dd7f9ad4
compute-0-10 b8 b9 8a 2 0060dd7f9ad1
compute-1-15 b8 bf 87 3
0060dd7f9b15
compute-0-16 b8 81 85 4 0060dd7f80ea
compute-1-16 b8 b5 89
Route table for this node follows The mapper
48-bit ID was 0060dd7f9b1d gmID MAC Address
gmName Route ----
----------------- --------------------------------
--------- 1 0060dd7f9ad4
compute-0-10 b8 b9 89 2 0060dd7f9ad1
compute-1-15 b8 bf 86 3
0060dd7f9b15
compute-0-16 b8 81 84 4 0060dd7f80ea
compute-1-5 b8 b5 88
34
Myrinet Debugging
  • To double check, run gm_board_info on one more
    node
  • This is the The Byzantine Generals Problem
  • We need to develop a program that automatically
    does this

35
After You Find the Dead Node
  • Replace cable Easy
  • Run diagnostic
  • Try a different port on the switch
  • Run diagnostic
  • Replace Myrinet card
  • Run diagnostic
  • Call Myricom Hard

36
PBS/Maui debugging
  • For large clusters (greater than 64), try
    increasing timeout values
  • In the file /usr/spool/maui/maui.cfg
  • RMPOLLINTERVAL
  • RMTIMEOUT0
  • Try restarting the services
  • /etc/rc.d/init.d/pbs-server restart
  • /etc/rc.d/init.d/maui restart

37
Node Debugging
38
UNIX is a prerequisite
  • A single unix machine is complex
  • A cluster is more complex than the sum of its
    parts
  • Understand how a single machine works before
    building a cluster
  • Build your own Linux desktop machine to get
    started

39
Laws of Electronics
  • If something fails try
  • Plugging it in
  • Turning it on
  • Recycling the power
  • This will fix most of your problems
  • We make mistakes like this all the time
  • E.g.
  • Forgot to connect the floppy drive to the MB
  • Forget to plug in the main power supply

40
BIOS
  • Functions
  • Power on Self Test
  • OS Boot
  • Windows
  • LILO
  • Linux does not use BIOS
  • Bad settings are fatal

41
New Machine
  • Reset the BIOS to factory defaults
  • Jumper setting
  • Software options
  • Verify machine can boot without keyboard
  • Verify virus protection is disabled
  • Boot order
  • CDROM
  • Floppy
  • Hard Disk
  • Network (PXE)

42
Bugs
  • Cannot find kickstart file
  • Frontend
  • compute

43
Frontend
  • File format
  • Make sure ks.cfg was saved as a Unix file on the
    floppy
  • RedHats kickstart demands that ks.cfg is a
    Unix-based text file
  • Best way is to save the file on Unix system
  • Floppy Media
  • Save kickstart file to another floppy
  • Dont recycle AOL disks
  • Floppy Drive
  • Replace the drive and try again
  • This happened to us (took hours to debug)

44
Compute Node
  • Suspected crashed frontend services
  • Restart the DHCP Server
  • Restart the Apache (web) Server
  • Restart the Database (mysql) Server
  • Check network connection
  • Private side network is eth0
  • Public side network is eth1
  • Verify CAT5e cables
  • Look at the link lights
  • If this is the first time the node is powered up,
    check that insert-ethers is running on the
    frontend

45
Understand your network
  • This is the key component
  • When it fails a cluster becomes a bunch of PCs

46
OSI Network Model
  • International Standard Organization - Open System
    Interconnect
  • 7 network layers
  • In every network text book
  • This is not the Internet

47
Internet (TCP/IP)
  • Simplifies OSI model
  • OSI is theory
  • TCP/IP is practice
  • 4 network layers
  • Link
  • Network
  • Transport
  • Application

48
Link (Ethernet)
  • MAC Addresses
  • 6 bytes
  • E.g. 0010b55516b5
  • ARP
  • MAC -gt IP lookup

49
Network (IP)
  • Infrastructure
  • Message Reassembly
  • Inspected by firewall rules

50
UDP Transport
  • Properties
  • Unreliable
  • Connectionless
  • Datagram (Packets)
  • Clients
  • NFS
  • Ganglia
  • DHCP
  • Syslog

51
TCP Transport
  • Properties
  • Reliable
  • Connection Oriented
  • Byte stream
  • Clients
  • SSH
  • HTTP
  • Kickstart file requests
  • RPM downloads

52
tcpdump
  • Poor Mans network analyzer
  • Pro
  • Lowest level debugging
  • Extremely verbose
  • De facto standard
  • Con
  • Lowest level debugging
  • Extemely verbose
  • Learn to love it
  • Great tool debugging for clusters

53
Example
  • tcpdump -i eth0
  • 070002.737039 arp who-has 2.31.117.203.in-addr.
    arpa tell 31.31.117.203.in-addr.arpa
  • 070002.942326 arp who-has 2.31.117.203.in-addr.
    arpa tell 31.31.117.203.in-addr.arpa
  • 070003.122296 165.0.168.192.in-addr.arpa.49206
    gt 1.0.168.192.in-addr.arpa.domain
    41414domain
  • 070003.126042 1.0.168.192.in-addr.arpa.domain
    gt 165.0.168.192.in-addr.arpa.49206
    41414domain
  • 070003.143136 165.0.168.192.in-addr.arpa.49206
    gt 1.0.168.192.in-addr.arpa.domain
    41126domain
  • 070003.146833 1.0.168.192.in-addr.arpa.domain
    gt 165.0.168.192.in-addr.arpa.49206
    41126domain
  • 070003.153560 165.0.168.192.in-addr.arpa.49206
    gt 1.0.168.192.in-addr.arpa.domain
    34968domain
  • 070003.157088 1.0.168.192.in-addr.arpa.domain
    gt 165.0.168.192.in-addr.arpa.49206
    34968domain

54
Network Verification
  • Install Frontend
  • Boot compute node
  • Run tcpdump on the frontend for eth0
  • Network correct
  • Will see DHCP request from compute nodes
  • Will see DHCP response from frontend
  • Network incorrect
  • Will see nothing
  • Or will see public-side network traffic

55
Syslog
  • Standard UNIX application event logger
  • Multiple logging facilities
  • USER
  • Kernel
  • LOCAL.0-9 - for user applications
  • Logging to
  • /var/log/
  • Frontend (in /var/log/)

56
Higher level than tcpdump
  • tail -f /var/log/message
  • On the frontend
  • Will show all events for the entire cluster
  • Use in combination with tcpdump

57
Application Optimization
58
Job launching
  • With many users, use a queuing system
  • Or else competing jobs will slow (or halt!) other
    jobs

59
Data storage
  • Use many file servers
  • E.g., NAS, NFS
  • But this has its problems too
  • Want to distribute load
  • But, there are challenges

60
Challenges with File Servers
Access
  • Gigabit connect to file server
  • Access data at 90 MB/s
  • 1 TB / server
  • 11 seconds to read 1 GB
  • 3 hours to scan all data
  • Worse if compute nodes have 100 Mb connections
  • Access data at 9 MB/s ? 110 seconds to read 1 GB

Data
61
Buy a good compiler
  • Your code will compile
  • G77 doesnt support F90 constructs
  • Cant promote real values to double precision
  • Your code will execute faster
  • At spec.org (CPU benchmark), not one entry used
    gcc or g77
  • Searched over 1000 entries
  • Your code will work
  • Experience with G77 has been known to give
    incorrect results

62
Source www.polyhedron.com/complnx.html
63
Buy a good compiler
  • Theyre cheap
  • Intel Fortran compiler
  • USD600 first year
  • USD250 / year
  • Portland Group Fortran compiler
  • USD500 first year
  • USD125 / year

64
Know your compiler flags
  • Also from spec.org

65
Data locality
  • Write code that is cache-aware and memory-aware
  • Flexible to the changing sizes
  • And not just larger!
  • L1 cache decreased from 16 KB in Pentium III to 8
    KB in Pentium 4.
  • Deployed main memory increases by a factor of 10
    every 4 years
  • So plan on 78 more memory / node / year

66
Data Locality
67
Memory is cheap and fast
  • Buy lots of it!
  • USD0.20 / megabyte

68
Software Development Trends
  • In the vector days, programmers focused on
    enhancing code within loops
  • E.g., 64 element, 64-bit vector registers

69
Software Development Trends
  • Now, in MPP bottleneck is inter-node
    communication
  • Focus on minimizing communication

70
Back to the Future?
  • Commodity processors are incorporating more
    vector registers and instructions
  • Pentium 4 8, 128-bit registers
  • Playstation 2 32, 128-bit registers
  • Nvidia GeForce 3 41, 128-bit registers

71
Back to the Future?
  • Compiler writers (and programmers) are writing
    vector-aware programs
  • E.g., taking advantage of SSE, Altivec, 3DNOW

72
Enhancing code
  • Common programmer method
  • Insert timers
  • Run code
  • Find a bottleneck
  • Goto 2 until fatigued

73
Enhancing code
  • For 32-bit addressable processors (e.g., x86)
    write values that will be reused to memory
    (rather than disk)
  • Since only can address 2 GB, create a data server

Compute Node
compute process
data server process
74
Program Profiling
  • gprof
  • Good for serial programs
  • Performance API (PAPI)
  • Developed at UTK
  • Same people who developed Linpack, ATLAS, PVM,
    NetSolve, etc.
  • Provide an API to access performance counters on
    microprocessors
  • Supported CPUs x86, IA-64, IBM Power series,
    Alpha, Cray
  • Supported Software Linux, AIX, Tru64, Unicos

75
PAPI
  • Programs have been developed using PAPI
  • Perfometer, DEEP/MPI,

76
DEEP/MPI
  • Program analysis and debugging tools for MPI
    programs written in Fortran or C
  • USD 700 - 950

77
DEEP/MPI
78
DEEP/MPI
79
PAPI Example
  • Application developer at SDSC used PAPI on two
    separate hardware implementations for IA-64 to
    examine performance differences
  • Embedded PAPI function calls in test program

80
PAPI
  • DEEP/MPI is just one of a half dozen applications
    based on PAPI
  • To us it looks like one of the more interesting
    ones
  • To explore PAPI further
  • www.utk.edu/papi

81
Contact Info
82
How to Get More Info
  • People
  • Philip phil_at_sdsc.edu
  • Mason mjk_at_sdsc.edu
  • Greg bruno_at_sdsc.edu
  • Rocks web site
  • http//rocks.npaci.edu
  • Discussion list
  • Send email to npaci-rocks-discussion_at_npaci.edu

83
Final Thoughts
84
Driving in Singapore
  • Fluid lanes
  • Watch out for the speed cameras
  • 24 points suspended license
  • Can you believe how much the COE is?

85
Red Man, Stand
86
My Cell Phone Before
87
(No Transcript)
Write a Comment
User Comments (0)
About PowerShow.com