Linux Cluster Production Readiness - PowerPoint PPT Presentation

About This Presentation
Title:

Linux Cluster Production Readiness

Description:

Linux Cluster Production Readiness. Egan Ford. IBM. egan_at_us.ibm.com. egan_at_sense.net. Agenda ... The first argument to dplot must be the number of bins, auto, or whole. ... – PowerPoint PPT presentation

Number of Views:93
Avg rating:3.0/5.0
Slides: 115
Provided by: sen103
Category:

less

Transcript and Presenter's Notes

Title: Linux Cluster Production Readiness


1
Linux Cluster Production Readiness
  • Egan Ford
  • IBM
  • egan_at_us.ibm.com
  • egan_at_sense.net

2
Agenda
  • Production Readiness
  • Diagnostics
  • Benchmarks
  • STAB
  • Case Study
  • SCAB

3
What is Production Readiness?
  • Production readiness is a series of tests to help
    determine if a system is ready for use.
  • Production readiness falls into two categories
  • diagnostic
  • benchmark
  • The purpose is to confirm that all hardware is
    good and identical (per class).
  • The search for consistency and predictability.

4
What are diagnostics?
  • Diagnostic tests are usually pass/fail and
    include but are not limited to
  • simple version checks
  • OS, BIOS versions
  • inventory checks
  • Memory, CPU, etc
  • configuration checks
  • Is HT off?
  • vendor supplied diagnostics
  • DOS on a CD

5
Why benchmark?
  • Diagnostics are usually pass/fail.
  • Thresholds may be undocumented.
  • Why is difficult to answer.
  • Diagnostics may be incomplete.
  • They may not test all subsystems.
  • Other issues with diagnostics
  • False positives.
  • Inconsistent from vendor to vendor.
  • Do no real work, cannot check for accuracy.
  • Usually hardware based.
  • What about software?
  • What about the user environment?

6
Why benchmark?
  • Benchmarks can be checked for accuracy.
  • Benchmarks can stress all used subsystems.
  • Benchmarks can stress all used software.
  • Benchmarks can be measured and you can determine
    the thresholds.

7
Benchmark or diagnostics?
  • Do both.
  • All diagnostics should pass first.
  • Benchmarks will be inconsistent if diagnostics
    fail.

8
WARNING!
  • The following slides will contain the word
    statistics.
  • Statistics cannot prove anything.
  • Exercise commonsense.

9
A few words on statistics
  • Statistics increases human knowledge through the
    use of empirical data.
  • There are three kinds of lies lies, damned lies
    and statistics. -- Benjamin Disraeli
    (1804-1881)
  • There are three kinds of lies lies, damned lies
    and linpack.

10
What is STAB?
  • STatistical Analysis of Benchmarks
  • A systematic way of running a series of
    increasing complex benchmarks to find avoidable
    inconsistencies.
  • Avoidable inconsistencies may lead to performance
    problems.
  • GOAL consistent, repeatable, accurate results.

11
What is STAB?
  • Each benchmark is run one or more times per node,
    then the best representative of each node (ignore
    for multinode tests) is grouped together and
    analyzed as a single population.  The results are
    not as interesting as the shape of the
    distribution of the results.  Empirical evidence
    for all the benchmarks in the STAB HOWTO suggest
    that they should all form a normal distribution.
  • A normal distribution is the classic bell curve
    that appears so frequently in statistics.  It is
    the sum of smaller, independent (may be
    unobservable), identically-distributed variables
    or random events.

12
Uniform Distribution
  • Plot below is of 20000 random dice.

13
Normal Distribution
  • Sum of 5 dice thrown 10000 times.

14
Normal Distribution
  • Benchmarks also have many small independent (may
    be unobservable) identically-distributed
    variables that may affect performance, e.g.
  • Competing processes
  • Context switching
  • Hardware interrupts
  • Software interrupts
  • Memory management
  • Process/Thread scheduling
  • Cosmic rays
  • The above may be unavoidable, but is in part the
    source a normal distribution.

15
Non-normal Distribution
  • Benchmarks may also have non-identically-distribut
    ed observable variables that may affect
    performance, e.g.
  • Memory configuration
  • BIOS Version
  • Processor speed
  • Operating system
  • Kernel type (e.g. NUMA vs SMP vs UNI)
  • Kernel version
  • Bad memory (e.g. excessive ECCs)
  • Chipset revisions
  • Hyper-Threading or SMT
  • Non-uniform competing processes (e.g. httpd
    running on some nodes, but not others)
  • Shared library versions
  • Bad cables
  • Bad administrators
  • Users
  • The above is avoidable and is the purpose of the
    STAB HOWTO.  Avoidable inconsistencies may lead
    to multimodal or non-normal distributions.

16
STAB Toolkit
  • The STAB Tools are a collection of scripts to
    help run selected benchmarks and to analyze their
    results.
  • Some of the tools are specific to a particular
    benchmark.
  • Others are general and operate on the data
    collected by the specific tools.
  • Benchmark specific tools comprise of benchmark
    launch scripts, accuracy validation scripts,
    miscellaneous utilities, and analysis scripts to
    collect the data, report some basic descriptive
    statistics, and create input files to be used
    with general STAB tools for additional analysis.

17
STAB Toolkit
  • With a goal of consistent, repeatable, accurate
    results it is best to start with as few variables
    as possible.  Start with single node benchmarks,
    e.g., STREAM.  If all machines have similar
    STREAM results, then memory can be ruled out as a
    factor with other benchmark anomalies.  Next,
    work your way up to processor and disk
    benchmarks, then two node (parallel) benchmarks,
    then multi-node (parallel) benchmarks.  After
    each more complicated benchmark run a check for
    consistent, repeatable, accurate results before
    continuing.

18
The STAB Benchmarks
  • Single Node (serial) Benchmarks
  • STREAM (memory MB/s)
  • NPB Serial (uni-processor FLOP/s and memory)
  • NPB OpenMP (multi-processor FLOP/s and memory)
  • HPL MPI Shared Memory (multi-processor FLOP/s and
    memory)
  • IOzone (disk MB/s, memory, and processor)
  • Parallel Benchmarks (for MPI systems only)
  • Ping-Pong (interconnect µsec and MB/s)
  • NAS Parallel (multi-node FLOP/s, memory, and
    interconnect)
  • HPL Parallel (multi-node FLOP/s, memory, and
    interconnect)

19
Getting STAB
  • http//sense.net/egan/bench
  • bench.tgz
  • Code with source (all script)
  • bench-oss.tgz
  • OSS code (e.g. Gnuplot)
  • bench-examples.tgz
  • 1GB of collected data (all text, 186000 files)
  • stab.pdf (currently 150 pages)
  • Documentation (WIP, check back before 11/30/2005)

20
Install STAB
  • Extract bench.tgz into home directorycd tar
    zxvf bench.tgztar zxvf bench-oss.tgztar zxvf
    bench-examples.tgz
  • Add STAB tools to PATHexport
    PATH/bench/binPATH
  • Append to .bashrcexport PATH/bench/binPATH

21
Install STAB
  • STAB requires Gnuplot 4 and it must be built a
    specific waycd /bench/srctar zxvf
    gnuplot-4.0.0.tar.gzcd gnuplot-4.0.0./configure
    --prefixHOME/bench --enable-thin-splinesmakema
    ke install

22
STAB Benchmark Tools
  • Each benchmark supported in this document
    contains an anal (short for analysis) script. 
    This script is usually run from a output
    directory, e.g.cd /bench/benchmark/output../a
    nalbenchmark nodes low high
    mean median std devbt.A.i686
    4 615.77 632.08 2.65
    627.85 632.02 8.06cg.A.i686 4
    159.78 225.08 40.87 191.05
    193.16 26.86ep.A.i686 4 11.51
    11.53 0.17 11.52 11.52
    0.01ft.A.i686 4 448.05 448.90
    0.19 448.63 448.81 0.39lu.A.i686
    4 430.60 436.59 1.39
    433.87 434.72 2.51mg.A.i686 4
    468.12 472.54 0.94 470.86
    472.12 2.00sp.A.i686 4 449.01
    449.87 0.19 449.58 449.72
    0.39
  • The anal scripts produce statistics about the
    results to help find anomalies.  The theory is
    that if you have identical nodes then you should
    be able to obtain identical results (not always
    true).  The anal scripts will also produce plot.
    files for use by dplot to graphically represent
    the distribution of the results, and by cplot to
    plot 2D correlations.

23
Rant vs. normal distribution
  • is good?
  • variability can tell you something about the
    data with respect to itself without knowing
    anything about the data
  • It is non-dimensional with a range (usually
    0-100) that has meaning to anyone.
  • IOW, management understands percentages.
  • is not good?
  • It minimizes the amount of useful empirical data.
  • It hides the truth.

24
is not good, exhibit A
  • Clearly this is a normal distribution, but the
    variability is 500.  This is an extreme case
    where all the possible values exist for
    predetermined range.

25
is not good, exhibit B
  • Low variability can hide a skewed distribution. 
    Variability is low, only 1.27.  But the
    distribution is clearly skewed to the right.

26
is not good, exhibit C
  • A 5.74 variability hides a bimodal
    distribution.  Bimodal distributions are clear
    indicators that there is an observable difference
    between two different sets of nodes.

27
STAB General Analysis Tools
  • dplot is for plotting distributions.
  • All the graphical output used as illustrations in
    this document up to this point was created with
    dplot.
  • dplot provides a number of options for binning
    the data and analyzing the distribution.
  • cplot is for correlating the results between two
    different sets of results.
  • E.g., does poor memory performance correlate to
    poor application performance?
  • danal is very similar to the output provided by
    the custom anal scripts provided with each
    benchmark, but has additional output options.
  • You can safely discard any anal screen output
    because it can be recreated with danal and the
    resulting plot.benchmark file.
  • Each script will require one or more
    plot.benchmark files.
  • dplot and danal are less strict and will work
    with any file of numbers as long as the numbers
    are in the first column subsequent columns are
    ignored.
  • cplot however requires the 2nd column it is
    impossible to correlate two sets of results
    without an index.

28
dplot
  • The first argument to dplot must be the number of
    bins, auto, or whole.  auto (or a) will use the
    square root of the number of results to determine
    the bin sizes and is usually the best place to
    start.  whole (or w) should only be used if your
    results are whole numbers and if the data
    contains all possible values between low and
    high.  This is only useful for creating plots
    like the dice examples at the beginning of this
    document.
  • The second argument is the plotfile.  The
    plotfile must contain one value per line in the
    first column, subsequent columns are ignored. 
    The order of the data is unimportant.

29
dplot a numbers.1000
30
dplot a numbers.1000 -n
31
dplot 19 numbers.1000 -n
32
dplot a plot.c.ppc64 -bi
33
dplot a plot.c.ppc64 bi -std
34
dplot a plot.c.ppc64 text
108 -------------------------------------------
--- 0.22


86
----------------------------------------------
0.18


65
------------------------------------------
0.13


.. 43
----------------------------------------
0.09


22
------------------------------------
0.05
....
....
....
0 -------------------------------------
----- 0.00 2023 2046 2068 2090
2112 2134 2156
35
GUI vs Text
36
dplot a plot.c_omp.ppc64 n -chi
37
chi-squared and scale
38
Abusing chi-squared
findn plot.c_omp.ppc64X2 26.75, scale
0.43, bins 21, normal distribution probability
14.30X2 13.29, scale 0.25, bins 12, normal
distribution probability 27.50X2 24.34,
scale 0.45, bins 22, normal distribution
probability 27.70X2 22.04, scale 0.41,
bins 20, normal distribution probability
28.20X2 4.65, scale 0.12, bins 6, normal
distribution probability 46.00X2 8.68,
scale 0.21, bins 10, normal distribution
probability 46.70X2 16.79, scale 0.37,
bins 18, normal distribution probability
46.90X2 12.52, scale 0.29, bins 14, normal
distribution probability 48.50X2 16.77,
scale 0.39, bins 19, normal distribution
probability 53.90X2 8.55, scale 0.23, bins
11, normal distribution probability 57.50X2
12.33, scale 0.31, bins 15, normal distribution
probability 58.00X2 13.25, scale 0.33,
bins 16, normal distribution probability
58.30X2 2.84, scale 0.1, bins 5, normal
distribution probability 58.40X2 10.22,
scale 0.27, bins 13, normal distribution
probability 59.70X2 6.27, scale 0.19, bins
9, normal distribution probability 61.70X2
1.36, scale 0.08, bins 4, normal distribution
probability 71.60X2 11.28, scale 0.35,
bins 17, normal distribution probability
79.20X2 3.36, scale 0.17, bins 8, normal
distribution probability 85.00X2 2.27,
scale 0.14, bins 7, normal distribution
probability 89.30
39
Abusing chi-squared
40
cplot
  • cplot or correlation plot is a perl front-end to
    Gnuplot to graphically represent the correlation
    between any two sets of indexed numbers.
  • Correlation measures the relationship between two
    sets of results, e.g. processor performance and
    memory throughput.
  • Correlations are often expressed as a correlation
    coefficient a numerical value with a range from
    -1 to 1.
  • A positive correlation would indicate that if one
    set of results increased, the other set would
    increase, e.g. better memory throughput increases
    processor performance.
  • A negative correlation would indication that if
    one set of results increases, the other set would
    decrease, e.g. better processor performance
    decreases latency.
  • A correlation of zero would indicate that there
    is no relationship at all, IOW, they are
    independent.
  • Any two sets of results with a non-zero
    correlation is considered dependent, however a
    check should be performed to determine if a
    dependent set of results is statistically
    significant.

41
cplot
  • A strong correlation between two sets of results
    should produce more questions, not quick answers.
  • It is possible for two unrelated results to have
    a strong correlation because they share something
    in common.
  • E.g.  You can show a positive correlation with
    the sales of skis and snowboards.  It is unlikely
    that increased ski sales increased snowboard
    sales, the mostly likely cause is an increase in
    the snow depth (or a decrease in temperature) at
    your local resort, i.e., something that is in
    common.  The correlation is valid, but it does
    not prove the cause of the correlation.

42
cplot plot.c.ppc64 plot.cg.B.ppc64
43
cplot plot.c.ppc64 plot.mg.B.ppc64
44
Correlation of temperature to memory performance
45
Correlation of 100 random numbers
46
Statistical Significance
47
Statistical Significance
48
Case Study
  • 484 JS20 blades
  • dual PPC970
  • 2GB RAM
  • Myrinet D
  • Full Bisection Switch
  • Cisco GigE
  • 141 over subscribed

49
Diagnostics
  • Vendor supplied (passed)
  • BIOS versions (failed)
  • Inventory
  • Number of CPUs (passed)
  • Total Memory (failed)
  • OS/Kernel Versions (passed)

50
BIOS Versions (failed)
  • All nodes but node443 have BIOS dated 10/21/04.
    node443 is dated 09/02/2004.
  • Inconsistent BIOS versions can affect
    performance.Command output rinv compute all
    tee /tmp/foo cat /tmp/foo grep BIOS awk
    'print 4' sort uniq09/02/200410/21/2004
    cat /tmp/foo grep BIOS grep
    09/02/2004node433 VPD BIOS
    09/02/2004

51
Memory quantity (failed)
  • All nodes except node224 have 2GB RAM.Command
    output psh compute free grep Mem awk
    'print 3' sort uniq146011619772041977208
    psh compute free grep Mem grep
    1460116node224 Mem 1460116 ...

52
STREAM
  • The STREAM benchmark is a simple synthetic
    benchmark program that measures sustainable
    memory bandwidth (in MB/s) and the corresponding
    computation rate for simple vector kernels.
  • STREAM C, FORTRAN, and C OMP are run 10 times on
    each node, then the best result from each node is
    taken to be used to compare consistency. Each
    result is also tested for accuracy.

53
STREAM validation results
  • node483 failed to pass OMP test 3 of 10 test for
    accuracy. Try replacing memory, processors, and
    then system board in that order.Command
    output cd /bench/stream/output.raw
    ../checkresultschecking stream_c_omp.ppc64.node48
    3.3...failed

54
STREAM consistency results
cd /bench/stream/output ../analstream
results benchmark nodes low
high mean median std
dev c.ppc64 484 2031.43 2147.98
5.74 2077.03 2069.02
23.20 c_omp.ppc64 484 1993.49
2124.24 6.56 2050.00 2050.51
22.86 f.ppc64 484 2007.16
2092.68 4.26 2039.20 2034.63
17.87
55
NAS Serial
  • The NAS Parallel Benchmarks (NPB) are a small set
    of programs designed to help evaluate the
    performance of parallel supercomputers. The
    benchmarks, which are derived from computational
    fluid dynamics (CFD) applications, consist of
    five kernels and three pseudo-applications.
  • The NAS Serial Benchmarks are the same as the NAS
    Parallel Benchmarks except that MPI calls have
    been taken out and they run on one processor.
  • bt.B, cg.B, ep.B, ft.B, lu.B, mg.B, and sp.B are
    run 5 times on each node, then the best result
    from each node is taken to be used to compare
    consistency. Each result is also tested for
    accuracy.

56
NAS Serial validation results
  • node483 failed to pass to a number of tests. Try
    replacing memory, processors, and then system
    board in that order.Command output cd
    /bench/NPB3.2/NPB3.2-SER/output.raw
    ../checkresultschecking bt.B.ppc64.node483.1...fa
    iled checking bt.B.ppc64.node483.2..
    .failed checking bt.B.ppc64.node483.
    3...failed checking
    bt.B.ppc64.node483.4...failed
    checking bt.B.ppc64.node483.5...failed
    checking cg.B.ppc64.node483.4...failed
    checking ep.B.ppc64.node483.3...failed
    checking ft.B.ppc64.node483.1...failed
    checking ft.B.ppc64.node483.2...faile
    d checking ft.B.ppc64.node483.3...fa
    iled checking ft.B.ppc64.node483.4..
    .failed checking lu.B.ppc64.node483.
    1...failed checking
    mg.B.ppc64.node483.1...failed
    checking mg.B.ppc64.node483.3...failed
    checking sp.B.ppc64.node483.1...failed
    checking sp.B.ppc64.node483.2...failed
    checking sp.B.ppc64.node483.3...failed
    checking sp.B.ppc64.node483.4...faile
    d checking sp.B.ppc64.node483.5...fa
    iled

57
NAS Serial consistency results
cd /bench/NPB3.2/NPB3.2-SER/output ../anal
NPB Serial benchmark nodes low
high mean median std
dev bt.B.ppc64 484 1077.69 1099.28
2.00 1087.60 1087.67
4.67 cg.B.ppc64 484 40.93
45.30 10.68 41.94 41.38
1.31 ep.B.ppc64 484 9.88
10.07 1.92 9.96 9.96
0.04 ft.B.ppc64 484 480.87
503.33 4.67 487.07 486.23
3.71 lu.B.ppc64 484 516.88
579.25 12.07 543.08 542.88
12.46 mg.B.ppc64 484 618.16
654.23 5.84 638.31 638.85
6.76 sp.B.ppc64 484 530.48
556.67 4.94 541.01 540.77 3.99
58
(No Transcript)
59
How does memory correlate to performance?
60
(No Transcript)
61
Statistically significant?
  • Command output

findc plot grep plot.c.ppc64 0.13 0.13
00 plot.bt.B.ppc64 plot.c.ppc64 0.62 0.62
00 plot.c.ppc64 plot.c_omp.ppc64 0.93 0.93
00 plot.c.ppc64 plot.cg.B.ppc64 0.19 0.19
00 plot.c.ppc64 plot.ep.B.ppc64 0.89 0.89
00 plot.c.ppc64 plot.f.ppc64 0.17 0.17 00
plot.c.ppc64 plot.ft.B.ppc64 0.11 0.11 02
plot.c.ppc64 plot.lu.B.ppc64 0.50 -0.50 00
plot.c.ppc64 plot.mg.B.ppc64 0.05 -0.05 27
plot.c.ppc64 plot.sp.B.ppc64
62
NAS OMP
  • The NAS OpenMP Benchmarks are the same as the NAS
    Parallel Benchmarks except that the MPI calls
    have been replaced with OpenMP calls to run on
    multiple processors on a shared memory system
    (SMP).
  • bt.B, cg.B, ep.B, ft.B, lu.B, mg.B, and sp.B are
    run 5 times on each node, then the best result
    from each node is taken to be used to compare
    consistency. Each result is also tested for
    accuracy.

63
NAS OMP validation results
  • node483 failed to pass to a number of tests. Try
    replacing memory, processors, and then system
    board in that order.Command output

cd /bench/NPB3.2/NPB3.2-OMP/output.raw
../checkresultschecking bt.B.ppc64.node483.1...fa
iled checking bt.B.ppc64.node483.2..
.failed checking bt.B.ppc64.node483.
3...failed checking
bt.B.ppc64.node483.4...failed
checking bt.B.ppc64.node483.5...failed
checking ft.B.ppc64.node483.1...failed
checking ft.B.ppc64.node483.2...failed
checking ft.B.ppc64.node483.3...failed
checking ft.B.ppc64.node483.4...faile
d checking ft.B.ppc64.node483.5...fa
iled checking lu.B.ppc64.node483.1..
.failed checking lu.B.ppc64.node483.
3...failed checking
lu.B.ppc64.node483.4...failed
checking mg.B.ppc64.node483.1...failed
checking mg.B.ppc64.node483.2...failed
checking mg.B.ppc64.node483.3...failed
checking mg.B.ppc64.node483.4...failed
checking mg.B.ppc64.node483.5...faile
d checking sp.B.ppc64.node483.1...fa
iled checking sp.B.ppc64.node483.2..
.failed checking sp.B.ppc64.node483.
3...failed checking
sp.B.ppc64.node483.4...failed
checking sp.B.ppc64.node483.5...failed
64
NAS OMP consistency results
cd /bench/NPB3.2/NPB3.2-OMP/output ../anal
NPB OpenMP benchmark nodes low
high mean median std
dev bt.B.ppc64 484 1850.99 1898.65
2.57 1871.41 1870.45
9.25 cg.B.ppc64 484 67.31
73.30 8.90 68.96 68.44
1.49 ep.B.ppc64 484 19.69
20.36 3.40 19.88 19.88
0.09 ft.B.ppc64 484 593.39
615.77 3.77 604.74 604.61
4.06 lu.B.ppc64 484 739.30
820.71 11.01 773.09 772.05
16.76 mg.B.ppc64 484 751.40
819.38 9.05 792.03 797.10
15.26 sp.B.ppc64 484 722.73
824.39 14.07 745.99 747.33 8.51
65
(No Transcript)
66
How does memory correlate to performance?
67
(No Transcript)
68
Statistically significant?
  • Command output

findc plot grep plot.f.ppc64 0.37 0.37
00 plot.bt.B.ppc64 plot.f.ppc64 0.89 0.89
00 plot.c.ppc64 plot.f.ppc64 0.64 0.64 00
plot.c_omp.ppc64 plot.f.ppc64 0.77 0.77 00
plot.cg.B.ppc64 plot.f.ppc64 0.07 -0.07 12
plot.ep.B.ppc64 plot.f.ppc64 0.20 -0.20 00
plot.f.ppc64 plot.ft.B.ppc64 0.29 -0.29 00
plot.f.ppc64 plot.lu.B.ppc64 0.81 -0.81 00
plot.f.ppc64 plot.mg.B.ppc64 0.65 -0.65 00
plot.f.ppc64 plot.sp.B.ppc64 findc plot
grep plot.c_omp.ppc64 0.29 0.29 00
plot.bt.B.ppc64 plot.c_omp.ppc64 0.62 0.62
00 plot.c.ppc64 plot.c_omp.ppc64 0.54 0.54
00 plot.c_omp.ppc64 plot.cg.B.ppc64 0.03
-0.03 51 plot.c_omp.ppc64 plot.ep.B.ppc64 0.64
0.64 00 plot.c_omp.ppc64
plot.f.ppc64 0.06 -0.06 19 plot.c_omp.ppc64
plot.ft.B.ppc64 0.20 -0.20 00
plot.c_omp.ppc64 plot.lu.B.ppc64 0.56 -0.56
00 plot.c_omp.ppc64 plot.mg.B.ppc64 0.44
-0.44 00 plot.c_omp.ppc64 plot.sp.B.ppc64
69
HPL
  • HPL is a software package that solves a (random)
    dense linear system in double precision (64 bits)
    arithmetic on distributed-memory computers. It
    can thus be regarded as a portable as well as
    freely available implementation of the High
    Performance Computing Linpack Benchmark.
  • xhpl is run 10 times on each node, then the best
    result from each node is taken to be used to
    compare consistency. Each result it also tested
    for accuracy.
  • NOTE nodes 215 and 224 were excluded from this
    test. node215 would not boot up. node224 only
    had 1.5GB of RAM. This test used 1.8GB RAM.

70
HPL validation test
  • node483 failed to pass any test. Try replacing
    memory, processors, and then system board in that
    order.
  • Command output

cd /bench/hpl/output.raw.single
../checkresultschecking xhpl.ppc64.node483.1...fa
iled checking xhpl.ppc64.node483.10
...failed checking
xhpl.ppc64.node483.2...failed
checking xhpl.ppc64.node483.3...failed
checking xhpl.ppc64.node483.4...failed
checking xhpl.ppc64.node483.5...failed
checking xhpl.ppc64.node483.6...failed
checking xhpl.ppc64.node483.7...faile
d checking xhpl.ppc64.node483.8...fa
iled checking xhpl.ppc64.node483.9..
.failed
71
HPL consistency and correlation
cd /bench/hpl/output ../anal HPL
results benchmark nodes low
high mean median std
dev xhpl.ppc64 482 11.62 12.04
3.61 11.89 11.89 0.08
72
Ping-Pong
  • Ping-Pong is a simple benchmark that measures
    latency and bandwidth for different message
    sizes.
  • Ping-Pong benchmarks should be run for each
    network (e.g. Myrinet and GigE).  First run the
    serial Ping-Pongs and then the parallel
    Ping-Pongs.  The purpose of the serial benchmarks
    is to find any single node or set of nodes that
    is not performing as well as the other nodes. The
    purpose of the parallel benchmarks is to help
    calculate bisectional bandwidth and test that
    system wide MPI jobs can be run.
  • There are four patterns, 3 deterministic and 1
    random. The purpose for all four is to help
    isolate poor performing nodes and possibly poor
    performing routes or trunks (e.g. bad uplink
    cable). 

73
Ping-Pong
  • Sorted

74
Ping-Pong
  • Cut

75
Ping-Pong
  • Fold

76
Myrinet consistency check
cd /bench/PMB2.2.1/output.gm ../anal spp
sort bw spp sort bw results bytes pairs
low high mean median
std dev 1 242 0.08 0.11
37.50 0.11 0.11
0.00 ... 4194304 242 87.62 234.93
168.12 232.49 233.43 9.38 ../anal
spp cut bw... 4194304 242 87.13
234.99 169.70 232.16 233.15
9.40 ../anal spp fold bw...4194304 242
87.17 235.04 169.63 232.13
233.16 9.39 ../anal spp shuffle
bw...4194304 242 87.61 234.77
167.97 232.14 232.70 9.36
The 4194304 results the mean and median are very
close together and also close to the high
indicating a one or a few nodes with poor
performance.
77
Myrinet consistency
head -5 plot.spp..bw.4194304 gt
plot.spp.cut.bw.4194304 lt 87.13
node164-node406 230.95 node107-node349 231.36
node147-node389 231.41 node091-node333 231.43
node045-node287 gt plot.spp.fold.bw.4194304
lt 87.17 node079-node406 227.58
node214-node271 229.34 node010-node475 231.40
node091-node394 231.48 node177-node308 gt
plot.spp.shuffle.bw.4194304 lt 87.61
node024-node406 231.47 node091-node166 231.51
node227-node003 231.55 node110-node293 231.57
node013-node231 gt plot.spp.sort.bw.4194304
lt 87.62 node405-node406 228.64
node039-node040 231.64 node231-node232 231.66
node091-node092 231.66 node481-node482
78
Bisectional Bandwidth
ppp cut bw results bytes pairs low
high mean median std
dev 4194304 242 60.28 233.44
287.26 138.94 137.92
36.87 Demonstrated BW 242 138.94 33623.48
MB/s 32.8 GB/s (262.4 Gb/s)
79
IP consistency check
cd /bench/PMB2.2.1/output.ip ../anal spp
sort bw spp sort bw results bytes pairs
low high mean median
std dev 1 241 0.01 0.01
0.00 0.01 0.01
0.00... 4194304 241 60.76 101.76
67.48 99.91 100.26 3.53 ../anal
spp cut bw... 4194304 241 45.54
89.88 97.36 86.96 88.60 6.58
../anal spp fold bw...4194304 241
50.91 100.60 97.60 87.33 88.48
6.30 ../anal spp shuffle bw...4194304
241 49.31 100.71 104.24 87.26
88.53 6.72
80
IP consistency check
  • The sorted pair output will be easiest to analyze
    for problem since each pair will be restricted to
    a single switch within each Bladecenter. The
    other tests will run across the network and may
    have higher variability.
  • Running the following command reviles that the
    pairs in bold performed poorly head -5
    plot.spp.sort.bw.4194304gt plot.spp.sort.bw.4194
    304 lt60.76 node025-node02668.97
    node023-node02479.97 node325-node32698.83
    node067-node06898.85 node071-node07298.94
    node337-node33898.98 node175-node17699.02
    node031-node03299.11 node401-node40299.16
    node085-node086
  • This may or may not be a problem. The uplink
    performance will be less 60MB/s/node because BC
    can at best provide an average of 35MB/s per
    blade (with a 4 cable trunk). Many Myrinet-based
    clusters only use GigE for management and NFS,
    both have greater bottlenecks elsewhere.
  • You may want to check the switch logs and
    consider reseating the switches and blades.

81
IP consistency check
Running the following command reviles that there
may be an uplink problem with nodes in BC 2.
i.e. node015-node028. head -20
plot.spp.cut.bw.4194304 plot.spp.fold.bw.4194304
plot.spp.shuffle.bw.4194304 gt
plot.spp.cut.bw.4194304 lt 45.54
node025-node268 50.47 node026-node269 54.85
node024-node267 56.27 node002-node245 57.08
node022-node265 58.50 node023-node266 62.74
node020-node263 69.37 node016-node259 69.48
node015-node258 69.56 node021-node264 69.73
node018-node261 71.06 node028-node271 71.42
node019-node262 71.45 node042-node285 72.06
node027-node270 72.31 node017-node260 84.69
node224-node465 86.40 node225-node466 87.10
node001-node244 87.54 node084-node327
82
IP consistency check
gt plot.spp.fold.bw.4194304 lt 50.91
node026-node459 51.72 node023-node462 55.32
node002-node483 58.39 node025-node460 60.24
node024-node461 65.66 node018-node467 68.09
node022-node463 68.28 node020-node465 69.96
node021-node464 70.23 node015-node470 70.27
node016-node469 70.61 node019-node466 71.12
node027-node458 71.50 node017-node468 74.35
node028-node457 84.75 node235-node252 85.02
node236-node251 85.79 node237-node250 85.94
node238-node24987.19 node118-node367
83
IP consistency check
gt plot.spp.shuffle.bw.4194304 lt 49.31
node001-node126 49.46 node029-node026 51.25
node024-node063 56.34 node274-node025 58.14
node023-node100 68.00 node019-node248 68.67
node443-node015 68.88 node018-node228 69.29
node020-node091 69.38 node028-node240 70.68
node022-node102 70.80 node027-node106 71.63
node021-node423 71.96 node291-node017 72.52
node460-node411 72.66 node016-node040 78.61
node031-node011 83.85 node041-node050 84.82
node407-node393 85.08 node420-node399
The cut, fold, and shuffle tests run from BC to
BC, and the nodes in BC 2 repeatable show up.
Consider checking the uplink cables, ports, and
the BC switch.
84
Bisectional Bandwidth
ppp cut bw results bytes pairs low
high mean median std
dev 4194304 241 6.18 17.36
180.91 7.95 7.28
1.82 Demonstrated BW 241 7.95 1915.95 MB/s
1.87 GB/s (14.96 Gb/s)
85
NAS MPI (8 node, 2ppn)
  • The NAS Parallel Benchmarks (NPB) are a small set
    of programs designed to help evaluate the
    performance of parallel supercomputers. The
    benchmarks, which are derived from computational
    fluid dynamics (CFD) applications, consist of
    five kernels and three pseudo-applications.
  • bt.B, cg.B, ep.B, ft.B, is.B, lu.B, mg.B, and
    sp.B are run 10 times on each set of 8 unique
    nodes using 2 different node set methods sorted
    and shuffle.
  • Sorted. Sets of 8 nodes are selected from a
    sorted list and assigned adjacently, e.g.
    node001-node008, node009-node016, etc, this is
    used to find consistency within the same set of
    nodes.
  • Shuffle. Sets of 8 nodes are selected from a
    shuffled list. Nodes are reshuffled between
    runs.
  • Both sorted and shuffle sets are run in parallel,
    i.e. all the sorted sets of 8 are run at the same
    time, then all the shuffle sets are run at the
    same time.
  • NOTE node215 and node446 were not included in
    the shuffle and sorted tests. node215 failed to
    boot, node446 failed to startup Myrinet.

86
NAS MPI verification
Verification command output cd
/bench/NPB3.2/NPB3.2-MPI/output.raw.shuffleThis
command will find the failed results and place
the names of the results filenames into the file
../failed ../checkresults ../failedThis
command will find the common nodes in all failed
results in the file ../failed and sort them by
number of occurrences (occurrences are counted by
processor, not node) xcommon ../failed
tail node395 12 node440 12 node056 12 node464
12 node043 12 node429 14 node297 14 node391
20 node174 22 node483 96
87
NAS MPI Consistency check
  • Consistency check command output

cd /bench/NPB3.2/NPB3.2-MPI/output.raw.shuffle
../analmNPB MPI benchmark runs
low high mean median
std dev bt.B.16 600 9089.46
10415.15 14.58 10204.94 10217.94
143.14 cg.B.16 600 1095.60
1685.61 53.85 1570.48 1575.38
57.70 ep.B.16 600 155.81 160.64
3.10 158.48 158.37 0.59 ft.B.16
600 2102.39 3232.49 53.75
3052.71 3066.45 130.37 is.B.16
600 87.06 185.29 112.83 155.97
154.39 12.94 lu.B.16 600
5069.36 5892.62 16.24 5529.00 5531.17
111.84 mg.B.16 600 3265.89
3898.99 19.39 3737.80 3739.77
74.91 sp.B.16 600 2156.46 2404.05
11.48 2340.00 2340.05 26.89
88
NAS MPI Consistency
The leading cause of variable for a stable system
is switch contention. The only way to determine
what is normal is to run the same set of
benchmarks multiple times on an isolated set of
stable nodes (nodes that passed single node
tests) with the rest of the switch not in use. I
did not have time to run a series of serial
parallel tests, but this is close cd
/bench/NPB3.2/NPB3.2-MPI/output.raw.sort
../analm (nr l node001-node080) NPB
MPI benchmark runs low high
mean median std dev bt.B.16
100 10025.30 10266.00 2.40
10129.42 10120.54 44.30 cg.B.16
100 1678.27 1787.76 6.52 1714.04
1712.43 15.39 ep.B.16 100
150.45 160.02 6.36 158.49 158.38
1.03 ft.B.16 100 3248.41
3694.40 13.73 3563.50 3575.43
81.22 is.B.16 100 159.31 168.14
5.54 163.91 164.22 1.98 lu.B.16
100 5156.19 5522.79 7.11
5346.95 5350.06 87.51 mg.B.16
100 3491.76 3685.78 5.56 3613.65
3614.44 37.25 sp.B.16 100
2259.08 2308.16 2.17 2289.66 2290.30
9.55 The above results are from the first
80 nodes run sorted. Each set of 8 nodes were
isolated to a single Myrinet line card reducing
switch contention (however each 2 sets of nodes
did share a single line card). Also to avoid
possible variability because of memory
performance I limited the report to the first 80
nodes.
89
NAS MPI Distribution
90
(No Transcript)
91
NAS MPI Correlation BT BW vs. Perf
92
NAS MPI Distribution w/o node406
93
(No Transcript)
94
NAS MPI Correlation BT STREAM vs. Perf
95
NAS MPI Correlation BT STREAM vs. Perf
CPLOTOPTS"-dy ," findc plot grep
plot.c.ppc64 0.09 -0.09 05 plot.c.ppc64
plot.cg.B.16 0.00 0.00 100 plot.c.ppc64
plot.ep.B.16 0.14 -0.14 00 plot.c.ppc64
plot.ft.B.16 0.22 -0.22 00 plot.c.ppc64
plot.is.B.16 0.21 -0.21 00 plot.c.ppc64
plot.lu.B.16 0.41 -0.41 00 plot.c.ppc64
plot.mg.B.16 0.42 -0.42 00 plot.c.ppc64
plot.sp.B.16
96
HPL MPI
  • HPL is a software package that solves a (random)
    dense linear system in double precision (64 bits)
    arithmetic on distributed-memory computers. It
    can thus be regarded as a portable as well as
    freely available implementation of the High
    Performance Computing Linpack Benchmark.
  • xhpl is run 10 (15 times for sorted) times on
    each set of 8 unique nodes using 2 different node
    set methods sorted and shuffle.
  • Sorted. Sets of 8 nodes are selected from a
    sorted list and assigned adjacently, e.g.
    node001-node008, node009-node016, etc, this is
    used to find consistency within the same set of
    nodes.
  • Shuffle. Sets of 8 nodes are selected from a
    shuffled list. Nodes are reshuffled between
    runs.
  • Both sorted and shuffle sets are run in parallel,
    i.e. all the sorted sets of 8 are run at the same
    time, then all the shuffle sets are run at the
    same time.

97
HPL MPI verification
cd /bench/hpl/output.raw.shuffleThis command
will find the failed results and place the names
of the results filenames into the file
../failed ../checkresults ../failedThis
command will find the common nodes in all failed
results in the file ../failed and sort them by
number of occurrences (occurrences are counted by
processor, not node) xcommon ../failed
tail node073 2 node121 2 node090 2 node406
2 node308 2 node276 2 node103 2 node199 2 node435
4 node483 20
98
HPL MPI consistency
cd /bench/hpl/output.raw.shuffle
../analm HPL results benchmark runs
low high mean median
std dev xhpl.16.15000 600 51.14
60.66 18.62 59.31 59.48
1.00 xhpl.16.30000 600 69.34
78.48 13.18 77.16 77.35 1.08
99
HPL MPI correlations
100
Summary
  • node483 has accuracy issues.
  • node406 has weak Myrinet performance.
  • BC2 has a switch or uplink issue.
  • nodes 1-84 has a different memory configuration
    that does correlate to application performance.
  • Applications at large scales my experience no
    performance anomalies.

101
What is SCAB?
  • SCalability Analysis of Benchmarks
  • The purpose of the SCAB HOWTO is to verify that
    the cluster you just built actually can do work
    at scale.  This can be accomplished by running a
    few industry accepted benchmarks.
  • The STAB/SCAB tools provide tools to plot the
    scalability for visual analysis.
  • The STAB HOWTO should be completed first to rule
    out any inconsistencies that may appear as
    scaling issues.

102
The Benchmarks
  • PMB (Pallas MPI Benchmark)
  • NPB (NAS Parallel Benchmark)
  • HPL (High Performance Linpack)

103
PMB
  • The Pallas MPI Benchmark (PMB) provides a concise
    set of benchmarks targeted at measuring the most
    important MPI functions.
  • NOTE  Pallas has been acquired by Intel.  Intel
    has released the IMB (Intel MPI Benchmark).  The
    IMB is a minor update of the PMB.  The IMB were
    not used because they failed to execute properly
    for all MPI implementations that I tested.
  • IMPORTANT  Consistent PMB Ping-Ping should be
    achieved before running this benchmark (STAB
    Lab).  Unresolved inconsistencies in the
    interconnect may appear as scaling issues.
  • The main purpose of this test is as a diagnostic
    to answer the following questions
  • Are my MPI implementation basic functions
    complete?
  • Does my MPI implementation scale?

104
PMB
  • Example plot from larger BC cluster.
  • Very impressive.  For the Sendrecv benchmark this
    cluster scales from 2 nodes to 240!  Could this
    be a non-blocking GigE configuration?  Another
    benchmark can help answer that question.

105
PMB
  • Example plot from larger BC cluster.
  • Quite revealing.  The sorted benchmark has the 4M
    message size performing at 115MB/s for all node
    counts, but shuffled it falls gradually as the
    number of nodes increase to 10MB/s.  Why?

106
PMB
  • This cluster is partitioned into 14
    nodes/BladeCenter Chassis.  Each chassis has a
    GigE switch with only 4 uplinks, 3 of the 4
    uplinks are bonded together to form a single
    3Gbit uplink to a stacked SMC GigE core switch. 
    Assuming no blocking with the core switch, this
    solution blocks at 143.
  • The Sendrecv benchmark is based on MPI_Sendrecv,
    the processes form a periodic communication
    chain. Each process sends to the right and
    receives from the left neighbor in the chain.

107
PMB
  • Based on the previous illustration it is easy to
    see why the sorted list performed so well.  Most
    of the traffic was isolated to good performing
    local switches and the jump from chassis to
    chassis through the SMC core switch only requires
    the bandwidth of a single link (1Gb full duplex).
  • The shuffled list has small odds that its left
    neighbor (receive from) and its right neighbor
    (send to) will be on the same switch.  This was
    illustrated in the second plot.
  • Moral of the story.
  • Dont trust interconnect vendors that do not
    provide the node list.
  • Ask for sorted and shuffled benchmarks.

108
PMB Myrinet GM
109
PMB Myrinet GM
110
PMB Myrinet MX
111
PMB Myrinet MX
112
PBM IB
113
PBM IB
114
Questions w/ Answers
  • Egan Ford, IBMegan_at_us.ibm.comegan_at_sense.net
Write a Comment
User Comments (0)
About PowerShow.com