Title: Computing%20Environment
1Computing Environment
- The computing environment rapidly evolving - you
need to know not only the methods, but also - How and when to apply them,
- Which computers to use,
- What type of code to write,
- What kind of CPU time and memory your jobs will
need, - What tools (e.g., visualization software) to use
to analyze the output data. - In short, how to make maximum advantage and to
make most effective use of available computing
resources.
2Definitions Clock Cycles, Clock Speed
- Computer chip operates at discrete intervals
called clocks. Often measured in nanoseconds (ns)
or megahertz. - 1800 megaHz 1.8 GHz (fastest Pentium V as of
today) clock speed of 0.5 ns - 100 mhz (Cray J90 vector processor) -gt 10 ns
- May take several clocks to do one multiplication
- Memory access also takes time, not just
computation - mHz is not the only measure of CPU speed.
Different CPUs of the same mHz often differ in
speed.
3Definitions FLOPS
- Floating Operations / Second
- Megaflops million FLOPS
- Gigaflops billion FLOPS
- Teraflops trillion FLOPS
- A good measure of code performance typically
one add is one flop, one multiplication is also
on flop - Cray J90 peak speed 200 Mflops, most codes
achieves only 1/3 of peak - Cray T90 perk 3.2 Gflops
- NEC XS-5 CPU 8 Gflops
- Fastest Workstation-class Processor as of today
(Alpha EV68) 2Gflops - See http//www.specbench.org for the latest
benchmarks of processors for real world problems.
Specbench numbers are relative.
4MIPS
- Million instructions per second also a measure
of computer speed used most the old days when
computer architectures were relatively simple
5 Bandwidth
- The speed at which data flow across a network or
wire - 56K Modem 56 kilobits / second
- T1 link 1.554 mbits / sec
- T3 link 45 mbits / sec
- FDDI 100 mbits / sec
- Fiber Channel 800 mbits /sec
- 100 BaseT (fast) Ethernet 100 mbits/ sec
- Gigabit Ethernet 1000 mbits /sec
- Brain system 3 Gbits / s
- 1 bytes 8 bits
6Hardware Evolution
- Mainframe computers
- Supercomputers
- Workstations
- Microcomputers / Personal Computers
- Desktop Supercomputers
- Workstation Super Clusters
- Handheld, Palmtop, Calculators,
- et al.
7Types of Processors
- Scalar (Serial)
- One operation per clock cycle
- Vector
- Multiple operations per clock cycle. Typically
achieved at the loop level where the instructions
are the same or similar for each loop index - Superscalar (most of todays microprocessors)
- Several instructions per clock cycle
8Types of Computer Systems
- Single Processor Scalar (e.g., ENIAC, IBM704,
traditional IBM-PC and Mac) - Single Processor Vector (CDC7600, Cray-1)
- Multi-Processor Vector (e.g., Cray XMP, Cray C90,
Cray J90, NEC SX-5), - Single Processor Super-scalar (Sun Sparc
Workstations) - Multi-processor scalar (e.g., Multi-processor
Pentium PC) - Multi-processor super-scalar (e.g., DEC Alpha
based Cray T3E, RS/6000 based IBM SP-2, SGI
Origin 2000) - Clusters of the above (e.g., Linux clusters,
Earth Simulator Cluster of multiple vector
processor nodes)
9Memory Architectures
- Shared Memory Parallel (SMP) Systems
- Distributed Memory Parallel (DMP) Systems
- Memory can be accessed and addressed
- uniformly by all processors
- Fast/expensive CPU, Memory, and networks
- Easy to use
- Difficult to scale to many (gt 32) processors
- Each processor has its own memory
- Others can access its memory only via
- network communications
- Often off-the-shelf components,
- therefore low cost
- Hard to use, explicit user specification of
- communications often needed.
- Single CPU slow. Not suitable for
- inherently serial codes
- High-scalability - largest current system
- has nearly 10K processors
10Memory Architectures
- Multi-level memory (cache and main memory)
architectures - Cache fast and expensive memory
- Typical L1 cache size in current day
microprocessors 32 K - L2 size 256K to 8mb
- Main memory a few Mb to many Gb.
- Try to reuse the content of cache as much as
possible before the content is replaced by new
data or instructions
11Vector Processing
- The most power CPU or processors (e.g., Cray T90
and NEC SX-5) are vector processors that can
perform operations on a stream of data in a
pipelined fashion. - A vector here is defined as an ordered list of
scalar values. For example, an array stored in
memory is a vector. - Vector systems have machine instructions (vector
instructions) that fetch a vector of values from
memory, operate on them and store them back to
memory. - Basically, vector processing is a version of the
Single Instruction Multiple Data (SIMD) parallel
processing technique. - On the other hand, scalar processing requires one
instruction to act on each data value.
12Vector Processing - Example
- DO I 1, N A(I) B(I) C(I) ENDDO
- If the above code is vectorized, the following
processes will take place, - A vector of values in B(I) will be fetched from
memory. - A vector of values in C(I) will be fetched from
memory. - A vector add instruction will operate on pairs of
B(I) and C(I) values. - After a short start-up time, stream of A(I)
values will be stored back to memory, one value
every clock cycle. - If the code is not vectorized, the following
scalar processes will take place, - B(1) will be fetched from memory.
- C(1) will be fetched from memory.
- A scalar add instruction will operate on B(1) and
C(1). - A(1) will be stored back to memory
- Step (1) to (4) will be repeated N times.
13Vector Processing
- Vector processing allows a vector of values to be
fed continuously to the vector processor. If the
value of N is large enough to make the start-up
time negligible in comparison, on the average the
vector processor is capable of producing close to
one result per clock cycle. - If the same code is not vectorized (using J90 as
an example), for every I iteration, e.g. I1, a
clock cycle each is needed to fetch B(1) and
C(1), about 4 clock cycles are needed to complete
a floating-point add operation, and another clock
cycle is needed to store the value A(1). Thus a
minimum of 6 clock cycles are needed to produce
one result (complete one iterations). We can say
that there is a speed up of about 6 times for
this example if the code is vectorized. - Vector processors can often chain operations such
as add and multiplication together, so that both
operations can be done in one clock cycles. This
further increases the processing speed. It
usually helps to have long statements inside
vector loops.
14Vectorization for Vector Computers
- Characteristics of Vectorizable Code
- Vectorization can only be done within a DO loop,
and it must be the innermost DO loop. - There need to be sufficient iterations in the DO
loop to offset the start-up time overhead. - Try to put more work into a vectorizable
statement (by having more operations) to provide
more opportunities for concurrent operation
(However, the compiler may not vectorize a loop
if it is too complicated). - Vectorization Inhibitors
- Recursive data dependencies is one of the most
'destructive' vectorisation inhibitors. E.g.,
A(I) A(I-1) B(I) - Subroutine calls,
- References to external functions
- input/output statements
- Assigned GOTO statements
- Certain nested IF blocks and backward transfers
within loops. - Inhibitors such as subroutine or function calls
inside loop can be removed by expanding the
function or inlining subroutine at the point of
reference. - Vectorization Directive compiler directives can
be manually inserted into code for force or
prevent vectorization of specific loops
15Parallel Processing
- Parallel processing means doing multiple
jobs/tasks simultaneously. Vectorization is a
type of parallel processing within a processor. - Code parallelization usually means parallel
processing across many processors, with within a
single compute node or across many nodes. - One can build a parallel processing system by
networking a bunch of PCs together e.g., the
Beowulf linux cluster. - Amhdals Law (1967)
- where a is the time needed for the serial
portion of the task. When N approaches infinity,
speedup 1/ a.
16Issues with Parallel Computing
- Load-balance / Synchronization
- Try to give equal amount of workload to each
processor - Try to give processors that finish first more
work to do (load rebalance) - The goal is to keep all processors as busy as
possible - Communication / Locality
- Inter-processor communications typically the
biggest overhead on MPP platforms, because
network is slow relative to CPU speed - Try to keep data access local
- E.g., 2nd-order finite difference
requires data at 3 points
4th-order finite difference
requires data at 5 points
17A Few Simple Roles for Writing Efficient Code
- Use multiplies instead of divides whenever
possible - Make innermost loop the longest
- Slower loop
- Do 100 i1000
- Do 10 j1,10
- a(i,j)
- 10 continue
- Faster loop
- Do 100 j100
- Do 10 i1,1000
- a(i,j)
- 10 continue
- For the short loop like Do I1,3, write out the
associated expressions explicitly since the
startup cost may be very high - Avoid complicated logics (IFs) inside Do loops
- Avoid subroutine and function calls inside long
DO loops - Vectorizable codes typically also run faster on
RISC based super-scalar processors - KISS - Keep it simple, stupid - principle
18Transition in Computing Architectures at NCAR SCD
This chart depicts major NCAR SCD computers from
the 1960s onward, along with the sustained
gigaflops (billions of floating-point
calculations per second) attained by the SCD
machines from 1986 to the end of fiscal year
1999. Arrows at right denote the machines that
will be operating at the start of FY00. The
division is aiming to bring its collective
computing power to 100 Gfps by the end of FY00,
200 Gfps in FY01, and 1 teraflop by FY03. (Source
at http//www.ucar.edu/staffnotes/9909/IBMSP.html)