Computing%20Environment - PowerPoint PPT Presentation

About This Presentation

Title:

Computing%20Environment

Description:

Single Processor Super-scalar (Sun Sparc Workstations) ... of the above (e.g., Linux clusters, Earth Simulator Cluster of multiple ... – PowerPoint PPT presentation

Number of Views:23

Avg rating:3.0/5.0

Slides: 19

Provided by: ming6

Learn more at: https://twister.caps.ou.edu

Category:

more less

Transcript and Presenter's Notes

Title: Computing%20Environment

1
Computing Environment

The computing environment rapidly evolving - you
need to know not only the methods, but also
How and when to apply them,
Which computers to use,
What type of code to write,
What kind of CPU time and memory your jobs will
need,
What tools (e.g., visualization software) to use
to analyze the output data.
In short, how to make maximum advantage and to
make most effective use of available computing
resources.

2
Definitions Clock Cycles, Clock Speed

Computer chip operates at discrete intervals
called clocks. Often measured in nanoseconds (ns)
or megahertz.
1800 megaHz 1.8 GHz (fastest Pentium V as of
today) clock speed of 0.5 ns
100 mhz (Cray J90 vector processor) -gt 10 ns
May take several clocks to do one multiplication
Memory access also takes time, not just
computation
mHz is not the only measure of CPU speed.
Different CPUs of the same mHz often differ in
speed.

3
Definitions FLOPS

Floating Operations / Second
Megaflops million FLOPS
Gigaflops billion FLOPS
Teraflops trillion FLOPS
A good measure of code performance typically
one add is one flop, one multiplication is also
on flop
Cray J90 peak speed 200 Mflops, most codes
achieves only 1/3 of peak
Cray T90 perk 3.2 Gflops
NEC XS-5 CPU 8 Gflops
Fastest Workstation-class Processor as of today
(Alpha EV68) 2Gflops
See http//www.specbench.org for the latest
benchmarks of processors for real world problems.
Specbench numbers are relative.

4
MIPS

Million instructions per second also a measure
of computer speed used most the old days when
computer architectures were relatively simple

5
Bandwidth

The speed at which data flow across a network or
wire
56K Modem 56 kilobits / second
T1 link 1.554 mbits / sec
T3 link 45 mbits / sec
FDDI 100 mbits / sec
Fiber Channel 800 mbits /sec
100 BaseT (fast) Ethernet 100 mbits/ sec
Gigabit Ethernet 1000 mbits /sec
Brain system 3 Gbits / s
1 bytes 8 bits

6
Hardware Evolution

Mainframe computers
Supercomputers
Workstations
Microcomputers / Personal Computers
Desktop Supercomputers
Workstation Super Clusters
Handheld, Palmtop, Calculators,
et al.

7
Types of Processors

Scalar (Serial)
One operation per clock cycle
Vector
Multiple operations per clock cycle. Typically
achieved at the loop level where the instructions
are the same or similar for each loop index
Superscalar (most of todays microprocessors)
Several instructions per clock cycle

8
Types of Computer Systems

Single Processor Scalar (e.g., ENIAC, IBM704,
traditional IBM-PC and Mac)
Single Processor Vector (CDC7600, Cray-1)
Multi-Processor Vector (e.g., Cray XMP, Cray C90,
Cray J90, NEC SX-5),
Single Processor Super-scalar (Sun Sparc
Workstations)
Multi-processor scalar (e.g., Multi-processor
Pentium PC)
Multi-processor super-scalar (e.g., DEC Alpha
based Cray T3E, RS/6000 based IBM SP-2, SGI
Origin 2000)
Clusters of the above (e.g., Linux clusters,
Earth Simulator Cluster of multiple vector
processor nodes)

9
Memory Architectures

Shared Memory Parallel (SMP) Systems
Distributed Memory Parallel (DMP) Systems

Memory can be accessed and addressed
uniformly by all processors
Fast/expensive CPU, Memory, and networks
Easy to use
Difficult to scale to many (gt 32) processors

Each processor has its own memory
Others can access its memory only via
network communications
Often off-the-shelf components,
therefore low cost
Hard to use, explicit user specification of
communications often needed.
Single CPU slow. Not suitable for
inherently serial codes
High-scalability - largest current system
has nearly 10K processors

10
Memory Architectures

Multi-level memory (cache and main memory)
architectures
Cache fast and expensive memory
Typical L1 cache size in current day
microprocessors 32 K
L2 size 256K to 8mb
Main memory a few Mb to many Gb.
Try to reuse the content of cache as much as
possible before the content is replaced by new
data or instructions

11
Vector Processing

The most power CPU or processors (e.g., Cray T90
and NEC SX-5) are vector processors that can
perform operations on a stream of data in a
pipelined fashion.
A vector here is defined as an ordered list of
scalar values. For example, an array stored in
memory is a vector.
Vector systems have machine instructions (vector
instructions) that fetch a vector of values from
memory, operate on them and store them back to
memory.
Basically, vector processing is a version of the
Single Instruction Multiple Data (SIMD) parallel
processing technique.
On the other hand, scalar processing requires one
instruction to act on each data value.

12
Vector Processing - Example

DO I 1, N A(I) B(I) C(I) ENDDO
If the above code is vectorized, the following
processes will take place,
A vector of values in B(I) will be fetched from
memory.
A vector of values in C(I) will be fetched from
memory.
A vector add instruction will operate on pairs of
B(I) and C(I) values.
After a short start-up time, stream of A(I)
values will be stored back to memory, one value
every clock cycle.
If the code is not vectorized, the following
scalar processes will take place,
B(1) will be fetched from memory.
C(1) will be fetched from memory.
A scalar add instruction will operate on B(1) and
C(1).
A(1) will be stored back to memory
Step (1) to (4) will be repeated N times.

13
Vector Processing

Vector processing allows a vector of values to be
fed continuously to the vector processor. If the
value of N is large enough to make the start-up
time negligible in comparison, on the average the
vector processor is capable of producing close to
one result per clock cycle.
If the same code is not vectorized (using J90 as
an example), for every I iteration, e.g. I1, a
clock cycle each is needed to fetch B(1) and
C(1), about 4 clock cycles are needed to complete
a floating-point add operation, and another clock
cycle is needed to store the value A(1). Thus a
minimum of 6 clock cycles are needed to produce
one result (complete one iterations). We can say
that there is a speed up of about 6 times for
this example if the code is vectorized.
Vector processors can often chain operations such
as add and multiplication together, so that both
operations can be done in one clock cycles. This
further increases the processing speed. It
usually helps to have long statements inside
vector loops.

14
Vectorization for Vector Computers

Characteristics of Vectorizable Code
Vectorization can only be done within a DO loop,
and it must be the innermost DO loop.
There need to be sufficient iterations in the DO
loop to offset the start-up time overhead.
Try to put more work into a vectorizable
statement (by having more operations) to provide
more opportunities for concurrent operation
(However, the compiler may not vectorize a loop
if it is too complicated).
Vectorization Inhibitors
Recursive data dependencies is one of the most
'destructive' vectorisation inhibitors. E.g.,
A(I) A(I-1) B(I)
Subroutine calls,
References to external functions
input/output statements
Assigned GOTO statements
Certain nested IF blocks and backward transfers
within loops.
Inhibitors such as subroutine or function calls
inside loop can be removed by expanding the
function or inlining subroutine at the point of
reference.
Vectorization Directive compiler directives can
be manually inserted into code for force or
prevent vectorization of specific loops

15
Parallel Processing

Parallel processing means doing multiple
jobs/tasks simultaneously. Vectorization is a
type of parallel processing within a processor.
Code parallelization usually means parallel
processing across many processors, with within a
single compute node or across many nodes.
One can build a parallel processing system by
networking a bunch of PCs together e.g., the
Beowulf linux cluster.
Amhdals Law (1967)
where a is the time needed for the serial
portion of the task. When N approaches infinity,
speedup 1/ a.

16
Issues with Parallel Computing

Load-balance / Synchronization
Try to give equal amount of workload to each
processor
Try to give processors that finish first more
work to do (load rebalance)
The goal is to keep all processors as busy as
possible
Communication / Locality
Inter-processor communications typically the
biggest overhead on MPP platforms, because
network is slow relative to CPU speed
Try to keep data access local
E.g., 2nd-order finite difference

requires data at 3 points
4th-order finite difference
requires data at 5 points
17
A Few Simple Roles for Writing Efficient Code

Use multiplies instead of divides whenever
possible
Make innermost loop the longest
Slower loop
Do 100 i1000
Do 10 j1,10
a(i,j)
10 continue
Faster loop
Do 100 j100
Do 10 i1,1000
a(i,j)
10 continue
For the short loop like Do I1,3, write out the
associated expressions explicitly since the
startup cost may be very high
Avoid complicated logics (IFs) inside Do loops
Avoid subroutine and function calls inside long
DO loops
Vectorizable codes typically also run faster on
RISC based super-scalar processors
KISS - Keep it simple, stupid - principle

18
Transition in Computing Architectures at NCAR SCD
This chart depicts major NCAR SCD computers from
the 1960s onward, along with the sustained
gigaflops (billions of floating-point
calculations per second) attained by the SCD
machines from 1986 to the end of fiscal year
1999. Arrows at right denote the machines that
will be operating at the start of FY00. The
division is aiming to bring its collective
computing power to 100 Gfps by the end of FY00,
200 Gfps in FY01, and 1 teraflop by FY03. (Source
at http//www.ucar.edu/staffnotes/9909/IBMSP.html)

Write a Comment

User Comments (0)