Computing%20Environment - PowerPoint PPT Presentation

About This Presentation
Title:

Computing%20Environment

Description:

Single Processor Super-scalar (Sun Sparc Workstations) ... of the above (e.g., Linux clusters, Earth Simulator Cluster of multiple ... – PowerPoint PPT presentation

Number of Views:22
Avg rating:3.0/5.0
Slides: 19
Provided by: ming6
Category:

less

Transcript and Presenter's Notes

Title: Computing%20Environment


1
Computing Environment
  • The computing environment rapidly evolving - you
    need to know not only the methods, but also 
  • How and when to apply them,
  • Which computers to use,
  • What type of code to write,
  • What kind of CPU time and memory your jobs will
    need,
  • What tools (e.g., visualization software) to use
    to analyze the output data.
  • In short, how to make maximum advantage and to
    make most effective use of available computing
    resources.

2
Definitions Clock Cycles, Clock Speed
  • Computer chip operates at discrete intervals
    called clocks. Often measured in nanoseconds (ns)
    or megahertz.
  • 1800 megaHz 1.8 GHz (fastest Pentium V as of
    today) clock speed of 0.5 ns
  • 100 mhz (Cray J90 vector processor) -gt 10 ns
  • May take several clocks to do one multiplication
  • Memory access also takes time, not just
    computation
  • mHz is not the only measure of CPU speed.
    Different CPUs of the same mHz often differ in
    speed.

3
Definitions FLOPS
  • Floating Operations / Second
  • Megaflops million FLOPS
  • Gigaflops billion FLOPS
  • Teraflops trillion FLOPS
  • A good measure of code performance typically
    one add is one flop, one multiplication is also
    on flop
  • Cray J90 peak speed 200 Mflops, most codes
    achieves only 1/3 of peak
  • Cray T90 perk 3.2 Gflops
  • NEC XS-5 CPU 8 Gflops
  • Fastest Workstation-class Processor as of today
    (Alpha EV68) 2Gflops
  • See http//www.specbench.org for the latest
    benchmarks of processors for real world problems.
    Specbench numbers are relative.

4
MIPS
  • Million instructions per second also a measure
    of computer speed used most the old days when
    computer architectures were relatively simple

5
Bandwidth
  • The speed at which data flow across a network or
    wire
  • 56K Modem 56 kilobits / second
  • T1 link 1.554 mbits / sec
  • T3 link 45 mbits / sec
  • FDDI 100 mbits / sec
  • Fiber Channel 800 mbits /sec
  • 100 BaseT (fast) Ethernet 100 mbits/ sec
  • Gigabit Ethernet 1000 mbits /sec
  • Brain system 3 Gbits / s
  • 1 bytes 8 bits

6
Hardware Evolution
  • Mainframe computers
  • Supercomputers
  • Workstations
  • Microcomputers / Personal Computers
  • Desktop Supercomputers
  • Workstation Super Clusters
  • Handheld, Palmtop, Calculators,
  • et al.

7
Types of Processors
  • Scalar (Serial)
  • One operation per clock cycle
  • Vector
  • Multiple operations per clock cycle. Typically
    achieved at the loop level where the instructions
    are the same or similar for each loop index
  • Superscalar (most of todays microprocessors)
  • Several instructions per clock cycle

8
Types of Computer Systems
  • Single Processor Scalar (e.g., ENIAC, IBM704,
    traditional IBM-PC and Mac)
  • Single Processor Vector (CDC7600, Cray-1)
  • Multi-Processor Vector (e.g., Cray XMP, Cray C90,
    Cray J90, NEC SX-5),
  • Single Processor Super-scalar (Sun Sparc
    Workstations)
  • Multi-processor scalar (e.g., Multi-processor
    Pentium PC)
  • Multi-processor super-scalar (e.g., DEC Alpha
    based Cray T3E, RS/6000 based IBM SP-2, SGI
    Origin 2000)
  • Clusters of the above (e.g., Linux clusters,
    Earth Simulator Cluster of multiple vector
    processor nodes)

9
Memory Architectures
  • Shared Memory Parallel (SMP) Systems
  • Distributed Memory Parallel (DMP) Systems
  • Memory can be accessed and addressed
  • uniformly by all processors
  • Fast/expensive CPU, Memory, and networks
  • Easy to use
  • Difficult to scale to many (gt 32) processors
  • Each processor has its own memory
  • Others can access its memory only via
  • network communications
  • Often off-the-shelf components,
  • therefore low cost
  • Hard to use, explicit user specification of
  • communications often needed.
  • Single CPU slow. Not suitable for
  • inherently serial codes
  • High-scalability - largest current system
  • has nearly 10K processors

10
Memory Architectures
  • Multi-level memory (cache and main memory)
    architectures
  • Cache fast and expensive memory
  • Typical L1 cache size in current day
    microprocessors 32 K
  • L2 size 256K to 8mb
  • Main memory a few Mb to many Gb.
  • Try to reuse the content of cache as much as
    possible before the content is replaced by new
    data or instructions

11
Vector Processing
  • The most power CPU or processors (e.g., Cray T90
    and NEC SX-5) are vector processors that can
    perform operations on a stream of data in a
    pipelined fashion.
  • A vector here is defined as an ordered list of
    scalar values. For example, an array stored in
    memory is a vector.
  • Vector systems have machine instructions (vector
    instructions) that fetch a vector of values from
    memory, operate on them and store them back to
    memory.
  • Basically, vector processing is a version of the
    Single Instruction Multiple Data (SIMD) parallel
    processing technique.
  • On the other hand, scalar processing requires one
    instruction to act on each data value.

12
Vector Processing - Example
  • DO I 1, N A(I) B(I) C(I) ENDDO
  • If the above code is vectorized, the following
    processes will take place,
  • A vector of values in B(I) will be fetched from
    memory.
  • A vector of values in C(I) will be fetched from
    memory.
  • A vector add instruction will operate on pairs of
    B(I) and C(I) values.
  • After a short start-up time, stream of A(I)
    values will be stored back to memory, one value
    every clock cycle.
  • If the code is not vectorized, the following
    scalar processes will take place,
  • B(1) will be fetched from memory.
  • C(1) will be fetched from memory.
  • A scalar add instruction will operate on B(1) and
    C(1).
  • A(1) will be stored back to memory
  • Step (1) to (4) will be repeated N times.

13
Vector Processing
  • Vector processing allows a vector of values to be
    fed continuously to the vector processor. If the
    value of N is large enough to make the start-up
    time negligible in comparison, on the average the
    vector processor is capable of producing close to
    one result per clock cycle.
  • If the same code is not vectorized (using J90 as
    an example), for every I iteration, e.g. I1, a
    clock cycle each is needed to fetch B(1) and
    C(1), about 4 clock cycles are needed to complete
    a floating-point add operation, and another clock
    cycle is needed to store the value A(1). Thus a
    minimum of 6 clock cycles are needed to produce
    one result (complete one iterations). We can say
    that there is a speed up of about 6 times for
    this example if the code is vectorized.
  • Vector processors can often chain operations such
    as add and multiplication together, so that both
    operations can be done in one clock cycles. This
    further increases the processing speed. It
    usually helps to have long statements inside
    vector loops.

14
Vectorization for Vector Computers
  • Characteristics of Vectorizable Code
  • Vectorization can only be done within a DO loop,
    and it must be the innermost DO loop.
  • There need to be sufficient iterations in the DO
    loop to offset the start-up time overhead.
  • Try to put more work into a vectorizable
    statement (by having more operations) to provide
    more opportunities for concurrent operation
    (However, the compiler may not vectorize a loop
    if it is too complicated).
  • Vectorization Inhibitors
  • Recursive data dependencies is one of the most
    'destructive' vectorisation inhibitors. E.g.,
    A(I) A(I-1) B(I)
  • Subroutine calls,
  • References to external functions
  • input/output statements
  • Assigned GOTO statements
  • Certain nested IF blocks and backward transfers
    within loops.
  • Inhibitors such as subroutine or function calls
    inside loop can be removed by expanding the
    function or inlining subroutine at the point of
    reference.
  • Vectorization Directive compiler directives can
    be manually inserted into code for force or
    prevent vectorization of specific loops

15
Parallel Processing
  • Parallel processing means doing multiple
    jobs/tasks simultaneously. Vectorization is a
    type of parallel processing within a processor.
  • Code parallelization usually means parallel
    processing across many processors, with within a
    single compute node or across many nodes.
  • One can build a parallel processing system by
    networking a bunch of PCs together e.g., the
    Beowulf linux cluster.
  • Amhdals Law (1967)
  • where a is the time needed for the serial
    portion of the task. When N approaches infinity,
    speedup 1/ a.

16
Issues with Parallel Computing
  • Load-balance / Synchronization
  • Try to give equal amount of workload to each
    processor
  • Try to give processors that finish first more
    work to do (load rebalance)
  • The goal is to keep all processors as busy as
    possible
  • Communication / Locality
  • Inter-processor communications typically the
    biggest overhead on MPP platforms, because
    network is slow relative to CPU speed
  • Try to keep data access local
  • E.g., 2nd-order finite difference

requires data at 3 points
4th-order finite difference
requires data at 5 points
17
A Few Simple Roles for Writing Efficient Code
  • Use multiplies instead of divides whenever
    possible
  • Make innermost loop the longest
  • Slower loop
  • Do 100 i1000
  • Do 10 j1,10
  • a(i,j)
  • 10 continue
  • Faster loop
  • Do 100 j100
  • Do 10 i1,1000
  • a(i,j)
  • 10 continue
  • For the short loop like Do I1,3, write out the
    associated expressions explicitly since the
    startup cost may be very high
  • Avoid complicated logics (IFs) inside Do loops
  • Avoid subroutine and function calls inside long
    DO loops
  • Vectorizable codes typically also run faster on
    RISC based super-scalar processors
  • KISS - Keep it simple, stupid - principle

18
Transition in Computing Architectures at NCAR SCD
This chart depicts major NCAR SCD computers from
the 1960s onward, along with the sustained
gigaflops (billions of floating-point
calculations per second) attained by the SCD
machines from 1986 to the end of fiscal year
1999. Arrows at right denote the machines that
will be operating at the start of FY00. The
division is aiming to bring its collective
computing power to 100 Gfps by the end of FY00,
200 Gfps in FY01, and 1 teraflop by FY03. (Source
at http//www.ucar.edu/staffnotes/9909/IBMSP.html)
Write a Comment
User Comments (0)
About PowerShow.com