High Performance Computing - PowerPoint PPT Presentation

1 / 114
About This Presentation
Title:

High Performance Computing

Description:

High Performance Computing – PowerPoint PPT presentation

Number of Views:405
Avg rating:3.0/5.0
Slides: 115
Provided by: johnr234
Category:

less

Transcript and Presenter's Notes

Title: High Performance Computing


1
High Performance Computing
  • RD
  • rnd_at_ciit.net.pk
  • http//rnd.ciit.net.pk

2
Objectives
  • At the end of the address u will
  • Understand HPC concepts
  • Tell about various HPC paradigms
  • Distinguish between HPC programming technologies
  • Understand the physical cluster architecture
  • Understand about the application building
  • Informed about the CIIT Cluster resources.

3
Outline
  • Introduction
  • Why we need powerful computers
  • Why powerful computers are parallel
  • Parallel computers, yesterday and today
  • Issues in parallel performance
  • What are Clusters
  • Parallel applications programming APIs
  • CIIT Computation resources
  • Aplications Areas
  • QA

4
Why do we need powerful computers?
5
Simulation The Third Pillar of Science
  • Traditional scientific and engineering paradigm
  • Do theory or paper design.
  • Perform experiments or build system.
  • Limitations
  • Too difficult -- build large wind tunnels.
  • Too expensive -- build a throw-away passenger
    jet.
  • Too slow -- wait for climate or galactic
    evolution.
  • Too dangerous -- weapons, drug design, climate
    experiments.
  • Computational science paradigm
  • Use high performance computer systems to simulate
    the phenomenon.
  • Base on known physical laws and efficient
    numerical methods.

6
Some Challenging Computations
  • Science
  • Global climate modeling
  • Astrophysical modeling
  • Biology genomics protein folding drug design
  • Computational Chemistry
  • Computational Material Sciences and Nanosciences
  • Engineering
  • Crash simulation
  • Semiconductor design
  • Earthquake and structural modeling
  • Computation fluid dynamics (airplane design)
  • Combustion (engine design)
  • Business
  • Financial and economic modeling
  • Transaction processing, web services and search
    engines
  • Defense
  • Nuclear weapons -- test by simulation
  • Cryptography

7
Units of Measure in HPC
  • High Performance Computing (HPC) units are
  • Flop/s floating point operations per second
  • Typical sizes are millions, billions, trillions
  • Mega Mflop/s 106 flop/sec Mbyte 106 byte
  • Giga Gflop/s 109 flop/sec Gbyte 109 byte
  • Tera Tflop/s 1012 flop/sec Tbyte 1012 byte
  • Peta Pflop/s 1015 flop/sec Pbyte 1015 byte
  • Exa Eflop/s 1018 flop/sec Ebyte 1018 byte

8
Global Climate Modeling Problem
  • Problem is to compute
  • f(latitude, longitude, elevation, time) ?
  • temperature, pressure,
    humidity, wind velocity
  • Approach
  • Discretize the domain, e.g., a measurement point
    every 10 km
  • Devise an algorithm to predict weather at time
    t1 given t
  • Uses
  • Predict major events, e.g., El Nino
  • Use in setting air emissions standards

Source http//www.epm.ornl.gov/chammp/chammp.html
9
Global Climate Modeling Computation
  • One piece is modeling the fluid flow in the
    atmosphere
  • Solve Navier-Stokes problem
  • Roughly 100 Flops per grid point with 1 minute
    timestep
  • Computational requirements
  • To match real-time, need 5x 1011 flops in 60
    seconds 8 Gflop/s
  • Weather prediction (7 days in 24 hours) ? 56
    Gflop/s
  • Climate prediction (50 years in 30 days) ? 4.8
    Tflop/s
  • To use in policy negotiations (50 years in 12
    hours) ? 288 Tflop/s
  • To double the grid resolution, computation is at
    least 8x
  • State of the art models require integration of
    atmosphere, ocean, sea-ice, land models, plus
    possibly carbon cycle, geochemistry and more
  • Current models are coarser than this
  • http//www.nersc.gov/aboutnersc/pubs/bigsplash.pdf

10
Why are powerful computers parallel?
11
Tunnel Vision by Experts
  • I think there is a world market for maybe five
    computers.
  • Thomas Watson, chairman of IBM, 1943.
  • There is no reason for any individual to have a
    computer in their home
  • Ken Olson, president and founder of Digital
    Equipment Corporation, 1977.
  • 640K of memory ought to be enough for
    anybody.
  • Bill Gates, chairman of Microsoft,1981.

Slide source Warfield et al.
12
Technology Trends Microprocessor Capacity
Moores Law
Moores Law transistors/chip doubles every
1.5 years
Gordon Moore (co-founder of Intel) predicted in
1965 that the transistor density of semiconductor
chips would double roughly every 18 months.
Microprocessors have become smaller, denser, and
more powerful.
Slide source Jack Dongarra
13
How fast can a serial computer be?
1 Tflop 1 TB sequential machine
r .3 mm
  • Consider the 1 Tflop sequential machine
  • data must travel some distance, r, to get from
    memory to CPU
  • to get 1 data element per cycle, this means 1012
    times per second at the speed of light, c 3e8
    m/s
  • so r lt c/1012 .3 mm
  • Now put 1 TB of storage in a .3 mm2 area
  • each word occupies 3 Angstroms2, the size of a
    small atom
  • (1 Angstrom 0.000,000,1 mm)

14
Automatic Parallelism in Modern Machines
  • Bit level parallelism
  • within floating point operations, etc.
  • Instruction level parallelism
  • multiple instructions execute per clock cycle
  • Memory system parallelism
  • overlap of memory operations with computation
  • OS parallelism
  • multiple jobs run in parallel on commodity SMPs

There are limits to all of these -- for very high
performance, user must identify, schedule and
coordinate parallel tasks
15
Number of transistors per processor chip
16
Number of transistors per processor chip
Instruction-Level Parallelism
Thread-Level Parallelism?
Bit-Level Parallelism
17
Parallel computers, yesterday and today
18
Various Competing Computer Architectures
  • Vector Computers (VC) ---proprietary system
  • provided the breakthrough needed for the
    emergence of computational science, buy they were
    only a partial answer.
  • Massively Parallel Processors (MPP)-proprietary
    system
  • high cost and a low performance/price ratio.
  • Symmetric Multiprocessors (SMP)
  • suffers from scalability
  • Clusters -- gaining popularity
  • High Performance Computing---Commodity
    Supercomputing
  • High Availability Computing ---Mission Critical
    Applications

19
High Performance Computing
  • Models
  • Shared Memory
  • Distributed Memory

20
Machine ArchitecturesShared Memory
CPU1
CPU2
CPUN
- - - - - - - - - -
NETWORK
MEMORY
FEATURES 1) All CPUs share memory 2) CPUs access
memory using the interconnection network
21
Machine ArchitecturesDistributed Memory
Network
FEATURES 1) Each node has its own local
memory 2) Nodes share data by passing data over
the network
22
Issues in parallel performance
23
Locality and Parallelism
Conventional Storage Hierarchy
Proc
Proc
Proc
Cache
Cache
Cache
L2 Cache
L2 Cache
L2 Cache
L3 Cache
L3 Cache
L3 Cache
potential interconnects
Memory
Memory
Memory
  • Large memories are slow, fast memories are small
  • Storage hierarchies are large and fast on average
  • Parallel processors, collectively, have large,
    fast cache
  • the slow accesses to remote data we call
    communication
  • Algorithm should do most work on local data

24
Finding Enough Parallelism Amdahls Law
  • Amdahls Law

How many processors can we really use? Lets say
we have a legacy code such that is it only
feasible to convert half of the heavily used
routines to parallel
25
Finding Enough Parallelism Amdahls Law
  • Amdahls Law
  • If we run this on a parallel machine with five
    processors
  • Our code now takes about 60s. We have sped it up
    by about 40. Lets say we use a thousand
    processors
  • We have now sped our code by about a factor of
    two.

26
Finding Enough Parallelism Amdahls Law
  • Suppose only part of an application seems
    parallel
  • Amdahls law
  • Let s be the fraction of work done sequentially,
    so (1-s) is the fraction parallelizable
  • Let P number of processors

Speedup(P) Time(1)/Time(P)
lt 1/(s (1-s)/P) lt 1/s
  • Even if the parallel part speeds up perfectly,
    the sequential part limits overall performance.

27
Finding Enough Parallelism Amdahls Law
  • Amdahls Law
  • speedup and efficiency
  • Speedup SN Ts / Tp
  • Efficiency EN SN / N
  • If the best known serial algorithm takes 8
    seconds i.e. Ts 8, while a parallel algorithm
    takes 2 seconds using 5 processors, then
  • SN Ts / Tp 8 / 2 4 and
  • EN SN / N 4 / 5 0.8 80
  • i.e. the parallel algorithm exhibits a speedup of
    4 with 5 processors giving an 80 efficiency.

28
Load Imbalance
  • Load imbalance is the time that some processors
    in the system are idle due to
  • insufficient parallelism (during that phase)
  • unequal size tasks
  • Examples of the latter
  • adapting to interesting parts of a domain
  • tree-structured computations
  • fundamentally unstructured problems
  • Algorithm needs to balance the load

29
Some of the Fastest Super Computers
30
Parallel Computing Today
IBM BlueGene _at_ 280 TFlops
31
Parallel Computing Today
Mini BlueGene _at_ 91 TFlops
32
Parallel Computing Today
ASC Purple _at_ 63 TFlops
33
Parallel Computing Today
Columbia SGI Altix _at_ 51 TFlops
34
Parallel Computing Today
Earth Simulator _at_ 35 TFlops
35
  • Parallel Computing Today

36
Parallel Computing _at_ home
Small class Beowulf cluster
37
Clusters
The Modern Choice
38
What is a cluster?
  • A cluster is a type of parallel or distributed
    processing system, which consists of a collection
    of interconnected stand-alone computers
    cooperatively working together as a single,
    integrated computing resource.
  • A typical cluster
  • Network Faster, closer connection than a typical
    network (LAN)
  • Low latency communication protocols
  • Looser connection than SMP

39
Cluster Architecture
40
Backbone/Communication Topology
41
Token-Ring/Ethernet with Workstations
42
Complete Connectivity
43
Star Topology
44
Binary Tree
45
INTEL Paragon (2-D Mesh)
46
The Need for Alternative Supercomputing Resources
  • Cannot afford to buy Big Iron machines
  • due to their high cost and short life span.
  • cut-down of funding
  • dont fit better into today's funding model.
  • .
  • Paradox time required to develop a parallel
    application for solving GCA is equal to
  • half Life of Parallel Supercomputers.

47
Clusters are best-alternative!
  • Supercomputing-class commodity components are
    available
  • They fit very well with todays/future funding
    model.
  • Can leverage upon future technological advances
  • VLSI, CPUs, Networks, Disk, Memory, Cache, OS,
    programming tools, applications,...

48
Best of both Worlds!
  • High Performance Computing
  • parallel computers/supercomputer-class
    workstation cluster
  • dependable parallel computers
  • High Availability Computing
  • mission-critical systems
  • fault-tolerant computing

49
So Whats So Different about Clusters?
  • Commodity Parts?
  • Communications Packaging?
  • Incremental Scalability?
  • Independent Failure?
  • Intelligent Network Interfaces?
  • Complete System on every node
  • virtual memory
  • scheduler
  • files
  • Nodes can be used individually or combined...

50
1984 Computer Food Chain
Mainframe
PC
Workstation
Mini Computer
Vector Supercomputer
51
Original Food Chain
Mainframe
Vector Supercomputer
Mini Computer
Workstation
PC
Before
52
Computer Food Chain (Now and Future)
53
Why Clusters now?(Beyond Technology and Cost)
  • Building block is big enough
  • complete computers (HW SW) shipped in millions
    killer micro, killer RAM, killer disks,killer
    OS, killer networks, killer apps.
  • Workstations performance is doubling every 18
    months.
  • Networks are faster
  • Higher link bandwidth (v 10Mbit Ethernet)
  • Switch based networks coming (ATM)
  • Interfaces simple fast (Active Msgs)
  • Demise of Mainframes, Supercomputers, MPPs

54
Architectural Drivers
  • Node architecture dominates performance
  • processor, cache, bus, and memory
  • design and engineering gt performance
  • Greatest demand for performance is on large
    systems
  • must track the leading edge of technology without
    lag
  • MPP network technology gt mainstream
  • system area networks
  • System on every node is a powerful enabler
  • very high speed I/O, virtual memory, scheduling,

55
...Architectural Drivers
  • Clusters can be grown Incremental scalability
    (up, down, and across)
  • Individual nodes performance can be improved by
    adding additional resource (new memory
    blocks/disks)
  • New nodes can be added or nodes can be removed
  • Clusters of Clusters and Metacomputing
  • Complete software tools
  • Threads, PVM, MPI, DSM, C, C, Java, Parallel
    C, Compilers, Debuggers, OS, etc.
  • Wide class of applications
  • Sequential and grand challenging parallel
    applications

56
Top500 SuperComputers Statistics
57
Top500 SuperComputers ListManufacturers
58
Top500 SuperComputers ListContinents
59
Top500 SuperComputers ListCountries/Performance
60
Top500 SuperComputers ListAsian Countries/Systems
61
Top500 SuperComputers ListCustomer
Segments/Performance
62
Top500 SuperComputers ListArchitecture/Performanc
e
63
Top500 SuperComputers ListInterConnet/Performance
64
Top500 SuperComputers ListOperatingSystems/System
s
65
How do I write parallel apps
66
Available APIs
67
What is OpenMP ?
  • A standard developed under the review of many
    major software and hardware developers,
    government, and academia
  • Facilitates simple development of programs to
    take advantage of SMP architectures
  • SMP Symmetric multi-processing, access time to
    memory is approx. equal for all processors
    (usually 2- 16 processors)
  • Shared Memory memory local to all processors in
    an SMP domain
  • Distributed Memory remote memory access
    (non-local memory) NUMA (clusters, grids)

68
What is OpenMP ?
  • OpenMP API is comprised of
  • Compiler directives
  • Library routines
  • Environment variables
  • OpenMP language support
  • Fortran, C, C
  • Compilers supporting OpenMP
  • Intel Compilers, Portland Group (PGI), IBM,
    Compaq
  • Omni, OdinMP can be used with gcc

69
OpenMP (behind the scenes)
  • Thread communication through shared variables
    (shared memory)
  • Threads can be carried through from one
    parallel region to the next
  • Imp Need to amortize thread fork cost and
    minimize thread joins
  • Number of threads can be dynamically altered
    during runtime
  • Support for nested parallelism exists in some
    compilers

70
What is OpenMosix ?
  • An OpenSource enhancement to the Linux kernel
  • Provides adaptive (on-line) load-balancing
    between the machines.
  • Uses pre-emptive process migration to assign and
    reassign the processes among the nodes to take
    the best advantage of the available resources

71
OpenMosix architecture (1/5)
  • Network transparency
  • The interactive user and the application level
    programs are provided by a virtual machine that
    looks like a single MP machine
  • Preemptive process migration
  • Any users process, trasparently and at any
    time, can migrate to any available node.
  • The migrating process is divided into two
    contexts
  • system context (deputy) that may not be migrated
    from home workstation
  • user context (remote) that can be migrated on a
    diskless node

72
OpenMosix architecture (2/5)
  • Preemptive process migration

master node
diskless node
73
OpenMosix architecture (3/5)
  • Dynamic load balancing
  • Initiates process migrations in order to balance
    the load of farm
  • Responds to variations in the load of the nodes,
    runtime characteristics of the processes, number
    of nodes and their speeds
  • Makes continuous attempts to reduce the load
    differences between pairs of nodes and
    dynamically migrating processes from nodes with
    higher load to nodes with a lower load

74
OpenMosix architecture (4/5)
  • Memory sharing
  • Places the maximal number of processes in the
    farm main memory, even if it implies an uneven
    load distribution among the nodes
  • Delays as much as possible swapping out of pages
  • Makes the decision of which process to migrate
    and where to migrate it is based on the knoweldge
    of the amount of free memory in other nodes
  • Efficient kernel communication
  • is specifically developed to reduce the overhead
    of the internal kernel communications (e.g.
    between the process and its home site, when it is
    executing in a remote site)
  • fast and reliable protocol with low startup
    latency and high throughput

75
OpenMosix architecture (5/5)
  • Probabilistic information dissemination
    algorithms
  • provide each node with sufficient knowledge about
    available resources in other nodes, without
    polling
  • measure the amount of the available resources on
    each node
  • receive the resources indices that each node send
    at regular intervals to a randomly chosen subset
    of nodes
  • the use of randomly chosen subset of nodes is due
    for support of dynamic configuration and to
    overcome partial nodes failures
  • Decentralized control and autonomy
  • each node makes its own control decisions
    independently and there is no master-slave
    relationship between nodes
  • each node is capable of operating as an
    independent system this property allows a
    dynamic configuration, where nodes may join or
    leave the farm with minimal disruption

76
Openmosix Conclusions (1/2)
  • Noticeable features of OpenMOSIX are
  • load-balancing
  • process migration algorithms
  • This is most useful in time-sharing, multi-user
    environments, where users do not have means (and
    usually are not interested) in the status (e.g.
    load of the nodes)
  • Parallel application can be executed by forking
    many processes, just like in an SMP, where
    OpenMOSIX continuously attempts to optimize the
    resource allocation

77
Openmosix Conclusions (2/2)
  • Building up farms with the OpenMosixClusterNFS
    approach requires no more than 2 hours
  • With this approach management of a farm
    management of a single server

78
Message Passing Interface (MPI)
  • Available for Numerous h/w and OS platforms
  • Most Popular
  • Well Supported
  • Supports C/C, Fortran 77/90/95
  • Processes coordinate their activities by passing
    and receiving messages
  • Extensive API wrapper functions available
  • Give substantial control over the applications
    design/ architecture

79
MPI continued
  • Basic Calls
  • include "mpi.h" provides basic MPI definitions
    and types
  • MPI_Init starts MPI
  • MPI_Finalize exits MPI
  • MPI_Comm_rank( MPI_COMM_WORLD, rank )
    MPI_Comm_size( MPI_COMM_WORLD, size )
  • MPI_Send()
  • MPI_Recv()

80
MPI continued
  • Basic Program Structure
  • include "mpi.h"
  • main(int argc ,char argv)
  • /No MPI fucntions before this/
  • MPI_Init(argc,argv)//allows systems to do
    special //setup
  • . .
  • .
  • MPI_Finalise()//frees memory used by MPI
  • /No MPI function called after this/
  • /main/

81
MPI Continued
  • Sample Hello World in MPI
  • include "mpi.h"
  • include ltstdio.hgt
  • int main( argc, argv )
  • int argc char argv
  • MPI_Init( argc, argv )
  • printf( "Hello world\n" )
  • MPI_Finalize()
  • return 0

82
Parallel Virtual Machine (PVM)
  • Runs on every UNIX and WinNT/Win95
  • Runs over most physical networks (ethernet, FDDI,
    Myrinet, ATM, Shared-Memory)
  • A heterogeneous collection of machines can be
    assembled and used as a Super Computer
  • Programming is completely portable
  • Supports C/C, Fortran 77/90/95
  • The underlying machine and network is transparent
    to the programmer/user
  • Each user has his/hers own private VM

83
PVM Continued
  • Basic Calls
  • Pvm_Spawn
  • num pvm_spawn(child, arguments, flag, where,
    howmany, tids)
  • Send (one receiver)
  • info pvm_send(tid, tag)
  • Receiving
  • bufid pvm_recv(tid, tag)
  • broadcast (multiple receivers)
  • info pvm_mcast(tids, n, tag),
  • info pvm_bcast(group_name, tag)

84
PVM Continued
  • Sample Hello world
  • include ltstdio.hgt
  • include ltpvm3.hgt
  • main()
  • int cc, tid
  • char buf100
  • printf("i'm tx\n", pvm_mytid())
  • / spawn 1 copy of hello_other on any machine /
  • cc pvm_spawn("hello_other", (char)0,PvmTaskDe
    fault, "", 1, tid)
  • if (cc 1)
  • cc pvm_recv(-1, -1) / receive a message
    from any source /
  • / get info about the sender /
  • pvm_bufinfo(cc, (int)0, (int)0, tid)
  • pvm_upkstr(buf)
  • printf("from tx s\n", tid, buf)
  • else
  • printf("can't start hello_other\n")

85
CIIT Computational Cluster
86
CIIT Computational Cluster
87
..CIIT Computational Cluster
  • Cluster Specs
  • Master Node
  • Dual P-II 500 Mhz, 512 Mb 20Gig Ultra Wide SCSI
  • Compute Node x 32
  • P-II 333 Mhz, 96 Mb 4Gig Ultra Wide SCSI
  • Ethernet 10/100 Interconnect
  • 9-12 GFlops

88
..CIIT Computational Cluster
  • Available Libraries/APIs
  • LAM MPI/MPICH
  • PVM
  • BLAS (Basic Linear Algebra software)
  • LAPack (Linear Algebra package)
  • ScalaPack (Scalable Linear Algebra Package)
  • BLACS (Basic Linear Algebra Communication sub
    programs)
  • Intel MKL (Intels Math Kernel Libraries)

89
..CIIT Computational Cluster
  • Available Programming languages/Compilers
  • C/C
  • FORTRAN77
  • GNU, Intel, Lahey, Fujitsu
  • Misc
  • OpenPBS/Torque

90
..CIIT Computational Cluster
  • Topology

91
..CIIT Computational Cluster
  • How do I access it ?
  • Use any ssh client
  • Openssh or Putty (for windows)
  • Connect from
  • Faculty IP 172.16.4.19
  • Students labs 172.16.0.45

92
..CIIT Computational Cluster
93
..CIIT Computational Cluster
94
..CIIT Computational Cluster
  • Code compilation
  • Mpicc , mpif77, gcc, g77
  • Eg to compile hello.c
  • mpicc o ltbinary outputgt ltsource file.cgt
  • Or
  • mpicc o hello hello.c
  • mpif77 o hello hello.f
  • gcc -I /opt/pvm3/include myprogram.c -L
    /opt/pvm3/lib/LINUX/ -lpvm3 -o myprogramexe

95
..CIIT Computational Cluster
  • Resource Availability

96
..CIIT Computational Cluster
97
..CIIT Computational Cluster
  • Submitting Your jobs
  • start of script
  • PBS -N HelloJob
  • PBS -q workq
  • PBS -l nodes32
  • echo "start it"
  • echo "HOMEHOME"
  • lamboot PBS_NODEFILE
  • echo "- LAM is ready"
  • cd PBS_O_WORKDIR
  • mpirun C hello
  • lamhalt PBS_NODEFILE
  • echo "done
  • End of script
  • user/ qsub myjobscript

98
..CIIT Computational Cluster
99
..CIIT Computational Cluster
  • Jobs Status

100
..CIIT Computational Cluster
101
..CIIT Computational Cluster
102
Applications of Parallel Computing revisited
  • Science
  • Global climate modeling
  • Astrophysical modeling
  • Biology genomics protein folding drug design
  • Computational Chemistry
  • Computational Material Sciences and Nano Sciences
  • Engineering
  • Crash simulation
  • Semiconductor design
  • Earthquake and structural modeling
  • Computation fluid dynamics (airplane design)
  • Combustion (engine design)

103
Applications of Parallel Computing revisited
  • Business
  • Financial and economic modeling
  • Transaction processing, web services and search
    engines
  • Defense
  • Nuclear weapons
  • Cryptography

104
PDC hot topics for E-commerce
Applications of Parallel Computing revisited
  • Cluster based web-servers, search engineers,
    portals
  • Scheduling and Single System Image.
  • Heterogeneous Computing
  • Reliability and High Availability and Data
    Recovery
  • Parallel Databases and high performance-reliable-m
    ass storage systems.
  • CyberGuard! Data mining for detection of cyber
    attacks, frauds, etc. detection and online
    control.
  • Data Mining for identifying sales pattern and
    automatically tuning portal to special
    sessions/festival sales
  • eCash, eCheque, eBank, eSociety, eGovernment,
    eEntertainment, eTravel, eGoods, and so on.
  • Data/Site Replications and Caching techniques
  • Compute Power Market
  • Infowares (yahoo.com, AOL.com)
  • ASPs (application service providers)
  • . . .

105
Q/A
  • References
  • www.tldp.com
  • www-unix.mcs.anl.gov/mpi/
  • www.netlib.org/pvm3/
  • www.beowulf.org
  • www.putty.nl/
  • www.top500.org
  • Openmp www.openmp.org
  • Introduction to Openmp http//www.llnl.gov/computi
    ng/tutorials/workshops/workshop/openMP/MAIN.html
  • http//www.openmosix.org

End of Day 6
106
BkupSlides
107
..Introduction of PDC
  • Let s be the fraction of work done sequentially,
    so (1-s) is the fraction parallelizable
  • Let P number of processors
  • Amdahls Law

Speedup(P) Time(1)/Time(P)
lt 1/(s (1-s)/P) lt 1/s
  • Even if the parallel part speeds up perfectly,
    the sequential part limits overall performance.

108
BkupSlides
109
Introduction of PDC
  • Sequential Hardware
  • Turing Machine View
  • Tape TM State
  • Sequential Change of State and Tape Position
  • Von Neumann View
  • Program Counter Registers Thread/process
  • Sequential change of Machine state
  • Sequence /U essence of computation

110
High Resolution Climate Modeling on NERSC-3 P.
Duffy, et al., LLNL
111
A 1000 Year Climate Simulation
  • Demonstration of the Community Climate Model
    (CCSM2)
  • A 1000-year simulation shows long-term, stable
    representation of the earths climate.
  • 760,000 processor hours used
  • Temperature change shown
  • Warren Washington and Jerry Meehl, National
    Center for Atmospheric Research Bert Semtner,
    Naval Postgraduate School John Weatherly, U.S.
    Army Cold Regions Research and Engineering Lab
    Laboratory et al.
  • http//www.nersc.gov/aboutnersc/pubs/bigsplash.pdf

112
Climate Modeling on the Earth Simulator System
  • Development of ES started in 1997 with the goal
    of enabling a comprehensive understanding of
    global environmental changes such as global
    warming.
  • Construction was completed February, 2002 and
    practical operation started March 1, 2002
  • 35.86 Tflops (87.5 of peak performance) on
    Linpack benchmark.
  • 26.58 Tflops on a global atmospheric circulation
    code.

113
Fully Connection CM-2
114
Scaling microprocessors
  • What happens when feature size shrinks by a
    factor of x?
  • Clock rate goes up by x
  • actually a little less
  • Transistors per unit area goes up by x2
  • Die size also tends to increase
  • typically another factor of x
  • Raw computing power of the chip goes up by x4 !
  • of which x3 is devoted either to parallelism or
    locality
Write a Comment
User Comments (0)
About PowerShow.com