Prof. Sin-Min Lee - PowerPoint PPT Presentation

1 / 80
About This Presentation
Title:

Prof. Sin-Min Lee

Description:

Parallel Computers Prof. Sin-Min Lee Department of Computer Science Uniprocessor Systems Improve performance: Allowing multiple, simultaneous memory access - requires ... – PowerPoint PPT presentation

Number of Views:129
Avg rating:3.0/5.0
Slides: 81
Provided by: RichSchl1
Category:
Tags: cell | lee | min | prof | saver | sin

less

Transcript and Presenter's Notes

Title: Prof. Sin-Min Lee


1
Parallel Computers
CS147 Lecture 17
  • Prof. Sin-Min Lee
  • Department of Computer Science

2
Uniprocessor Systems
  • Improve performance
  • Allowing multiple, simultaneous memory access
  • - requires multiple address, data, and control
    buses
  • (one set for each simultaneous memory access)
  • - The memory chip has to be able to handle
    multiple
  • transfers simultaneously

3
Uniprocessor Systems
  • Multiport Memory
  • Has two sets of address, data, and control pins
    to allow simultaneous data transfers to occur
  • CPU and DMA controller can transfer data
    concurrently
  • A system with more than one CPU could handle
    simultaneous requests from two different
    processors

4
Uniprocessor Systems
  • Multiport Memory (cont.)
  • Can
  • Multiport memory can handle two requests to read
    data from the same location at the same time
  • Cannot
  • Process two simultaneous requests to write data
    to the same memory location
  • - Requests to read from and write to the same
    memory location simultaneously

5
Multiprocessors
Bus
CPU
Device
I/O Port
Memory
6
Multiprocessors
  • Systems designed to have 2 to 8 CPUs
  • The CPUs all share the other parts of the
    computer
  • Memory
  • Disk
  • System Bus
  • etc
  • CPUs communicate via Memory and the System Bus

7
MultiProcessors
  • Each CPU shares memory, disks, etc
  • Cheaper than clusters
  • Not as good performance as clusters
  • Often used for
  • Small Servers
  • High-end Workstations

8
MultiProcessors
  • OS automatically shares work among available CPUs
  • On a workstation
  • One CPU can be running an engineering design
    program
  • Another CPU can be doing complex graphics
    formatting

9
Applications of Parallel Computers
  • Traditionally government labs, numerically
    intensive applications
  • Research Institutions
  • Recent Growth in Industrial Applications
  • 236 of the top 500
  • Financial analysis, drug design and analysis, oil
    exploration, aerospace and automotive

10
Multiprocessor SystemsFlynns Classification
  • Single instruction multiple data (SIMD)

Main Memory
Control Unit
Processor
Memory
Communications Network
Processor
Memory
Processor
Memory
  • Executes a single instruction on multiple data
    values simultaneously using many processors
  • Since only one instruction is processed at any
    given time, it is not necessary for each
    processor to fetch and decode the instruction
  • This task is handled by a single control unit
    that sends the control signals to each processor.
  • Example Array processor

11
Why Multiprocessors?
  • Microprocessors as the fastest CPUs
  • Collecting several much easier than redesigning 1
  • Complexity of current microprocessors
  • Do we have enough ideas to sustain 1.5X/yr?
  • Can we deliver such complexity on schedule?
  • Slow (but steady) improvement in parallel
    software (scientific apps, databases, OS)
  • Emergence of embedded and server markets driving
    microprocessors in addition to desktops
  • Embedded functional parallelism,
    producer/consumer model
  • Server figure of merit is tasks per hour vs.
    latency

12
Parallel Processing Intro
  • Long term goal of the field scale number
    processors to size of budget, desired performance
  • Machines today Sun Enterprise 10000 (8/00)
  • 64 400 MHz UltraSPARC II CPUs,64 GB SDRAM
    memory, 868 18GB disk,tape
  • 4,720,800 total
  • 64 CPUs 15,64 GB DRAM 11, disks 55, cabinet
    16 (10,800 per processor or 0.2 per
    processor)
  • Minimal E10K - 1 CPU, 1 GB DRAM, 0 disks, tape
    286,700
  • 10,800 (4) per CPU, plus 39,600 board/4 CPUs
    (8/CPU)
  • Machines today Dell Workstation 220 (2/01)
  • 866 MHz Intel Pentium III (in Minitower)
  • 0.125 GB RDRAM memory, 1 10GB disk, 12X CD, 17
    monitor, nVIDIA GeForce 2 GTS,32MB DDR Graphics
    card, 1yr service
  • 1,600 for extra processor, add 350 (20)

13
Major MIMD Styles
  • Centralized shared memory ("Uniform Memory
    Access" time or "Shared Memory Processor")
  • Decentralized memory (memory module with CPU)
  • get more memory bandwidth, lower memory latency
  • Drawback Longer communication latency
  • Drawback Software model more complex

14
Organization of Multiprocessor Systems
  • Three different ways to organize/classify
    systems
  • Flynns Classification
  • System Topologies
  • MIMD System Architectures

15
Multiprocessor SystemsFlynns Classification
  • Flynns Classification
  • Based on the flow of instructions and data
    processing
  • A computer is classified by
  • - whether it processes a single instruction at a
    time or multiple instructions simultaneously
  • - whether it operates on one more multiple data
    sets

16
Multiprocessor SystemsFlynns Classification
  • Four Categories of Flynns Classification
  • SISD Single instruction single data
  • SIMD Single instruction multiple data
  • MISD Multiple instruction single data
  • MIMD Multiple instruction multiple data
  • The MISD classification is not practical to
    implement.
  • In fact, no significant MISD computers have ever
    been build.
  • It is included only for completeness.

17
From the beginning of time, computer scientists
have been challenging computers with larger and
larger problems. Eventually, computer processors
were combined together in parallel to work on the
same task together. This is parallel processing.
Types Of Parallel Processing
SISD Single Instruction stream, Single Data
stream MISD Multiple Instruction stream, Single
Data stream SIMD Single Instruction stream,
Multiple Data stream MIMD Multiple Instruction
stream, Multiple Data stream
18
SISD
One piece of data is sent to one processor.
Ex To multiply one hundred numbers by the number
three, each number would be sent and calculated
until all one hundred results were calculated.
19
MISD
One piece of data is broken up and sent to many
processor.
CPU
Data
CPU
Search
CPU
CPU
Ex A database is broken up into sections of
records and sent to several different processor,
each of which searches the section for a specific
key.
20
SIMD
Multiple processors execute the same instruction
of separate data.
Ex A SIMD machine with 100 processors could
multiply 100 numbers, each by the number three,
at the same time.
21
MIMD
Multiple processors execute different instruction
of separate data.
CPU
Data
Multiply
CPU
Data
Search
CPU
Data
Add
CPU
Data
Subtract
This is the most complex form of parallel
processing. It is used on complex simulations
like modeling the growth of cities.
22
The Granddaddy of Parallel Processing
MIMD
23
MIMD computers usually have a different program
running on every processor. This makes for a
very complex programming environment.
Whats doing what when?
What processor? Doing which task? At what time?
24
Memory latency
The time between issuing a memory fetch and
receiving the response.
Simply put, if execution proceeds before the
memory request responds, unexpected results will
occur. What values are being used? Not the
ones requested!
25
A similar problem can occur with instruction
executions themselves.
Synchronization The need to enforce the ordering
of instruction executions according to their data
dependencies.
Instruction b must occur before instruction a.
26
Despite potential problems, MIMD can prove larger
than life.
MIMD Successes
IBM Deep Blue Computer beats professional chess
player.
Some may not consider this to be a fair example,
because Deep Blue was built to beat Kasparov
alone. It knew his play style so it could
counter is projected moves. Still, Deep Blues
win marked a major victory for computing.
27
IBMs latest, a supercomputer that models nuclear
explosions.
IBM Poughkeepsie built the worlds fastest
supercomputer for the U. S. Department of Energy.
Its job was to model nuclear explosions.
28
MIMD its the most complex, fastest, flexible
parallel paradigm. Its beat a world class chess
player at his own game. It models things that
few people understand. It is parallel processing
at its finest.
29
Multiprocessor SystemsFlynns Classification
  • Single instruction single data (SISD)
  • Consists of a single CPU executing individual
    instructions on individual data values

30
Multiprocessor SystemsFlynns Classification
  • Multiple instruction Multiple data (MIMD)
  • Executes different instructions simultaneously
  • Each processor must include its own control unit
  • The processors can be assigned to parts of the
    same task or to completely separate tasks
  • Example Multiprocessors, multicomputers

31
Popular Flynn Categories
  • SISD (Single Instruction Single Data)
  • Uniprocessors
  • MISD (Multiple Instruction Single Data)
  • ??? multiple processors on a single data stream
  • SIMD (Single Instruction Multiple Data)
  • Examples Illiac-IV, CM-2
  • Simple programming model
  • Low overhead
  • Flexibility
  • All custom integrated circuits
  • (Phrase reused by Intel marketing for media
    instructions vector)
  • MIMD (Multiple Instruction Multiple Data)
  • Examples Sun Enterprise 5000, Cray T3D, SGI
    Origin
  • Flexible
  • Use off-the-shelf micros
  • MIMD current winner Concentrate on major design
    emphasis lt 128 processor MIMD machines

32
Multiprocessor Systems
  • System Topologies
  • The topology of a multiprocessor system refers to
    the pattern of connections between its processors
  • Quantified by standard metrics
  • Diameter The maximum distance between two
    processors in the computer system
  • Bandwidth The capacity of a communications link
    multiplied by the number of such links in
    the system (best case)
  • Bisectional Bandwidth The total bandwidth of the
    links connecting the two halves of the
    processor split so that the number of
    links between the two halves is
    minimized (worst case)

33
Multiprocessor SystemsSystem Topologies
  • Six Categories of System Topologies
  • Shared bus
  • Ring
  • Tree
  • Mesh
  • Hypercube
  • Completely Connected

34
(No Transcript)
35
Multiprocessor SystemsSystem Topologies
  • Shared bus
  • The simplest topology
  • Processors communicate with each other
    exclusively via this bus
  • Can handle only one data transmission at a time
  • Can be easily expanded by connecting additional
    processors to the shared bus, along with the
    necessary bus arbitration circuitry

M
M
M
P
P
P
Shared Bus
Global Memory
36
(No Transcript)
37
Multiprocessor SystemsSystem Topologies
  • Ring
  • Uses direct dedicated connections between
    processors
  • Allows all communication links to be active
    simultaneously
  • A piece of data may have to travel through
    several processors to reach its final destination
  • All processors must have two communication links

P
P
P
P
P
P
38
Multiprocessor SystemsSystem Topologies
  • Tree topology
  • Uses direct connections between processors
  • Each processor has three connections
  • Its primary advantage is its relatively low
    diameter
  • Example DADO Computer

P
P
P
P
P
P
P
39
(No Transcript)
40
(No Transcript)
41
(No Transcript)
42
Multiprocessor SystemsSystem Topologies
  • Mesh topology
  • Every processor connects to the processors above,
    below, left, and right
  • Left to right and top to bottom wraparound
    connections may or may not be present

P
P
P
P
P
P
P
P
P
43
(No Transcript)
44
(No Transcript)
45
Multiprocessor SystemsSystem Topologies
  • Hypercube
  • Multidimensional mesh
  • Has n processors, each with log n connections

46
(No Transcript)
47
(No Transcript)
48
Multiprocessor SystemsSystem Topologies
  • Completely Connected
  • Every processor has n-1
  • connections, one to each
  • of the other processors
  • The complexity of the
  • processors increases as
  • the system grows
  • Offers maximum
  • communication capabilities

49
Architecture Details
  • Computers ? MPPs

Worlds simplest computer (processor/memory)
Standard computer (add cache,disk)
Network
50
A Supercomputer at 5.2 million
Virginia Tech 1,100 node Macs. G5 supercomputer
51
The Virginia Polytechnic Institute and State
University has built a supercomputer comprised of
a cluster of 1,100 dual-processor Macintosh G5
computers. Based on preliminary benchmarks, Big
Mac is capable of 8.1 teraflops per second. The
Mac supercomputer still is being fine tuned, and
the full extent of its computing power will not
be known until November. But the 8.1 teraflops
figure would make the Big Mac the world's fourth
fastest supercomputer
52
Big Mac's cost relative to similar machines is as
noteworthy as its performance. The Apple
supercomputer was constructed for just over US5
million, and the cluster was assembled in about
four weeks. In contrast, the world's leading
supercomputers cost well over 100 million to
build and require several years to construct. The
Earth Simulator, which clocked in at 38.5
teraflops in 2002, reportedly cost up to 250
million.
53
October 28 2003Time 730pm - 900pmLocation Sa
nta Clara Ballroom
Srinidhi Varadarajan, Ph.D.Dr. Srinidhi
Varadarajan is an Assistant Professor of Computer
Science at Virginia Tech. He was honored with the
NSF Career Award in 2002 for "Weaving a Code
Tapestry A Compiler Directed Framework for
Scalable Network Emulation." He has focused his
research on building a distributed network
emulation system that can scale to emulate
hundreds of thousands of virtual nodes.
54
Parallel Computers
  • Two common types
  • Cluster
  • Multi-Processor

55
Cluster Computers
56
Clusters on the Rise Using clusters of small
machines to build a supercomputer is not a new
concept. Another of the world's top machines,
housed at the Lawrence Livermore National
Laboratory, was constructed from 2,304 Xeon
processors. The machine was build by Utah-based
Linux Networx. Clustering technology has meant
that traditional big-iron leaders like Cray
(Nasdaq CRAY) and IBM have new competition
from makers of smaller machines. Dell (Nasdaq
DELL) , among other companies, has sold
high-powered computing clusters to research
institutions.
57
Cluster Computers
  • Each computer in a cluster is a complete computer
    by itself
  • CPU
  • Memory
  • Disk
  • etc
  • Computers communicate with each other via some
    interconnection bus

58
Cluster Computers
  • Typically used where one computer does not have
    enough capacity to do the expected work
  • Large Servers
  • Cheaper than building one GIANT computer

59
Although not new, supercomputing clustering
technology still is impressive. It works by
farming out chunks of data to individual
machines, adding that clustering works better for
some types of computing problems than others.
For example, a cluster would not be ideal to
compete against IBM's Deep Blue supercomputer in
a chess match in this case, all the data must be
available to one processor at the same moment --
the machine operates much in the same way as the
human brain handles tasks. However, a cluster
would be ideal for the processing of seismic data
for oil exploration, because that computing job
can be divided into many smaller tasks.
60
Cluster Computers
  • Need to break up work among the computers in the
    cluster
  • Example Microsoft.com Search Engine
  • 6 computers running SQL Server
  • Each has a copy of the MS Knowledge Base
  • Search requests come to one computer
  • Sends request to one of the 6
  • Attempts to keep all 6 busy

61
The Virginia Tech Mac supercomputer should be
fully functional and in use by January 2004. It
will be used for research into nanoscale
electronics, quantum chemistry, computational
chemistry, aerodynamics, molecular statics,
computational acoustics and the molecular
modeling of proteins.
62
Specialized Processors
  • Vector Processors
  • Massively Parallel Computers

63
Vector Processors
For (I0IltnI) array1I array2I
array3I
This is an array (vector) operation
64
Vector Processors
  • Special instructions to operate on vectors
    (arrays)
  • Vector instruction specifies
  • Starting addresses of all 3 arrays
  • Loop count
  • Saves For Loop overhead
  • Can more efficiently access memory
  • Also Known as SIMD Computers
  • Single Instruction Multiple Data

65
Vector Processors
  • Until the 1990s, the worlds fastest
    supercomputers were implemented as vector
    processors
  • Now, Vector Processors are typically special
    peripheral devices that can be installed on a
    regular computer

66
Massively Parallel Computers
  • IBM ASCI Purple
  • Cluster of 196 computers
  • Each computer has
  • 64 CPUs
  • 256 Gigabytes of RAM
  • 10,000 GB of Disk

67
Massively Parallel Computer
  • How will ASCI Purple be used?
  • Simulation of molecular dynamics
  • Research into repairing damaged DNA
  • Analysis of seismic waves
  • Earthquake research
  • Simulation of star evolution
  • Simulation of Weapons of Mass Destruction

68
According to the article, the supercomputer,
powered by 2,200 IBM G5 processors, has been
initially rated at computing 7.41 trillion
operations per second. The final number could be
much higher, according to school officials, but
if not, it would rank as the 4 fastest
supercomputing cluster in the world.
Japan's US250M Earth Simulator, which is
currently the world's fastest computer Lawrence
Livermore's US10-15M cluster system, which is
made up of 2,304 Intel Xeon processors. IBM
recently installed "Pacific Blue" at the Lawrence
Livermore Laboratories for 94 million
69
"We are demonstrating that you can build a very
high performance machine for a fifth to a tenth
of the cost of what supercomputers now cost,"
said Hassan Aref, the dean of the School of
Engineering at Virginia Tech in Blacksburg
1998 a group called distributed.net linked
thousands of computers of all kinds around the
world via the Internet, and cracked a 56-bit
DES-II code in 40 days. It had previously been
thought that such heavyweight ciphers would take
hundreds of years to crack even on fast
computers. One version of the Distributed.net
program ran as a screen saver that kicked in, and
began cracking code, whenever the machine was
idle for more than a few minutes.
Distributed.net bills itself as the "Fastest
Computer on Earth", even though their hardware
bill is effectively zero.
70
The idea is straightforward. You set up an
arbitrary number of PCs, network them, typically
using fast Ethernet, and then send them problems
that can be divided up among the machines'
processors. One machine acts as a server that
syncs up all the rest, called clients. Beowulf
specs software like the Message Passing Interface
written under the Linux operating system, that
allows the machines to communicate while working
on the problem. And since Linux, brainchild of
computer science student Linus Torvalds, is free,
it keeps the cost down
71
Modeling the trajectories of tens of millions
of charged particles, each interacting with the
others through electro-magnetic forces, requires
heavy-duty number crunching. To harness
supercomputing power at a desktop price, UCLAs
Dr. Viktor K. Decyk and his colleagues have
created their own super-fast, parallel processing
supercomputer using a cluster of Power
Macintosh computers.
72

SYDNEY - 22 January 2001
73
World's fastest" Macintosh cluster Tuesday, May
15, 2001 _at_ 845am Researchers at the Grupo de
Lasers e Plasmas (GoLP) in Portugal have created
what they bill as the world's fastest
Macintosh-based cluster. Consisting of 16
dual-processor Power Mac G4/450s, the cluster
delivers more than 50 GigaFlops of peak power and
took just one day to set up.
74
Apple Computer purchased a big Cray supercomputer
in the mid-1980s. In fact, Steve Jobs was Cray's
first and only walk-in customer. He arrived
unannounced (so the story goes) at Cray
headquarters in Mendota Heights, Minnesota and
asked to speak to someone about buying a Cray.
They nearly threw him out. It's only slightly
less eccentric than someone walking into NASA
Johnson Space Center and inquiring how to
purchase a shuttle orbiter. Later, Cray president
John Rollwagen phoned Seymour and told him that
Apple had just purchased a Cray that would be
used in designing the next Macintosh. Seymour
thought for a bit, and replied that that seemed
reasonable, since he was using a Macintosh to
design the next Cray!
75
Parallel Computer Architectures
  • MPP Massively Parallel Processors
  • Top of the top500 list consists of mostly mpps
    but clusters are rising

2002
  • Clusters are there!
  • Earth Simulator Old-old style making news again
  • ASCI Machines Big companies, special purpose
  • Beowulf Clusters Popping up everywhere
  • Software
  • Embarassingly parallel or sacrifice a grad
    student
  • MATLABp (our little homegrown project)

2003
76
(No Transcript)
77
Performance Trends
78
Extrapolations
79
Beowulf Clusters
80
Current Beowulfs (2)
Write a Comment
User Comments (0)
About PowerShow.com