Title: Vector computers
1Vector computers
2Supercomputer
- Definition of a supercomputer
- Fastest machine in the world at given task
- Any machine costing 30 milion
- A device to turn a compute-bound problem into an
I/O bound problem - Any machine designed by Seymour Cray ?
- In 70s, 80s, Supercomputer ? Vector machine
3First Vector Computers / Processors
- CDC STAR-100, TI ASC (1972)
- Memory-memory vector processors
- High start-up overhead
- Relatively slow scalar units (underestimation of
Amdahls Law) - Cray-1 (1976)
- Vector-register vector processor (lower start-up
overhead, reduced bandwidth requirements) - Fastest scalar processor in the world at that
time - Vector chaining support
4Vector ComputersMemory-memory vector computers
- CDC CYBER 205 (1981)
- Memory-memory architecture
- Four lanes with multiple functional units
- Wide load-store pipeline
- Support for nonunit stride memory accesses and
sparse vectors - ETA-10 (CDC, late 80s)
- 10 processors
- Each supporting the memory-memory architecture
- Last significant memory-memory design
5Vector ComputersVector-register vector processors
K. Asanovic. "Vector processors, Appendix G in
Computer Architecture A Quantitative Approach.
6Vector ComputersVector-register vector processors
K. Asanovic. "Vector processors, Appendix G in
Computer Architecture A Quantitative Approach.
7Vector ComputersMemory-memory vs vector-register
- Memory-memory vector computers
- Operands fetched directly from the main
- Results written directly to the memory
- Vector-register vector computers
- Vector elements read from the memory into the
register by a LOAD VECTOR operation - All arithmetic and logic operations are
register-register operations - Results of vector operations are put into vector
registers and may be stored back in memory by a
STORE VECTOR operation
8Vector ComputersMemory-memory vs vector-register
- Memory-memory architecture
- Requires greater bandwidth
- Unables easy reuse of intermediate results
- Makes difficult to overlap multiple vector
operations - Start-up time is significantly increased due to
cost of memory accesses - Becomes more efficient for very long vectors
- Vector-register architecture
- Free of disadavantages of memory-memory machines
- Experience has shown that shorter vectors are
more commonly used
9Vector computersMemory bandwidth latency
- Memory access latency adds to the start-up cost
of fetching a vector from memory - Assuring sustainable sufficient bandwidth
requires special memory organization into
multiple memory banks - Additional problems arise when the memory is
accessed in an irregular pattern (very typical
for various matrix based computations)
10Vector ComputersSimplified general structure of
a vector-register vector computer
Data (vectors)
Main memory
External memory
Vector transfer control and address generator
Vector registers (local memory)
Data
Address parameters
Data
Vector operation control
Functions Status
Pipelined functional units
Data (scalars)
Vector processor
Data
Scalar processor
Scalar instructions
Vector instructions
Instruction processor
Instructions
11Vector ComputersCray-1
- Main features of a classical vector-register
vector computer - Load/Store Architecture
- Vector Registers
- Vector Instructions
- Hardwired Control
- Highly Pipelined Functional Units
- Interleaved Memory System
- (16 banks, 4 cycle busy time, 12 cycle latency)
- No Data Caches
- No Virtual Memory
12Basic Cray-1 architecture
13Vector computersVector instructions
- ai f1 ( bi )
- sine, cosine, square root,
- scalar f2 ( A )
- sum, maximum,
- ai f3 ( bi ci )
- add, subtract,
- ai f4 ( scalar ci )
- multiply vector by scalar,
- It is possible to combine the above operations
14Vector computersVector instruction set advantages
- Compact
- One short instruction encodes N operations (may
be an equivalent to an entire loop) - Expressive
- Each instruction tells hardware that these N
operations - are independent
- use the same functional unit
- access disjoint registers
- access registers in the same pattern as previous
instructions - access a contiguous block of memory (unit-stride
load/store) - access memory in a known pattern (strided
load/store) - Scalable
- The same object code can be run on more parallel
pipelines or lanes
15Vector computersStripmining
Theoretical throughput as a function of vector
length. What happens when a vector length exceeds
the size of vector Registers?
16Vector computersStripmining
Performance of Spert-II system on dot product
with unit-stride operands. K. Asanovic, Vector
microprocessors. (32 vector registers)
17Vector computersVector chaining
a
ax11
ax10
ax9
ax8
ax7
ax6
ax5
,x13 ,x12
ax3y3
ax2y2
ax1y1
,y5 ,y4
Performance of Cray-1 was almost doubled with the
use of vector chaining, from 80 Mflops to 153
Mflops.
18Vector computersScatter and gather
- Sometimes, only certain elements of a vector are
needed in a computation - If the elements to be used are in a
regularly-spaced pattern, the spacing between the
elements to be gathered is called stride - Example
- Elements extracted
- x1, x5, x9, x13, , x4floor((n-1)/4)1
- from a vector
- x1, x2, x3, x4, x5, x6, x7, x8, , xn
- with a stride equal to 4
19Vector computersScatter and gather
- Scatter and gather operations may be also used
with irregularly-spaced data - Example operation gather
1
3
4
7
a1
a2
a3
a4
a5
a6
a7
a8
a1
a3
a4
a7
20Vector computersCompress and expand
- Scatter and gather operations may be also used
with irregularly-spaced data - Example operation compress
1
0
1
1
0
0
1
0
a1
a2
a3
a4
a5
a6
a7
a8
a1
a3
a4
a7
21Vector computersVector conditional execution
- Vectorization of a loop with a conditional code
for (i0 iltN i) if (Aigt0) then
Ai Bi else Ai Ci
- Use of vector mask register (1bit per element)
lv vA, rA Load A vector mgtz m0, vA Set
bits in mask register m0 where Agt0 lv.m vA, rB,
m0 Load B vector into A under mask fnot m1, m0
Invert mask register lv.m vA, rC, m1 Load
C vector into A under mask sv vA, rA Store A
back to memory (no mask)
22Vector computersVector conditional execution
5
0
1
0
0
2
3
4
lv vA, rA mgtz m0, vA lv.m vA, rB, m0 fnot m1,
m0 lv.m vA, rC, m1 sv vA, rA
Source A
1
0
1
0
0
1
1
1
m0
B1
B2
B3
B4
B5
B6
B7
B8
B
1
0
1
0
0
1
1
1
m0
B1
B3
B6
B7
B8
C2
C4
C5
Result A
0
1
0
1
1
0
0
0
m1
C1
C2
C3
C4
C5
C6
C7
C8
C
23Vector computersPrograming vector computers
- Assembly language programming
- Libraries
- Data-parallel languages
- Support for data-parallel operations as an
inherent part of the langauge (intrinsic
operators and functions) - Fortran 90, High Performance Fortran
- Vectorizing compilers
- Extensive loop dependencies analysis
24Vector computersVector processing applications
- Problems that can be efficiently formulated in
terms of vectors - Long- range weather forecasting
- Petroleum explorations
- Seismic data analysis
- Medical diagnosis
- Aerodynamics and space flight simulations
- Artificial intelligence and expert systems
- Mapping the human genome
- Image processing