Title: Examples of One-Dimensional Systolic Arrays
1Examples of One-Dimensional Systolic Arrays
2Motivation Introduction
- We need a high-performance , special-purpose
computer - system to meet specific application.
- I/O and computation imbalance is a notable
problem. - The concept of Systolic architecture can map
high-level - computation into hardware structures.
- Systolic system works like an automobile
assembly line. - Systolic system is easy to implement because of
its - regularity and easy to reconfigure.
- Systolic architecture can result in
cost-effective , high- - performance special-purpose systems for a wide
range - of problems.
3Pipelined Computations
4Pipelined Computations
- Pipelined program divided into a series of tasks
that have to be completed one after the other. - Each task executed by a separate pipeline stage
- Data streamed from stage to stage to form
computation
5Pipelined Computations
- Computation consists of data streaming through
pipeline stages - Execution Time Time to fill pipeline (P-1)
Time to run in steady state (N-P1) - Time to empty pipeline (P-1)
P of processors N of data items (assume P
lt N)
This slide must be explained in all detail. It is
very important
6Pipelined Example Sieve of Eratosthenes
- Goal is to take a list of integers greater than 1
and produce a list of primes - E.g. For input 2 3 4 5 6 7 8 9 10, output is
2 3 5 7 - A pipelined approach
- Processor P_i divides each input by the i-th
prime - If the input is divisible (and not equal to the
divisor), it is marked (with a negative sign) and
forwarded - If the input is not divisible, it is forwarded
- Last processor only forwards unmarked (positive)
data primes
7Sieve of Eratosthenes Pseudo-Code
- Code for last processor
- xrecv(data,P_(i-1))
- If xgt0 then send(x,OUTPUT)
- Code for processor Pi (and prime p_i)
- xrecv(data,P_(i-1))
- If (xgt0) then
- If (p_i divides x and p_i x ) then
send(-x,P_(i1) - If (p_i does not divide x or p_i x) then
send(x, P_(i1)) - Else
- Send(x,P_(i1))
/
Processor P_i divides each input by the i-th prime
8Programming Issues
- Algorithm will take NP-1 to run where N is the
number of data items and P is the number of
processors. - Can also consider just the odd bnys or do some
initial part separately - In given implementation, number of processors
must store all primes which will appear in
sequence - Not a scalable approach
- Can fix this by having each processor do the job
of multiple primes, i.e. mapping logical
processors in the pipeline to each physical
processor - What is the impact of this on performance?
processor does the job of three primes
9Processors for such operation
- In pipelined algorithm, flow of data moves
through processors in lockstep. - The design attempts to balance the work so that
there is no bottleneck at any processor - In mid-80s, processors were developed to support
in hardware this kind of parallel pipelined
computation - Two commercial products from Intel
- Warp (1D array)
- iWarp (components for 2D array)
- Warp and iWarp were meant to operate
synchronously Wavefront Array Processor (S.Y.
Kung) was meant to operate asynchronously, - i.e. arrival of data would signal that it was
time to execute
10Systolic Arrays
11Example 1 pipelined polynomial evaluation
12Example 1 pipelined polynomial evaluation
- Polynomial Evaluation is done by using a Linear
array with 2D.
- Expression
- Y ((((anxan-1)xan-2)xan-3)xa1)x a0
- Function of PEs in pairs
- 1. Multiply input by x
- 2. Pass result to right.
- 3. Add aj to result from left.
- 4. Pass result to right.
13Example 1 polynomial evaluation
Y ((((anxan-1)xan-2)xan-3)xa1)x a0
Multiplying processor
X is broadcasted
Adding processor
- Using systolic array for polynomial evaluation.
- This pipelined array can produce a polynomial on
new X value on every cycle - after 2n stages. - Another variant you can also calculate various
polynomials on the same X. - This is an example of a deeply pipelined
computation- - The pipeline has 2n stages.
x
an-1
an-2
an
x
a0
x
x
.
X
X
X
X
14For you to think about
- Pipelined Graph Coloring
- Pipelined Satisfiability
- Pipelined sorting/absorbing
- Pipelined decision function like Petrick
Function. - Pipelined multiplication.
- Pipelined calculation of (A B) (C D) on
vectors A, B, C, D.
15Example 2Matrix Vector Multiplication
16Example 2Matrix Vector Multiplication
- There are many ways to solve a matrix problems
using systolic arrays, some of the methods are
- Triangular Array performing gaussian elimination
with neighbor pivoting.
- Triangular Array performing orthogonal
triangularization. - Simple matrix multiplication methods are shown in
next slides.
17Example 2Matrix Vector Multiplication
- Matrix Vector Multiplication
- Each cells function is
- 1. To multiply the top and bottom inputs.
- 2. Add the left input to the product just
obtained. - 3. Output the final result to the right.
- Each cell consists of an adder and a few
registers. (Booth Algorithm for mul). - Or, a cell can include a hardware multiplier.
18Matrix Multiplication
Example 2Matrix Vector Multiplication
- At time t0 the array receives 1, a, p, q, and r
( The other inputs are all zero).
- At time t1, the array receive m, d, b, p, q, and
r .e.t.c - The results emerge after 5 steps.
19- Explain how to multiply the first row of the
matrix by the vector, - how data are shifted from left to right in the
architecture
To visualize how it works it is good to do a
snapshot animation
20Systolic Algorithms and Architectures
21Systolic Algorithms
- Systolic arrays were built to support systolic
algorithms, a hot area of research in the early
80s - Systolic algorithms used pipelining through
various kinds of arrays to accomplish
computational goals - Some of the data streaming and applications were
very creative and quite complex - CMU a hotbed of systolic algorithm and array
research (especially H.T. Kung and his group)
22Systolic Arrays from Intel
- Warp and iWarp were examples of systolic arrays
- Systolic means regular and rhythmic,
- data was supposed to move through pipelined
computational units in a regular and rhythmic
fashion - Systolic arrays meant to be special-purpose
processors or co-processors. - They were very fine-grained
- Processors implement a limited and very simple
computation, usually called cells - Communication is very fast, granularity meant to
be around one operation/communication!
23Systolic Processors, versus Cellular Automata
versus Regular Networks of Automata
Data Path Block
Data Path Block
Data Path Block
Data Path Block
Systolic processor
Control Block
Control Block
Control Block
Control Block
Cellular Automaton
These slides are for one-dimensional only
24Systolic Processors, versus Cellular Automata
versus Regular Networks of Automata
Control Block
Control Block
Control Block
Control Block
Data Path Block
Data Path Block
Data Path Block
Data Path Block
Regular Network of Automata