Parallelization of ADALINE Networks - PowerPoint PPT Presentation

1 / 18

About This Presentation

Title:

Parallelization of ADALINE Networks

Description:

Weights describe a network and their values are adaptively adjusted ('learning' ... Suggestion for degree of sharing cores depending on computational intensiveness ... – PowerPoint PPT presentation

Number of Views:61

Avg rating:3.0/5.0

Slides: 19

Provided by: cs315bwik

Category:

more less

Transcript and Presenter's Notes

Title: Parallelization of ADALINE Networks

1
Parallelization of ADALINE Networks

Wook-Jin Chung
May 24th, 2007

2
What is an ADALINE Network?

Stands for Adaptive Linear Element
A network (relationship) of linear systems
tuned to yield desired results.
Yi ?WnmXj b
W weight / b bias
Weights describe a network and their values are
adaptively adjusted (learning)
Used in neural nets for
Adaptive filtering
Pattern recognition
Learn more at
Widrow, B and Winter, R. Neural Nets for Adaptive
Filtering and Adaptive Pattern Recognition.
Computer, vol. 21, no. 3. Mar, 1988. 25-39.
http//en.wikipedia.org/wiki/Artificial_neural_net
workChoosing_a_cost_function

3
ADALINE Pattern Recognition
X1
W11
FALSE
X2
Y1
X3
Y2
TRUE
Wn1
FALSE
Y3
Wn3
Xn
N
x
M

N x M 1-stage network (M of patterns to
identify)
Each input node (Xi) reads in some portion of the
input pattern
Deterministic function renders it either -1 or 1
Every link has a weight (Wnm)
Each output node (Yj) sums all weighted inputs
Only one node results in TRUE to identify pattern

4
Using ADALINE Networks

Initialize
Assign random weights to all links
Training
Feed in known inputs in random sequence
Simulate the network
Compute error between the input and the output
(Error Function)
Adjust weights (Learning Function)
Repeat until total error lt e
Thinking
Simulate the network
Network will respond to any input
Does not guarantee a correct solution even for
trained inputs

Initialize
Training
Thinking
More information on these equations can be
found at http//en.wikipedia.org/wiki/Artificial_
neural_networkChoosing_a_cost_function
5
Sample Output
f4.bmp

Acceptable errors (?)
there is some reason for the wrong recognition

6
Training Code

// essentially this is simulating the neural net
with a known input and a corresponding output.
Number of loop is undetermined.
do
n rand() // pick a random node for training
SetInput(Patternn) // sets the input layers
output (1 or -1
on the links)
PropagateNet() // sum of link value link
weight
GetOutput() // interpret the output of the
net recognized pattern or invalid output
e ComputeError(Answern) // compute the
error of the network based on known
answers and nets results
if(e gt epsilon) // if error is greater than
margin,
AdjustWeights() // adjust the link weights
while (e gt epsilon)

7
Training Code Parallelization

- Net Divide since each iteration is dependent,
must parallelize the network (divide nodes
between cores)
do
n rand() // one thread only of course!
BARRIER() // cannot change input before all
done
SetInput(Patternn) // N/P
BARRIER() // cannot start until all input
ready
PropagateNet() // M/PN N
GetOutput() // M/P
e ComputeError(Answern) // M/P M/P
M/P
if(e gt epsilon)
AdjustWeights() // M/PN3 ? heaviest
computation
while (e gt epsilon)

8
Training Code Analysis (Net Divide)

Barrier() ? 2(2o 2L) g
L latency to L2 cache (Niagara)
o spin-loop overhead
g negligible
P only up to M
Amdahl's Law Two barriers are costly. But
significant amount of computation increases the
parallelizable region
Communication / computation overlap not possible
All-to-all communication after SetInput(Patternn
)
Only need 4N bytes (or 2N bytes if using short)
Good spatial and temporal locality
Gather operation during ComputeError()
Need 4P bytes
Weights remain in cache after being adjusted and
are independent for each output node

9
Thinking Code Parallelization

Loop Divide since each iteration is independent
Net Divide is also possible
for (many many loops)
n rand()
BARRIER() // only needed for Net Divide
SetInput(Patternn) // N/P
BARRIER() // only needed for Net Divide
PropagateNet() // M/PN N
GetOutput() // M/P
e ComputeError(Answern)
if(e gt epsilon)
AdjustWeights()

10
Thinking Code Analysis

Net Divide
All characteristics of the training code remain
But computation has decreased significantly less
speedup expected
Loop Divide
No barriers
Temporal locality in link value as one thread
sets all outputs of the input layer ? 4N bytes
used M times
Temporal locality in weights ? 4M bytes used M
times
Sharing cores has positive interference
Skys the limit!

11
Experimental Setup

System
Niagara processor (8 cores, 32 threads)
Code
Converted floating point operations to integer
77 x 16 neural network (N x M)
Training with 16 patterns (each pattern tile is
7x11)
1648000 runs of Thinking
Varied Latency
Since barriers are the predominant overhead
CMP Niagara default
SMP 2x barrier overhead
Cluster 5x barrier overhead

12
Experimental Results
13
Experimental Results
14
Experimental Results
15
Insights

Synchronization (barrier) is the predominant
overhead for parallelization
The more computation intensive training is, more
speedup can be expected
Actual simulation of net (thinking) is not
computation intensive
Multi-stage network should behave similarly as
they are just multiple 1-stage networks
concatenated
Of course larger memory footprint

16
Hardware Implications

As P grows to 100s or 1000s, having a low latency
(L) is critical
Cluster or SMP ? impractical or impossible
CMP
Reasonable up to a certain number of cores
(communicating via cache)
A dedicated wire for barrier is inevitable help
On-chip bandwidth between cores is enough
Computation is not as intensive positive
interference ? share cores
With multi-stage networks, since memory footprint
grows, even more beneficial to share
Niagara-like processor with abundant FPU and
hardware thread space is the ideal environment ?
Niagara II specs are tempting!!