Parallelization of ADALINE Networks - PowerPoint PPT Presentation

1 / 18
About This Presentation
Title:

Parallelization of ADALINE Networks

Description:

Weights describe a network and their values are adaptively adjusted ('learning' ... Suggestion for degree of sharing cores depending on computational intensiveness ... – PowerPoint PPT presentation

Number of Views:61
Avg rating:3.0/5.0
Slides: 19
Provided by: cs315bwik
Category:

less

Transcript and Presenter's Notes

Title: Parallelization of ADALINE Networks


1
Parallelization of ADALINE Networks
  • Wook-Jin Chung
  • May 24th, 2007

2
What is an ADALINE Network?
  • Stands for Adaptive Linear Element
  • A network (relationship) of linear systems
    tuned to yield desired results.
  • Yi ?WnmXj b
  • W weight / b bias
  • Weights describe a network and their values are
    adaptively adjusted (learning)
  • Used in neural nets for
  • Adaptive filtering
  • Pattern recognition
  • Learn more at
  • Widrow, B and Winter, R. Neural Nets for Adaptive
    Filtering and Adaptive Pattern Recognition.
    Computer, vol. 21, no. 3. Mar, 1988. 25-39.
  • http//en.wikipedia.org/wiki/Artificial_neural_net
    workChoosing_a_cost_function

3
ADALINE Pattern Recognition
X1
W11
FALSE
X2
Y1
X3
Y2
TRUE
Wn1
FALSE
Y3
Wn3
Xn
N
x
M
  • N x M 1-stage network (M of patterns to
    identify)
  • Each input node (Xi) reads in some portion of the
    input pattern
  • Deterministic function renders it either -1 or 1
  • Every link has a weight (Wnm)
  • Each output node (Yj) sums all weighted inputs
  • Only one node results in TRUE to identify pattern

4
Using ADALINE Networks
  • Initialize
  • Assign random weights to all links
  • Training
  • Feed in known inputs in random sequence
  • Simulate the network
  • Compute error between the input and the output
    (Error Function)
  • Adjust weights (Learning Function)
  • Repeat until total error lt e
  • Thinking
  • Simulate the network
  • Network will respond to any input
  • Does not guarantee a correct solution even for
    trained inputs

Initialize
Training
Thinking
More information on these equations can be
found at http//en.wikipedia.org/wiki/Artificial_
neural_networkChoosing_a_cost_function
5
Sample Output
f4.bmp
  • Acceptable errors (?)
  • there is some reason for the wrong recognition

6
Training Code
  • // essentially this is simulating the neural net
    with a known input and a corresponding output.
    Number of loop is undetermined.
  • do
  • n rand() // pick a random node for training
  • SetInput(Patternn) // sets the input layers
    output (1 or -1
  • on the links)
  • PropagateNet() // sum of link value link
    weight
  • GetOutput() // interpret the output of the
    net recognized pattern or invalid output
  • e ComputeError(Answern) // compute the
    error of the network based on known
    answers and nets results
  • if(e gt epsilon) // if error is greater than
    margin,
  • AdjustWeights() // adjust the link weights
  • while (e gt epsilon)

7
Training Code Parallelization
  • - Net Divide since each iteration is dependent,
    must parallelize the network (divide nodes
    between cores)
  • do
  • n rand() // one thread only of course!
  • BARRIER() // cannot change input before all
    done
  • SetInput(Patternn) // N/P
  • BARRIER() // cannot start until all input
    ready
  • PropagateNet() // M/PN N
  • GetOutput() // M/P
  • e ComputeError(Answern) // M/P M/P
    M/P
  • if(e gt epsilon)
  • AdjustWeights() // M/PN3 ? heaviest
    computation
  • while (e gt epsilon)

8
Training Code Analysis (Net Divide)
  • Barrier() ? 2(2o 2L) g
  • L latency to L2 cache (Niagara)
  • o spin-loop overhead
  • g negligible
  • P only up to M
  • Amdahl's Law Two barriers are costly. But
    significant amount of computation increases the
    parallelizable region
  • Communication / computation overlap not possible
  • All-to-all communication after SetInput(Patternn
    )
  • Only need 4N bytes (or 2N bytes if using short)
  • Good spatial and temporal locality
  • Gather operation during ComputeError()
  • Need 4P bytes
  • Weights remain in cache after being adjusted and
    are independent for each output node

9
Thinking Code Parallelization
  • Loop Divide since each iteration is independent
  • Net Divide is also possible
  • for (many many loops)
  • n rand()
  • BARRIER() // only needed for Net Divide
  • SetInput(Patternn) // N/P
  • BARRIER() // only needed for Net Divide
  • PropagateNet() // M/PN N
  • GetOutput() // M/P
  • e ComputeError(Answern)
  • if(e gt epsilon)
  • AdjustWeights()

10
Thinking Code Analysis
  • Net Divide
  • All characteristics of the training code remain
  • But computation has decreased significantly less
    speedup expected
  • Loop Divide
  • No barriers
  • Temporal locality in link value as one thread
    sets all outputs of the input layer ? 4N bytes
    used M times
  • Temporal locality in weights ? 4M bytes used M
    times
  • Sharing cores has positive interference
  • Skys the limit!

11
Experimental Setup
  • System
  • Niagara processor (8 cores, 32 threads)
  • Code
  • Converted floating point operations to integer
  • 77 x 16 neural network (N x M)
  • Training with 16 patterns (each pattern tile is
    7x11)
  • 1648000 runs of Thinking
  • Varied Latency
  • Since barriers are the predominant overhead
  • CMP Niagara default
  • SMP 2x barrier overhead
  • Cluster 5x barrier overhead

12
Experimental Results
13
Experimental Results
14
Experimental Results
15
Insights
  • Synchronization (barrier) is the predominant
    overhead for parallelization
  • The more computation intensive training is, more
    speedup can be expected
  • Actual simulation of net (thinking) is not
    computation intensive
  • Multi-stage network should behave similarly as
    they are just multiple 1-stage networks
    concatenated
  • Of course larger memory footprint

16
Hardware Implications
  • As P grows to 100s or 1000s, having a low latency
    (L) is critical
  • Cluster or SMP ? impractical or impossible
  • CMP
  • Reasonable up to a certain number of cores
    (communicating via cache)
  • A dedicated wire for barrier is inevitable help
  • On-chip bandwidth between cores is enough
  • Computation is not as intensive positive
    interference ? share cores
  • With multi-stage networks, since memory footprint
    grows, even more beneficial to share
  • Niagara-like processor with abundant FPU and
    hardware thread space is the ideal environment ?
    Niagara II specs are tempting!!

17
High-Level Language Implications
  • Intuitive representation of a layers, node, links
  • Automatic parallelization is very plausible
  • Easy and effective representation of the barrier
  • Automatic barrier insertion of each stage
  • Suggestion for degree of sharing cores depending
    on computational intensiveness
  • Easy switch between latency (Net Divide) and
    throughput (Loop Divide) mode

18
Conclusion
  • Neural nets are useful in computer vision, AI,
    etc.
  • It makes sense to accelerate neural nets with
    multiple cores
  • Communication bandwidth is not the bottleneck
  • CMPs with FPU, hardware threads, and core sharing
    is ideal
  • Dedicated synchronization wires will help greatly

Neural net sample codes http//www.ip-atlas.com
/pub/nap/nn-src/
Write a Comment
User Comments (0)
About PowerShow.com