A NoC mapped Generic, Reconfigurable, Fault Tolerant Neural Network Processor

1 / 40

About This Presentation

Title:

A NoC mapped Generic, Reconfigurable, Fault Tolerant Neural Network Processor

Description:

A NoC mapped Generic, Reconfigurable, Fault Tolerant Neural Network Processor ... Generic Processor Add-Ons. PC Cards / Co-processors optimized for Neural operations ... –

Number of Views:230

Avg rating:3.0/5.0

Slides: 41

Provided by: theotheo

Category:

more less

Transcript and Presenter's Notes

Title: A NoC mapped Generic, Reconfigurable, Fault Tolerant Neural Network Processor

1
A NoC mapped Generic, Reconfigurable, Fault
Tolerant Neural Network Processor

T. Theocharides, G. Link, J.M. Kim
Embedded and Mobile Computing Center
Department of Computer Science and Engineering
The Pennsylvania State University
December 11th, 2003

2
Outline

Introduction Short NN background
Motivation
Why hardware?
Existing hardware work
Network Perspective - Application Perspective
Initial Proposed System Architecture
Experimental Platform and Results
Synthesis/Hardware metrics
Comparative Performance
Fault Tolerant System Architecture
Conclusion / Future work

3
Artificial Neural Networks (ANNs)

ANN is an information processing paradigm
Inspired by the way biological nervous systems,
such as the human brain, process information.
Novel structure of an information processing
system
It is composed of a large number of highly
interconnected processing elements called neurons
Neurons work in unison to solve specific problems
ANNs, like people, learn by example
ANNs used in digital signal processing, pattern
recognition, data classification, control systems
and many more applications

4
ANNs

ANNs are configured for a specific application
through a learning process
Learning in biological systems involves
adjustments to the synaptic connections that
exist between the neurons
Same for ANNs as well
ANNs receive a set of data which are already
trained to process
Output the desired result of the computation
performed
Output depends on the application and the
training they are exposed to

5
Artificial vs. Biological Neural Networks

Human brain functions by learning, adopting and
then solving problems
Humans learn either from errors or from
directions given
Human brain consists of billions of neurons
Neurons are connected to each other in a very
dense manner
Artificial Neural Networks attempt to emulate a
small part of the brain (impossible to emulate
billions!)
Consist of artificial neurons
Artificial Neurons trained for problem solving
Layers of neurons make up a neural network

6
The Neuron
Human Neuron vs. Artificial Neuron

Neurons receive a set of inputs, X0-Xn
Each input is associated with a pre-determined
weight W0-Wn
Neurons accumulate the sum of the product of each
input and each weight
Accumulated sum minus preset threshold passed
through an activation function
Activation functions include various common
functions depending on application
Final output propagated to other neurons

7
Layers - Connections

Neurons are grouped and connected to each other
via preconfigured layers
Layer configuration determined on application
Computation sets occur within layers, results
propagate to other layers

8
Neural Networks - Topologies
9
Motivation Why hardware?

Huge Application Range
Significant research both in software and
hardware.
Software implementations have been researched and
optimized heavily, yet still lack in speed
Not applicable to real time computation
?Hardware implementation for real time systems

10
Issues Complications

Huge connection problem
Human neurons have app. 10000 inputs
ANNs with 100 neurons?
Neurons require MAC operation,
MAC units are expensive in terms of area/power
Layer configuration changes with application
Weight storage/ Update
Activation Function Implementation How?
Accuracy/Area/Speed tradeoffs
Only one advantage Human neurons operate in
the millisecond rangean eternity for todays
technology
Use time advantage to perform more operations
using same hardware

11
Ideal Hardware

Reconfigurable Layers
Multiple weight support
Multiple Activation Function Support
Interconnect problem addressed
Do more with less (area that is)
If possible, training on chip

12
Existing Work

Neurochips vs. Neurocomputers
Neurochips (taking over)
Analog Implementations
Efficient Emulate human brain as closely as
possible
Noise and Size issues
Limited number of neurons on each chip
CMOS Digital Implementations
Successful in implementing small networks
Limited number of neurons due to area constraints
Connections between neurons a huge problem
No Reconfigurable systems!
Neurocomputers (Too Costly)
Multiple Boards
Use a multiboard platform adding neurons to the
network, connected via traditional bus
architectures
Bulky, impractical in todays world
Generic Processor Add-Ons
PC Cards / Co-processors optimized for Neural
operations
Require host system

13
Existing Work Architectures

Systolic Arrays Not reconfigurable Lindsey,
98
Bit-Slice architectures limit parallelism
Lindsey, 98
RBF (Radial Basis Functions) Networks Allow
masking of MAC operations with variations in
activation function
Limited to RBF topologies only.
Straight forward traditional busses
Limited number of neurons per chip
interconnection issues
MAC hardware size issues precision vs. accuracy
Hardware development slowed by all issues
mentioned
Need for something new

14
Our Proposed Solution?

Use Networks on Chip!
Novel architecture
Partially solves interconnect problem
Allows for reconfigurability
Virtualization (mapping of multiple logical units
into same hardware units)
Things you heard many times

15
Projected Achievements

Alleviate interconnect issues ? NoC
Provide high precision and accuracy ? 32x32 bit
Multiplier
Allow for design-time expandability
Runtime reconfigurability (network topology
adjusted by sending new data) Multiple target
applications
Virtualization (limited physical hardware used to
implement multiple logical neurons)
Dynamic adjustment of weights and activation
functions
Optimized for Multilayer Perceptron Topology
Majority of used topologies today are MLP (80)

16
Implementation Details

Work done in two parts
Basic Implementation
Involves most of the work
Implement a functional system to extract
performance metrics
Work done submitted to DAC 04
Low Power/Fault Tolerant Implementation
Explore architectural modifications to make
system consume less power and perform reliably

17
Basic Implementation

Utilizes NoC architecture
Neurons ? PEs
Idea
Cluster 4 neurons per routing node
Share activation function for all 4 clustered
neurons
Each neuron performs a MAC operation
Aggregate each neurons outputs, pass through
activation function and route back to the network
Less hardware (shared Activation Function Units)
Less network traffic
Ingress, Egress , Neuron and Aggregator/Routing
nodes

18
Ingress / Egress Nodes

Ingress Nodes
Tiny (compared to the rest of the system) Nodes
Input pads
Route packets
Buffering capacity enough to hold an entire
computation set
Receive data from outside world and direct data
to network
Egress Nodes
Also tiny
Output pads
Route packets
Output computation results to outside world.

19
Neuron PEs

Perform the MAC operation only
3 On-Chip memories, each
3 depths of virtualization
LUTs with weight values for 3 virtual neurons
Memories support weights for up to 8 layers
High precision 32-bit 2-stage multiplier
10-stage pipeline

20
Aggregator Nodes

Two operations
Activation function unit
Route packets
RAM LUT stores values for activation function
NoC Routing Algorithm

21
The big Picture

Training done off-chip
Weight values and activation functions
initialized during configuration
After the network is configured, network operates
as trained
Data in and out of the network through
ingress/egress nodes
Each aggregator-neurons cluster receives packets
4 data words header with control/routing
information
Each neuron operates independently of others but
within the same layer
Completed neuron results used by aggregator for
activation function
Result back to network

22
Experimental Platform

9 aggregators / 36 Neuron PEs
3 ingress 3 egress nodes
32 bit weights/inputs (data words)
4 words 32 bit header 160 bit packet
2 flits 80 bits each
3 2KB weight memories per neuron for each virtual
neuron
Up to 512 weights per input layer neurons
Up to 512 network inputs per set
4KB RAM for activation functions
Allows strong representation for most activation
functions, sufficient storage for Gaussian
function (95 accuracy)

23
Synthesis Results
24
Results

Some clarifications first
No standard metric for Hardware ANN
Performance cannot be measured with the clock
rate
Instead, Connections Per Second (CPS)
Rate of MAC operations per second
Precision
Topology
Number of Neurons
Number of Synapses per neuron

25
Tests - Methodology

Non-existing other reconfigurable hardware
Standard topology to test our architecture
against others
MLP applications selected
5 widely used applications
Both with existing software and hardware
implementations
Simulation results for both software (Matlab) and
hardware (Custom ASIC Implementation of same
area/technology and traditional bus for
connections)
Connections Per Second
Accuracy of solution (w.r.t. double-precision
software)
Latency
CPS comparison for commercial hardware

26
Results
Architectural Comparison vs. Existing Commercial
Hardware
27
Results
Network Performance
28
Results
Accuracy - Latency
29
Results
CPS Comparison vs. Existing Commercial Hardware
30
Preliminary Conclusions

Architecture seems superior to almost every
aspect compared to existing architectures
Reconfigurable, expandable, better resource
utilization
Excellent comparative performance
Large silicon area (multipliers/RAM)
High power (multipliers/RAM)

31
Fault Tolerance, Low Power?

Fault tolerance Optimization
Low power architectural optimization

32
Fault Tolerance

Plenty of fault Tolerant neural network
implementations proposed
2 main ideas
Neuron duplication
Duplicate all or some critical neurons
Mostly work done in software
Effective but in hardware, heavy cost on area
Training optimization
Neural Networks by default are fault tolerant
i.e. errors in inputs should be easily detected
if correct training
Error impact much larger on the weight values
rather than inputs
Modified Training algorithms can detect erroneous
input patterns and discard them
Not applicable in our case, since training is
done off-chip

33
Fault Tolerance in our architecture

Errors in incoming data ? network errors
Already analyzed by Greg
Identified possible error areas within PEs
Weight RAMs
Activation Function RAMs
Datapath components (less chance however)
Weight RAM impact
Important, as a maxima/minima weight values will
affect output
Fault tolerance training optimization ruined
Activation Function Impact
If error deviates a lot from the original value,
then very important

34
Weight RAM Protection

Weight RAM subjected to transient/soft/noise
errors
Standard SECDEC hardware can be used area/delay
penalty
Penalty estimate Synopsys DesignWare
Area app. 3 increase per neuron PE
Pipeline Latency 2 stages (decoder/encoder)
Alternate algorithm for error detection
Does not impact performance
Minimal area penalty
High Power Consumption ?

35
Proposed (Alternate) Algorithm

Alternate Algorithm for Error Detection
When weight values are initially inserted into
the RAM, use the corresponding accumulator
(already in place) and accumulate the sum
Accumulator not used during configuration, hence
no performance penalty
When inputs arrive, each weight has to be read
Use existing subtractor, subtract from the
accumulated sum
Subtractor not used until accumulated sum of
products has been completed
In the end the checksum has to be zero ?
Penalty ?power consumption
Also most of times, weight RAM is not full
Use empty slots to replicate weight values

36
Activation Function Verification

Algorithm
Activation function is continuous
Values next to each other in memory have very
close values to each other
Determine a close step size between consecutive
values as threshold
Upon retrieving a value, retrieve neighboring
values as well
Compare values
If larger than threshold, possible error in read
value raise flag
If smaller than threshold, even if erroneous,
will not impact performance

Error Value detected
Error Value with small impact
Correct Value
37
Neuron Replication Which Neurons?

Neuron Importance in Computation affected by 3
factors
Number of inputs - Inversely Proportional
Number of outputs Directly Proportional
Weight Range and Variation Weight Deviation
Directly Proportional
Hidden Layer Neurons most important
Input layer neurons have many inputs
Output layer neurons have less weight deviation
Evaluate neuron impact during training and
topology mapping and then do any of the
following
If available resources, replicate high impact
neurons (virtualization)
Use error detection schemes mentioned on high
impact neurons
Higher power consumption but more reliable system

38
Reliability vs. Power vs. Performance
39
Lowering the Power

Multipliers are the big share of the power
consumption pie
Use a sleep vector Synopsys DesignWare
Library
offers leakage savings up to 15
Turn off unused neuron PEs
Power Savings huge up to 40 depending on
application
On-Chip Memory and Datapath Components Power
Consumption Optimization
Vdd / Clock gating
Other known memory power reduction techniques
Work in Progress
Dynamic Power consumption - Hard
? unpredicted inputs and input rate

40
Future Work

Power Optimization
New Fault Tolerant Algorithm explorations
Collaboration with NoC Fault Tolerant Algorithm
for overall system reliability

41
Conclusions

Reconfigurable, Expandable, Reliable Neural
Network Architecture
NoC implementation
? utilizes almost all advantages of NoC
architecture
High comparative performance
Low Power still a work in progress
QUESTIONS?

42
References

C. Lindsey and T. Lindblad, Review of Neural
Network Hardware A users perspective, IEEE
Third Workshop on Neural Networks From Biology
to High Energy Physics, Marciana Marina, Isola
d'Elba, Italy, Sept. 26-30, 1994
M. Glesner and W. Pöchmüller, Neurocomputers An
Overview of Neural Networks in VLSI, Chapman
Hall, London, 1994
Heemskerk, J. N. H., Overview of Neural Hardware.
In Neurocomputers for Brain-Style Processing.
Design, Implementation and Application, PhD
Thesis, Unit of Experimental and Theoretical
Psychology Leiden University, The Netherlands,
1995.
T. Axelrod, E. Vittoz, I. Saxena et al, Neural
Network Hardware Implementations, Handbook of
Neural Computation, IOP Publishing Ltd and Oxford
University Press, London, 1997.
K. Jain, J. Mao, K M. Mohiuddin, Artificial
Neural Networks A Tutorial, IEEE Computer,
Volume 29, pp. 3144, March 1996.
L. Benini and G. De Micheli, "Networks on chips
a new SoC paradigm", IEEE Computer, Volume 35,
pp. 70--78, January 2002
W. Dally and B. Towles. Route Packets, Not
Wires On-Chip Interconnection Networks. In
Proceedings of 38th Design Automation Conference,
Jun, 2001.
S. Kumar, A. Jantsch, M. Millberg, J. Berg and
J.P. Soininen and M. Forsell and K. Tiensyrj and
A. Hemani. A Network on Chip Architecture and
Design Methodology. In Proceedings of IEEE
Computer Society Annual Symposium on VLSI, Apr,
2002.
T. Theocharides, G. Link, N. Vijaykrishnan, M. J.
Irwin and W. Wolf, Embedded Hardware Face
Detection, In the Proceedings of the VLSI Design
Conference, Mumbai, India, January 2004.
B. Krose and P. Van Der Smagt, An Introduction
to Neural Networks, 8th Edition, Internal
Report, The University of Amsterdam, The
Netherlands, November 1996.
D. Whelihan and H. Schmit, NOCSim Simulator,
http//www.ece.cmu.edu/ece725/DOCUMENTS/NOCsim_us
ers_guide.doc, September 2003.