Title: A NoC mapped Generic, Reconfigurable, Fault Tolerant Neural Network Processor
1A NoC mapped Generic, Reconfigurable, Fault
Tolerant Neural Network Processor
- T. Theocharides, G. Link, J.M. Kim
- Embedded and Mobile Computing Center
- Department of Computer Science and Engineering
- The Pennsylvania State University
- December 11th, 2003
2Outline
- Introduction Short NN background
- Motivation
- Why hardware?
- Existing hardware work
- Network Perspective - Application Perspective
- Initial Proposed System Architecture
- Experimental Platform and Results
- Synthesis/Hardware metrics
- Comparative Performance
- Fault Tolerant System Architecture
- Conclusion / Future work
3Artificial Neural Networks (ANNs)
- ANN is an information processing paradigm
- Inspired by the way biological nervous systems,
such as the human brain, process information. - Novel structure of an information processing
system - It is composed of a large number of highly
interconnected processing elements called neurons - Neurons work in unison to solve specific problems
- ANNs, like people, learn by example
- ANNs used in digital signal processing, pattern
recognition, data classification, control systems
and many more applications
4ANNs
- ANNs are configured for a specific application
through a learning process - Learning in biological systems involves
adjustments to the synaptic connections that
exist between the neurons - Same for ANNs as well
- ANNs receive a set of data which are already
trained to process - Output the desired result of the computation
performed - Output depends on the application and the
training they are exposed to
5Artificial vs. Biological Neural Networks
- Human brain functions by learning, adopting and
then solving problems - Humans learn either from errors or from
directions given - Human brain consists of billions of neurons
- Neurons are connected to each other in a very
dense manner - Artificial Neural Networks attempt to emulate a
small part of the brain (impossible to emulate
billions!) - Consist of artificial neurons
- Artificial Neurons trained for problem solving
- Layers of neurons make up a neural network
6The Neuron
Human Neuron vs. Artificial Neuron
- Neurons receive a set of inputs, X0-Xn
- Each input is associated with a pre-determined
weight W0-Wn - Neurons accumulate the sum of the product of each
input and each weight - Accumulated sum minus preset threshold passed
through an activation function - Activation functions include various common
functions depending on application - Final output propagated to other neurons
7Layers - Connections
- Neurons are grouped and connected to each other
via preconfigured layers - Layer configuration determined on application
- Computation sets occur within layers, results
propagate to other layers
8Neural Networks - Topologies
9Motivation Why hardware?
- Huge Application Range
- Significant research both in software and
hardware. - Software implementations have been researched and
optimized heavily, yet still lack in speed - Not applicable to real time computation
- ?Hardware implementation for real time systems
10Issues Complications
- Huge connection problem
- Human neurons have app. 10000 inputs
- ANNs with 100 neurons?
- Neurons require MAC operation,
- MAC units are expensive in terms of area/power
- Layer configuration changes with application
- Weight storage/ Update
- Activation Function Implementation How?
- Accuracy/Area/Speed tradeoffs
- Only one advantage Human neurons operate in
the millisecond rangean eternity for todays
technology - Use time advantage to perform more operations
using same hardware
11Ideal Hardware
- Reconfigurable Layers
- Multiple weight support
- Multiple Activation Function Support
- Interconnect problem addressed
- Do more with less (area that is)
- If possible, training on chip
12Existing Work
- Neurochips vs. Neurocomputers
- Neurochips (taking over)
- Analog Implementations
- Efficient Emulate human brain as closely as
possible - Noise and Size issues
- Limited number of neurons on each chip
- CMOS Digital Implementations
- Successful in implementing small networks
- Limited number of neurons due to area constraints
- Connections between neurons a huge problem
- No Reconfigurable systems!
- Neurocomputers (Too Costly)
- Multiple Boards
- Use a multiboard platform adding neurons to the
network, connected via traditional bus
architectures - Bulky, impractical in todays world
- Generic Processor Add-Ons
- PC Cards / Co-processors optimized for Neural
operations - Require host system
13Existing Work Architectures
- Systolic Arrays Not reconfigurable Lindsey,
98 - Bit-Slice architectures limit parallelism
Lindsey, 98 - RBF (Radial Basis Functions) Networks Allow
masking of MAC operations with variations in
activation function - Limited to RBF topologies only.
- Straight forward traditional busses
- Limited number of neurons per chip
interconnection issues - MAC hardware size issues precision vs. accuracy
- Hardware development slowed by all issues
mentioned - Need for something new
14Our Proposed Solution?
- Use Networks on Chip!
- Novel architecture
- Partially solves interconnect problem
- Allows for reconfigurability
- Virtualization (mapping of multiple logical units
into same hardware units) - Things you heard many times
15Projected Achievements
- Alleviate interconnect issues ? NoC
- Provide high precision and accuracy ? 32x32 bit
Multiplier - Allow for design-time expandability
- Runtime reconfigurability (network topology
adjusted by sending new data) Multiple target
applications - Virtualization (limited physical hardware used to
implement multiple logical neurons) - Dynamic adjustment of weights and activation
functions - Optimized for Multilayer Perceptron Topology
- Majority of used topologies today are MLP (80)
16Implementation Details
- Work done in two parts
- Basic Implementation
- Involves most of the work
- Implement a functional system to extract
performance metrics - Work done submitted to DAC 04
- Low Power/Fault Tolerant Implementation
- Explore architectural modifications to make
system consume less power and perform reliably
17Basic Implementation
- Utilizes NoC architecture
- Neurons ? PEs
- Idea
- Cluster 4 neurons per routing node
- Share activation function for all 4 clustered
neurons - Each neuron performs a MAC operation
- Aggregate each neurons outputs, pass through
activation function and route back to the network - Less hardware (shared Activation Function Units)
- Less network traffic
- Ingress, Egress , Neuron and Aggregator/Routing
nodes
18Ingress / Egress Nodes
- Ingress Nodes
- Tiny (compared to the rest of the system) Nodes
- Input pads
- Route packets
- Buffering capacity enough to hold an entire
computation set - Receive data from outside world and direct data
to network - Egress Nodes
- Also tiny
- Output pads
- Route packets
- Output computation results to outside world.
19Neuron PEs
- Perform the MAC operation only
- 3 On-Chip memories, each
- 3 depths of virtualization
- LUTs with weight values for 3 virtual neurons
- Memories support weights for up to 8 layers
- High precision 32-bit 2-stage multiplier
- 10-stage pipeline
20Aggregator Nodes
- Two operations
- Activation function unit
- Route packets
- RAM LUT stores values for activation function
- NoC Routing Algorithm
21The big Picture
- Training done off-chip
- Weight values and activation functions
initialized during configuration - After the network is configured, network operates
as trained - Data in and out of the network through
ingress/egress nodes - Each aggregator-neurons cluster receives packets
- 4 data words header with control/routing
information - Each neuron operates independently of others but
within the same layer - Completed neuron results used by aggregator for
activation function - Result back to network
22Experimental Platform
- 9 aggregators / 36 Neuron PEs
- 3 ingress 3 egress nodes
- 32 bit weights/inputs (data words)
- 4 words 32 bit header 160 bit packet
- 2 flits 80 bits each
- 3 2KB weight memories per neuron for each virtual
neuron - Up to 512 weights per input layer neurons
- Up to 512 network inputs per set
- 4KB RAM for activation functions
- Allows strong representation for most activation
functions, sufficient storage for Gaussian
function (95 accuracy)
23Synthesis Results
24Results
- Some clarifications first
- No standard metric for Hardware ANN
- Performance cannot be measured with the clock
rate - Instead, Connections Per Second (CPS)
- Rate of MAC operations per second
- Precision
- Topology
- Number of Neurons
- Number of Synapses per neuron
25Tests - Methodology
- Non-existing other reconfigurable hardware
- Standard topology to test our architecture
against others - MLP applications selected
- 5 widely used applications
- Both with existing software and hardware
implementations - Simulation results for both software (Matlab) and
hardware (Custom ASIC Implementation of same
area/technology and traditional bus for
connections) - Connections Per Second
- Accuracy of solution (w.r.t. double-precision
software) - Latency
- CPS comparison for commercial hardware
26Results
Architectural Comparison vs. Existing Commercial
Hardware
27Results
Network Performance
28Results
Accuracy - Latency
29Results
CPS Comparison vs. Existing Commercial Hardware
30Preliminary Conclusions
- Architecture seems superior to almost every
aspect compared to existing architectures - Reconfigurable, expandable, better resource
utilization - Excellent comparative performance
- Large silicon area (multipliers/RAM)
- High power (multipliers/RAM)
31Fault Tolerance, Low Power?
- Fault tolerance Optimization
- Low power architectural optimization
32Fault Tolerance
- Plenty of fault Tolerant neural network
implementations proposed - 2 main ideas
- Neuron duplication
- Duplicate all or some critical neurons
- Mostly work done in software
- Effective but in hardware, heavy cost on area
- Training optimization
- Neural Networks by default are fault tolerant
i.e. errors in inputs should be easily detected
if correct training - Error impact much larger on the weight values
rather than inputs - Modified Training algorithms can detect erroneous
input patterns and discard them - Not applicable in our case, since training is
done off-chip
33Fault Tolerance in our architecture
- Errors in incoming data ? network errors
- Already analyzed by Greg
- Identified possible error areas within PEs
- Weight RAMs
- Activation Function RAMs
- Datapath components (less chance however)
- Weight RAM impact
- Important, as a maxima/minima weight values will
affect output - Fault tolerance training optimization ruined
- Activation Function Impact
- If error deviates a lot from the original value,
then very important
34Weight RAM Protection
- Weight RAM subjected to transient/soft/noise
errors - Standard SECDEC hardware can be used area/delay
penalty - Penalty estimate Synopsys DesignWare
- Area app. 3 increase per neuron PE
- Pipeline Latency 2 stages (decoder/encoder)
- Alternate algorithm for error detection
- Does not impact performance
- Minimal area penalty
- High Power Consumption ?
35Proposed (Alternate) Algorithm
- Alternate Algorithm for Error Detection
- When weight values are initially inserted into
the RAM, use the corresponding accumulator
(already in place) and accumulate the sum - Accumulator not used during configuration, hence
no performance penalty - When inputs arrive, each weight has to be read
- Use existing subtractor, subtract from the
accumulated sum - Subtractor not used until accumulated sum of
products has been completed - In the end the checksum has to be zero ?
- Penalty ?power consumption
- Also most of times, weight RAM is not full
- Use empty slots to replicate weight values
36Activation Function Verification
- Algorithm
- Activation function is continuous
- Values next to each other in memory have very
close values to each other - Determine a close step size between consecutive
values as threshold - Upon retrieving a value, retrieve neighboring
values as well - Compare values
- If larger than threshold, possible error in read
value raise flag - If smaller than threshold, even if erroneous,
will not impact performance
Error Value detected
Error Value with small impact
Correct Value
37Neuron Replication Which Neurons?
- Neuron Importance in Computation affected by 3
factors - Number of inputs - Inversely Proportional
- Number of outputs Directly Proportional
- Weight Range and Variation Weight Deviation
Directly Proportional - Hidden Layer Neurons most important
- Input layer neurons have many inputs
- Output layer neurons have less weight deviation
- Evaluate neuron impact during training and
topology mapping and then do any of the
following - If available resources, replicate high impact
neurons (virtualization) - Use error detection schemes mentioned on high
impact neurons - Higher power consumption but more reliable system
38Reliability vs. Power vs. Performance
39Lowering the Power
- Multipliers are the big share of the power
consumption pie - Use a sleep vector Synopsys DesignWare
Library - offers leakage savings up to 15
- Turn off unused neuron PEs
- Power Savings huge up to 40 depending on
application - On-Chip Memory and Datapath Components Power
Consumption Optimization - Vdd / Clock gating
- Other known memory power reduction techniques
- Work in Progress
- Dynamic Power consumption - Hard
- ? unpredicted inputs and input rate
40Future Work
- Power Optimization
- New Fault Tolerant Algorithm explorations
- Collaboration with NoC Fault Tolerant Algorithm
for overall system reliability
41Conclusions
- Reconfigurable, Expandable, Reliable Neural
Network Architecture - NoC implementation
- ? utilizes almost all advantages of NoC
architecture - High comparative performance
- Low Power still a work in progress
- QUESTIONS?
42References
- C. Lindsey and T. Lindblad, Review of Neural
Network Hardware A users perspective, IEEE
Third Workshop on Neural Networks From Biology
to High Energy Physics, Marciana Marina, Isola
d'Elba, Italy, Sept. 26-30, 1994 - M. Glesner and W. Pöchmüller, Neurocomputers An
Overview of Neural Networks in VLSI, Chapman
Hall, London, 1994 - Heemskerk, J. N. H., Overview of Neural Hardware.
In Neurocomputers for Brain-Style Processing.
Design, Implementation and Application, PhD
Thesis, Unit of Experimental and Theoretical
Psychology Leiden University, The Netherlands,
1995. - T. Axelrod, E. Vittoz, I. Saxena et al, Neural
Network Hardware Implementations, Handbook of
Neural Computation, IOP Publishing Ltd and Oxford
University Press, London, 1997. - K. Jain, J. Mao, K M. Mohiuddin, Artificial
Neural Networks A Tutorial, IEEE Computer,
Volume 29, pp. 3144, March 1996. - L. Benini and G. De Micheli, "Networks on chips
a new SoC paradigm", IEEE Computer, Volume 35,
pp. 70--78, January 2002 - W. Dally and B. Towles. Route Packets, Not
Wires On-Chip Interconnection Networks. In
Proceedings of 38th Design Automation Conference,
Jun, 2001. - S. Kumar, A. Jantsch, M. Millberg, J. Berg and
J.P. Soininen and M. Forsell and K. Tiensyrj and
A. Hemani. A Network on Chip Architecture and
Design Methodology. In Proceedings of IEEE
Computer Society Annual Symposium on VLSI, Apr,
2002. - T. Theocharides, G. Link, N. Vijaykrishnan, M. J.
Irwin and W. Wolf, Embedded Hardware Face
Detection, In the Proceedings of the VLSI Design
Conference, Mumbai, India, January 2004. - B. Krose and P. Van Der Smagt, An Introduction
to Neural Networks, 8th Edition, Internal
Report, The University of Amsterdam, The
Netherlands, November 1996. - D. Whelihan and H. Schmit, NOCSim Simulator,
http//www.ece.cmu.edu/ece725/DOCUMENTS/NOCsim_us
ers_guide.doc, September 2003.