A NoC mapped Generic, Reconfigurable, Fault Tolerant Neural Network Processor

1 / 40
About This Presentation
Title:

A NoC mapped Generic, Reconfigurable, Fault Tolerant Neural Network Processor

Description:

A NoC mapped Generic, Reconfigurable, Fault Tolerant Neural Network Processor ... Generic Processor Add-Ons. PC Cards / Co-processors optimized for Neural operations ... –

Number of Views:230
Avg rating:3.0/5.0
Slides: 41
Provided by: theotheo
Category:

less

Transcript and Presenter's Notes

Title: A NoC mapped Generic, Reconfigurable, Fault Tolerant Neural Network Processor


1
A NoC mapped Generic, Reconfigurable, Fault
Tolerant Neural Network Processor
  • T. Theocharides, G. Link, J.M. Kim
  • Embedded and Mobile Computing Center
  • Department of Computer Science and Engineering
  • The Pennsylvania State University
  • December 11th, 2003

2
Outline
  • Introduction Short NN background
  • Motivation
  • Why hardware?
  • Existing hardware work
  • Network Perspective - Application Perspective
  • Initial Proposed System Architecture
  • Experimental Platform and Results
  • Synthesis/Hardware metrics
  • Comparative Performance
  • Fault Tolerant System Architecture
  • Conclusion / Future work

3
Artificial Neural Networks (ANNs)
  • ANN is an information processing paradigm
  • Inspired by the way biological nervous systems,
    such as the human brain, process information.
  • Novel structure of an information processing
    system
  • It is composed of a large number of highly
    interconnected processing elements called neurons
  • Neurons work in unison to solve specific problems
  • ANNs, like people, learn by example
  • ANNs used in digital signal processing, pattern
    recognition, data classification, control systems
    and many more applications

4
ANNs
  • ANNs are configured for a specific application
    through a learning process
  • Learning in biological systems involves
    adjustments to the synaptic connections that
    exist between the neurons
  • Same for ANNs as well
  • ANNs receive a set of data which are already
    trained to process
  • Output the desired result of the computation
    performed
  • Output depends on the application and the
    training they are exposed to

5
Artificial vs. Biological Neural Networks
  • Human brain functions by learning, adopting and
    then solving problems
  • Humans learn either from errors or from
    directions given
  • Human brain consists of billions of neurons
  • Neurons are connected to each other in a very
    dense manner
  • Artificial Neural Networks attempt to emulate a
    small part of the brain (impossible to emulate
    billions!)
  • Consist of artificial neurons
  • Artificial Neurons trained for problem solving
  • Layers of neurons make up a neural network

6
The Neuron
Human Neuron vs. Artificial Neuron
  • Neurons receive a set of inputs, X0-Xn
  • Each input is associated with a pre-determined
    weight W0-Wn
  • Neurons accumulate the sum of the product of each
    input and each weight
  • Accumulated sum minus preset threshold passed
    through an activation function
  • Activation functions include various common
    functions depending on application
  • Final output propagated to other neurons

7
Layers - Connections
  • Neurons are grouped and connected to each other
    via preconfigured layers
  • Layer configuration determined on application
  • Computation sets occur within layers, results
    propagate to other layers

8
Neural Networks - Topologies
9
Motivation Why hardware?
  • Huge Application Range
  • Significant research both in software and
    hardware.
  • Software implementations have been researched and
    optimized heavily, yet still lack in speed
  • Not applicable to real time computation
  • ?Hardware implementation for real time systems

10
Issues Complications
  • Huge connection problem
  • Human neurons have app. 10000 inputs
  • ANNs with 100 neurons?
  • Neurons require MAC operation,
  • MAC units are expensive in terms of area/power
  • Layer configuration changes with application
  • Weight storage/ Update
  • Activation Function Implementation How?
  • Accuracy/Area/Speed tradeoffs
  • Only one advantage Human neurons operate in
    the millisecond rangean eternity for todays
    technology
  • Use time advantage to perform more operations
    using same hardware

11
Ideal Hardware
  • Reconfigurable Layers
  • Multiple weight support
  • Multiple Activation Function Support
  • Interconnect problem addressed
  • Do more with less (area that is)
  • If possible, training on chip

12
Existing Work
  • Neurochips vs. Neurocomputers
  • Neurochips (taking over)
  • Analog Implementations
  • Efficient Emulate human brain as closely as
    possible
  • Noise and Size issues
  • Limited number of neurons on each chip
  • CMOS Digital Implementations
  • Successful in implementing small networks
  • Limited number of neurons due to area constraints
  • Connections between neurons a huge problem
  • No Reconfigurable systems!
  • Neurocomputers (Too Costly)
  • Multiple Boards
  • Use a multiboard platform adding neurons to the
    network, connected via traditional bus
    architectures
  • Bulky, impractical in todays world
  • Generic Processor Add-Ons
  • PC Cards / Co-processors optimized for Neural
    operations
  • Require host system

13
Existing Work Architectures
  • Systolic Arrays Not reconfigurable Lindsey,
    98
  • Bit-Slice architectures limit parallelism
    Lindsey, 98
  • RBF (Radial Basis Functions) Networks Allow
    masking of MAC operations with variations in
    activation function
  • Limited to RBF topologies only.
  • Straight forward traditional busses
  • Limited number of neurons per chip
    interconnection issues
  • MAC hardware size issues precision vs. accuracy
  • Hardware development slowed by all issues
    mentioned
  • Need for something new

14
Our Proposed Solution?
  • Use Networks on Chip!
  • Novel architecture
  • Partially solves interconnect problem
  • Allows for reconfigurability
  • Virtualization (mapping of multiple logical units
    into same hardware units)
  • Things you heard many times

15
Projected Achievements
  • Alleviate interconnect issues ? NoC
  • Provide high precision and accuracy ? 32x32 bit
    Multiplier
  • Allow for design-time expandability
  • Runtime reconfigurability (network topology
    adjusted by sending new data) Multiple target
    applications
  • Virtualization (limited physical hardware used to
    implement multiple logical neurons)
  • Dynamic adjustment of weights and activation
    functions
  • Optimized for Multilayer Perceptron Topology
  • Majority of used topologies today are MLP (80)

16
Implementation Details
  • Work done in two parts
  • Basic Implementation
  • Involves most of the work
  • Implement a functional system to extract
    performance metrics
  • Work done submitted to DAC 04
  • Low Power/Fault Tolerant Implementation
  • Explore architectural modifications to make
    system consume less power and perform reliably

17
Basic Implementation
  • Utilizes NoC architecture
  • Neurons ? PEs
  • Idea
  • Cluster 4 neurons per routing node
  • Share activation function for all 4 clustered
    neurons
  • Each neuron performs a MAC operation
  • Aggregate each neurons outputs, pass through
    activation function and route back to the network
  • Less hardware (shared Activation Function Units)
  • Less network traffic
  • Ingress, Egress , Neuron and Aggregator/Routing
    nodes

18
Ingress / Egress Nodes
  • Ingress Nodes
  • Tiny (compared to the rest of the system) Nodes
  • Input pads
  • Route packets
  • Buffering capacity enough to hold an entire
    computation set
  • Receive data from outside world and direct data
    to network
  • Egress Nodes
  • Also tiny
  • Output pads
  • Route packets
  • Output computation results to outside world.

19
Neuron PEs
  • Perform the MAC operation only
  • 3 On-Chip memories, each
  • 3 depths of virtualization
  • LUTs with weight values for 3 virtual neurons
  • Memories support weights for up to 8 layers
  • High precision 32-bit 2-stage multiplier
  • 10-stage pipeline

20
Aggregator Nodes
  • Two operations
  • Activation function unit
  • Route packets
  • RAM LUT stores values for activation function
  • NoC Routing Algorithm

21
The big Picture
  • Training done off-chip
  • Weight values and activation functions
    initialized during configuration
  • After the network is configured, network operates
    as trained
  • Data in and out of the network through
    ingress/egress nodes
  • Each aggregator-neurons cluster receives packets
  • 4 data words header with control/routing
    information
  • Each neuron operates independently of others but
    within the same layer
  • Completed neuron results used by aggregator for
    activation function
  • Result back to network

22
Experimental Platform
  • 9 aggregators / 36 Neuron PEs
  • 3 ingress 3 egress nodes
  • 32 bit weights/inputs (data words)
  • 4 words 32 bit header 160 bit packet
  • 2 flits 80 bits each
  • 3 2KB weight memories per neuron for each virtual
    neuron
  • Up to 512 weights per input layer neurons
  • Up to 512 network inputs per set
  • 4KB RAM for activation functions
  • Allows strong representation for most activation
    functions, sufficient storage for Gaussian
    function (95 accuracy)

23
Synthesis Results
24
Results
  • Some clarifications first
  • No standard metric for Hardware ANN
  • Performance cannot be measured with the clock
    rate
  • Instead, Connections Per Second (CPS)
  • Rate of MAC operations per second
  • Precision
  • Topology
  • Number of Neurons
  • Number of Synapses per neuron

25
Tests - Methodology
  • Non-existing other reconfigurable hardware
  • Standard topology to test our architecture
    against others
  • MLP applications selected
  • 5 widely used applications
  • Both with existing software and hardware
    implementations
  • Simulation results for both software (Matlab) and
    hardware (Custom ASIC Implementation of same
    area/technology and traditional bus for
    connections)
  • Connections Per Second
  • Accuracy of solution (w.r.t. double-precision
    software)
  • Latency
  • CPS comparison for commercial hardware

26
Results
Architectural Comparison vs. Existing Commercial
Hardware
27
Results
Network Performance
28
Results
Accuracy - Latency
29
Results
CPS Comparison vs. Existing Commercial Hardware
30
Preliminary Conclusions
  • Architecture seems superior to almost every
    aspect compared to existing architectures
  • Reconfigurable, expandable, better resource
    utilization
  • Excellent comparative performance
  • Large silicon area (multipliers/RAM)
  • High power (multipliers/RAM)

31
Fault Tolerance, Low Power?
  • Fault tolerance Optimization
  • Low power architectural optimization

32
Fault Tolerance
  • Plenty of fault Tolerant neural network
    implementations proposed
  • 2 main ideas
  • Neuron duplication
  • Duplicate all or some critical neurons
  • Mostly work done in software
  • Effective but in hardware, heavy cost on area
  • Training optimization
  • Neural Networks by default are fault tolerant
    i.e. errors in inputs should be easily detected
    if correct training
  • Error impact much larger on the weight values
    rather than inputs
  • Modified Training algorithms can detect erroneous
    input patterns and discard them
  • Not applicable in our case, since training is
    done off-chip

33
Fault Tolerance in our architecture
  • Errors in incoming data ? network errors
  • Already analyzed by Greg
  • Identified possible error areas within PEs
  • Weight RAMs
  • Activation Function RAMs
  • Datapath components (less chance however)
  • Weight RAM impact
  • Important, as a maxima/minima weight values will
    affect output
  • Fault tolerance training optimization ruined
  • Activation Function Impact
  • If error deviates a lot from the original value,
    then very important

34
Weight RAM Protection
  • Weight RAM subjected to transient/soft/noise
    errors
  • Standard SECDEC hardware can be used area/delay
    penalty
  • Penalty estimate Synopsys DesignWare
  • Area app. 3 increase per neuron PE
  • Pipeline Latency 2 stages (decoder/encoder)
  • Alternate algorithm for error detection
  • Does not impact performance
  • Minimal area penalty
  • High Power Consumption ?

35
Proposed (Alternate) Algorithm
  • Alternate Algorithm for Error Detection
  • When weight values are initially inserted into
    the RAM, use the corresponding accumulator
    (already in place) and accumulate the sum
  • Accumulator not used during configuration, hence
    no performance penalty
  • When inputs arrive, each weight has to be read
  • Use existing subtractor, subtract from the
    accumulated sum
  • Subtractor not used until accumulated sum of
    products has been completed
  • In the end the checksum has to be zero ?
  • Penalty ?power consumption
  • Also most of times, weight RAM is not full
  • Use empty slots to replicate weight values

36
Activation Function Verification
  • Algorithm
  • Activation function is continuous
  • Values next to each other in memory have very
    close values to each other
  • Determine a close step size between consecutive
    values as threshold
  • Upon retrieving a value, retrieve neighboring
    values as well
  • Compare values
  • If larger than threshold, possible error in read
    value raise flag
  • If smaller than threshold, even if erroneous,
    will not impact performance

Error Value detected
Error Value with small impact
Correct Value
37
Neuron Replication Which Neurons?
  • Neuron Importance in Computation affected by 3
    factors
  • Number of inputs - Inversely Proportional
  • Number of outputs Directly Proportional
  • Weight Range and Variation Weight Deviation
    Directly Proportional
  • Hidden Layer Neurons most important
  • Input layer neurons have many inputs
  • Output layer neurons have less weight deviation
  • Evaluate neuron impact during training and
    topology mapping and then do any of the
    following
  • If available resources, replicate high impact
    neurons (virtualization)
  • Use error detection schemes mentioned on high
    impact neurons
  • Higher power consumption but more reliable system

38
Reliability vs. Power vs. Performance
39
Lowering the Power
  • Multipliers are the big share of the power
    consumption pie
  • Use a sleep vector Synopsys DesignWare
    Library
  • offers leakage savings up to 15
  • Turn off unused neuron PEs
  • Power Savings huge up to 40 depending on
    application
  • On-Chip Memory and Datapath Components Power
    Consumption Optimization
  • Vdd / Clock gating
  • Other known memory power reduction techniques
  • Work in Progress
  • Dynamic Power consumption - Hard
  • ? unpredicted inputs and input rate

40
Future Work
  • Power Optimization
  • New Fault Tolerant Algorithm explorations
  • Collaboration with NoC Fault Tolerant Algorithm
    for overall system reliability

41
Conclusions
  • Reconfigurable, Expandable, Reliable Neural
    Network Architecture
  • NoC implementation
  • ? utilizes almost all advantages of NoC
    architecture
  • High comparative performance
  • Low Power still a work in progress
  • QUESTIONS?

42
References
  • C. Lindsey and T. Lindblad, Review of Neural
    Network Hardware A users perspective, IEEE
    Third Workshop on Neural Networks From Biology
    to High Energy Physics, Marciana Marina, Isola
    d'Elba, Italy, Sept. 26-30, 1994
  • M. Glesner and W. Pöchmüller, Neurocomputers An
    Overview of Neural Networks in VLSI, Chapman
    Hall, London, 1994
  • Heemskerk, J. N. H., Overview of Neural Hardware.
    In Neurocomputers for Brain-Style Processing.
    Design, Implementation and Application, PhD
    Thesis, Unit of Experimental and Theoretical
    Psychology Leiden University, The Netherlands,
    1995.
  • T. Axelrod, E. Vittoz, I. Saxena et al, Neural
    Network Hardware Implementations, Handbook of
    Neural Computation, IOP Publishing Ltd and Oxford
    University Press, London, 1997.
  • K. Jain, J. Mao, K M. Mohiuddin, Artificial
    Neural Networks A Tutorial, IEEE Computer,
    Volume 29, pp. 3144, March 1996.
  • L. Benini and G. De Micheli, "Networks on chips
    a new SoC paradigm", IEEE Computer, Volume 35,
    pp. 70--78, January 2002
  • W. Dally and B. Towles. Route Packets, Not
    Wires On-Chip Interconnection Networks. In
    Proceedings of 38th Design Automation Conference,
    Jun, 2001.
  • S. Kumar, A. Jantsch, M. Millberg, J. Berg and
    J.P. Soininen and M. Forsell and K. Tiensyrj and
    A. Hemani. A Network on Chip Architecture and
    Design Methodology. In Proceedings of IEEE
    Computer Society Annual Symposium on VLSI, Apr,
    2002.
  • T. Theocharides, G. Link, N. Vijaykrishnan, M. J.
    Irwin and W. Wolf, Embedded Hardware Face
    Detection, In the Proceedings of the VLSI Design
    Conference, Mumbai, India, January 2004.
  • B. Krose and P. Van Der Smagt, An Introduction
    to Neural Networks, 8th Edition, Internal
    Report, The University of Amsterdam, The
    Netherlands, November 1996.
  • D. Whelihan and H. Schmit, NOCSim Simulator,
    http//www.ece.cmu.edu/ece725/DOCUMENTS/NOCsim_us
    ers_guide.doc, September 2003.
Write a Comment
User Comments (0)
About PowerShow.com