HIGH PERFORMANCE MULTILAYER PERCEPTRON ON A CUSTOM COMPUTING MACHINE - PowerPoint PPT Presentation

1 / 54
About This Presentation
Title:

HIGH PERFORMANCE MULTILAYER PERCEPTRON ON A CUSTOM COMPUTING MACHINE

Description:

CUSTOM COMPUTING MACHINES. A CCM can overcome the limitations of ASICs. Limitations of ASICs ... CUSTOM COMPUTING MACHINES. Advantages of CCMs (cont'd) ... – PowerPoint PPT presentation

Number of Views:95
Avg rating:3.0/5.0
Slides: 55
Provided by: kit2
Category:

less

Transcript and Presenter's Notes

Title: HIGH PERFORMANCE MULTILAYER PERCEPTRON ON A CUSTOM COMPUTING MACHINE


1
HIGH PERFORMANCE MULTILAYER PERCEPTRON ON A
CUSTOM COMPUTING MACHINE
  • Authors Nalini K. Ratha, Anil K. Jain
  • H. GÜL ÇALIKLI
  • 2002700743

2
INTRODUCTION
  • Artificial Neural Networks (ANNs) attempt to
    mimic biological neural networks.
  • One of the main features of biological neural
    networks is the massively parallel
    interconnections among the neurons.

3
INTRODUCTION
  • Computational model of a biological neural
    network
  • simple operations such as
  • inner product computation
  • thresholding
  • design parameters
  • 1. network topology
  • i.number of layers
  • ii.number of nodes in a layer
  • 2. connection weights
  • 3. property at a node e.g.type of non-linearity
    to be used.

4
INTRODUCTION
Schematic of a Perceptron
d-dimensional input vector X(x1,x2,...,xd)
weight vector W (w1,w2,w3,....,wd)
Output y used to determine the category of input
x1
w1
? wi . xi
?
y
w2
Non-Linearity
x2
w3
x3
wd
Inner product of input vector X and weight vector
W
....
xd
5
INTRODUCTION
  • Multilayer Perceptrons (MLPs)
  • one of the most popular neural network models for
    solving pattern classification and image
    classification problems
  • consists of several layer of perceptrons
  • nodes in the i layer are connected to nodes in
    the (i1) layer through suitable weights
  • no interconnection among the nodes in a layer.

th
th
6
INTRODUCTION Multilayer Perceptrons
Multilayer Perceptron
Biological neuron
7
INTRODUCTION
  • Training of a MLP
  • 1.Feedforward Stage
  • training patterns with known class labels are
    presented at the input layer
  • at the start weight matrix is randomly
    initialised
  • the output is computed at the output node.
  • 2.Weight Update Stage
  • weights are updated in a backward fashion
    starting with the output layer
  • weights are changed proportional to the error
    between the desired output and the actual output.
  • 3.Repeat 1 and 2 until the network converges

8
INTRODUCTION
A Multilayer Perceptron
x1
x2
x3
. . .
. . . . .
Feature Vector
. . .
xd
hidden layer
output layer
input layer
9
INTRODUCTION
  • For an n-node MLP, O(n2) interconnections are
    needed.
  • THUS
  • mapping a MLP onto a parallel processor is a real
    challenge.
  • on a uniprocessor, the whole operation proceeds
    sequentially one node at a time. (no complex
    communications involved)
  • HOWEVER
  • for a high performance implementation, efficient
    communication capability must be supported.

10
INTRODUCTION
  • Typical Pattern Recognition and Computer Vision
    Applications
  • applications have gt100 input nodes
  • classification process involving complex decision
    boundaries demands a large number of hidden nodes

11
INTRODUCTION
  • Real Time Computer Vision Applications
  • The network training can be carried out offline.
  • Recall PhaseHigh input/output bandwidth is
    required along with fast classification (recall)
    speeds.

12
INTRODUCTION
  • For a three layer network (excluding the input
    layer),let
  • minput nodes
  • n1nodes in the first hidden layer
  • n2nodes in the second hidden layer
  • koutput nodes (classes)
  • Nm the number of multiplications
  • Nathe number of additions

Nonlinearity not included
Nm(mn1)(n1n2)(n2k) NaNm-(n1n2k)
13
INTRODUCTION
  • Example- a Practical Vision System
  • Process a 1024 x 1024 image in real time
  • 30 frames to be processed per second
  • 30 x 1024 x 1024 30 x 106 input patterns/sec
  • THUS, a real time neural network classifier is
    expected to perform billions of operations per
    second.
  • Connection weights are floating point numbers ?
    floating point multiplications and additions
  • ResultThroughputs of this kind are difficult to
    achieve with todays most powerful uniprocessors.

14
INTRODUCTION
  • Parallel Architectures for ANNs
  • Types of Parallelism Available in a MLP
  • 1. Training session parallelism
  • 2. Training example parallelism
  • 3. Layer and forward/backward parallelism
  • 4. Node parallelism
  • 5. Weight parallelism
  • 6. Bit parallelism

easily mapped onto a parallel architecture
15
INTRODUCTION
  • Parallel Architectures for ANNs (contd)
  • Complexities involved
  • 1.computational complexity
  • 2.communication complexity? inner product
    computation involves a large number of
    communication steps.
  • THUS,special purpose neurocomputers have been
    built using
  • 1.commercially available special purpose VLSIs
  • 2.special purpose VLSIs ? provides the best
    performance

16
INTRODUCTION
  • Parallel Architectures for ANNs (contd)
  • Dynamically changing architecture
  • ?number of nodes
  • ?number of layers from application to application
  • Expensive to design a VLSI architecture for
    individual applications
  • Typically, architectures with a fixed number of
    nodes and layers are fabricated.

17
INTRODUCTION
  • Parallel Architectures for ANNs (contd)
  • Special Purpose ANN Implementations in the
    Literature
  • Ghosh and Hwang
  • ?investigate architectural requirements for
    simulating ANNs using massively parallel
    multipocessors
  • ?propose a model for mapping neural networks onto
    message passing multicomputers.

18
INTRODUCTION
  • Liu
  • ? presents an efficient implementation of
    backpropagation algorithm on the CM-5 that avoids
    explicit message passing
  • ? results of CM-5 implementation compared with
    those of Cray-2,CrayX-MP,and CrayY-MP
  • Chinn
  • ? describe a systolic algorithm for ANN on
    MasPar-1 using a 2-D Systolic Array-Based Design

19
INTRODUCTION
  • Onuki
  • ?present a parallel implementation using a set of
    sixteen standard 24 bit DSPs connected in a
    hypercube
  • Kirsanov
  • ?discusses a new architecture for ANNs using
    Transputers
  • Muller
  • ?presents a special purpose parallel computer
    using a large number of Motorola floating point
    processors for ANN implementation.

20
INTRODUCTION
  • Parallel Architectures for ANNs (contd)
  • Special Purpose VLSI chips designed fabricated
    for ANN implementations
  • Hamerstorm
  • ? a high performance and low cost ANN with
  • 64 processing nodes per chip
  • hardware based multiply and accumulator operators
  • Barber
  • ?used a binary tree adder following parallel
    multipliers in SPIN-L architecture

21
INTRODUCTION
  • Shinokawa
  • ? describe a fast ANN with billion connections
    per second using ASIC VLSI chips
  • Viredez
  • ? describes MANTRA-I neurocomputer using 2x2
    systolic PE blocks
  • Kotolainen
  • ? proposed a tree of connection units with
    processing units at the leaf nodes for mapping
    many common ANNs.

22
INTRODUCTION
  • Asanovic
  • ? proposed a VLIW of 128 bit instruction width
    and a 7 stage pipelined processor with 8
    processors per chip.
  • Ramacher
  • ? describes the architecture of SYNAPSE
  • SYNAPSEa systolic neural signal processor using
    a 2D array of systolic elements
  • Mueller Hammerstrom
  • ? describe design and implementation of CNAPS

23
INTRODUCTION
  • CNAPS
  • ? a gate array implementation of ANNs
  • ? a single CNAPS chip
  • consists of 64 processing nodes
  • each node connected in a SIMD fashion
  • using broadcast interconnect.
  • ? each processor has
  • 4K bytes of local memory
  • a multipler
  • ALU
  • dual internal buses

24
INTRODUCTION
  • Cox
  • ? describes the implementation of GANGLION
  • GANGLION
  • ? a single neuron caters a fixed neural
    architecture of
  • ? 12 input nodes
  • ? 14 hidden nodes
  • ? 4 output nodes
  • ? using CLBs 8x8 multipliers have been built
  • ? a lookup table is used for the activation
    function.

25
INTRODUCTION
  • Stochastic Neural Architectures
  • There is no need for a time-consuming and area
    costly floating point multiplier.
  • Suitable for VLSI implementations
  • Examples
  • Armstrong Thomas
  • ? proposed a variation of ANN called Adaptive
    Logic Network (ALNs)
  • ALNs ? similiar to ANNs
  • ? costly multiplications replaced by logical and
    operations
  • ? additions replaced by logical or operations

26
INTRODUCTION
  • Masa et. al.
  • Describe an ANN,
  • with ? a single output
  • ? six hidden layers
  • ? seventy inputs
  • can operate at 50 MHz input rate

27
CUSTOM COMPUTING MACHINES
  • Uniprocessor
  • instruction set available to a programmer is
    fixed
  • an algorithm is coded using a sequence of
    instructions
  • processor can serve many applications by simply
    reordering the sequence of instructions
  • Application Specific Integrated Circuits (ASICs)
  • used for a specific application
  • provide higher performance compared to the
    general purpose uniprocessor

28
CUSTOM COMPUTING MACHINES
  • Custom Computing Machine (CCM)
  • a user can customize the architecture and
    instructions for a given application ?
    programming at a gate level
  • by programming at a gate level, high performance
    can be achieved.
  • using a CCM, a designer can tune and match the
    architectural requirements of the problem

29
CUSTOM COMPUTING MACHINES
  • A CCM can overcome the limitations of ASICs
  • Limitations of ASICs
  • ?fast but costly
  • ?nonreconfigurable
  • ?time consuming
  • Advantages of CCMs
  • ?cheap
  • CCMs use Field Programmable Gate Arrays (FPGAs)
    as compute elements
  • FPGAs are off-the-shelf components, thus
    relatively cheap.

30
CUSTOM COMPUTING MACHINES
  • Advantages of CCMs (contd)
  • ?reconfigurable since FPGAs are reconfigurable,
    CCMs are easily reprogrammed.
  • ?time saving
  • CCMs do not need to be fabricated with every new
    application since they are often employed for
    fast prototyping
  • THUS they save a considerable amount of time in
    design and implementation of algorithms.

31
SPLASH 2 ARCHITECTURE and PROGRAMMING FLOW
  • Splash 2 is one of the leading FPGA based custom
    computing machine designed and developed by
    Supercomputing Research Center

32
SYSTEM LEVEL VIEW of the SPLASH 2 ARCHITECTURE
interface board 1.connects Splash 2 to the host
2.Extends the address and data buses
Processing Element (PE) Each PE has 512 KB of
memory The host can read/write this memory
PE X0 controls the data flow into the processor
board
PEs (X1-X16)
Splash 2 Processing Board
The Sun host can read/write to memories and
memory mapped control registers of Splash 2 via
these buses.
33
SPLASH 2 ARCHITECTURE and PROGRAMMING FLOW
Processing Element in Splash 2
SBus Read
SBus Write
individual memory available with each PE makes it
convenient to store temporary results and tables.
Address
Data
SBus Address
SBus Data
Processor inhibit
To left neighbor
To right neighbor
To crossbar
34
SPLASH 2 ARCHITECTURE and PROGRAMMING FLOW
Programming Flow for Splash 2
Simulation
VHDL source
Logic Synthesis (Gate level decription)
Main concern achieve the best placement of
logic in an FPGA in order to minimize timing
delay.
Logic designed using VHDL is verified.
If timing obtained is not acceptable then design
process is repeated.
  • To program Splash 2, we
  • need to program
  • Each of the PEs
  • Crossbar
  • Host interface

Partition, place and route (Logic placement)
Timing of logic
if the logic circuit can not be mapped to
CLBs and flip flops which are available internal
to an FPGA , then designer needs to revise the
logic in the VHDL code and the process is
repeated.
Splash 2
Once the logic is mapped to CLBs, the timing for
the entire digital logic is obtained.
35
SPLASH 2 ARCHITECTURE and PROGRAMMING FLOW
Steps in Software Development on Splash 2
Design Entry (VHDL)
simulation
Functional Verification
Verified Design
Partition, place and route
Delay Analysis
synthesis
Generate Control BIts
Debugging
Host interface improvement
Integration
Host-splash 2 Executable code
36
MAPPING an MLP on SPLASH 2
  • In implementing a neural network classifier on
    Splash-2 building block ? perceptron
    implementation
  • For mapping MLP to Splash-2 2 physical PEs serve
    as a neuron.
  • ith PE handles the inner product phase ?wijxi
  • (i1)th PE computes nonlinear function tanh(ßx)
    with ß0.25
  • where i is odd and (i1) is even

37
MAPPING an MLP on SPLASH 2
  • Assume perceptrons have been trained ?
    connection weights are fixed.
  • Thus, an efficient way of handling the
    multiplication is to employ a Look-up Table.Since
    a large external memory (512 KB),the lookup table
    can be stored.
  • A pattern vector component xi is presented at
    every clock cycle
  • 1. Inner Product Calculation
  • The ith (odd) PE look up the multiplication table
    to obtain the weighted product

38
MAPPING an MLP on SPLASH 2
  • The sum ?wijxi is computed using an accumulator.
  • After all the components of a pattern vector have
    been examined, we have computed the inner
    product.
  • 2. Application of nonlinear function to the inner
    product
  • The nonlinearity is again stored as a lookup
    table in the second PE.
  • On receiving the inner product result from the
    first PE, the second uses the result as the
    address to the non-linearity look-up table and
    produces the output.

39
MAPPING an MLP on SPLASH 2
  • 3. Thus the output of a neuron is obtained
  • The output is written back to the external memory
    of the second PE starting from a prespecified
    location.
  • 4. After sending all the pattern vectors, the
    host can read back the memory contents.
  • A layer in the neural network is simply a
    collection of neurons working synchronously on
    the input.
  • On Splash-2 this can be achived by broadcasting
    the input to as many physical PEs as desired.The
    output of a neuron is written into a specified
    segment of external memory and read back by the
    host.

40
MAPPING an MLP on SPLASH 2
  • For every layer in MLP stages 1-4 is repated
    until the output layer is reached.
  • NOTE For every layer, there is a different
    look-up table.
  • Look-up Table Organization
  • There are m multiplications to be performed per
    node corresponding to the m-dimensional weight
    vector.
  • Look-up Table is divided into m segments.

41
MAPPING an MLP on SPLASH 2
  • Look-up Table Organization (contd)
  • A counter is incremented at every clock which
    forms the higher order (block) address for the
    lookup table.
  • NoteThe offset can also be negative correponding
    to a negative input to the look up table.
  • Pattern vector component
  • forms the lower order address bits.
  • Splash-2 has 18 bit adress bus for the external
    memory.
  • Higher order 6 bits for the block address
  • Lower order 12 bits for the offset address within
    the block.

42
MAPPING an MLP on SPLASH 2
  • Look-up Table Organization (contd)
  • The numbers have been represented by 12 bits 2s
    complement representation. Hence
  • The resolution of this representation is eleven
    bits.
  • accumulator
  • within PE
  • 16 bit wide
  • After accumulation, the accumulator result is
    scaled down to 12 bits.

43
MAPPING an MLP on SPLASH 2
Lookup Table Organization
44
PERFORMANCE EVALUATION
  • The requirements for mapping a MLP required to
    complete a classification process
  • in terms of PEs required
  • number of PEs required is equal to twice the
    number of layers in each layer.
  • number of clock cycles required mK l
  • where
  • m number of input layer nodes.
  • K number of patterns
  • l number of clock cycles

45
PERFORMANCE EVALUATION
  • in the implementation by authors of the paper
  • m 20
  • K 1024 X 1024 1 MB (Total number of pixels in
    the input smage)
  • l 2
  • THUS
  • no. of clock-cycles 202106 40 million
  • with a clock rate of 22 MHz., time taken for 40
    million clock ticks 1.81 secs.

46
PERFORMANCE EVALUATION
  • When the number of PEs required is larger than
    the available PEs
  • either more processor boards need to be added
  • or PEs need to be time shared.
  • NOTE
  • ? neuron outputs are produced independent of
    other neurons
  • ? algorithm waits till the computations in each
    layer is completed.

47
PERFORMANCE EVALUATION
  • A MLP has communication complexity of O(n2) where
    n is the number of nodes.
  • As n grows, it will be difficult to get good
    timing performance from a single processor
    system.
  • with a large number of processor boards, the
    single input data bus of 36 bits can cater to
    multiple input patterns.
  • Note
  • ?In a multiboard system, all boards receive the
    same input.
  • ? This parallelism can give rise to more data
    streaming into the system,
  • ? thus the number of clock cycles is reduced.

48
PERFORMANCE EVALUATION
SCALABILITY ?only a single layer
is considered ?network size is represented by
the of nodes in that layers ?multilayered
networks are considered to be linearly scalable
in Splash 2 architecture ?performance measure
is processing time as measured by of clock
cycles for Splash 2 with 22 MHz. Clock.
sparc20
Time (Log Scale)
splash
Size of network
49
PERFORMANCE EVALUATION
  • Speed Evaluation
  • 20 input nodes implemented on a 2-board system
  • 176 million connections per second (MCPS) is
    achieved per layer by running the Splash clock at
    22 MHz.
  • A 6- board system can deliver more than a billion
    connections per second.
  • Comparable to the performance of many high level
    VLSI-based systems such as Synapse, CNAPS which
    perform in the range of 5 GCPS.

50
PERFORMANCE EVALUATION
  • Network-based Image Segmentation
  • Image Segmentation The process of partitioning
    an image into mutually exclusive connected image
    regions.
  • In an automated image document understanding
    system, page layout segmentation plays an
    important role for segmenting text, graphics and
    background areas.
  • Jain and Karu proposed an algorithm to learn
    texture discrimination masks needed for image
    segmentation.

51
PERFORMANCE EVALUATION
  • Network-based Image Segmentation (contd)
  • The page segmentation algorithm proposed by Jain
    and Karu has three stages of computation
  • 1.feature extraction
  • Based on 20 masks
  • 2. classification
  • A multisate feedforward neural network with
  • 20 input nodes
  • 20 hidden nodes
  • 3 output nodes.
  • 3. postprocessing
  • involves removing small noisy regions and placing
    rectangular blocks around homogenous identical
    regions.

52
PERFORMANCE EVALUATION
  • Network-based Image Segmentation (contd)
  • Schematic of the Page Segmentation Algorithm

53
PERFORMANCE EVALUATION
Page Segmentation
input gray level image
result of segmentation algorithm
result after postprocessing
54
CONCLUSIONS
  • A novel sheme of mapping MLPs on Custom Computing
    Machine has been presented.
  • The scheme is scalable in terms of number of
    nodes and the number of layers in the MLP and
    provides near-ASIC level speed.
  • The reconfigurality of CCMs has been exploited to
    map several layers of a MLP onto the same
    hardware.
  • The performance gains achieved using this mapping
    have been demonstrated on a network-based image
    segmentation.
Write a Comment
User Comments (0)
About PowerShow.com