Neural Networks - PowerPoint PPT Presentation

About This Presentation
Title:

Neural Networks

Description:

PowerPoint Presentation ... Neural Networks – PowerPoint PPT presentation

Number of Views:384
Avg rating:3.0/5.0
Slides: 78
Provided by: edus1246
Category:

less

Transcript and Presenter's Notes

Title: Neural Networks


1
Neural Networks
2
WHY ARTIFICIAL NEURAL NETWORKS?
  • Characteristics of the human brain that are not
    present in von Neumann or modern parallel
    computers include
  • massive parallelism,
  • distributed representation and computation,
  • learning ability,
  • generalization ability,
  • adaptivety,
  • inherent contextual information processing,
  • fault tolerance, and
  • low energy consumption.
  • It is hoped that devices based on biological
    neural networks will possess some of these
    desirable characteristics.

3
(No Transcript)
4
ANNs
  • Inspired by biological neural networks, ANNs are
    massively parallel computing systems consisting
    of an extremely large number of simple processors
    with many interconnections.
  • ANN models attempt to use some organizational
    principles believed to be used in the human

5
Brief historical review
  • ANN research has experienced three periods of
    extensive activity
  • The first peak in the 1940s was due to McCulloch
    and Pitts'
  • The second occurred in the 1960s with
    Rosenblatt's perceptron convergence theorem and
    Minsky and Papert's work showing the limitations
    of a simple perceptron. Minsky and Papert's
    results dampened the enthusiasm of most
    researchers which lasted almost 20 years.
  • Since the early 1980s, ANNs have received
    considerable renewed interest. The major
    developments include
  • Hopfield's energy approach in 1982 and
  • The back-propagation learning algorithm for
    multilayer perceptrons (multilayer feed forward
    networks) first proposed by Werbos, and then
    popularized by Rumelhart et al. in 1986.

6
Biological neural networks
  • A neuron (or nerve cell) is a special biological
    cell that processes information. It is composed
    of a cell body, or soma, and two types of
    out-reaching tree-like branches the axon and the
    dendrites.

7
Biological neural networks (cont.)
  • A neuron receives signals (impulses) from other
    neurons through its dendrites (receivers) and
    transmits signals generated by its cell body
    along the axon (transmitter), which eventually
    branches into strands and sub strands.
  • At the terminals of these strands are the
    synapses.
  • A synapse is an elementary structure and
    functional unit between two neurons (an axon
    strand of one neuron and a dendrite of another)

8
Biological neural networks (cont.)
  • The human brain contains about 1011 neurons,
    which is approximately the number of stars in the
    Milky Way.
  • Neurons are massively connected, much more
    complex and dense than telephone networks.
  • Each neuron is connected to 103 to l04 other
    neurons.
  • In total, the human brain contains approximately
    1014 to 1015 interconnections.

9
Biological neural networks (cont.)
  • Complex perceptual decisions such as face
    recognition are typically made by humans within a
    few hundred milliseconds.
  • These decisions are made by a network of neurons
    whose operational speed is only a few
    milliseconds. This implies that
  • the computations cannot take more than about 100
    serial stages.
  • the brain runs parallel programs that are about
    100 steps long for such perceptual tasks. This is
    known as the hundred step rule

10
Computational models of neurons
  • This mathematical neuron computes a weighted sum
    of its n input signals ,x,, j 1,2, . . . , n.
  • Generates an output of 1 if this sum gt certain
    threshold U. Otherwise, an output of 0 results.

11
  • Mathematically
  • ?(.) is the unit step function
  • wj is the synapse weight
  • associated with the jth input
  • For simplicity of notation, we often consider the
    threshold U as another weight wo - U attached
    to the neuron with a constant input x0 1

12
Activation Functions

13
The Sigmoid
  • The standard sigmoid function is the logistic
    function, defined by

where ? is the slope parameter
14
Network architectures
  • ANNs can be viewed as weighted directed graphs in
    which artificial neurons are nodes and directed
    edges (with weights) are connections between
    neuron outputs and neuron inputs.
  • feed-forward networks, in which graphs have no
    loops, and
  • recurrent (or feedback) networks, in which loops
    occur because of feedback connections.

15
Network architectures

Different connectivity's yield different network
behaviors
16
Network architectures
  • Feed-forward networks are
  • static, that is, they produce only one set of
    output values rather than a sequence of values
    from a given input.
  • memory-less in the sense that their response to
    an input is independent of the previous network
    state.
  • Recurrent, or feedback, networks are
  • dynamic systems.
  • When a new input pattern is presented, the neuron
    outputs are computed. Because of the feedback
    paths, the inputs to each neuron are then
    modified, which leads the network to enter a new
    state.
  • Different network architectures require
    appropriate learning algorithms.

17
Learning
  • A learning process in the ANN context can be
    viewed as the problem of updating network
    architecture and connection weights so that a
    network can efficiently perform a specific task.
  • The network usually must learn the connection
    weights from available training patterns.
  • Performance is improved over time by iteratively
    updating the weights in the network.

18
Learning
  • ANNs' ability to automatically learn from
    examples makes them attractive and exciting.
  • ANNs appear to learn underlying rules (like
    input-output relationships) from the given
    collection of representative examples.
  • This is one of the major advantages of neural
    networks over traditional expert systems.

19
Learning algorithm
  • To understand or design a learning process, you
    must have
  • A learning paradigm a model of the environment
    in which a neural network operates, i.e., you
    must know what information is available to the
    network.
  • Learning rules you must understand how network
    weights are updated, i.e., which learning rules
    govern the updating process.
  • A learning algorithm refers to a procedure in
    which learning rules are used for adjusting the
    weights.

20
Learning paradigms
  • Supervised learning The network is provided with
    a correct answer (output) for every input pattern
    - learning with a teacher.
  • Weights are determined to allow the network to
    produce answers as close as possible to the known
    correct answers.
  • Reinforcement learning is a variant of
    supervised learning in which the network is
    provided with only a critique on the correctness
    of network outputs, not the correct answers
    themselves.
  • Unsupervised learning The network explores the
    underlying structure in the data, or correlations
    between patterns in the data, and organizes
    patterns into categories from these correlations
    - learning without a teacher.
  • Hybrid learning Part of the weights are usually
    determined through supervised learning, while the
    others are obtained through unsupervised learning
    - combines supervised and unsupervised learning.

21
Learning theory
  • Learning theory must address three fundamental
    and practical issues associated with learning
    from samples capacity, sample complexity, and
    computational complexity.
  • Capacity how many patterns can be stored, and
    what functions and decision boundaries a network
    can form.
  • Sample complexity determines the number of
    training patterns needed to train the network to
    guarantee a valid generalization.
  • Too few patterns may cause over-fitting
    (wherein the network performs well on the
    training data set, but poorly on independent test
    patterns drawn from the same distribution as the
    training patterns).
  • Computational complexity refers to the time
    required for a learning algorithm to estimate a
    solution from training patterns.
  • Many existing learning algorithms have high
    computational complexity.

22
Learning rules
  • Error correction, Boltzmann, Hebbian, and
    Competitive learning.
  • ERROR-CORRECTION RULES During the learning
    process, the actual output y generated by the
    network may not equal the desired output d.
  • The basic principle of error-correction learning
    rules is to use the error signal (d-y) to modify
    the connection weights to gradually reduce this
    error.
  • The perceptron learning rule is based on this
    error-correction principle.
  • A perceptron consists of a single neuron with
    adjustable weights, wj, j 1,2, . . . , n, and
    threshold U (threshold function).

23
ERROR-CORRECTION RULES
  • Given an input vector x (xl, x2, . . . , xn)t,
    the net input to the neuron is
  • The output y of the perceptron
  • is 1 if v gt 0, and 0 otherwise.
  • In a two-class classification problem, the
    perceptron assigns an input pattern to one class
    if y 1, and to the other class if y0.
  • The linear equation defines the decision boundary
    that halves the space.

24
Perceptron learning algorithm
  • Randomly initialize weights and threshold w1 w2
    wm
  • Present an input vector x (xl, x2, . . . , xn)t
    and evaluate the output of the neuron.
  • Update the weights according to
  • wj (t 1) wj (t) ?? (d-y) xj
  • where d is the desired output, t is the
    iteration number, and ? is the gain step size (
    0.0 lt ? lt 1.0)

25
Perceptron learning algorithm
  • Note that learning occurs only when the
    perceptron makes an error.
  • The perceptron convergence theorem Rosenblatt
    proved that when training patterns are drawn from
    two linearly separable classes, the perceptron
    learning procedure converges after a finite
    number of iterations.
  • In practice, you do not know whether the patterns
    are linearly separable.
  • Many variations of this learning algorithm have
    been proposed in the literature
  • Other activation functions that lead to different
    learning characteristics can also be used.
  • The back-propagation learning algorithm is based
    on the error-correction principle.

26
Perceptrons and Boolean Functions
  • If inputs are all 0s and 1s and outputs are all
    0s and 1s
  • Can learn the function x1 ? x2
  • Can learn the function x1 ? x2 .

27
Perceptrons and Boolean Functions
  • What about the exclusive or function?
  • f(x1,x2) x1 ? x2
  • (x1 ? x2) ? ( x1 ? x2)

28
XOR problem
  • Desired make an ANN which will produce Y X1
    xor X2 on inputs X1 and X2.
  • Problem there is no single line that can cut
    X1 X2 space into two proper regions. Therefore,
    cannot use a single-layer neural net.
  • Solution use multilayer network

29
HEBBIAN RULE
  • The oldest learning rule is Hebbs postulate of
    learning. Hebb based it on the following
    observation from neurobiological experiments
  • If neurons on both sides of a synapse are
    activated synchronously and repeatedly, the
    synapses strength is selectively increased.
  • Mathematically, the Hebbian rule can be described
    as
  • where xi and yj are the output values of neurons
    i and j, respectively, which are connected by the
    synapse wij and ? is the learning rate. Note
    that xi is the input to the synapse.

30
HEBBIAN RULE
  • An important property of this rule is that
  • learning is done locally, i.e., the change in
    synapse weight depends only on the activities of
    the two neurons connected by it.
  • This significantly simplifies the complexity of
    the learning circuit in a VLSI implementation.

31
HEBBIAN RULE
  • A single neuron trained using the Hebbian rule
    exhibits an orientation selectivity.
  • The points depicted are drawn from a
    two-dimensional Gaussian distribution and used
    for training a neuron.
  • The weight vector of the neuron is initialized to
    w0.
  • As the learning proceeds, the weight vector
    moves progressively closer to the
  • direction w of maximal
  • variance in the data.
  • w is the eigenvector of the
  • covariance matrix of the data
  • corresponding to the largest
  • eigen value.

32
BOLTZMANN LEARNING
  • The Boltzmann machine (named in honour of a
    19th-century scientist by its inventors)
  • Boltzmann machines are symmetric recurrent
    networks consisting of binary units ( 1 for on
    and -1 for off).
  • the weight on the connection from unit i to unit
    j is equal to the weight on the connection from
    unit j to unit i.
  • A subset of the neurons, called visible, interact
    with the environment the rest, called hidden, do
    not.
  • Each neuron is a stochastic unit that generates
    an output (or state) according to the Boltzmann
    distribution of statistical mechanics.

33
  • Boltzmann machines operate in two modes
  • Clamped visible neurons are clamped onto
    specific states determined by the environment
    and
  • Free-running both visible and hidden neurons are
    allowed to operate freely. The hidden neurons
    always operate freely.
  • K is the number of visible neurons
  • L is the number of hidden neurons.

34
BOLTZMANN LEARNING
  • Boltzmann learning is a stochastic learning rule
    derived from information-theoretic and
    thermodynamic principles.
  • The objective of Boltzmann learning is to adjust
    the connection weights so that the states of
    visible units satisfy a particular desired
    probability distribution.
  • According to the Boltzmann learning rule, the
    change in the connection weight wg is given by
  • where ? is the learning rate, and ?ij and ?ij are
    the correlations between the states of units i
    and j when the network operates in the clamped
    mode and free-running mode, respectively.

35
Summary of the Boltzmann Machine Learning
Procedure
  • 1. Initialization set weights to random numbers
    in 1,1
  • 2. Clamping Phase Present the net with the
    mapping it is supposed to learn by clamping input
    and output units to patterns. For each pattern,
    perform simulated annealing on the hidden units
    at a sequence T0, T1, ..., Tfinal of
    temperatures. At the final temperature, collect
    statistics to estimate the correlations

36
Summary of the Boltzmann Machine Learning
Procedure
  • 3. Free-Running Phase Repeat the calculations
    performed in step 2, but this time clamp only the
    input units. Hence, at the final temperature,
    estimate the correlations
  • 4. Updating of Weights update them using the
    learning rule Where ? is a learning rate
    parameter.

37
Summary of the Boltzmann Machine Learning
Procedure
  • 5. Iterate until Convergence Iterate steps 2 to
    4 until the learning procedure converges with no
    more changes taking place in the synaptic weights
    wji for all j, i.

38
(No Transcript)
39
Alternative Boltzmann Architecture
  • Alternatively, the visible units may be viewed as
    divided into input and output units.
  • In this case the Boltzmann machine performs
    association under the supervision of a teacher,
    with the input units receiving information form
    the environment, and the output units reporting
    the outcome for that input pattern.

40
Boltzmann vs Hopfield
  • Similarities
  • 1. Processing units have binary states (1)
  • 2. Connections between units are symmetric
  • 3. Units are picked at random and one at a time
    for updating
  • 4. Units have no self-feedback.
  • Differences
  • 1. Boltzmann machine permits the use of hidden
    neurons.
  • 2. Boltzmann machine uses stochastic neurons with
    a probabilistic firing mechanism, whereas the
    standard Hopfield net uses neurons based on the
    McCulloch-Pitts model with a deterministic firing
    mechanism.
  • 3. Boltzmann machine may also be trained by a
    probabilistic form of supervision.

41
COMPETITIVE LEARNING RULES
  • Competitive-learning output units compete among
    themselves for activation. As a result, only one
    output unit is active at any given time. This
    phenomenon is known as winner-take-all.
  • Competitive learning has been found to exist in
    biological neural network.
  • Competitive learning often clusters or
    categorizes the input data. Similar patterns are
    grouped by the network and represented by a
    single unit. This grouping is done automatically
    based on data correlations.

42
COMPETITIVE LEARNING RULES
  • The simplest competitive learning network
    consists of a single layer of output units.
  • Each output unit i in the network connects to all
    the input units (xi ,s) via weights, wij , j
    1,2, . . . , n.
  • Each output unit also connects to all other
    output units via inhibitory weights but has a
    self-feed back with an excitatory weight.

43
COMPETITIVE LEARNING RULES
  • A simple competitive learning rule can be stated
    as
  • Note that only the weights of the winner unit get
    updated.
  • The effect of this learning rule is to move the
    stored pattern in the winner unit (weights) a
    little bit closer to the input pattern.
  • Assume that all input vectors have been
    normalized to have unit length.
  • The weight vectors of the three units are
    randomly initialized. Their initial and final
    positions on the sphere after competitive
    learning are marked as Xs.
  • Each of the three natural groups (clusters) of
    patterns has been discovered by
  • an output unit whose weight vector points to the
    center of gravity of the
  • discovered group.

44
COMPETITIVE LEARNING RULES
  • You can see from the competitive learning rule
    that the network will not stop learning (updating
    weights) unless the learning rate q is 0.
  • A particular input pattern can fire different
    output units at different iterations during
    learning.
  • The system is said to be stable if no pattern in
    the training data changes its category after a
    finite number of learning iterations.
  • One way to achieve stability is to force the
    learning rate to decrease gradually as the
    learning process proceeds towards 0. However,
    this artificial freezing of learning causes
    another problem termed plasticity, which is the
    ability to adapt to new data. This is known as
    Grossbergs stability- plasticity dilemma in
    competitive learning.

45
COMPETITIVE LEARNING RULES
  • The most well-known example of competitive
    learning is vector quantization for data
    compression.
  • It has been widely used in speech and image
    processing for efficient storage, transmission,
    and modeling.
  • Its goal is to represent a set or distribution of
    input vectors with a relatively small number of
    prototype vectors (weight vectors), or a
    codebook. Once a codebook has been constructed
    and agreed upon by both the transmitter and the
    receiver, you need only transmit or store the
    index of the corresponding prototype to the input
    vector.
  • Given an input vector, its corresponding
    prototype can be found by searching for the
    nearest prototype in the codebook.

46
Well known learning algorithms
47
Well known learning algorithms
48
SUMMARY
  • Learning rules based on error-correction can be
    used for training feed-forward networks
  • Hebbian learning rules have been used for all
    types of network architectures.
  • Each learning algorithm is designed for training
    a specific architecture.
  • When we discuss a learning algorithm, a
    particular network architecture association is
    implied.
  • Each algorithm can perform only a few tasks well.
  • Other algorithms, including Adaline, Madaline,
    linear discriminant analysis, Sammon's projection
    , and principal component analysis.

49
Multilayer Networks
  • The class of functions representable by
    perceptrons is limited

This is a nonlinear function Of a linear
combination Of non linear functions
Of linear combinations of inputs
50
A 1-HIDDEN LAYER NET
NINPUTS 2
NHIDDEN 3
w11
w1
x1
w21
w31
w2
w12
w22
x2
w3
w32
51
OTHER NEURAL NETS

52
Multilayer perceptron
  • The most popular class of multilayer feed-forward
    networks is multilayer perceptrons
  • Each computational unit employs either the
    thresholding function or the sigmoid function.
  • Multilayer perceptrons can form arbitrarily
    complex decision boundaries and represent any
    Boolean function.
  • The development of the back-propagation learning
    algorithm for determining weights in a multilayer
    perceptron has made these networks the most
    popular among researchers and users of neural
    networks.

53
Multilayer perceptron
  • We denote wij(l) as the weight on the connection
    between the ith unit in layer (l-1) to jth unit
    in layer l.
  • Let (x(1), d(1)), (x(2), d(2)), . . . , (x(p),
    d(p)) be a set of p training patterns
    (input-output pairs),
  • where x(i) ? Rn is the input vector in the
    n-dimensional pattern space, and
  • d(i) ? 0, l m, an m-dimensional hypercube.
  • For classification purposes, m is the number of
    classes. The squared error cost function most
    frequently used in the ANN literature is defined
    as

54
Back-propagation
  • The back-propagation algorithm is a
    gradient-descent method to minimize the
    squared-error cost function E.

55
GRADIENT DESCENT
  • Suppose we have a scalar function
  • We want to find a local minimum.
  • Assume our current weight is w
  • GRADIENT DESCENT RULE
  • ? is called the LEARNING RATE. A small positive
    number, e.g. ? 0.05

56
Gradient Descent in m Dimensions
  • Given

points in direction of steepest ascent.
GRADIENT DESCENT RULE Equivalently
.where wj is the jth weight just like a linear
feedback system
57
A RULE KNOWN BY MANY NAMES
The Widrow Hoff rule
The LMS Rule
The delta rule
The adaline rule
Classical conditioning
58
Back-propagation algorithm
  • 1. Initialize the weights to small random
    variables
  • 2- Randomly choose an input pattern X(u)
  • 3- propagate the signal forward through the
    network
  • 4- Compute ?iL in the output layer (Oi yiL)
  • ?il g (hil) diu yil,
  • where hil represents the net input to the ith
    unit in the lth layer, g is the derivative of
    the activation function g.
  • 5- Compute the deltas for the preceding layers by
    propagating the errors backwards
  • ?il g (hil) ?j wijl1 ?jl1 ,
  • for l (L-1),, 1
  • 6- Update weights using
  • ?wjil ?il yjl-1
  • 7- Go to step 2 and repeat for next pattern until
    the error in the output layer is acceptably low,
    or a prespecified number of iterations is
    reached.

59
Backpropagation algorithm (instance-based)
  • 1 Randomize the weights ws to small random
    values (both positive and negative) to ensure
    that the network is not saturated by large values
    of weights.
  • 2  Select an instance t, that is the vector
    xk(t), i 1,...,Ninp (a pair of input and
    output patterns), from the training set.
  • 3   Apply the network input vector to network
    input.
  • 4  Calculate the network output vector
    zk(t), k 1,...,Nout.
  • 5  Calculate the errors for each of the outputs k
    , k1,...,Nout, the difference between the
    desired output and the network output
  •           (for simplicity we will denote it as
    simply E).
  • 6  Calculate the necessary updates for weights
    -ws in a way that minimizes this error (discussed
    below).
  • 7   Adjust the weights of the network by -ws.
  • 8   Repeat steps 2 6 for each instance (pair of
    inputoutput vectors) in the training set until
    the error for the entire system (error E defined
    above or the error on cross-validation set) is
    acceptably low, or the pre-defined number of
    iterations is reached.

60
Backpropagation algorithm
  • Often it is reasonable not to update weights
    immediately after processing each instance, but
    accumulates (sums up) the necessary changes
    across a subset of training instances (call an
    epoch) and only then updates the weights. This
    allows for faster convergence (Smith 1993).
  • Epoch can be the part or the whole training set.
    After the whole training set is processed (this
    sequence of steps is called an iteration),
  • the whole process is repeated again in an
    iterative fashion until the total error is
    acceptably low.
  • Number of such iterations may sometimes be as
    high as several thousand.

61
Backpropagation algorithm (epoch-based, with
cumulative updates)
  • 1 6 as above
  • 7 add up the calculated weights updates -ws to
    the accumulated total updates ?Ws.
  • 8 Repeat steps 2 7 for several instances
    comprising an epoch.
  • 9 Adjust the weights ws of the network by the
    updates -Ws.
  • 10 Repeat steps 2 9 until all instances in the
    training set are processed. This constitutes one
    iteration.
  • 11 Repeat the iteration of steps 2 10 until the
    error for the entire system (error E defined
    above or the error on cross-validation set) is
    acceptably low, or the pre-defined number of
    iterations is reached.

62

63
Backpropagation
  • In a Single-layer network,
  • Each neuron adjusts its weights according to
    what output was expected of it, and the output it
    gave. This can be mathematically expressed by the
    Perceptron Delta Rule
  • Where w is the array of weights,
  • x is the array of inputs.

64
The Sigmoid (logistic) function
  • One of the more popular alternatives function
    used with back-propagation nets is the Sigmoid
    (logistic) function.

65
The perceptron learning rule
  • Where w is the array of weights,
  • x is the array of inputs, and ? is defined as the
    learning rate.
  • yi and di are the actual and desired outputs,
    respectively.
  • Calculating the deltas for the output layer as

66
Calculate delta for the hidden layers
  • We have to know the effect on the output of the
    neuron if a weight is to change.
  • Therefore, we need to know the derivative of the
    error with respect to that weight.
  • It has been proven that for neuron q in hidden
    layer p, delta is

Each delta value for hidden layers require that
the delta value for the layer after it be
calculated.
67
Backpropagation example
NINPUTS 2
NHIDDEN 2

1
W1(0,1)
1
W1(0,2)
W2(0,1)
W1(1,1)
W2(0,1)
W2(1,1)
x1
W1(1,2)
W1(2,1)
x2
W2(2,1)
W1(2,2)
68
Back propagation algorithm
  • 1-Initialize the weights to small random
    variables
  • Layer 1
  • Layer 2

69
  • Randomly choose an input pattern X(u)
  • 3- Propagate the signal forward through the
    network

Layer 1 X2(i) ?k0,1,2 Wi (k,i) X(k)
70
  • Out(x) g(?k0,1,2 Wi (k,i) X(k) )
  • X2(i) g(?k0,1,2 Wi (k,i) X(k) )
  • g(x) 1/(1e-x)

71
  • 4. Compute ?iL in the output layer (Oi yiL)
  • ?il g (hil) diu yil,
  • where hil represents the net input to the ith
    unit in the lth layer, g is the derivative of
    the activation function g.
  • d3(1) x3(1)(1 - x3(1))(d - x3(1))

72
  • 5- Compute the deltas for the preceding layers by
    propagating the errors backwards
  • ?il g (hil) ?j wijl1 ?jl1 ,
  • for l (L-1),, 1

73
  • 6- Update weights using
  • ?wjil ?il yjl-1
  • Taking ? as 0.05
  • dw2(0,1) ?x1(0)d2(1)

74
  • 7- Go to step 2 and repeat for next pattern until
    the error in the output layer below a
    prespecified number of iterations is reached.

75
  • Run the entire process again on the next set of
    training data.
  • Slowly, as the training data is fed in and the
    network in retrained a few thousand times, the
    network could balance out to certain values.

76
APPLICATIONS
  • To successfully work with real-world problems,
    you must deal with numerous design issues,
    including network model, network size, activation
    function, learning parameters, and number of
    training samples.
  • Pattern classification
  • Clustering
  • Function approximation
  • Prediction
  • Optimization
  • Content addressable memory
  • Control

77
Reference
Write a Comment
User Comments (0)
About PowerShow.com