Activation Functions - PowerPoint PPT Presentation

1 / 107
About This Presentation
Title:

Activation Functions

Description:

Calculation: average distance from the mean of the data set to a point. s = Si=1n(Xi X)2 ... Why have both variance and SD to calculate the spread of data? ... – PowerPoint PPT presentation

Number of Views:315
Avg rating:3.0/5.0
Slides: 108
Provided by: fakp
Category:

less

Transcript and Presenter's Notes

Title: Activation Functions


1
Activation Functions
2
Log Sigmoid
3
Local Gradient
4
Local Gradient
  • Thus the weight change is
  • Where is the local gradient y max and where is
    it minimum?

5
Hyperbolic Tangent
6
Hyperbolic Tangent
7
Local Gradient
  • Thus the weight change is
  • Where is the local gradient y max and where is
    it minimum?

8
Momentum
  • As the learning rate increases the network trains
    faster, but may go unstable
  • Often times a momentum term is used to make
    training fast but avoid training too fast

9
Momentum
  • We want the learning rate small for accuracy, but
    large for speed of convergence, but too large and
    its unstable.
  • How can this be accomplished?

10
Momentum
  • What does the momentum term do?
  • When the gradient has a constant sign from
    iteration to iteration, the momentum term gets
    larger
  • When the gradient has opposite signs from
    iteration to iteration, the momentum term gets
    smaller

11
Heuristic Improvements
  • Stopping Criteria
  • How do we know when we have reached the correct
    answer?
  • A necessary condition for minimum error is that
    the gradient 0, i.e. w(n1) w(m)
  • Note this condition is not sufficient since we
    may have a local minimum

12
Heuristic Improvements
  • Stopping Criteria
  • The BP algorithm is considered to have converged
    when the Euclidean norm of the gradient vector is
    sufficiently small
  • This is problematic since one must compute the
    gradient vector, and it will be slow

13
Heuristic Improvements
  • Stopping Criteria EX gradient vector norm
  • Lets say we have a logistic function for the 5
    neurons in the output layer with a 1. (Im
    using 5 for this example)
  • For each neuron I compute outi (1-outi)ki
  • Then if

14
Heuristic Improvements
  • Stopping Criteria
  • The BP algorithm is considered to have converged
    when the absolute rate od change of the average
    squared error per epoch is sufficiently small
    (typically lt .1 to 1 of the error)
  • May result in premature stopping

15
Heuristic Improvements
  • Stopping Criteria
  • The BP algorithm is considered to have when its
    generalization performance stops improving
  • Generalization performance is found by testing
    the network on a representative set of data not
    used to train on

16
Heuristic Improvements
  • On-line, Stochastic or sequential mode training
    all are synonymous for the author
  • On-line input a value once and train on it
    (update)
  • Stochastic randomly select from the pool of
    training samples and update (train on it)

17
Heuristic Improvements
  • In Matlab we have trainb, trainr, and trains
  • trainb
  • trains a network with weight and bias learning
    rules with batch updates. The weights and biases
    are updated at the end of an entire pass through
    the input data. Inputs are presented in random
    order
  • trainr
  • trains a network with weight and bias learning
    rules with incremental updates after each
    presentation of an input. Inputs are presented in
    random order.

18
Heuristic Improvements
  • In Matlab we have trainb, trainr, and trains
  • trains
  • Trains a network with weight and bias learning
    rules with incremental updates after each
    presentation of an input. Inputs are presented in
    sequential order.

19
Heuristic Improvements
  • Batch Mode
  • with each input, accumulate update values for
    each weight and after all inputs (an epoch) are
    received update the weights

20
Summary of BP Steps
  • Initialize
  • With no prior information, pick weights from a
    uniform distribution with mean 0 and range /- 1,
    causes weights to fall in the linear region of
    the sigmoid
  • Presentation of training examples
  • Randomly pick values, sequential mode (easiest)

21
Summary of BP Steps
  • Forward Computation
  • Present input vector and compute output
  • Backward Computation
  • Compute the ds as

22
Adjustment
23
Heuristic Improvements
  • Maximize information content of samples
  • Examples should contain maximum information
  • Results in the largest training error
  • Radically different from previous examples
  • Generally this is simulated by presenting
    training samples randomly.
  • Randomize for each epoch

24
Heuristic Improvements
  • Activation Function
  • Generally learns faster if it is antisymmetric
  • Not true of logsigmoid, but true of Tanh

25
Heuristic Improvements
  • Activation Function (contd)
  • The next figure is the log sigmoid which does not
    meet the criterion
  • The figure after that is the Tanh which is
    antisymmetric

26
(No Transcript)
27
(No Transcript)
28
Heuristic Improvements
  • Activation Function (contd)
  • For the Tanh function, empirical studies have
    shown the following values for a and b to be
    appropriate

29
Heuristic Improvements
  • Activation Function (contd)
  • Note that

30
Heuristic Improvements
  • Target (output) values
  • Choose within range of possible output values
  • Really should be some small value e lt the maximum
    neuron output
  • For the tanh, with a1.7159, choose e0.7159, and
    then the targets can be /-1 (see the tanh slide)

31
Heuristic Improvements
  • Input range problems
  • If dealing with a persons height (meters) and
    weight (lbs), then the weight will overpower
    the height
  • Also, it is not good if one range of values is
    -/ and another is -/- or /

32
Heuristic Improvements
  • Preprocess the inputs so that each has
  • An average of 0 or
  • Its average is small compared to its standard
    deviation
  • Consider the case of all positive values
  • All weights must change in the same direction and
    this will give a zig-zag traversal of the error
    surface which can be very slow.

33
Heuristic Improvements
  • If possible, input values should be uncorrelated
  • Can be done using principal components analysis

34
Principal Component Analysis
35
Topics covered
  • Standard Deviation
  • Variance
  • Covariance
  • Correlation
  • Eigenvectors
  • Eigenvalues
  • PCA
  • Application of PCA - Eigenfaces

36
Standard Deviation
  • Statistics analyzing data sets in terms of the
    relationships between the individual points
  • Standard Deviation is a measure of the spread of
    the data
  • 0 8 12 20 8 9 11 12
  • Calculation average distance from the mean of
    the data set to a point
  • s Si1n(Xi X)2
  • (n -1)
  • Denominator of n-1 for sample and n for entire
    population

37
Standard Deviation
  • For example
  • 0 8 12 20 has s 8.32
  • Sqrt(((0-10)2(8-10)2(12-10)2(20-10)2)/3)8.
    32
  • 8 9 11 12 has s 1.82
  • 10 10 10 10 has s 0

38
Variance
  • Another measure of the spread of the data in a
    data set
  • Calculation
  • s2 Si1n(Xi X)2
  • (n -1)
  • Why have both variance and SD to calculate the
    spread of data?
  • Variance is claimed to be the original
    statistical measure of spread of data. However
    its unit would be expressed as a square e.g.
    cm2, which is unrealistic to express heights or
    other measures. Hence SD as the square root of
    variance was born.

39
Covariance
  • Variance measure of the deviation from the mean
    for points in one dimension e.g. heights
  • Covariance is a measure of how much each of the
    dimensions vary from the mean with respect to
    each other.
  • Covariance is measured between 2 dimensions to
    see if there is a relationship between the 2
    dimensions e.g. number of hours studied marks
    obtained.
  • The covariance between one dimension and itself
    is the variance

40
Covariance
  • variance (X) Si1n(Xi X) (Xi X)
  • (n -1)
  • covariance (X,Y) Si1n(Xi X) (Yi Y)
  • (n -1)
  • So, if you had a 3-dimensional data set (x,y,z),
    then you could measure the covariance between the
    x and y dimensions, the y and z dimensions, and
    the x and z dimensions. Measuring the covariance
    between x and x , or y and y , or z and z would
    give you the variance of the x , y and z
    dimensions respectively.

41
Variance Covariance - Matlab
  • gtgt x0 8 12 208 9 11 1210 10 10 10
  • gtgt var(x)
  • ans
  • 28 1 1 28 note 28 is var(0,8,10)
  • gtgt Ccov(x)
  • 28 5 -5 -28
  • 5 1 -1 -5
  • -5 -1 1 5
  • -28 -5 5 28

42
Variance Covariance - Matlab
  • gtgt C(2,3)
  • -1

43
Covariance
  • What is the interpretation of covariance
    calculations?
  • e.g. 2 dimensional data set
  • x number of hours studied for a subject
  • y marks obtained in that subject
  • covariance value is say 104.53
  • what does this value mean?

44
Covariance
  • Exact value is not as important as its sign.
  • A positive value of covariance indicates both
    dimensions increase or decrease together e.g. as
    the number of hours studied increases, the marks
    in that subject increase.
  • A negative value indicates while one increases
    the other decreases, or vice-versa e.g. active
    social life at RIT vs performance in CS dept.
  • If covariance is zero the two dimensions are
    independent of each other e.g. heights of
    students vs the marks obtained in a subject

45
Covariance
  • Why bother with calculating covariance when we
    could just plot the 2 values to see their
    relationship?
  • Covariance calculations are used to find
    relationships between dimensions in high
    dimensional data sets (usually greater than 3)
    where visualization is difficult.

46
Covariance Matrix
  • Representing Covariance between dimensions as a
    matrix e.g. for 3 dimensions
  • cov(x,x) cov(x,y) cov(x,z)
  • C cov(y,x) cov(y,y) cov(y,z)
  • cov(z,x) cov(z,y) cov(z,z)
  • Diagonal is the variances of x, y and z
  • cov(x,y) cov(y,x) hence matrix is symmetrical
    about the diagonal
  • N-dimensional data will result in nxn covariance
    matrix

47
Correlation
  • For a positive correlation it means as X
    increases, so does y, and vice versa
  • Of the Following plots which has the highest
    correlation?

48
(No Transcript)
49
Transformation matrices
  • Consider
  • 2 3 3 12 3
  • 2 1 2 8 2
  • Square transformation matrix transforms (3,2)
    from its original location. Now if we were to
    take a multiple of (3,2)
  • 3 6
  • 2 4
  • 2 3 6 24 6
  • 2 1 4 16 4

x

x

4
x
2

x

x

4
50
Transformation matrices
  • Scale vector (3,2) by a value 2 to get (6,4)
  • Multiply by the square transformation matrix
  • We see the result is still a multiple of 4.
  • WHY?
  • A vector consists of both length and direction.
    Scaling a vector only changes its length and not
    its direction. This is an important observation
    in the transformation of matrices leading to
    formation of eigenvectors and eigenvalues.
  • Irrespective of how much we scale (3,2) by, the
    solution is always a multiple of 4.

51
eigenvalue problem
  • The eigenvalue problem is any problem having the
    following form
  • A . v ? . v
  • A n x n matrix
  • v n x 1 non-zero vector
  • ? scalar
  • Any value of ? for which this equation has a
    solution is called the eigenvalue of A and vector
    v which corresponds to this value is called the
    eigenvector of A.

52
eigenvalue problem
  • 2 3 3 12 3
  • 2 1 2 8 2
  • A . v ? . v
  • Therefore, (3,2) is an eigenvector of the square
    matrix A and 4 is an eigenvalue of A
  • Given matrix A, how can we calculate the
    eigenvector and eigenvalues for A?

x

x

4
53
Calculating eigenvectors eigenvalues
  • Given A . v ? . v
  • A . v - ? . I . v 0
  • (A - ? . I ). v 0
  • Finding the roots of A - ? . I will give the
    eigenvalues and for each of these eigenvalues
    there will be an eigenvector
  • Example

54
Calculating eigenvectors eigenvalues
  • If A 0 1
  • -2 -3
  • Then A - ? . I 0 1 ? 0
    0
  • -2 -3 0 ?
  • -? 1 ?2 3? 2 0
  • -2 -3-?
  • This gives us 2 eigenvalues
  • ?1 -1 and ?2 -2

55
Calculating eigenvectors eigenvalues
  • For ?1 the eigenvector is
  • (A ?1 . I ). v1 0
  • 1 1 v11 0
  • -2 -2 v12
  • -2.v11 -2.v12 0
  • v11 -v12
  • Therefore the first eigenvector is any column
    vector in which the two elements have equal
    magnitude and opposite sign

56
Calculating eigenvectors eigenvalues
  • Therefore eigenvector v1 is
  • v1 k1 1
  • -1
  • Where k1 is some constant. Similarly we find
    eigenvector v2
  • v2 k2 1
  • -2
  • And the eigenvalues are ?1 -1 and ?2 -2

57
Properties of eigenvectors and eigenvalues
  • Note that Irrespective of how much we scale (3,2)
    by, the solution is always a multiple of 4.
  • Eigenvectors can only be found for square
    matrices and not every square matrix has
    eigenvectors.
  • Given an n x n matrix, we can find n eigenvectors

58
Properties of eigenvectors and eigenvalues
  • All eigenvectors of a matrix are perpendicular to
    each other, no matter how many dimensions we have
  • In practice eigenvectors are normalized to have
    unit length. Since the length of the eigenvectors
    do not affect our calculations we prefer to keep
    them standard by scaling them to have a length of
    1. e.g.
  • For eigenvector (3,2)
  • ((32 22))1/2 (13)1/2
  • 3 (13)1/2 3/(13)1/2
  • 2 2/(13)1/2

59
Matlab
  • gtgt A 0 1 2 3
  • A
  • 0 1
  • 2 3
  • gtgt v,d eig(A)
  • v
  • -0.8719 -0.2703
  • 0.4896 -0.9628
  • d
  • -0.5616 0
  • 0 3.5616

gtgt help eig V,D EIG(X) produces a diagonal
matrix D of eigenvalues and a full matrix V whose
columns are the corresponding eigenvectors so
that XV VD.
60
PCA
  • principal components analysis (PCA) is a
    technique that can be used to simplify a dataset
  • It is a linear transformation that chooses a new
    coordinate system for the data set such that
  • greatest variance by any projection of the data
    set comes to lie on the first axis (then called
    the first principal component), the second
    greatest variance on the second axis, and so on.
  • PCA can be used for reducing dimensionality by
    eliminating the later principal components.

61
PCA
  • By finding the eigenvalues and eigenvectors of
    the covariance matrix, we find that the
    eigenvectors with the largest eigenvalues
    correspond to the dimensions that have the
    strongest correlation in the dataset.
  • This is the principal component.
  • PCA is a useful statistical technique that has
    found application in
  • fields such as face recognition and image
    compression
  • finding patterns in data of high dimension.
  • Reducing dimensionality of data

62
PCA process STEP 1
  • Subtract the mean from each of the data
    dimensions. All the x values have x subtracted
    and y values have y subtracted from them. This
    produces a data set whose mean is zero.
  • Subtracting the mean makes variance and
    covariance calculation easier by simplifying
    their equations. The variance and co-variance
    values are not affected by the mean value.

63
PCA process STEP 1
http//kybele.psych.cornell.edu/edelman/Psych-465
-Spring-2003/PCA-tutorial.pdf
ZERO MEAN DATA x y .69 .49 -1.31
-1.21 .39 .99 .09 .29 1.29 1.09 .49
.79 .19 -.31 -.81 -.81 -.31 -.31 -.71
-1.01
  • DATA
  • x y
  • 2.5 2.4
  • 0.5 0.7
  • 2.2 2.9
  • 1.9 2.2
  • 3.1 3.0
  • 2.3 2.7
  • 2 1.6
  • 1 1.1
  • 1.5 1.6
  • 1.1 0.9

64
PCA process STEP 1
http//kybele.psych.cornell.edu/edelman/Psych-465
-Spring-2003/PCA-tutorial.pdf
65
PCA process STEP 2
  • Calculate the covariance matrix
  • cov .616555556 .615444444
  • .615444444 .716555556
  • since the non-diagonal elements in this
    covariance matrix are positive, we should expect
    that both the x and y variable increase together.

66
PCA process STEP 3
  • Calculate the eigenvectors and eigenvalues of the
    covariance matrix
  • eigenvalues .0490833989
  • 1.28402771
  • eigenvectors .735178656 -.677873399
  • -.677873399 -.735178656

67
PCA process STEP 3
http//kybele.psych.cornell.edu/edelman/Psych-465
-Spring-2003/PCA-tutorial.pdf
  • eigenvectors are plotted as diagonal dotted lines
    on the plot.
  • Note they are perpendicular to each other.
  • Note one of the eigenvectors goes through the
    middle of the points, like drawing a line of best
    fit.
  • The second eigenvector gives us the other, less
    important, pattern in the data, that all the
    points follow the main line, but are off to the
    side of the main line by some amount.

68
PCA process STEP 4
  • Reduce dimensionality and form feature vector
  • the eigenvector with the highest eigenvalue is
    the principle component of the data set.
  • In our example, the eigenvector with the largest
    eigenvalue was the one that pointed down the
    middle of the data.
  • Once eigenvectors are found from the covariance
    matrix, the next step is to order them by
    eigenvalue, highest to lowest. This gives you the
    components in order of significance.

69
PCA process STEP 4
  • Now, if you like, you can decide to ignore the
    components of lesser significance.
  • You do lose some information, but if the
    eigenvalues are small, you dont lose much
  • n dimensions in your data
  • calculate n eigenvectors and eigenvalues
  • choose only the first p eigenvectors
  • final data set has only p dimensions.

70
PCA process STEP 4
  • Feature Vector
  • FeatureVector (eig1 eig2 eig3 eign)
  • We can either form a feature vector with both of
    the eigenvectors
  • -.677873399 .735178656
  • -.735178656 -.677873399
  • or, we can choose to leave out the smaller, less
    significant component and only have a single
    column
  • - .677873399
  • - .735178656

71
PCA process STEP 5
  • Deriving the new data
  • FinalData RowFeatureVector x RowZeroMeanData
  • RowFeatureVector is the matrix with the
    eigenvectors in the columns transposed so that
    the eigenvectors are now in the rows, with the
    most significant eigenvector at the top
  • RowZeroMeanData is the mean-adjusted data
    transposed, ie. the data items are in each
    column, with each row holding a separate
    dimension.

72
PCA process STEP 5
  • FinalData is the final data set, with data items
    in columns, and dimensions along rows.
  • What will this give us? It will give us the
    original data solely in terms of the vectors we
    chose.
  • We have changed our data from being in terms of
    the axes x and y , and now they are in terms of
    our 2 eigenvectors.

73
PCA process STEP 5
  • FinalData transpose dimensions along columns
  • x y
  • -.827970186 .175115307
  • -1.77758033 -.142857227
  • -.992197494 -.384374989
  • -.274210416 -.130417207
  • -1.67580142 .209498461
  • -.912949103 -.175282444
  • .0991094375 .349824698
  • 1.14457216 -.0464172582
  • .438046137 -.0177646297
  • 1.22382056 .162675287

74
PCA process STEP 5
http//kybele.psych.cornell.edu/edelman/Psych-465
-Spring-2003/PCA-tutorial.pdf
75
Reconstruction of original Data
  • If we reduced the dimensionality, obviously, when
    reconstructing the data we would lose those
    dimensions we chose to discard. In our example
    let us assume that we considered only the x
    dimension

76
Reconstruction of original Data
http//kybele.psych.cornell.edu/edelman/Psych-465
-Spring-2003/PCA-tutorial.pdf
  • x
  • -.827970186
  • 1.77758033
  • -.992197494
  • -.274210416
  • -1.67580142
  • -.912949103
  • .0991094375
  • 1.14457216
  • .438046137
  • 1.22382056

77
Matlab PCA
  • Matlab has a function called princomp(x)
  • COEFF,SCOREprincomp(x) performs principal
    components analysis on the n-by-p data matrix X,
    and returns the principal component coefficients,
    also known as loadings. Rows of X correspond to
    observations, columns to variables. COEFF is a
    p-by-p matrix, each column containing
    coefficients for one principal component. The
    columns are in order of decreasing component
    variance. Princomp centers X by subtracting off
    column means

78
Matlab PCA
  • COEFF,SCORE princomp(X) returns SCORE, the
    principal component scores that is, the
    representation of X in the principal component
    space. Rows of SCORE correspond to observations,
    columns to components.

79
Matlab PCA
  • gtgtD (SLIDE 63)
  • D
  • 0.6900 0.4900
  • -1.3100 -1.2100
  • 0.3900 0.9900
  • 0.0900 0.2900
  • 1.2900 1.0900
  • 0.4900 0.7900
  • 0.1900 -0.3100
  • -0.8100 -0.8100
  • -0.3100 -0.3100
  • -0.7100 -1.0100

80
Matlab PCA
  • gtgt A,Bprincomp(D)
  • A
  • -0.6779 0.7352 (see slide 66, note
    that these columns are by highest eigenvalue,
    unlike slide 66)
  • -0.7352 -0.6779
  • B
  • -0.8280 0.1751 (see slide 73)
  • 1.7776 -0.1429
  • -0.9922 -0.3844
  • -0.2742 -0.1304
  • -1.6758 0.2095
  • -0.9129 -0.1753
  • 0.0991 0.3498
  • 1.1446 -0.0464
  • 0.4380 -0.0178
  • 1.2238 0.1627

81
MATLAB DEMO
82
PCA applications -Eigenfaces
  • Eigenfaces are the eigenvectors the covariance
    matrix of the probability distribution of the
    vector space of human faces
  • Eigenfaces are the standardized face
    ingredients derived from the statistical
    analysis of many pictures of human faces
  • A human face may be considered to be a
    combination of these standard faces

83
PCA applications -Eigenfaces
  • To generate a set of eigenfaces
  • Large set of digitized images of human faces is
    taken under the same lighting conditions.
  • The images are normalized to line up the eyes and
    mouths.
  • The eigenvectors of the covariance matrix of the
    statistical distribution of face image vectors
    are then extracted.
  • These eigenvectors are called eigenfaces.

84
PCA applications -Eigenfaces
  • the principal eigenface looks like a bland
    androgynous average human face

http//en.wikipedia.org/wiki/ImageEigenfaces.png
85
Eigenfaces Face Recognition
  • When properly weighted, eigenfaces can be summed
    together to create an approximate gray-scale
    rendering of a human face.
  • Remarkably few eigenvector terms are needed to
    give a fair likeness of most people's faces
  • Hence eigenfaces provide a means of applying data
    compression to faces for identification purposes.

86
Expert Object Recognition in Video Matt McEuen
87
EOR
  • Principal Component Analysis (PCA)
  • Based on covariance
  • Visual memory reconstruction
  • Images of cats and dogs are aligned so that the
    eyes are in the same position in every image

88
EOR
89
Back to Heuristic Improvements
90
Back to Heuristic Improvements
91
Heuristic Improvements
  • Initialization
  • Large initial weights will saturate neurons
  • All 0s for weights is also potentially bad
  • From the text, it can be shown (approximately,
    under certain conditions) that a good choice for
    weights is to select them randomly from a uniform
    distribution with

92
Number of Hidden Layers
  • Three layers suffice to implement any function
    with properly chosen transfer functions
  • Additional layers can help
  • It is easier for a four-layer net to learn
    translations than for a three-layer net.
  • Each layer can learn an invariance - maybe

93
Feature Detection
  • Hidden neurons play the role of feature detectors
  • Tend to transform the input vector space into a
    hidden or feature space
  • Each hidden neurons output is then a measure of
    how well that feature is present in the current
    input

94
Generalization
  • Generalization is the term used to describe how
    well a NN correctly classifies a set of data that
    was not used as the training set.
  • One generally has 3 sets of data
  • Training
  • Validate (on-going generalization test)
  • Testing (this is the true error rate)

95
Generalization
  • Network generalizes well when for
    non-trained-on-data produces correct (or near
    correct) outputs.
  • Can overfit or overtrain
  • Generally want to select smoothest/simplest
    mapping of function in absence of prior knowledge
    demogt nnd11gn

96
Generalization
  • Influenced by four factors
  • Size of training set
  • How representative training set is of data
  • Neural network architecture
  • Physical complexity of problem at hand
  • Often NN configuration or training set fixed and
    so have only other two to work with

97
Generalization Over training
  • Over training
  • Overtraining occurs when the network has been
    trained to only minimize the error
  • The next slide shows a network in which a
    trigonometric function is being approximated
  • 1-3-1 denotes one input 3 hidden layer neurons
    0ne output layer neuron
  • The fit is perfect at 4, but at 8 it is lower
    error but poorer fit

98
(No Transcript)
99
Generalization Complexity of Network
  • In the next slide, as the number of hidden layer
    neurons goes from 1 to 5, the network does better.

100
(No Transcript)
101
Approximations of Functions
  • In general, for good generalization, the number
    of training samples N should larger than the
    ratio of the total number of free parameters
    (weights) in the network to the mean-square value
    of the estimation error
  • Normally want the simplest NN can get

102
Generalization
  • A commonly used value is
  • NO(W/? ) where
  • O gt is like Big-O
  • W total number of weights
  • ? fraction of classification errors permitted on
    test data
  • Nnumber of training samples

103
Approximations of Functions
  • NN acts as a non-linear mapping from input to
    output space
  • Everywhere differentiable if all transfer
    functions are differentiable
  • What is the minimum number of hidden layers in a
    multilayer perceptron with an I/O mapping that
    provides an approximate mapping of any continuous
    mapping

104
Approximations of Functions
  • Part of universal approximation theorem
  • This theorem states (in essence) that a NN with
    bounded, nonconstant, monotone increasing
    continuous transfer functions and one hidden
    layer can approximate any function
  • Says nothing about optimum in terms of learning
    time, ease of implementation, or generalization

105
Practical Considerations
  • For high dimensional spaces, it is often better
    to have 2-layer networks so that neurons in
    layers do not interact so much.

106
Cross Validation
  • Randomly divide data into training and testing
    sets
  • Further randomly divide training set into
    estimation and validation subsets.
  • Use validation set to test accuracy of model, and
    then test set for actual accuracy value.

107
Cross Validation leave-one-out
  • Train on everything but one and then test on it
  • Repeat for all partitions
Write a Comment
User Comments (0)
About PowerShow.com