Title: Activation Functions
1Activation Functions
2Log Sigmoid
3Local Gradient
4Local Gradient
- Thus the weight change is
- Where is the local gradient y max and where is
it minimum?
5Hyperbolic Tangent
6Hyperbolic Tangent
7Local Gradient
- Thus the weight change is
- Where is the local gradient y max and where is
it minimum?
8Momentum
- As the learning rate increases the network trains
faster, but may go unstable - Often times a momentum term is used to make
training fast but avoid training too fast
9Momentum
- We want the learning rate small for accuracy, but
large for speed of convergence, but too large and
its unstable. - How can this be accomplished?
10Momentum
- What does the momentum term do?
- When the gradient has a constant sign from
iteration to iteration, the momentum term gets
larger - When the gradient has opposite signs from
iteration to iteration, the momentum term gets
smaller
11Heuristic Improvements
- Stopping Criteria
- How do we know when we have reached the correct
answer? - A necessary condition for minimum error is that
the gradient 0, i.e. w(n1) w(m) - Note this condition is not sufficient since we
may have a local minimum
12Heuristic Improvements
- Stopping Criteria
- The BP algorithm is considered to have converged
when the Euclidean norm of the gradient vector is
sufficiently small - This is problematic since one must compute the
gradient vector, and it will be slow
13Heuristic Improvements
- Stopping Criteria EX gradient vector norm
- Lets say we have a logistic function for the 5
neurons in the output layer with a 1. (Im
using 5 for this example) - For each neuron I compute outi (1-outi)ki
- Then if
14Heuristic Improvements
- Stopping Criteria
- The BP algorithm is considered to have converged
when the absolute rate od change of the average
squared error per epoch is sufficiently small
(typically lt .1 to 1 of the error) - May result in premature stopping
15Heuristic Improvements
- Stopping Criteria
- The BP algorithm is considered to have when its
generalization performance stops improving - Generalization performance is found by testing
the network on a representative set of data not
used to train on
16Heuristic Improvements
- On-line, Stochastic or sequential mode training
all are synonymous for the author - On-line input a value once and train on it
(update) - Stochastic randomly select from the pool of
training samples and update (train on it)
17Heuristic Improvements
- In Matlab we have trainb, trainr, and trains
- trainb
- trains a network with weight and bias learning
rules with batch updates. The weights and biases
are updated at the end of an entire pass through
the input data. Inputs are presented in random
order - trainr
- trains a network with weight and bias learning
rules with incremental updates after each
presentation of an input. Inputs are presented in
random order.
18Heuristic Improvements
- In Matlab we have trainb, trainr, and trains
- trains
- Trains a network with weight and bias learning
rules with incremental updates after each
presentation of an input. Inputs are presented in
sequential order.
19Heuristic Improvements
- Batch Mode
- with each input, accumulate update values for
each weight and after all inputs (an epoch) are
received update the weights
20Summary of BP Steps
- Initialize
- With no prior information, pick weights from a
uniform distribution with mean 0 and range /- 1,
causes weights to fall in the linear region of
the sigmoid - Presentation of training examples
- Randomly pick values, sequential mode (easiest)
21Summary of BP Steps
- Forward Computation
- Present input vector and compute output
- Backward Computation
- Compute the ds as
22Adjustment
23Heuristic Improvements
- Maximize information content of samples
- Examples should contain maximum information
- Results in the largest training error
- Radically different from previous examples
- Generally this is simulated by presenting
training samples randomly. - Randomize for each epoch
24Heuristic Improvements
- Activation Function
- Generally learns faster if it is antisymmetric
- Not true of logsigmoid, but true of Tanh
25Heuristic Improvements
- Activation Function (contd)
- The next figure is the log sigmoid which does not
meet the criterion - The figure after that is the Tanh which is
antisymmetric
26(No Transcript)
27(No Transcript)
28Heuristic Improvements
- Activation Function (contd)
- For the Tanh function, empirical studies have
shown the following values for a and b to be
appropriate
29Heuristic Improvements
- Activation Function (contd)
- Note that
30Heuristic Improvements
- Target (output) values
- Choose within range of possible output values
- Really should be some small value e lt the maximum
neuron output - For the tanh, with a1.7159, choose e0.7159, and
then the targets can be /-1 (see the tanh slide)
31Heuristic Improvements
- Input range problems
- If dealing with a persons height (meters) and
weight (lbs), then the weight will overpower
the height - Also, it is not good if one range of values is
-/ and another is -/- or /
32Heuristic Improvements
- Preprocess the inputs so that each has
- An average of 0 or
- Its average is small compared to its standard
deviation - Consider the case of all positive values
- All weights must change in the same direction and
this will give a zig-zag traversal of the error
surface which can be very slow.
33Heuristic Improvements
- If possible, input values should be uncorrelated
- Can be done using principal components analysis
34Principal Component Analysis
35Topics covered
- Standard Deviation
- Variance
- Covariance
- Correlation
- Eigenvectors
- Eigenvalues
- PCA
- Application of PCA - Eigenfaces
36Standard Deviation
- Statistics analyzing data sets in terms of the
relationships between the individual points - Standard Deviation is a measure of the spread of
the data - 0 8 12 20 8 9 11 12
- Calculation average distance from the mean of
the data set to a point - s Si1n(Xi X)2
- (n -1)
- Denominator of n-1 for sample and n for entire
population
37Standard Deviation
- For example
- 0 8 12 20 has s 8.32
- Sqrt(((0-10)2(8-10)2(12-10)2(20-10)2)/3)8.
32 - 8 9 11 12 has s 1.82
- 10 10 10 10 has s 0
38Variance
- Another measure of the spread of the data in a
data set - Calculation
- s2 Si1n(Xi X)2
- (n -1)
- Why have both variance and SD to calculate the
spread of data? - Variance is claimed to be the original
statistical measure of spread of data. However
its unit would be expressed as a square e.g.
cm2, which is unrealistic to express heights or
other measures. Hence SD as the square root of
variance was born.
39Covariance
- Variance measure of the deviation from the mean
for points in one dimension e.g. heights - Covariance is a measure of how much each of the
dimensions vary from the mean with respect to
each other. - Covariance is measured between 2 dimensions to
see if there is a relationship between the 2
dimensions e.g. number of hours studied marks
obtained. - The covariance between one dimension and itself
is the variance
40Covariance
- variance (X) Si1n(Xi X) (Xi X)
- (n -1)
- covariance (X,Y) Si1n(Xi X) (Yi Y)
- (n -1)
- So, if you had a 3-dimensional data set (x,y,z),
then you could measure the covariance between the
x and y dimensions, the y and z dimensions, and
the x and z dimensions. Measuring the covariance
between x and x , or y and y , or z and z would
give you the variance of the x , y and z
dimensions respectively.
41Variance Covariance - Matlab
- gtgt x0 8 12 208 9 11 1210 10 10 10
- gtgt var(x)
- ans
- 28 1 1 28 note 28 is var(0,8,10)
- gtgt Ccov(x)
- 28 5 -5 -28
- 5 1 -1 -5
- -5 -1 1 5
- -28 -5 5 28
42Variance Covariance - Matlab
43Covariance
- What is the interpretation of covariance
calculations? - e.g. 2 dimensional data set
- x number of hours studied for a subject
- y marks obtained in that subject
- covariance value is say 104.53
- what does this value mean?
44Covariance
- Exact value is not as important as its sign.
- A positive value of covariance indicates both
dimensions increase or decrease together e.g. as
the number of hours studied increases, the marks
in that subject increase. - A negative value indicates while one increases
the other decreases, or vice-versa e.g. active
social life at RIT vs performance in CS dept. - If covariance is zero the two dimensions are
independent of each other e.g. heights of
students vs the marks obtained in a subject
45Covariance
- Why bother with calculating covariance when we
could just plot the 2 values to see their
relationship? - Covariance calculations are used to find
relationships between dimensions in high
dimensional data sets (usually greater than 3)
where visualization is difficult.
46Covariance Matrix
- Representing Covariance between dimensions as a
matrix e.g. for 3 dimensions - cov(x,x) cov(x,y) cov(x,z)
- C cov(y,x) cov(y,y) cov(y,z)
- cov(z,x) cov(z,y) cov(z,z)
- Diagonal is the variances of x, y and z
- cov(x,y) cov(y,x) hence matrix is symmetrical
about the diagonal - N-dimensional data will result in nxn covariance
matrix
47Correlation
- For a positive correlation it means as X
increases, so does y, and vice versa - Of the Following plots which has the highest
correlation?
48(No Transcript)
49Transformation matrices
- Consider
- 2 3 3 12 3
- 2 1 2 8 2
- Square transformation matrix transforms (3,2)
from its original location. Now if we were to
take a multiple of (3,2) - 3 6
- 2 4
- 2 3 6 24 6
- 2 1 4 16 4
x
x
4
x
2
x
x
4
50Transformation matrices
- Scale vector (3,2) by a value 2 to get (6,4)
- Multiply by the square transformation matrix
- We see the result is still a multiple of 4.
- WHY?
- A vector consists of both length and direction.
Scaling a vector only changes its length and not
its direction. This is an important observation
in the transformation of matrices leading to
formation of eigenvectors and eigenvalues. - Irrespective of how much we scale (3,2) by, the
solution is always a multiple of 4.
51eigenvalue problem
- The eigenvalue problem is any problem having the
following form - A . v ? . v
- A n x n matrix
- v n x 1 non-zero vector
- ? scalar
- Any value of ? for which this equation has a
solution is called the eigenvalue of A and vector
v which corresponds to this value is called the
eigenvector of A.
52eigenvalue problem
- 2 3 3 12 3
- 2 1 2 8 2
- A . v ? . v
- Therefore, (3,2) is an eigenvector of the square
matrix A and 4 is an eigenvalue of A - Given matrix A, how can we calculate the
eigenvector and eigenvalues for A?
x
x
4
53Calculating eigenvectors eigenvalues
- Given A . v ? . v
- A . v - ? . I . v 0
- (A - ? . I ). v 0
- Finding the roots of A - ? . I will give the
eigenvalues and for each of these eigenvalues
there will be an eigenvector - Example
54Calculating eigenvectors eigenvalues
- If A 0 1
- -2 -3
- Then A - ? . I 0 1 ? 0
0 - -2 -3 0 ?
- -? 1 ?2 3? 2 0
- -2 -3-?
- This gives us 2 eigenvalues
- ?1 -1 and ?2 -2
55Calculating eigenvectors eigenvalues
- For ?1 the eigenvector is
- (A ?1 . I ). v1 0
- 1 1 v11 0
- -2 -2 v12
- -2.v11 -2.v12 0
- v11 -v12
- Therefore the first eigenvector is any column
vector in which the two elements have equal
magnitude and opposite sign
56Calculating eigenvectors eigenvalues
- Therefore eigenvector v1 is
- v1 k1 1
- -1
- Where k1 is some constant. Similarly we find
eigenvector v2 - v2 k2 1
- -2
- And the eigenvalues are ?1 -1 and ?2 -2
57Properties of eigenvectors and eigenvalues
- Note that Irrespective of how much we scale (3,2)
by, the solution is always a multiple of 4. - Eigenvectors can only be found for square
matrices and not every square matrix has
eigenvectors. - Given an n x n matrix, we can find n eigenvectors
58Properties of eigenvectors and eigenvalues
- All eigenvectors of a matrix are perpendicular to
each other, no matter how many dimensions we have - In practice eigenvectors are normalized to have
unit length. Since the length of the eigenvectors
do not affect our calculations we prefer to keep
them standard by scaling them to have a length of
1. e.g. - For eigenvector (3,2)
- ((32 22))1/2 (13)1/2
- 3 (13)1/2 3/(13)1/2
- 2 2/(13)1/2
59Matlab
- gtgt A 0 1 2 3
- A
- 0 1
- 2 3
- gtgt v,d eig(A)
- v
- -0.8719 -0.2703
- 0.4896 -0.9628
- d
- -0.5616 0
- 0 3.5616
gtgt help eig V,D EIG(X) produces a diagonal
matrix D of eigenvalues and a full matrix V whose
columns are the corresponding eigenvectors so
that XV VD.
60PCA
- principal components analysis (PCA) is a
technique that can be used to simplify a dataset - It is a linear transformation that chooses a new
coordinate system for the data set such that - greatest variance by any projection of the data
set comes to lie on the first axis (then called
the first principal component), the second
greatest variance on the second axis, and so on. - PCA can be used for reducing dimensionality by
eliminating the later principal components.
61PCA
- By finding the eigenvalues and eigenvectors of
the covariance matrix, we find that the
eigenvectors with the largest eigenvalues
correspond to the dimensions that have the
strongest correlation in the dataset. - This is the principal component.
- PCA is a useful statistical technique that has
found application in - fields such as face recognition and image
compression - finding patterns in data of high dimension.
- Reducing dimensionality of data
62PCA process STEP 1
- Subtract the mean from each of the data
dimensions. All the x values have x subtracted
and y values have y subtracted from them. This
produces a data set whose mean is zero. - Subtracting the mean makes variance and
covariance calculation easier by simplifying
their equations. The variance and co-variance
values are not affected by the mean value.
63PCA process STEP 1
http//kybele.psych.cornell.edu/edelman/Psych-465
-Spring-2003/PCA-tutorial.pdf
ZERO MEAN DATA x y .69 .49 -1.31
-1.21 .39 .99 .09 .29 1.29 1.09 .49
.79 .19 -.31 -.81 -.81 -.31 -.31 -.71
-1.01
- DATA
- x y
- 2.5 2.4
- 0.5 0.7
- 2.2 2.9
- 1.9 2.2
- 3.1 3.0
- 2.3 2.7
- 2 1.6
- 1 1.1
- 1.5 1.6
- 1.1 0.9
64PCA process STEP 1
http//kybele.psych.cornell.edu/edelman/Psych-465
-Spring-2003/PCA-tutorial.pdf
65PCA process STEP 2
- Calculate the covariance matrix
- cov .616555556 .615444444
- .615444444 .716555556
- since the non-diagonal elements in this
covariance matrix are positive, we should expect
that both the x and y variable increase together.
66PCA process STEP 3
- Calculate the eigenvectors and eigenvalues of the
covariance matrix - eigenvalues .0490833989
- 1.28402771
- eigenvectors .735178656 -.677873399
- -.677873399 -.735178656
67PCA process STEP 3
http//kybele.psych.cornell.edu/edelman/Psych-465
-Spring-2003/PCA-tutorial.pdf
- eigenvectors are plotted as diagonal dotted lines
on the plot. - Note they are perpendicular to each other.
- Note one of the eigenvectors goes through the
middle of the points, like drawing a line of best
fit. - The second eigenvector gives us the other, less
important, pattern in the data, that all the
points follow the main line, but are off to the
side of the main line by some amount.
68PCA process STEP 4
- Reduce dimensionality and form feature vector
- the eigenvector with the highest eigenvalue is
the principle component of the data set. - In our example, the eigenvector with the largest
eigenvalue was the one that pointed down the
middle of the data. - Once eigenvectors are found from the covariance
matrix, the next step is to order them by
eigenvalue, highest to lowest. This gives you the
components in order of significance. -
69PCA process STEP 4
- Now, if you like, you can decide to ignore the
components of lesser significance. - You do lose some information, but if the
eigenvalues are small, you dont lose much - n dimensions in your data
- calculate n eigenvectors and eigenvalues
- choose only the first p eigenvectors
- final data set has only p dimensions.
70PCA process STEP 4
- Feature Vector
- FeatureVector (eig1 eig2 eig3 eign)
- We can either form a feature vector with both of
the eigenvectors - -.677873399 .735178656
- -.735178656 -.677873399
- or, we can choose to leave out the smaller, less
significant component and only have a single
column - - .677873399
- - .735178656
71PCA process STEP 5
- Deriving the new data
- FinalData RowFeatureVector x RowZeroMeanData
- RowFeatureVector is the matrix with the
eigenvectors in the columns transposed so that
the eigenvectors are now in the rows, with the
most significant eigenvector at the top - RowZeroMeanData is the mean-adjusted data
transposed, ie. the data items are in each
column, with each row holding a separate
dimension.
72PCA process STEP 5
- FinalData is the final data set, with data items
in columns, and dimensions along rows. - What will this give us? It will give us the
original data solely in terms of the vectors we
chose. - We have changed our data from being in terms of
the axes x and y , and now they are in terms of
our 2 eigenvectors.
73PCA process STEP 5
- FinalData transpose dimensions along columns
- x y
- -.827970186 .175115307
- -1.77758033 -.142857227
- -.992197494 -.384374989
- -.274210416 -.130417207
- -1.67580142 .209498461
- -.912949103 -.175282444
- .0991094375 .349824698
- 1.14457216 -.0464172582
- .438046137 -.0177646297
- 1.22382056 .162675287
74PCA process STEP 5
http//kybele.psych.cornell.edu/edelman/Psych-465
-Spring-2003/PCA-tutorial.pdf
75Reconstruction of original Data
- If we reduced the dimensionality, obviously, when
reconstructing the data we would lose those
dimensions we chose to discard. In our example
let us assume that we considered only the x
dimension
76Reconstruction of original Data
http//kybele.psych.cornell.edu/edelman/Psych-465
-Spring-2003/PCA-tutorial.pdf
- x
- -.827970186
- 1.77758033
- -.992197494
- -.274210416
- -1.67580142
- -.912949103
- .0991094375
- 1.14457216
- .438046137
- 1.22382056
77Matlab PCA
- Matlab has a function called princomp(x)
- COEFF,SCOREprincomp(x) performs principal
components analysis on the n-by-p data matrix X,
and returns the principal component coefficients,
also known as loadings. Rows of X correspond to
observations, columns to variables. COEFF is a
p-by-p matrix, each column containing
coefficients for one principal component. The
columns are in order of decreasing component
variance. Princomp centers X by subtracting off
column means
78Matlab PCA
- COEFF,SCORE princomp(X) returns SCORE, the
principal component scores that is, the
representation of X in the principal component
space. Rows of SCORE correspond to observations,
columns to components.
79Matlab PCA
- gtgtD (SLIDE 63)
- D
- 0.6900 0.4900
- -1.3100 -1.2100
- 0.3900 0.9900
- 0.0900 0.2900
- 1.2900 1.0900
- 0.4900 0.7900
- 0.1900 -0.3100
- -0.8100 -0.8100
- -0.3100 -0.3100
- -0.7100 -1.0100
80Matlab PCA
- gtgt A,Bprincomp(D)
- A
- -0.6779 0.7352 (see slide 66, note
that these columns are by highest eigenvalue,
unlike slide 66) - -0.7352 -0.6779
- B
- -0.8280 0.1751 (see slide 73)
- 1.7776 -0.1429
- -0.9922 -0.3844
- -0.2742 -0.1304
- -1.6758 0.2095
- -0.9129 -0.1753
- 0.0991 0.3498
- 1.1446 -0.0464
- 0.4380 -0.0178
- 1.2238 0.1627
81MATLAB DEMO
82PCA applications -Eigenfaces
- Eigenfaces are the eigenvectors the covariance
matrix of the probability distribution of the
vector space of human faces - Eigenfaces are the standardized face
ingredients derived from the statistical
analysis of many pictures of human faces - A human face may be considered to be a
combination of these standard faces
83PCA applications -Eigenfaces
- To generate a set of eigenfaces
- Large set of digitized images of human faces is
taken under the same lighting conditions. - The images are normalized to line up the eyes and
mouths. - The eigenvectors of the covariance matrix of the
statistical distribution of face image vectors
are then extracted. - These eigenvectors are called eigenfaces.
84PCA applications -Eigenfaces
- the principal eigenface looks like a bland
androgynous average human face
http//en.wikipedia.org/wiki/ImageEigenfaces.png
85Eigenfaces Face Recognition
- When properly weighted, eigenfaces can be summed
together to create an approximate gray-scale
rendering of a human face. - Remarkably few eigenvector terms are needed to
give a fair likeness of most people's faces - Hence eigenfaces provide a means of applying data
compression to faces for identification purposes.
86Expert Object Recognition in Video Matt McEuen
87EOR
- Principal Component Analysis (PCA)
- Based on covariance
- Visual memory reconstruction
- Images of cats and dogs are aligned so that the
eyes are in the same position in every image
88EOR
89Back to Heuristic Improvements
90Back to Heuristic Improvements
91Heuristic Improvements
- Initialization
- Large initial weights will saturate neurons
- All 0s for weights is also potentially bad
- From the text, it can be shown (approximately,
under certain conditions) that a good choice for
weights is to select them randomly from a uniform
distribution with
92Number of Hidden Layers
- Three layers suffice to implement any function
with properly chosen transfer functions - Additional layers can help
- It is easier for a four-layer net to learn
translations than for a three-layer net. - Each layer can learn an invariance - maybe
93Feature Detection
- Hidden neurons play the role of feature detectors
- Tend to transform the input vector space into a
hidden or feature space - Each hidden neurons output is then a measure of
how well that feature is present in the current
input
94Generalization
- Generalization is the term used to describe how
well a NN correctly classifies a set of data that
was not used as the training set. - One generally has 3 sets of data
- Training
- Validate (on-going generalization test)
- Testing (this is the true error rate)
95Generalization
- Network generalizes well when for
non-trained-on-data produces correct (or near
correct) outputs. - Can overfit or overtrain
- Generally want to select smoothest/simplest
mapping of function in absence of prior knowledge
demogt nnd11gn
96Generalization
- Influenced by four factors
- Size of training set
- How representative training set is of data
- Neural network architecture
- Physical complexity of problem at hand
- Often NN configuration or training set fixed and
so have only other two to work with
97Generalization Over training
- Over training
- Overtraining occurs when the network has been
trained to only minimize the error - The next slide shows a network in which a
trigonometric function is being approximated - 1-3-1 denotes one input 3 hidden layer neurons
0ne output layer neuron - The fit is perfect at 4, but at 8 it is lower
error but poorer fit
98(No Transcript)
99Generalization Complexity of Network
- In the next slide, as the number of hidden layer
neurons goes from 1 to 5, the network does better.
100(No Transcript)
101Approximations of Functions
- In general, for good generalization, the number
of training samples N should larger than the
ratio of the total number of free parameters
(weights) in the network to the mean-square value
of the estimation error - Normally want the simplest NN can get
102Generalization
- A commonly used value is
- NO(W/? ) where
- O gt is like Big-O
- W total number of weights
- ? fraction of classification errors permitted on
test data - Nnumber of training samples
103Approximations of Functions
- NN acts as a non-linear mapping from input to
output space - Everywhere differentiable if all transfer
functions are differentiable - What is the minimum number of hidden layers in a
multilayer perceptron with an I/O mapping that
provides an approximate mapping of any continuous
mapping
104Approximations of Functions
- Part of universal approximation theorem
- This theorem states (in essence) that a NN with
bounded, nonconstant, monotone increasing
continuous transfer functions and one hidden
layer can approximate any function - Says nothing about optimum in terms of learning
time, ease of implementation, or generalization
105Practical Considerations
- For high dimensional spaces, it is often better
to have 2-layer networks so that neurons in
layers do not interact so much.
106Cross Validation
- Randomly divide data into training and testing
sets - Further randomly divide training set into
estimation and validation subsets. - Use validation set to test accuracy of model, and
then test set for actual accuracy value.
107Cross Validation leave-one-out
- Train on everything but one and then test on it
- Repeat for all partitions