Title: Unsupervised Learning
1Unsupervised Learning
- i.e., learning even when there
- is no right answer
2Unsupervised Learning
- Subject What does unsupervised learning learn?
- Unsupervised learning allegedly involves no
target values. - In fact, for most varieties of unsupervised
learning, the targets are the same as the inputs
(Sarle, 1994). - In other words, unsupervised learning usually
performs the same task as an auto-associative
network, compressing the information from the
inputs (Deco and Obradovic, 1996).
3Examples of unsupervised learning
- Cluster analysis
- Identifying the relational structure of the
world. - Accomplished via competitive learning
- Correlational analysis
- Identifying the correlations among features.
- Accomplished via Hebbian learning
4Cluster analysis
- Competitive learning
- Unsupervised competitive learning is used in a
wide variety of fields under a wide variety of
names, the most common of which is "cluster
analysis" - Analogous to k-means clustering
- But, competitive learning is iterative and can be
non-linear - The goal in these cases is to classify the input
vectors into groups thus providing a summary of
the inputs efficient categorization.
5Competitive learning
- Create a number of arbitrarily defined cluster
vectors (the k-clusters of k-means clustering) - Change the location of these vectors to match the
distribution of the input vectors
6Moving cluster vectors to cluster centers
7Graphical depiction
From Dr. Nigel Allinsons web page
8Computational details
- Simple competitive learning
- Single layer of output units, output, each fully
connected to the input units - The number of output units is determined by the
designer and critically important to network
behavior. - Activity of each output unit is determined in the
traditional way - output W.input.
- The activity of the output units is a measure of
the similarity between their weight vectors and
the input vector. - Similarity traditionally measured using the dot
product.
9How is learning accomplished?
- Winner-take-all (WTA)
- The output unit with the highest activation level
changes its connections to the input layer - Learning consists of making the weight vector
more similar to the input vector (wi input)
should decrease. - Eq ?wi ?(input wi) where wi is the weight
vector for the winner. - Geometric analogy
- Draw on board
- Finds the middle of clusters in the input
10Number of clusters (i.e. of output units)
- The number of units determines how finely the
input representations are divided - In k-means cluster analysis, k specifies number
of clusters. - Dead units
- Some units may be perennial losers how to
rectify? - Initialize the weight vectors to samples from the
input itself to insure they are where the action
is - Update the weights of losers as well as winners,
but losers change weights much more slowly - Leaky learning
- Dead units gravitate toward where the action is
- But, in some cases we may want dead units
- Reserving dead units as resources helps prepare
the network for handling retroactive interference.
11Variations on simple competitive learning
- Vector quantization
- Used for data compression and is used for both
storage and transmission of speech and image
data. - Idea Categorize input vectors into M classes
(for M output units) and then represent each
vector by the category in which it falls. - This basic technique uses standard competitive
learning. - Learning vector quantization (Kohonen, 1989) aka
LVQ
12Supervised version of vector quantization
- We have a body of labeled sample data (labeled
with its correct, predefined class). - In the first case, the learning rule is the
standard competitive learning rule, but in the
second case, the weight vector and the input
vector are moved in OPPOSITE directions - Minimizes the number of misclassifications
- LVQ2 (Kohonen, 1989)
13Multi-layer networks
- Can do hierarchical clustering
- Results of first level of cluster analysis feeds
into the next layer. - Each layer should extract higher orders of
clustering information
14Kohonens self-organizing (feature) map
- The idea behind these networks is to preserve the
topology (spatial arrangement) of the input
vector. - Each output unit is no longer independent of the
others - Neighboring output units should represent similar
input vectors - ?wi ? f(i,i) (input-wi)
- where f is a neighborhood function and i is
winning unit - f is 1 when ii and falls off toward 0.0 as the
distance between is and is weight vectors
increases.
15(No Transcript)
16Typical neighborhood function
- Low values of s produce a small neighborhood,
small values produce a large neighborhood - ? usually starts out high and decreases during
training to optimize learning. - The product can be thought of as an elastic net
that covers the input space
17Uses of Kohonen nets
- These types of nets are most often used as
accounts of the development of topological maps
in the brain.
18Uses of competitive learning for cluster analysis
- Obviously classification
- Used to model unsupervised learning in people.
- Unsupervised learning most commonly studied as a
function of development. - Preprocessing of inputs before supervised (Hybrid
learning) - Removes redundancy in inputs
- Produces outputs that are less correlated (even
orthogonal), thus reducing future interference. - Makes for improved efficiency in learning.
19Finding correlational structure
- Hebbian learning
- Hebbian learning is the other most common variety
of unsupervised learning. - The goal is to identify the regularities or
correlations in the input patterns to identify
redundancy.
20Applications
- Some possible applications of a system that could
detect correlations/redundancies - Familiarity a single continuous-valued output
could tell us how similar a new pattern is to
typical or average patterns seen in the past.
Network learns what is typical. - Principal component analysis extends the
familiarity case to several units that together
produce a multi-component measure of similarity
to past patterns. - Encoding reduction of a pattern to a
less-redundant code. Helpful if information must
be transmitted over a limited-capacity channel.
21Error function
- Hebbian learning minimizes the same error
function as an auto-associative network with a
linear hidden layer. - Therefore, it is therefore a form of
dimensionality reduction. - This error function is minimized by identifying
the leading principal components. - There are variations of Hebbian learning that
explicitly produce the principal components.
22Computational details
- 1-unit (first principal component)
- Ojas rule
- This is just the autoassociation delta rule.
23M-unit (other principal components)
- Sangers learning rule
- Ojas M-unit rule
24Sangers vs. Ojas rule
- Rules differ only in the limits for the
summation. - For Sangers rule, the weight vectors converge on
the M principal components, in order. - For Ojas M-unit rule, the weight vectors span
the same subspace as those components but dont
find the components themselves. - Sangers is more useful for practical
applications, but Ojas is more likely to be used
by real brains.
25Competition between networks, not just units
- Jacobs, Jordan, Nowlan Hinton
- For discussion on Thursday.
- Also, well discuss Sarle through unsupervised
learning section on p. 7 - Discussion of Hybrid networks and following next
week.