CIS732Lecture3720070419 - PowerPoint PPT Presentation

1 / 28
About This Presentation
Title:

CIS732Lecture3720070419

Description:

Kansas State University. Department of Computing and Information Sciences. CIS 732: Machine Learning and ... Duality: data fusion problem redistribution ... – PowerPoint PPT presentation

Number of Views:46
Avg rating:3.0/5.0
Slides: 29
Provided by: lindajacks
Category:

less

Transcript and Presenter's Notes

Title: CIS732Lecture3720070419


1
Lecture 37 of 42
Unsupervised Learning AutoClass, SOM, EM, and
Hierarchical Mixtures of Experts
Thursday, 19 April 2007 William H.
Hsu Department of Computing and Information
Sciences, KSU http//www.cis.ksu.edu/Courses/Sprin
g-2007/CIS732 Readings Bagging, Boosting, and
C4.5, Quinlan Section 5, MLC Utilities 2.0,
Kohavi and Sommerfield
2
Mixture ModelsReview
  • Intuitive Idea
  • Integrate knowledge from multiple experts (or
    data from multiple sensors)
  • Collection of inducers organized into committee
    machine (e.g., modular ANN)
  • Dynamic structure take input signal into account
  • References
  • Bishop, 1995 (Sections 2.7, 9.7)
  • Haykin, 1999 (Section 7.6)
  • Problem Definition
  • Given collection of inducers (experts) L, data
    set D
  • Perform supervised learning using inducers and
    self-organization of experts
  • Return committee machine with trained gating
    network (combiner inducer)
  • Solution Approach
  • Let combiner inducer be generalized linear model
    (e.g., threshold gate)
  • Activation functions linear combination, vote,
    smoothed vote (softmax)

3
Mixture ModelsProcedure
  • Algorithm Combiner-Mixture-Model (D, L,
    Activation, k)
  • m ? D.size
  • FOR j ? 1 TO k DO // initialization
  • wj ? 1
  • UNTIL the termination condition is met, DO
  • FOR j ? 1 TO k DO
  • Pj ? Lj.Update-Inducer (D) // single
    training step for Lj
  • FOR i ? 1 TO m DO
  • Sumi ? 0
  • FOR j ? 1 TO k DO Sumi Pj(Di)
  • Neti ? Compute-Activation (Sumi) // compute
    gj ? Netij
  • FOR j ? 1 TO k DO wj ? Update-Weights (wj,
    Neti, Di)
  • RETURN (Make-Predictor (P, w))
  • Update-Weights Single Training Step for Mixing
    Coefficients

4
Mixture ModelsProperties
?
5
Generalized Linear Models (GLIMs)
  • Recall Perceptron (Linear Threshold Gate) Model
  • Generalization of LTG Model McCullagh and
    Nelder, 1989
  • Model parameters connection weights as for LTG
  • Representational power depends on transfer
    (activation) function
  • Activation Function
  • Type of mixture model depends (in part) on this
    definition
  • e.g., o(x) could be softmax (x w) Bridle,
    1990
  • NB softmax is computed across j 1, 2, , k
    (cf. hard max)
  • Defines (multinomial) pdf over experts Jordan
    and Jacobs, 1995

6
Hierarchical Mixture of Experts (HME)Idea
  • Hierarchical Model
  • Compare stacked generalization network
  • Difference trained in multiple passes
  • Dynamic Network of GLIMs

All examples x and targets y c(x) identical
7
Hierarchical Mixture of Experts (HME)Procedure
  • Algorithm Combiner-HME (D, L, Activation, Level,
    k, Classes)
  • m ? D.size
  • FOR j ? 1 TO k DO wj ? 1 // initialization
  • UNTIL the termination condition is met DO
  • IF Level gt 1 THEN
  • FOR j ? 1 TO k DO
  • Pj ? Combiner-HME (D, Lj, Activation, Level
    - 1, k, Classes)
  • ELSE
  • FOR j ? 1 TO k DO Pj ? Lj.Update-Inducer (D)
  • FOR i ? 1 TO m DO
  • Sumi ? 0
  • FOR j ? 1 TO k DO
  • Sumi Pj(Di)
  • Neti ? Compute-Activation (Sumi)
  • FOR l ? 1 TO Classes DO wl ? Update-Weights
    (wl, Neti, Di)
  • RETURN (Make-Predictor (P, w))

8
Hierarchical Mixture of Experts (HME)Properties
  • Advantages
  • Benefits of ME base case is single level of
    expert and gating networks
  • More combiner inducers ? more capability to
    decompose complex problems
  • Views of HME
  • Expresses divide-and-conquer strategy
  • Problem is distributed across subtrees on the
    fly by combiner inducers
  • Duality data fusion ? problem redistribution
  • Recursive decomposition until good fit found to
    local structure of D
  • Implements soft decision tree
  • Mixture of experts 1-level decision tree
    (decision stump)
  • Information preservation compared to traditional
    (hard) decision tree
  • Dynamics of HME improves on greedy
    (high-commitment) strategy of decision tree
    induction

9
EM AlgorithmExample 1
  • Experiment
  • Two coins P(Head on Coin 1) p, P(Head on Coin
    2) q
  • Experimenter first selects a coin P(Coin 1)
    ?
  • Chosen coin tossed 3 times (per experimental run)
  • Observe D (1 H H T), (1 H T T), (2 T H T)
  • Want to predict ?, p, q
  • How to model the problem?
  • Simple Bayesian network
  • Now, can find most likely values of parameters ?,
    p, q given data D
  • Parameter Estimation
  • Fully observable case easy to estimate p, q, and
    ?
  • Suppose k heads are observed out of n coin flips
  • Maximum likelihood estimate vML for Flipi p
    k/n
  • Partially observable case
  • Dont know which coin the experimenter chose
  • Observe D (H H T), (H T T), (T H T) ? (? H
    H T), (? H T T), (? T H T)

P(Coin 1) ?
P(Flipi 1 Coin 1) p P(Flipi 1 Coin
2) q
10
EM AlgorithmExample 2
  • Problem
  • When we knew Coin 1 or Coin 2, there was no
    problem
  • No known analytical solution to the partially
    observable problem
  • i.e., not known how to compute estimates of p, q,
    and ? to get vML
  • Moreover, not known what the computational
    complexity is
  • Solution Approach Iterative Parameter Estimation
  • Given a guess of P(Coin 1 x), P(Coin 2
    x)
  • Generate fictional data points, weighted
    according to this probability
  • P(Coin 1 x) P(x Coin 1) P(Coin 1) /
    P(x) based on our guess of ?, p, q
  • Expectation step (the E in EM)
  • Now, can find most likely values of parameters ?,
    p, q given fictional data
  • Use gradient descent to update our guess of ?, p,
    q
  • Maximization step (the M in EM)
  • Repeat until termination condition met (e.g.,
    stopping criterion on validation set)
  • EM Converges to Local Maxima of the Likelihood
    Function P(D ?)

11
EM AlgorithmExample 3
12
EM for Unsupervised Learning
  • Unsupervised Learning Problem
  • Objective estimate a probability distribution
    with unobserved variables
  • Use EM to estimate mixture policy (more on this
    later see 6.12, Mitchell)
  • Pattern Recognition Examples
  • Human-computer intelligent interaction (HCII)
  • Detecting facial features in emotion recognition
  • Gesture recognition in virtual environments
  • Computational medicine Frey, 1998
  • Determining morphology (shapes) of bacteria,
    viruses in microscopy
  • Identifying cell structures (e.g., nucleus) and
    shapes in microscopy
  • Other image processing
  • Many other examples (audio, speech, signal
    processing motor control etc.)
  • Inference Examples
  • Plan recognition mapping from (observed) actions
    to agents (hidden) plans
  • Hidden changes in context e.g., aviation
    computer security MUDs

13
Unsupervised LearningAutoClass 1
14
Unsupervised LearningAutoClass 2
  • AutoClass Algorithm Cheeseman et al, 1988
  • Based on maximizing P(x ?j, yj, J)
  • ?j class (cluster) parameters (e.g., mean and
    variance)
  • yj synthetic classes (can estimate marginal
    P(yj) any time)
  • Apply Bayess Theorem, use numerical BOC
    estimation techniques (cf. Gibbs)
  • Search objectives
  • Find best J (ideally integrate out ?j, yj
    really start with big J, decrease)
  • Find ?j, yj use MAP estimation, then integrate
    in the neighborhood of yMAP
  • EM Find MAP Estimate for P(x ?j, yj, J) by
    Iterative Refinement
  • Advantages over Symbolic (Non-Numerical) Methods
  • Returns probability distribution over class
    membership
  • More robust than best yj
  • Compare fuzzy set membership (similar but
    probabilistically motivated)
  • Can deal with continuous as well as discrete data

15
Unsupervised LearningAutoClass 3
  • AutoClass Resources
  • Beginning tutorial (AutoClass II) Cheeseman et
    al, 4.2.2 Buchanan and Wilkins
  • Project page http//ic-www.arc.nasa.gov/ic/projec
    ts/bayes-group/autoclass/
  • Applications
  • Knowledge discovery in databases (KDD) and data
    mining
  • Infrared astronomical satellite (IRAS) spectral
    atlas (sky survey)
  • Molecular biology pre-clustering DNA acceptor,
    donor sites (mouse, human)
  • LandSat data from Kansas (30 km2 region, 1024 x
    1024 pixels, 7 channels)
  • Positive findings see book chapter by Cheeseman
    and Stutz, online
  • Other typical applications see KD Nuggets
    (http//www.kdnuggets.com)
  • Implementations
  • Obtaining source code from project page
  • AutoClass III Lisp implementation Cheeseman,
    Stutz, Taylor, 1992
  • AutoClass C C implementation Cheeseman, Stutz,
    Taylor, 1998
  • These and others at http//www.recursive-partitio
    ning.com/cluster.html

16
Unsupervised LearningCompetitive Learning for
Feature Discovery
  • Intuitive Idea Competitive Mechanisms for
    Unsupervised Learning
  • Global organization from local, competitive
    weight update
  • Basic principle expressed by Von der Malsburg
  • Guiding examples from (neuro)biology lateral
    inhibition
  • Previous work Hebb, 1949 Rosenblatt, 1959 Von
    der Malsburg, 1973 Fukushima, 1975 Grossberg,
    1976 Kohonen, 1982
  • A Procedural Framework for Unsupervised
    Connectionist Learning
  • Start with identical (neural) processing units,
    with random initial parameters
  • Set limit on activation strength of each unit
  • Allow units to compete for right to respond to a
    set of inputs
  • Feature Discovery
  • Identifying (or constructing) new features
    relevant to supervised learning
  • Examples finding distinguishable letter
    characteristics in handwriten character
    recognition (HCR), optical character recognition
    (OCR)
  • Competitive learning transform X into X train
    units in X closest to x

17
Unsupervised LearningKohonens Self-Organizing
Map (SOM) 1
  • Another Clustering Algorithm
  • aka Self-Organizing Feature Map (SOFM)
  • Given vectors of attribute values (x1, x2, ,
    xn)
  • Returns vectors of attribute values (x1, x2,
    , xk)
  • Typically, n gtgt k (n is high, k 1, 2, or 3
    hence dimensionality reducing)
  • Output vectors x, the projections of input
    points x also get P(xj xi)
  • Mapping from x to x is topology preserving
  • Topology Preserving Networks
  • Intuitive idea similar input vectors will map to
    similar clusters
  • Recall informal definition of cluster (isolated
    set of mutually similar entities)
  • Restatement clusters of X (high-D) will still
    be clusters of X (low-D)
  • Representation of Node Clusters
  • Group of neighboring artificial neural network
    units (neighborhood of nodes)
  • SOMs combine ideas of topology-preserving
    networks, unsupervised learning
  • Implementation http//www.cis.hut.fi/nnrc/ and
    MATLAB NN Toolkit

18
Unsupervised LearningKohonens Self-Organizing
Map (SOM) 2
19
Unsupervised LearningKohonens Self-Organizing
Map (SOM) 3
j
20
Unsupervised LearningSOM and Other Projections
for Clustering
Cluster Formation and Segmentation Algorithm
(Sketch)
21
Unsupervised LearningOther Algorithms (PCA,
Factor Analysis)
  • Intuitive Idea
  • Q Why are dimensionality-reducing transforms
    good for supervised learning?
  • A There may be many attributes with undesirable
    properties, e.g.,
  • Irrelevance xi has little discriminatory power
    over c(x) yi
  • Sparseness of information feature of interest
    spread out over many xis (e.g., text document
    categorization, where xi is a word position)
  • We want to increase the information density by
    squeezing X down
  • Principal Components Analysis (PCA)
  • Combining redundant variables into a single
    variable (aka component, or factor)
  • Example ratings (e.g., Nielsen) and polls (e.g.,
    Gallup) responses to certain questions may be
    correlated (e.g., like fishing? time spent
    boating)
  • Factor Analysis (FA)
  • General term for a class of algorithms that
    includes PCA
  • Tutorial http//www.statsoft.com/textbook/stfacan
    .html

22
Clustering MethodsDesign Choices
23
Clustering Applications
Information Retrieval Text Document Categorizatio
n
24
Unsupervised Learning andConstructive Induction
  • Unsupervised Learning in Support of Supervised
    Learning
  • Given D ? labeled vectors (x, y)
  • Return D ? transformed training examples (x,
    y)
  • Solution approach constructive induction
  • Feature construction generic term
  • Cluster definition
  • Feature Construction Front End
  • Synthesizing new attributes
  • Logical x1 ? ? x2, arithmetic x1 x5 / x2
  • Other synthetic attributes f(x1, x2, , xn),
    etc.
  • Dimensionality-reducing projection, feature
    extraction
  • Subset selection finding relevant attributes for
    a given target y
  • Partitioning finding relevant attributes for
    given targets y1, y2, , yp
  • Cluster Definition Back End
  • Form, segment, and label clusters to get
    intermediate targets y
  • Change of representation find an (x, y) that
    is good for learning target y

x / (x1, , xp)
25
ClusteringRelation to Constructive Induction
  • Clustering versus Cluster Definition
  • Clustering 3-step process
  • Cluster definition back end for feature
    construction
  • Clustering 3-Step Process
  • Form
  • (x1, , xk) in terms of (x1, , xn)
  • NB typically part of construction step,
    sometimes integrates both
  • Segment
  • (y1, , yJ) in terms of (x1, , xk)
  • NB number of clusters J not necessarily same as
    number of dimensions k
  • Label
  • Assign names (discrete/symbolic labels (v1, ,
    vJ)) to (y1, , yJ)
  • Important in document categorization (e.g.,
    clustering text for info retrieval)
  • Hierarchical Clustering Applying Clustering
    Recursively

26
Training Methods forHierarchical Mixture of
Experts (HME)
27
Terminology
  • Committee Machines aka Combiners
  • Static Structures
  • Ensemble averaging
  • Single-pass, separately trained inducers, common
    input
  • Individual outputs combined to get scalar output
    (e.g., linear combination)
  • Boosting the margin separately trained inducers,
    different input distributions
  • Filtering feed examples to trained inducers
    (weak classifiers), pass on to next classifier
    iff conflict encountered (consensus model)
  • Resampling aka subsampling (Si of fixed size
    m resampled from D)
  • Reweighting fixed size Si containing weighted
    examples for inducer
  • Dynamic Structures
  • Mixture of experts training in combiner inducer
    (aka gating network)
  • Hierarchical mixtures of experts hierarchy of
    inducers, combiners
  • Mixture Model, aka Mixture of Experts (ME)
  • Expert (classification), gating (combiner)
    inducers (modules, networks)
  • Hierarchical Mixtures of Experts (HME) multiple
    combiner (gating) levels

28
Summary Points
  • Committee Machines aka Combiners
  • Static Structures (Single-Pass)
  • Ensemble averaging
  • For improving weak (especially unstable)
    classifiers
  • e.g., weighted majority, bagging, stacking
  • Boosting the margin
  • Improve performance of any inducer weight
    examples to emphasize errors
  • Variants filtering (aka consensus), resampling
    (aka subsampling), reweighting
  • Dynamic Structures (Multi-Pass)
  • Mixture of experts training in combiner inducer
    (aka gating network)
  • Hierarchical mixtures of experts hierarchy of
    inducers, combiners
  • Mixture Model (aka Mixture of Experts)
  • Estimation of mixture coefficients (i.e.,
    weights)
  • Hierarchical Mixtures of Experts (HME) multiple
    combiner (gating) levels
  • Next Week Intro to GAs, GP (9.1-9.4, Mitchell
    1, 6.1-6.5, Goldberg)
Write a Comment
User Comments (0)
About PowerShow.com