Multiclassifier Systems: Back to the Future - PowerPoint PPT Presentation

About This Presentation
Title:

Multiclassifier Systems: Back to the Future

Description:

Multiclassifier Systems: Back to the Future Joydeep Ghosh The University of Texas at Austin – PowerPoint PPT presentation

Number of Views:109
Avg rating:3.0/5.0
Slides: 36
Provided by: Joyd5
Category:

less

Transcript and Presenter's Notes

Title: Multiclassifier Systems: Back to the Future


1
Multiclassifier Systems Back to the Future
  • Joydeep Ghosh
  • The University of Texas at Austin

2
Agenda
  • MCS at crossroads
  • Even part of SAS,.. but whats next?
  • Historical Tidbits
  • Selected (old) highlights
  • Themes worth re-visiting
  • Broadening the scope
  • Combining multiple clusterings
  • Knowledge transfer/reuse
  • Exploiting output space
  • Limits to performance, confidences, added
    classes,
  • Modular approaches revisited

3
Combining Votes/Ranks
  • Roots in French revolution?
  • Jean-Claude de Borda, 1781
  • Condorcets rule, 1785
  • Duncan Black (1958) Condorcet then Borda
  • Condorcets Jury Theorem, 1785
  • Social choice functions or group consensus
    functions
  • Arrows impossibility theorem (1963)
  • Even 3 classes can be problematic
  • ? Did not have true class
  • Linear opinion pool Laplace

4
Multi-class Winner-Take All
  • Selfridges PANDEMONUIM (1958)
  • Ensembles of specialized demons
  • Hierarchy Data, computational and cognitive
    demons
  • Decision pick demon that shouted the loudest
  • Hill climbing re-constituting useless demons,
    .
  • Nilssons Committee Machine (1965)
  • Pick max of C linear discriminant functions
  • g i (X) Wi T X wi0

5
Hybrid PR in the 70s and 80s
  • Theory No single model exists for all pattern
    recognition problems and no single technique is
    applicable to all problems. Rather what we have
    is a bag of tools and a bag of problems Kanal,
    1974
  • Practice multimodal inputs multiple
    representations,
  • Syntactic structural statistical approaches
    (Bunke 86)
  • multistage models
  • progressively identify/reject subset of classes
  • invoke KNN if linear classifier is ambiguous
  • Combining multiple output types 0/1 0,1,
  • Designs were typically specific to application

6
Combining in Other Areas
  • PR/Vision Data /sensor/decision fusion
  • AI evidence combination (Barnett 81)
  • Econometrics combining estimators (Granger 89)
  • Engineering Non-linear control
  • Statistics model-mix methods, Bayesian Model
    Averaging,..
  • Software diversity
  • .

7
mid 90s- Competing vs. Cooperating Models
  • Final Decision

Confidence, ROC,
Feedback
Combiner
Classification/ Regression Model 1
Model n
Model 2
Data, Knowledge, Sensors,
  • Less diversity nowadays??

8
Motivation for Modular Networks
  • (Sharkey 97)
  • More interpretable localized models (Divide and
    conquer)
  • Incorporate prior knowledge
  • Better modeling of inverse problems,
    discontinuous maps, switched time series, ..
  • Future (localized) modifications
  • Neurobiological plausibility
  • Varieties
  • Cooperative, successive, supervisory,..
  • Automatic or explicit decomposition
  • Progress in MCS
  • Local selection (Woods et al 97)
  • dynamic classifiers (Giacinto Roli, 00)

9
DARPA Sonar Transients Classification Program
(1989-)
  • J. Ghosh, S. Beck and L. Deuser, IEEE Jl. of
    Ocean Engineering, Vol 17, No. 4, October 1992,
    pp. 351-363.

Ave/median/..
MLP
RBF
Classifer N
. . .
. . .
FFT
. . .
Gabor Wavelets
Feature Set M
Pre-processsed Data from Observed Phenomenon
10
Ensembles Insights and Lessons (Ho, MCS 2001)
  • Additional Observations
  • Coverage Optimization
  • Bagging/arcing/.. Most popular in machine
    learning and neural network communities!
  • sweet spot in training data set size
  • Decision Optimization
  • Usually simple averaging adequate (Kittler et al,
    96,98)
  • Highly correlated outputs
  • Diversity from feature and classifier choices
    more effective than diversity from
    samples/training

11
Cluster Ensembles
  • Given a set of provisional partitionings, we want
    to aggregate them into a single consensus
    partitioning, even without access to original
    features .

(individual cluster labels)
Clusterer 1
(consensus labels)
12
History Consensus classification
  • Barthelemy Laclerk 1986, Neumann Norton
    1986
  • Classifications included partitions,
    dendrograms, n-trees, ..
  • Basic assumption is that a partial ordering
    exists which induces a lattice
  • Strict consensus used today in phylogenetic tree
    estimation

13
Cluster Ensemble Problem
  • Let there be r clusterings ?(r) with k(r)
    clusters each
  • What is the integrated clustering ? that
    optimally summarizes the r given clusterings
    using k clusters?

Much more difficult than Classification ensembles
14
Application Scenarios
  • Improve quality and robustness
  • Reduce variance
  • Good results on a wide range of data using a
    diverse portfolio of algorithms
  • Knowledge reuse
  • Consolidate legacy clusterings where original
    object descriptions are no longer available
  • Distributed Clustering (one clusterer/ node)
  • Only some features available per clusterer
  • Only some objects available per clusterer
  • Hybrids

15
Average Norm. Mutual Info. (ANMI)
  • Normalized mutual information between clusterings
    a, b
  • Other normalizations, e.g. using geometric mean,
    possible
  • Proposed Optimal k consensus clustering
  • Empirical validation

16
Designing a Consensus Function
  • Direct optimization impractical
  • Three efficient heuristics
  • Cluster-based Similarity Partitioning Alg. (CSPA)
  • O( n2 k r)
  • HyperGraph Partitioning Alg. (HGPA)
  • O( n k r)
  • Meta-Clustering Alg. (MCLA)
  • O( n k2 r2)
  • All 3 exploit a hypergraph representation of the
    sets of cluster labels (input to consensus
    function)
  • See AAAI 2002 paper for details.
  • Supra-consensus function performs all three and
    picks the one with highest ANMI (fully
    unsupervised)

17
Hypergraph Representation
  • One hyperedge/cluster
  • Example

18
Applications and Experiments
  • Data-sets
  • 2-dimensional bi-modal Gaussian simulated
    data(k2, d2, n1000)
  • 5 Gaussians in 8-dimensions (k5, d8, n1000)
  • Pen digit data (k3, d4, n7494)
  • Yahoo news web-document data (k40, d2903,
    n2340)
  • application setups
  • Robust Consensus Clustering (RCC)
  • Feature Distributed Clustering (FDC)

19
Robust Consensus Clustering (RCC)
  • Goal Create an auto-focus clusterer that works
    for a wide variety of data-sets
  • Diverse portfolio of 10 approaches
  • SOM, HGP
  • GP (Eucl, Corr, Cosi, XJac)
  • KM (Eucl, Corr, Cosi, XJac)
  • Each approach is run on the same subsample of the
    data and the 10 clusterings combined using our
    supra-consensus function
  • Evaluation using increase in NMI of
    supra-consensus results increase over Random

20
Robustness Summary
  • Avg. qualityversusensemblequality
  • For severalsamplesizes n(50,100,200,400,800)
  • 10-fold exp.
  • 1 standarddeviation bars

21
Feature-Distributed Clustering (FDC)
  • Federated cluster analysis with partial feature
    views
  • Experimental scenario
  • Portfolio of r clusterers receiving random subset
    of features for all objects
  • Approach
  • identical individual clustering algorithm (graph
    partitioning) and same k
  • Use supra-consensus function for combining
  • Evaluation
  • NMI of consensus with category labels

22
FDC Example
  • Data 5 Gaussians in 8 dimensions
  • Experiment 5 clusterings in 2-dimensional
    subspaces
  • Result Avg. ind. 0.70, best ind. 0.77, ensemble
    0.99

23
Experimental Results FDC
  • Reference clustering and consensus clustering
  • Ensemble always equal or better than individual
  • More than double the avg. individual quality in
    YAHOO!

24
Remarks
  • Cluster ensembles
  • Improve quality robustness
  • Enable knowledge reuse
  • Work with distributed data
  • Are yet largely unexplored
  • Future work
  • Soft in/output clusterings
  • What if (some) Features are known?
  • Bioinformatics
  • Papers, data, demos code at http//strehl.com/

25
Solving Related Classification Problems
  • Real-world problems are often not isolated
  • History
  • Compound decision theory (Abend, 68)
  • 90s Life-long learning, learning to learn,
    (Pratt, Thrun,..)

26
Knowledge transfer or reuse
  • Leveraging a set of previously existing solutions
    for (possibly) related problems
  • Scarce new data ?? prior knowledge

Grapefruit / Pear
Orange / Apple
Existing SUPPORT
Knowledge Transfer
TARGET
Size
Color
Shape
27
Supra-Classifier Architecture
  • K. Bollacker and J. Ghosh, "Knowledge reuse in
    multiclassifier systems", Pattern Recognition
    Letters, 18 (11-13), Nov 1997, 1385-90.

28
Output Space Decomposition
  • History
  • Pandemonium, committee machine
  • 1 class vs. all others
  • Pairwise classification (how to combine?)
  • Limited
  • Application specific solutions (80s)
  • Error correcting output coding (Dietterich
    Bakhiri, 95)
  • ve of meta-classifiers can be less can
    tailor features
  • -ve groupings may be forced
  • Desired a general framework for natural grouping
    of classes
  • Hierarchical with variable resolution
  • Custom features

29
Hierarchical Grouping of Classes
  • Top down Solve 3 coupled problems
  • group classes into two meta-classes
  • design feaure extractor tailored for the 2
    meta-classes (e.g. Fisher)
  • design the 2-metaclass classifier (Bayesian)
  • Solution using Deterministic Annealing
  • Softly associate each class with both partitions
  • Compute/update the most discriminating features
  • Update associations
  • For hard associations also lower temperature
  • Recurse
  • Fast convergence, computation at macro-level

30
Binary Hierarchical Classifier
  • Building the tree
  • Bottom-Up
  • Top-Down
  • Hard soft variants
  • Provides valuable domain knowledge
  • Simplified feature extraction at each stage

31
The Future Scaling to Large, Non-Stationary
Datasets
  • Can build a knowledge base of
  • discriminating features
  • Typical class pairings
  • More amenable to changing mix of classes,
    changing class statistics
  • Integrate with semi-supervised learning methods

32
Re-visiting Mixtures of Experts (MoEs)
Expert 1
y1
g1(x)
Y(x) ? gi (x)yi (x)
Expert 2
?
y2
g2
x

yk
Expert K

gk
Gating Network
  • Hierarchical versions possible

33
Beyond Mixtures of Experts
  • Problems with soft-max based gating network
  • Alternative use normalized Gaussians
  • Structurally adaptive add/delete experts
  • on-line learning versions
  • hard vs. soft switching error bars, etc
  • Piagets assimilation accomodation
  • V. Ramamurti and J. Ghosh, "Structurally Adaptive
    Modular Networks for Non-Stationary
    Environments", IEEE Trans. Neural Networks,
    10(1), Jan 1999, pp. 152-60.

34
Generalizing MoE models
  • Mixtures of X
  • X HMMs, factor models, trees, principal
    components
  • State dependent gating networks
  • Sequence classification
  • Mixture of Kalman Filters
  • Outperformed NASAs McGill filter bank!
  • W. S. Chaer, R. H. Bishop and J. Ghosh,
    "Hierarchical Adaptive Kalman Filtering for
    Interplanetary Orbit Determination", IEEE Trans.
    on Aerospace and Electronic Systems, 34(3), Aug
    1998, pp. 883-896. 

35
Some Directions for MCS
  • Extend to Multi-learner systems
  • Develop a Meta-theory based on data properties
  • - for Classification
  • Catering to changing statistics and changing
    questions (concept drift)
  • Maintaining explainability (cf. Briemans
    constant)
  • Classification of sequences
  • Online evidence accumulation
  • distributed data mining and scalability issues
  • Active learning
  • Implications for feature selection
  • Computational aspects

36
Acknowledgements
  • Completed PhD theses
  • Alexander Strehl, (cluster ensembles), May 02
  • Shailesh Kumar, Modular Learning Through Output
    Space Transformations, 2000
  • Viswanath Ramamurti, Modular Networks, 1997  
  • Kurt D. Bollacker, A Supra-Classifier Framework
    for Knowledge Reuse, 1998  
  •  
  • Kagan Tumer, math analysis of ensembles, 1996  
  • Ismail Taha, symbolic connectionist, 1997
  •  
  • Papers at http//www.lans.ece.utexas.edu
  •  
Write a Comment
User Comments (0)
About PowerShow.com