Multiclassifier Systems: Back to the Future - PowerPoint PPT Presentation

About This Presentation

Title:

Multiclassifier Systems: Back to the Future

Description:

Multiclassifier Systems: Back to the Future Joydeep Ghosh The University of Texas at Austin – PowerPoint PPT presentation

Number of Views:112

Avg rating:3.0/5.0

Slides: 36

Provided by: Joyd5

Learn more at: http://www.ideal.ece.utexas.edu

Category:

more less

Transcript and Presenter's Notes

Title: Multiclassifier Systems: Back to the Future

1
Multiclassifier Systems Back to the Future

Joydeep Ghosh
The University of Texas at Austin

2
Agenda

MCS at crossroads
Even part of SAS,.. but whats next?
Historical Tidbits
Selected (old) highlights
Themes worth re-visiting
Broadening the scope
Combining multiple clusterings
Knowledge transfer/reuse
Exploiting output space
Limits to performance, confidences, added
classes,
Modular approaches revisited

3
Combining Votes/Ranks

Roots in French revolution?
Jean-Claude de Borda, 1781
Condorcets rule, 1785
Duncan Black (1958) Condorcet then Borda
Condorcets Jury Theorem, 1785
Social choice functions or group consensus
functions
Arrows impossibility theorem (1963)
Even 3 classes can be problematic
? Did not have true class
Linear opinion pool Laplace

4
Multi-class Winner-Take All

Selfridges PANDEMONUIM (1958)
Ensembles of specialized demons
Hierarchy Data, computational and cognitive
demons
Decision pick demon that shouted the loudest
Hill climbing re-constituting useless demons,
.
Nilssons Committee Machine (1965)
Pick max of C linear discriminant functions
g i (X) Wi T X wi0

5
Hybrid PR in the 70s and 80s

Theory No single model exists for all pattern
recognition problems and no single technique is
applicable to all problems. Rather what we have
is a bag of tools and a bag of problems Kanal,
1974
Practice multimodal inputs multiple
representations,
Syntactic structural statistical approaches
(Bunke 86)
multistage models
progressively identify/reject subset of classes
invoke KNN if linear classifier is ambiguous
Combining multiple output types 0/1 0,1,
Designs were typically specific to application

6
Combining in Other Areas

PR/Vision Data /sensor/decision fusion
AI evidence combination (Barnett 81)
Econometrics combining estimators (Granger 89)
Engineering Non-linear control
Statistics model-mix methods, Bayesian Model
Averaging,..
Software diversity
.

7
mid 90s- Competing vs. Cooperating Models

Final Decision

Confidence, ROC,
Feedback
Combiner
Classification/ Regression Model 1
Model n
Model 2
Data, Knowledge, Sensors,

Less diversity nowadays??

8
Motivation for Modular Networks

(Sharkey 97)
More interpretable localized models (Divide and
conquer)
Incorporate prior knowledge
Better modeling of inverse problems,
discontinuous maps, switched time series, ..
Future (localized) modifications
Neurobiological plausibility
Varieties
Cooperative, successive, supervisory,..
Automatic or explicit decomposition
Progress in MCS
Local selection (Woods et al 97)
dynamic classifiers (Giacinto Roli, 00)

9
DARPA Sonar Transients Classification Program
(1989-)

J. Ghosh, S. Beck and L. Deuser, IEEE Jl. of
Ocean Engineering, Vol 17, No. 4, October 1992,
pp. 351-363.

Ave/median/..
MLP
RBF
Classifer N
. . .
. . .
FFT
. . .
Gabor Wavelets
Feature Set M
Pre-processsed Data from Observed Phenomenon
10
Ensembles Insights and Lessons (Ho, MCS 2001)

Additional Observations
Coverage Optimization
Bagging/arcing/.. Most popular in machine
learning and neural network communities!
sweet spot in training data set size
Decision Optimization
Usually simple averaging adequate (Kittler et al,
96,98)
Highly correlated outputs
Diversity from feature and classifier choices
more effective than diversity from
samples/training

11
Cluster Ensembles

Given a set of provisional partitionings, we want
to aggregate them into a single consensus
partitioning, even without access to original
features .

(individual cluster labels)
Clusterer 1
(consensus labels)
12
History Consensus classification

Barthelemy Laclerk 1986, Neumann Norton
1986
Classifications included partitions,
dendrograms, n-trees, ..
Basic assumption is that a partial ordering
exists which induces a lattice
Strict consensus used today in phylogenetic tree
estimation

13
Cluster Ensemble Problem

Let there be r clusterings ?(r) with k(r)
clusters each
What is the integrated clustering ? that
optimally summarizes the r given clusterings
using k clusters?

Much more difficult than Classification ensembles
14
Application Scenarios

Improve quality and robustness
Reduce variance
Good results on a wide range of data using a
diverse portfolio of algorithms
Knowledge reuse
Consolidate legacy clusterings where original
object descriptions are no longer available
Distributed Clustering (one clusterer/ node)
Only some features available per clusterer
Only some objects available per clusterer
Hybrids

15
Average Norm. Mutual Info. (ANMI)

Normalized mutual information between clusterings
a, b
Other normalizations, e.g. using geometric mean,
possible
Proposed Optimal k consensus clustering
Empirical validation

16
Designing a Consensus Function

Direct optimization impractical
Three efficient heuristics
Cluster-based Similarity Partitioning Alg. (CSPA)
O( n2 k r)
HyperGraph Partitioning Alg. (HGPA)
O( n k r)
Meta-Clustering Alg. (MCLA)
O( n k2 r2)
All 3 exploit a hypergraph representation of the
sets of cluster labels (input to consensus
function)
See AAAI 2002 paper for details.
Supra-consensus function performs all three and
picks the one with highest ANMI (fully
unsupervised)

17
Hypergraph Representation

One hyperedge/cluster
Example

18
Applications and Experiments

Data-sets
2-dimensional bi-modal Gaussian simulated
data(k2, d2, n1000)
5 Gaussians in 8-dimensions (k5, d8, n1000)
Pen digit data (k3, d4, n7494)
Yahoo news web-document data (k40, d2903,
n2340)
application setups
Robust Consensus Clustering (RCC)
Feature Distributed Clustering (FDC)

19
Robust Consensus Clustering (RCC)

Goal Create an auto-focus clusterer that works
for a wide variety of data-sets
Diverse portfolio of 10 approaches
SOM, HGP
GP (Eucl, Corr, Cosi, XJac)
KM (Eucl, Corr, Cosi, XJac)
Each approach is run on the same subsample of the
data and the 10 clusterings combined using our
supra-consensus function
Evaluation using increase in NMI of
supra-consensus results increase over Random

20
Robustness Summary

Avg. qualityversusensemblequality
For severalsamplesizes n(50,100,200,400,800)
10-fold exp.
1 standarddeviation bars

21
Feature-Distributed Clustering (FDC)

Federated cluster analysis with partial feature
views
Experimental scenario
Portfolio of r clusterers receiving random subset
of features for all objects
Approach
identical individual clustering algorithm (graph
partitioning) and same k
Use supra-consensus function for combining
Evaluation
NMI of consensus with category labels

22
FDC Example

Data 5 Gaussians in 8 dimensions
Experiment 5 clusterings in 2-dimensional
subspaces
Result Avg. ind. 0.70, best ind. 0.77, ensemble
0.99

23
Experimental Results FDC

Reference clustering and consensus clustering
Ensemble always equal or better than individual
More than double the avg. individual quality in
YAHOO!

24
Remarks

Cluster ensembles
Improve quality robustness
Enable knowledge reuse
Work with distributed data
Are yet largely unexplored
Future work
Soft in/output clusterings
What if (some) Features are known?
Bioinformatics
Papers, data, demos code at http//strehl.com/

25
Solving Related Classification Problems

Real-world problems are often not isolated
History
Compound decision theory (Abend, 68)
90s Life-long learning, learning to learn,
(Pratt, Thrun,..)

26
Knowledge transfer or reuse

Leveraging a set of previously existing solutions
for (possibly) related problems
Scarce new data ?? prior knowledge

Grapefruit / Pear
Orange / Apple
Existing SUPPORT
Knowledge Transfer
TARGET
Size
Color
Shape
27
Supra-Classifier Architecture

K. Bollacker and J. Ghosh, "Knowledge reuse in
multiclassifier systems", Pattern Recognition
Letters, 18 (11-13), Nov 1997, 1385-90.

28
Output Space Decomposition

History
Pandemonium, committee machine
1 class vs. all others
Pairwise classification (how to combine?)
Limited
Application specific solutions (80s)
Error correcting output coding (Dietterich
Bakhiri, 95)
ve of meta-classifiers can be less can
tailor features
-ve groupings may be forced
Desired a general framework for natural grouping
of classes
Hierarchical with variable resolution
Custom features

29
Hierarchical Grouping of Classes

Top down Solve 3 coupled problems
group classes into two meta-classes
design feaure extractor tailored for the 2
meta-classes (e.g. Fisher)
design the 2-metaclass classifier (Bayesian)
Solution using Deterministic Annealing
Softly associate each class with both partitions
Compute/update the most discriminating features
Update associations
For hard associations also lower temperature
Recurse
Fast convergence, computation at macro-level

30
Binary Hierarchical Classifier

Building the tree
Bottom-Up
Top-Down
Hard soft variants
Provides valuable domain knowledge
Simplified feature extraction at each stage

31
The Future Scaling to Large, Non-Stationary
Datasets

Can build a knowledge base of
discriminating features
Typical class pairings
More amenable to changing mix of classes,
changing class statistics
Integrate with semi-supervised learning methods

32
Re-visiting Mixtures of Experts (MoEs)
Expert 1
y1
g1(x)
Y(x) ? gi (x)yi (x)
Expert 2
?
y2
g2
x

yk
Expert K

gk
Gating Network

Hierarchical versions possible

33
Beyond Mixtures of Experts

Problems with soft-max based gating network
Alternative use normalized Gaussians
Structurally adaptive add/delete experts
on-line learning versions
hard vs. soft switching error bars, etc
Piagets assimilation accomodation
V. Ramamurti and J. Ghosh, "Structurally Adaptive
Modular Networks for Non-Stationary
Environments", IEEE Trans. Neural Networks,
10(1), Jan 1999, pp. 152-60.

34
Generalizing MoE models

Mixtures of X
X HMMs, factor models, trees, principal
components
State dependent gating networks
Sequence classification
Mixture of Kalman Filters
Outperformed NASAs McGill filter bank!
W. S. Chaer, R. H. Bishop and J. Ghosh,
"Hierarchical Adaptive Kalman Filtering for
Interplanetary Orbit Determination", IEEE Trans.
on Aerospace and Electronic Systems, 34(3), Aug
1998, pp. 883-896.

35
Some Directions for MCS