Title: Part II: Practical Implementations.
1Part II Practical Implementations.
2Modeling the Classes
Stochastic Discrimination
3Algorithm for Training a SD Classifier
Generate projectable weak model
Evaluate model w.r.t. training set, check
enrichment
Check uniformity w.r.t. existing collection
Add to discriminant
4Dealing with Data GeometrySD in Practice
52D Example
- Adapted from Kleinberg, PAMI, May 2000
6- An r1/2 random subset in the feature space
that covers ½ of all the points
7- Watch how many such subsets cover a particular
point, say, (2,17)
(2,17)
8 In
Out
In
Its in 1/2 models Y ½ 0.5
Its in 2/3 models Y 2/3 0.67
Its in 0/1 models Y 0/1 0.0
In
In
In
Its in 3/4 models Y ¾ 0.75
Its in 4/5 models Y 4/5 0.8
Its in 5/6 models Y 5/6 0.83
9 In
In
Out
Its in 6/8 models Y 6/8 0.75
Its in 7/9 models Y 7/9 0.77
Its in 5/7 models Y 5/7 0.72
In
Out
Out
Its in 8/10 models Y 8/10 0.8
Its in 8/11 models Y 8/11 0.73
Its in 8/12 models Y 8/12 0.67
10- Fraction of r1/2 random subsets covering point
(2,17) as more such subsets are generated
11- Fractions of r1/2 random subsets covering
several selected points as more such subsets are
generated
12- Distribution of model coverage for all points in
space, with 100 models
13- Distribution of model coverage for all points in
space, with 200 models
14- Distribution of model coverage for all points in
space, with 300 models
15- Distribution of model coverage for all points in
space, with 400 models
16- Distribution of model coverage for all points in
space, with 500 models
17- Distribution of model coverage for all points in
space, with 1000 models
18- Distribution of model coverage for all points in
space, with 2000 models
19- Distribution of model coverage for all points in
space, with 5000 models
20- Introducing enrichment
- For any discrimination to happen, the models
must have some difference in coverage for
different classes.
21- Enforcing enrichment (adding in a bias) require
each subset to cover more points of one class
than another
Class distribution
A biased (enriched) weak model
22- Distribution of model coverage for points in each
class, with 100 enriched weak models
23- Distribution of model coverage for points in each
class, with 200 enriched weak models
24- Distribution of model coverage for points in each
class, with 300 enriched weak models
25- Distribution of model coverage for points in each
class, with 400 enriched weak models
26- Distribution of model coverage for points in each
class, with 500 enriched weak models
27- Distribution of model coverage for points in each
class, with 1000 enriched weak models
28- Distribution of model coverage for points in each
class, with 2000 enriched weak models
29- Distribution of model coverage for points in each
class, with 5000 enriched weak models
30- Error rate decreases as number of models
increases - Decision rule if Y lt 0.5 then class 2
else class 1
31- Sparse Training Data
- Incomplete knowledge about class distributions
Training Set
Test Set
32- Distribution of model coverage for points in each
class, with 100 enriched weak models
Training Set
Test Set
33- Distribution of model coverage for points in each
class, with 200 enriched weak models
Training Set
Test Set
34- Distribution of model coverage for points in each
class, with 300 enriched weak models
Training Set
Test Set
35- Distribution of model coverage for points in each
class, with 400 enriched weak models
Training Set
Test Set
36- Distribution of model coverage for points in each
class, with 500 enriched weak models
Training Set
Test Set
37- Distribution of model coverage for points in each
class, with 1000 enriched weak models
Training Set
Test Set
38- Distribution of model coverage for points in each
class, with 2000 enriched weak models
Training Set
Test Set
39- Distribution of model coverage for points in each
class, with 5000 enriched weak models
No discrimination!
Training Set
Test Set
40- Models of this type, when enriched for training
set, are not necessarily enriched for test set
Training Set
Test Set
Random model with 50 coverage of space
41- Introducing projectability
- Maintain local continuity of class
interpretations. - Neighboring points of the same class should
share similar model coverage.
42- Allow some local continuity in model membership,
so that interpretation of a training point can
generalize to its immediate neighborhood
Class distribution
A projectable model
43- Distribution of model coverage for points in each
class, with 100 enriched, projectable weak models
Training Set
Test Set
44- Distribution of model coverage for points in each
class, with 300 enriched, projectable weak models
Training Set
Test Set
45- Distribution of model coverage for points in each
class, with 400 enriched, projectable weak models
Training Set
Test Set
46- Distribution of model coverage for points in each
class, with 500 enriched, projectable weak models
Training Set
Test Set
47- Distribution of model coverage for points in each
class, with 1000 enriched, projectable weak
models
Training Set
Test Set
48- Distribution of model coverage for points in each
class, with 2000 enriched, projectable weak
models
Training Set
Test Set
49- Distribution of model coverage for points in each
class, with 5000 enriched, projectable weak
models
Training Set
Test Set
50- Promoting uniformity
- All points in the same class should have equal
likelihood to be covered by a model of each
particular rating. - Retain models that cover the points whose
coverage by current collection is less
51- Distribution of model coverage for points in each
class, with 100 enriched, projectable, uniform
weak models
Training Set
Test Set
52- Distribution of model coverage for points in each
class, with 1000 enriched, projectable, uniform
weak models
Training Set
Test Set
53- Distribution of model coverage for points in each
class, with 5000 enriched, projectable, uniform
weak models
Training Set
Test Set
54- Distribution of model coverage for points in each
class, with 10000 enriched, projectable, uniform
weak models
Training Set
Test Set
55- Distribution of model coverage for points in each
class, with 50000 enriched, projectable, uniform
weak models
Training Set
Test Set
56The 3 necessary conditions
Discriminating Power
Enrichment
Uniformity
Projectability
Complementary Information
Generalization Power
57Extensions and Comparisons
58Alternative Discriminants
- Berlind 1994
- Different discriminants for N-class problems
- Additional condition on symmetry
- Approximate uniformity
- Hierarchy of indiscernibility
59Estimates of Classification Accuracies
- Chen 1997
- Statistical estimate of classification accuracy
- under weaker conditions
- Approximate uniformity
- Approximate indiscernibility
60Multi-class Problems
- For n classes, define n discriminants Yi, one
for each class i vs the others - Classify an unknown point to the class i for
which the computed Yi is the largest
61Ho Kleinberg ICPR 1996
62(No Transcript)
63(No Transcript)
64(No Transcript)
65Open Problems
- Algorithm for uniformity enforcement
- Deterministic methods?
- Desirable form of weak models
- Fewer, more sophisticated classifiers?
- Other ways to address the 3-way trade-off
- Enrichment / Uniformity / Projectability
66Random Decision Forest
- Ho 1995, 1998
- A structured way to create models fully split a
tree, use leaves as models - Perfect enrichment and uniformity for TR
- Promote projectability by subspace projection
67Compact Distribution Maps
- Ho Baird 1993, 1997
- Another structured way to create models
- Start with projectable models by coarse
quantization of feature value range - Seek enrichment and uniformity
68SD Other Ensemble Methods
- Ensemble learning via boosting
- A sequential way to promote uniformity of
ensemble element coverage - XCS (a genetic algorithm)
- A way to create, filter, and use stochastic
models that are regions in feature space
69XCS Classifier System
- Wilson,95
- Recent focus of GA community
- Good performance
- Reinforcement Learning Genetic Algorithms
- Model set of rules
if (shapesquare and numbergt10) then classred if
(shapecircle and numberlt5) then classyellow
input
class
Set of Rules
update
search
Reinforcement Learning
Genetic Algorithms
reward
Environment
70Multiple Classifier SystemsExamples in Word
Image Recognition
71Complementary Strengths of Classifiers
Rank of true class out of a lexicon of 1091
words, by 10 classifiers for 20 images
- The case for classifier combination
- decision fusion
- mixture of experts
- committee decision making
72Classifier Combination Methods
- Decision Optimization
- find consensus among a given set of classifiers
- Coverage Optimization
- create a set of classifiers that work best with
a given decision combination function
73Decision Optimization
- Develop classifiers with expert knowledge
- Try to make the best use of their decisions
- via majority/plurality vote, sum/product rule,
probabilistic methods, Bayesian methods,
rank/confidence score combination - The joint capability of the classifiers set an
intrinsic limit on the combined accuracy - There is no way to handle the blind spots
-
74Difficulties in Decision Optimization
- Reliability versus overall accuracy
- Fixed or trainable combination function
- Simple models or combinatorial estimates
- How to model complementary behavior
75Coverage Optimization
- Fix a decision combination function
- Generate classifiers automatically and
systematically - via training set sub-sampling (stacking,
bagging, boosting), - subspace projection (RSM),
- superclass/subclass decomposition (ECOC),
- random perturbation of training processes, noise
injection - Need enough classifiers to cover all blind spots
- (how many are enough?)
- What else is critical?
-
76Difficulties inCoverage Optimization
- What kind of differences to introduce
- Subsamples? Subspaces? Super/Subclasses?
- Training parameters?
- Model geometry?
- 3-way tradeoff
- discrimination diversity generalization
- Effects of the form of component classifiers
77Dilemmas and Paradoxes in Classifier Combination
- Weaken individuals for a stronger whole?
- Sacrifice known samples for unseen cases?
- Seek agreements or differences?
78Stochastic Discrimination
- A mathematical theory that relates several key
concepts in pattern recognition - Discriminative power enrichment
- Complementary information uniformity
- Generalization power projectability
- It offers a way to describe complementary
behavior of classifiers - It offers guidelines to design multiple
classifier systems (classifier ensembles)