Title: Boosting and predictive modeling
1Boosting and predictive modeling
- Yoav Freund
- Columbia University
2What is data mining?
- Lots of data - complex models
- Classifying customers using transaction logs.
- Classifying events in high-energy physics
experiments. - Object detection in computer vision.
- Predicting gene regulatory networks.
- Predicting stock prices and portfolio management.
3Leo BreimanStatistical Modeling / the two
culturesStatistical Science, 2001
- The data modeling culture (Generative modeling)
- Assume a stochastic model (5-50 parameters).
- Estimate model parameters.
- Interpret model and make predictions.
- Estimated population 98 of statisticians
- The algorithmic modeling culture (Predictive
modeling) - Assume relationship btwn predictor vars and
response vars has a functional form (102 --
106 parameters). - Search (efficiently) for the best prediction
function. - Make predictions
- Interpretation / causation - mostly an
after-thought. - Estimated population 2 0f statisticians (many
in other fields).
4Toy Example
- Computer receives telephone call
- Measures Pitch of voice
- Decides gender of caller
5Generative modeling
Voice Pitch
6Discriminative approach
Voice Pitch
7Ill-behaved data
Voice Pitch
8Plan of talk
- Boosting
- Alternating Decision Trees
- Data-mining ATT transaction logs.
- The I/O bottleneck in data-mining.
- Resistance of boosting to over-fitting.
- Confidence rated prediction.
- Confidence-rating for object recognition.
- Gene regulation modeling.
- Summary
9Plan of talk
- Boosting Combining weak classifiers.
- Alternating Decision Trees
- Data-mining ATT transaction logs.
- The I/O bottleneck in data-mining.
- Resistance of boosting to over-fitting.
- Confidence rated prediction.
- Confidence-rating for object recognition.
- Gene regulation modeling.
- Summary
10batch learning for binary classification
11A weighted training set
12A weak learner
Weighted training set
Weak Learner
h
13The boosting process
14Adaboost
Freund, Schapire 1997
15Main property of Adaboost
- If advantages of weak rules over random guessing
are g1,g2,..,gT then training error of final
rule is at most
16Boosting block diagram
17Adaboost as gradient-descent
18Plan of talk
- Boosting
- Alternating Decision Trees a hybrid of boosting
and decision trees - Data-mining ATT transaction logs.
- The I/O bottleneck in data-mining.
- Resistance of boosting to over-fitting.
- Confidence rated prediction.
- Confidence-rating for object recognition.
- Gene regulation modeling.
- Summary
19Decision Trees
1
-1
Xgt3
Ygt5
-1
-1
1
-1
20A decision tree as a sum of weak rules.
-0.2
0.1
-0.1
0.2
-0.2
0.1
-0.1
-0.3
0.2
-0.3
21An alternating decision tree
Freund, Mason 1997
-0.2
0.7
22Example Medical Diagnostics
- Cleve dataset from UC Irvine database.
- Heart disease diagnostics (1healthy,-1sick)
- 13 features from tests (real valued and
discrete). - 303 instances.
23AD-tree for heart-disease diagnostics
gt0 Healthy lt0 Sick
24Plan of talk
- Boosting
- Alternating Decision Trees
- Data-mining ATT transaction logs.
- The I/O bottleneck in data-mining.
- Resistance of boosting to over-fitting.
- Confidence rated prediction.
- Confidence-rating for object recognition.
- Gene regulation modeling.
- Summary
25ATT buisosity problem
Freund, Mason, Rogers, Pregibon, Cortes 2000
- Distinguish business/residence customers from
call detail information. (time of day, length of
call ) - 230M telephone numbers, label unknown for 30
- 260M calls / day
- Required computer resources
Huge counting log entries to produce statistics
-- use specialized I/O efficient sorting
algorithms (Hancock). Significant Calculating
the classification for 70M customers. Negligible
Learning (2 Hours on 10K training examples on an
off-line computer).
26AD-tree for buisosity
27AD-tree (Detail)
28Quantifiable results
- For accuracy 94 increased coverage from 44 to
56. - Saved ATT 15M in the year 2000 in operations
costs and missed opportunities.
29Plan of talk
- Boosting
- Alternating Decision Trees
- Data-mining ATT transaction logs.
- The I/O bottleneck in data-mining.
- Resistance of boosting to over-fitting.
- Confidence rated prediction.
- Confidence-rating for object recognition.
- Gene regulation modeling.
- Summary
30The database bottleneck
- Physical limit disk seek takes 0.01 sec
- Same time to read/write 105 bytes
- Same time to perform 107 CPU operations
- Commercial DBMS are optimized for varying queries
and transactions. - Statistical analysis requires evaluation of fixed
queries on massive data streams. - Keeping disk I/O sequential is key.
- Data Compression improves I/O speed but
restricts random access.
31CS theory regarding very large data-sets
- Massive datasets You pay 1 per disk block you
read/write e per CPU operation. Internal memory
can store N disk blocks - Example problem Given a stream of line segments
(in the plane), identify all segment pairs that
intersect. - Vitter, Motwani, Indyk,
- Property testing You can only look at a small
fraction of the data - Example problem decide whether a given graph is
bi-partite by testing only a small fraction of
the edges. - Rubinfeld, Ron, Sudan, Goldreich, Goldwasser,
32Plan of talk
- Boosting
- Alternating Decision Trees
- Data-mining ATT transaction logs.
- The I/O bottleneck in data-mining.
- Resistance of boosting to over-fitting.
- Confidence rated prediction.
- Confidence-rating for object recognition.
- Gene regulation modeling.
- Summary
33A very curious phenomenon
Boosting decision trees
Using lt10,000 training examples we fit gt2,000,000
parameters
34Large margins
Thesis large margins gt reliable predictions
Very similar to SVM.
35Experimental Evidence
36Theorem
Schapire, Freund, Bartlett Lee / Annals of
statistics 1998
H set of binary functions with VC-dimension d
No dependence on no. of combined functions!!!
37Idea of Proof
38Plan of talk
- Boosting
- Alternating Decision Trees
- Data-mining ATT transaction logs.
- The I/O bottleneck in data-mining.
- Resistance of boosting to over-fitting.
- Confidence rated prediction.
- Confidence-rating for object recognition.
- Gene regulation modeling.
- Summary
39A motivating example
?
?
?
40The algorithm
Freund, Mansour, Schapire, Annals of Stat, August
2004
41Suggested tuning
Yields
42Confidence Rating block diagram
43Summary of Confidence-Rated Classifiers
- Frequentist explanation for the benefits of
model averaging - Separates between inherent uncertainty and
uncertainty due to finite training set. - Computational hardness unknown other than in few
special cases - Margins from Boosting or SVM can be used as an
approximation. - Many practical applications!
44Plan of talk
- Boosting
- Alternating Decision Trees
- Data-mining ATT transaction logs.
- The I/O bottleneck in data-mining.
- Resistance of boosting to over-fitting.
- Confidence rated prediction.
- Confidence-rating for object recognition.
- Gene regulation modeling.
- Summary
45Face Detection - Using confidence to save time
Viola Jones 1999
- Paul Viola and Mike Jones developed a face
detector that can work in real time (15 frames
per second).
46Image Features
Rectangle filters Similar to Haar wavelets
Papageorgiou, et al.
47Example Classifier for Face Detection
A classifier with 200 rectangle features was
learned using AdaBoost 95 correct detection on
test set with 1 in 14084 false positives. Not
quite competitive...
ROC curve for 200 feature classifier
48Employing a cascade to minimize average detection
time
The accurate detector combines 6000 simple
features using Adaboost.
In most boxes, only 8-9 features are calculated.
Features 1-3
Features 4-10
49Using confidence to avoid labeling
Levin, Viola, Freund 2003
50Image 1
51Image 1 - diff from time average
52Image 2
53Image 2 - diff from time average
54Co-training
Blum and Mitchell 98
Partially trained B/W based Classifier
Raw B/W
Hwy Images
Diff Image
Partially trained Diff based Classifier
55(No Transcript)
56Co-Training Results
57Plan of talk
- Boosting
- Alternating Decision Trees
- Data-mining ATT transaction logs.
- The I/O bottleneck in data-mining.
- Resistance of boosting to over-fitting.
- Confidence rated prediction.
- Confidence-rating for object recognition.
- Gene regulation modeling.
- Summary
58Gene Regulation
- Regulatory proteins bind to non-coding regulatory
sequence of a gene to control rate of
transcription
59From mRNA to Protein
Nucleus wall
60Protein Transcription Factors
61Genome-wide Expression Data
- Microarrays measure mRNA transcript expression
levels for all of the 6000 yeast genes at once. - Very noisy data
- Rough time slice over all compartments of many
cells. - Protein expression not observed
62Partial Parts List for Yeast
- Many known and putative
- Transcription factors
- Signaling molecules that activate transcription
factors - Known and putative binding site motifs
- In yeast, regulatory sequence 500 bp upstream
region
63GeneClass Problem Formulation
M. Middendorf, A. Kundaje, C. Wiggins,
Y. Freund, C. Leslie. Predicting Genetic
Regulatory Response Using Classification. ISMB
2004.
- Predict target gene regulatory response from
regulator activity and binding site data
64Role of quantization
By Quantizing expression into three classes We
reduce noise but maintain most of signal
Weighting 1/-1 examples linearly with
Expression level performs slightly better.
65Problem setup
- Data point Target gene X Microarray
- Input features
- Parent state -1,0,1
- Motif Presence 0,1
- Predict output
- Target Gene -1,1
66Boosting with Alternating Decision Trees (ADTs)
- Use boosting to build a single ADT, margin-based
generalization of decision tree
Splitter Node Is MotifMIG1 present AND ParentXBP1
up?
Prediction Node F(x) given by sum of prediction
nodes along all paths consistent with x
67Statistical Validation
- 10-fold cross-validation experiments, 50,000
(gene/microarray) training examples - Significant correlation between prediction score
and true log expression ratio on held-out data. - Prediction accuracy on 1/-1 labels 88.5
68Biological InterpretationFrom correlation to
causation
- Good prediction only implies Correlation.
- To infer causation we need to integrate
additional knowledge. - Comparative case studies train on similar
conditions (stresses), test on related
experiments - Extract significant features from learned model
- Iteration score (IS) Boosting iteration at which
feature first appears - Identifies significant motifs, motif-parent pairs
- Abundance score (AS) Number of nodes in ADT
containing feature - Identifies important regulators
- In silico knock-outs remove significant
regulator and retrain.
69Case Study Heat Shock and Osmolarity
- Training set Heat shock, osmolarity, amino acid
starvation - Test set Stationary phase, simultaneous heat
shockosmolarity - Results
- Test error 9.3
- Supports Gasch hypothesis heat shock and
osmolarity pathways independent, additive - High scoring parents (AS) USV1 (stationary phase
and heat shock), PPT1 (osmolarity response), GAC1
(response to heat)
70Case Study Heat Shock and Osmolarity
- Results
- High scoring binding sites (IS)
- MSN2/MSN4 STRE element
- Heat shock related HSF1 and RAP1 binding sites
- Osmolarity/glycerol pathways CAT8, MIG1, GCN4
- Amino acid starvation GCN4, CHA4, MET31
- High scoring motif-parent pair (IS)
- TPK1STRE pair (kinase that regulates MSN2 via
cellular localization) indirect effect
Direct binding
Indirect effect
Co-occurrence
71Case Study In silico knockout
- Training and test sets Same as heat shock and
osmolarity case study - Knockout Remove USV1 from regulator list and
retrain - Results
- Test error 12 (increase from 9)
- Identify putative downstream targets of USV1
target genes that change from correct to
incorrect label - GO annotation analysis reveals putative
functions Nucleoside transport, cell-wall
organization and biogenesis, heat-shock protein
activity - Putative functions match those identified in wet
lab USV1 knockout (Segal et al., 2003)
72Conclusions Gene Regulation
- New predictive model for study of gene regulation
- First gene regulation model to make quantitative
predictions. - Using actual expression levels - no clustering.
- Strong prediction accuracy on held-out
experiments - Interpretable hypotheses significant regulators,
binding motifs, regulator-motif pairs - New methodology for biological analysis
comparative training/test studies, in silico
knockouts
73Plan of talk
- Boosting
- Alternating Decision Trees
- Data-mining ATT transaction logs.
- The I/O bottleneck in data-mining.
- Resistance of boosting to over-fitting.
- Confidence rated prediction.
- Confidence-rating for object recognition.
- Gene regulation modeling.
- Summary
74Summary
- Moving from density estimation to classification
can make hard problems tractable. - Boosting is an efficient and flexible method for
constructing complex and accurate classifiers. - I/O is the main bottleneck to data-mining
- Sampling, data localization and parallelization
help. - Correlation -gt Causation still a hard problem,
requires domain specific expertise and
integration of data sources.
75Future work
- New applications
- Bio-informatics.
- Vision / Speech and signal processing.
- Information Retrieval and Information Extraction.
- Theory
- Improving the robustness of learning algorithms.
- Utilization of unlabeled examples in
confidence-rated classification. - Sequential experimental design.
- Relationships between learning algorithms and
stochastic differential equations.
76Extra
77Plan of talk
- Boosting
- Alternating Decision Trees
- Data-mining ATT transaction logs.
- The I/O bottleneck in data-mining.
- High-energy physics.
- . Resistance of boosting to over-fitting.
- Confidence rated prediction.
- Confidence-rating for object recognition.
- Gene regulation modeling.
- Summary
78Analysis for the MiniBooNE experiment
- Goal To test for neutrino mass by searching for
neutrino oscillations. - Important because it may lead us to physics
beyond the Standard Model. - The BooNE project began in 1997.
- The first beam induced neutrino events were
detected in September, 2002.
MiniBooNE detector (Fermi Lab)
79MiniBooNE Classification Task
Ion Stancu. UC Riverside
80(No Transcript)
81Results
82Using confidence to reduce labeling
Unlabeled data
Partially trained classifier
Query-by-committee, Seung, Opper
Sompolinsky Freund, Seung, Shamir Tishby
83Discriminative approach
Voice Pitch
84Results from Yotam Abramson.
85(No Transcript)