Boosting and predictive modeling - PowerPoint PPT Presentation

About This Presentation

Title:

Boosting and predictive modeling

Description:

... Mason, Rogers, Pregibon, Cortes 2000. University of Washington. 26. AD ... Saved AT&T 15M$ in the year 2000 in operations costs and missed opportunities. Recall ... – PowerPoint PPT presentation

Number of Views:130

Avg rating:3.0/5.0

Slides: 86

Provided by: yoavf

Learn more at: https://cseweb.ucsd.edu

Category:

more less

Transcript and Presenter's Notes

Title: Boosting and predictive modeling

1
Boosting and predictive modeling

Yoav Freund
Columbia University

2
What is data mining?

Lots of data - complex models
Classifying customers using transaction logs.
Classifying events in high-energy physics
experiments.
Object detection in computer vision.
Predicting gene regulatory networks.
Predicting stock prices and portfolio management.

3
Leo BreimanStatistical Modeling / the two
culturesStatistical Science, 2001

The data modeling culture (Generative modeling)
Assume a stochastic model (5-50 parameters).
Estimate model parameters.
Interpret model and make predictions.
Estimated population 98 of statisticians
The algorithmic modeling culture (Predictive
modeling)
Assume relationship btwn predictor vars and
response vars has a functional form (102 --
106 parameters).
Search (efficiently) for the best prediction
function.
Make predictions
Interpretation / causation - mostly an
after-thought.
Estimated population 2 0f statisticians (many
in other fields).

4
Toy Example

Computer receives telephone call
Measures Pitch of voice
Decides gender of caller

5
Generative modeling
Voice Pitch
6
Discriminative approach
Voice Pitch
7
Ill-behaved data
Voice Pitch
8
Plan of talk

Boosting
Alternating Decision Trees
Data-mining ATT transaction logs.
The I/O bottleneck in data-mining.
Resistance of boosting to over-fitting.
Confidence rated prediction.
Confidence-rating for object recognition.
Gene regulation modeling.
Summary

9
Plan of talk

Boosting Combining weak classifiers.
Alternating Decision Trees
Data-mining ATT transaction logs.
The I/O bottleneck in data-mining.
Resistance of boosting to over-fitting.
Confidence rated prediction.
Confidence-rating for object recognition.
Gene regulation modeling.
Summary

10
batch learning for binary classification
11
A weighted training set
12
A weak learner
Weighted training set
Weak Learner
h
13
The boosting process

14
Adaboost
Freund, Schapire 1997
15
Main property of Adaboost

If advantages of weak rules over random guessing
are g1,g2,..,gT then training error of final
rule is at most

16
Boosting block diagram
17
Adaboost as gradient-descent
18
Plan of talk

Boosting
Alternating Decision Trees a hybrid of boosting
and decision trees
Data-mining ATT transaction logs.
The I/O bottleneck in data-mining.
Resistance of boosting to over-fitting.
Confidence rated prediction.
Confidence-rating for object recognition.
Gene regulation modeling.
Summary

19
Decision Trees
1
-1
Xgt3
Ygt5
-1
-1
1
-1
20
A decision tree as a sum of weak rules.
-0.2
0.1
-0.1
0.2
-0.2
0.1
-0.1
-0.3
0.2
-0.3
21
An alternating decision tree
Freund, Mason 1997
-0.2
0.7
22
Example Medical Diagnostics

Cleve dataset from UC Irvine database.
Heart disease diagnostics (1healthy,-1sick)
13 features from tests (real valued and
discrete).
303 instances.

23
AD-tree for heart-disease diagnostics
gt0 Healthy lt0 Sick
24
Plan of talk

Boosting
Alternating Decision Trees
Data-mining ATT transaction logs.
The I/O bottleneck in data-mining.
Resistance of boosting to over-fitting.
Confidence rated prediction.
Confidence-rating for object recognition.
Gene regulation modeling.
Summary

25
ATT buisosity problem
Freund, Mason, Rogers, Pregibon, Cortes 2000

Distinguish business/residence customers from
call detail information. (time of day, length of
call )
230M telephone numbers, label unknown for 30
260M calls / day
Required computer resources

Huge counting log entries to produce statistics
-- use specialized I/O efficient sorting
algorithms (Hancock). Significant Calculating
the classification for 70M customers. Negligible
Learning (2 Hours on 10K training examples on an
off-line computer).
26
AD-tree for buisosity
27
AD-tree (Detail)
28
Quantifiable results

For accuracy 94 increased coverage from 44 to
56.
Saved ATT 15M in the year 2000 in operations
costs and missed opportunities.

29
Plan of talk

Boosting
Alternating Decision Trees
Data-mining ATT transaction logs.
The I/O bottleneck in data-mining.
Resistance of boosting to over-fitting.
Confidence rated prediction.
Confidence-rating for object recognition.
Gene regulation modeling.
Summary

30
The database bottleneck

Physical limit disk seek takes 0.01 sec
Same time to read/write 105 bytes
Same time to perform 107 CPU operations
Commercial DBMS are optimized for varying queries
and transactions.
Statistical analysis requires evaluation of fixed
queries on massive data streams.
Keeping disk I/O sequential is key.
Data Compression improves I/O speed but
restricts random access.

31
CS theory regarding very large data-sets

Massive datasets You pay 1 per disk block you
read/write e per CPU operation. Internal memory
can store N disk blocks
Example problem Given a stream of line segments
(in the plane), identify all segment pairs that
intersect.
Vitter, Motwani, Indyk,
Property testing You can only look at a small
fraction of the data
Example problem decide whether a given graph is
bi-partite by testing only a small fraction of
the edges.
Rubinfeld, Ron, Sudan, Goldreich, Goldwasser,

32
Plan of talk

Boosting
Alternating Decision Trees
Data-mining ATT transaction logs.
The I/O bottleneck in data-mining.
Resistance of boosting to over-fitting.
Confidence rated prediction.
Confidence-rating for object recognition.
Gene regulation modeling.
Summary

33
A very curious phenomenon
Boosting decision trees
Using lt10,000 training examples we fit gt2,000,000
parameters
34
Large margins
Thesis large margins gt reliable predictions
Very similar to SVM.
35
Experimental Evidence
36
Theorem
Schapire, Freund, Bartlett Lee / Annals of
statistics 1998
H set of binary functions with VC-dimension d
No dependence on no. of combined functions!!!
37
Idea of Proof
38
Plan of talk

Boosting
Alternating Decision Trees
Data-mining ATT transaction logs.
The I/O bottleneck in data-mining.
Resistance of boosting to over-fitting.
Confidence rated prediction.
Confidence-rating for object recognition.
Gene regulation modeling.
Summary

39
A motivating example
?
?
?
40
The algorithm
Freund, Mansour, Schapire, Annals of Stat, August
2004
41
Suggested tuning
Yields
42
Confidence Rating block diagram
43
Summary of Confidence-Rated Classifiers

Frequentist explanation for the benefits of
model averaging
Separates between inherent uncertainty and
uncertainty due to finite training set.
Computational hardness unknown other than in few
special cases
Margins from Boosting or SVM can be used as an
approximation.
Many practical applications!

44
Plan of talk

Boosting
Alternating Decision Trees
Data-mining ATT transaction logs.
The I/O bottleneck in data-mining.
Resistance of boosting to over-fitting.
Confidence rated prediction.
Confidence-rating for object recognition.
Gene regulation modeling.
Summary

45
Face Detection - Using confidence to save time
Viola Jones 1999

Paul Viola and Mike Jones developed a face
detector that can work in real time (15 frames
per second).

46
Image Features
Rectangle filters Similar to Haar wavelets
Papageorgiou, et al.
47
Example Classifier for Face Detection
A classifier with 200 rectangle features was
learned using AdaBoost 95 correct detection on
test set with 1 in 14084 false positives. Not
quite competitive...
ROC curve for 200 feature classifier
48
Employing a cascade to minimize average detection
time
The accurate detector combines 6000 simple
features using Adaboost.
In most boxes, only 8-9 features are calculated.
Features 1-3
Features 4-10
49
Using confidence to avoid labeling
Levin, Viola, Freund 2003
50
Image 1
51
Image 1 - diff from time average
52
Image 2
53
Image 2 - diff from time average
54
Co-training
Blum and Mitchell 98
Partially trained B/W based Classifier
Raw B/W
Hwy Images
Diff Image
Partially trained Diff based Classifier
55
(No Transcript)
56
Co-Training Results
57
Plan of talk

Boosting
Alternating Decision Trees
Data-mining ATT transaction logs.
The I/O bottleneck in data-mining.
Resistance of boosting to over-fitting.
Confidence rated prediction.
Confidence-rating for object recognition.
Gene regulation modeling.
Summary

58
Gene Regulation

Regulatory proteins bind to non-coding regulatory
sequence of a gene to control rate of
transcription

59
From mRNA to Protein
Nucleus wall
60
Protein Transcription Factors
61
Genome-wide Expression Data

Microarrays measure mRNA transcript expression
levels for all of the 6000 yeast genes at once.
Very noisy data
Rough time slice over all compartments of many
cells.
Protein expression not observed

62
Partial Parts List for Yeast

Many known and putative
Transcription factors
Signaling molecules that activate transcription
factors
Known and putative binding site motifs
In yeast, regulatory sequence 500 bp upstream
region

63
GeneClass Problem Formulation
M. Middendorf, A. Kundaje, C. Wiggins,
Y. Freund, C. Leslie. Predicting Genetic
Regulatory Response Using Classification. ISMB
2004.

Predict target gene regulatory response from
regulator activity and binding site data

64
Role of quantization
By Quantizing expression into three classes We
reduce noise but maintain most of signal
Weighting 1/-1 examples linearly with
Expression level performs slightly better.
65
Problem setup

Data point Target gene X Microarray
Input features
Parent state -1,0,1
Motif Presence 0,1
Predict output
Target Gene -1,1

66
Boosting with Alternating Decision Trees (ADTs)

Use boosting to build a single ADT, margin-based
generalization of decision tree

Splitter Node Is MotifMIG1 present AND ParentXBP1
up?
Prediction Node F(x) given by sum of prediction
nodes along all paths consistent with x
67
Statistical Validation

10-fold cross-validation experiments, 50,000
(gene/microarray) training examples
Significant correlation between prediction score
and true log expression ratio on held-out data.
Prediction accuracy on 1/-1 labels 88.5

68
Biological InterpretationFrom correlation to
causation

Good prediction only implies Correlation.
To infer causation we need to integrate
additional knowledge.
Comparative case studies train on similar
conditions (stresses), test on related
experiments
Extract significant features from learned model
Iteration score (IS) Boosting iteration at which
feature first appears
Identifies significant motifs, motif-parent pairs
Abundance score (AS) Number of nodes in ADT
containing feature
Identifies important regulators
In silico knock-outs remove significant
regulator and retrain.

69
Case Study Heat Shock and Osmolarity

Training set Heat shock, osmolarity, amino acid
starvation
Test set Stationary phase, simultaneous heat
shockosmolarity
Results
Test error 9.3
Supports Gasch hypothesis heat shock and
osmolarity pathways independent, additive
High scoring parents (AS) USV1 (stationary phase
and heat shock), PPT1 (osmolarity response), GAC1
(response to heat)

70
Case Study Heat Shock and Osmolarity

Results
High scoring binding sites (IS)
MSN2/MSN4 STRE element
Heat shock related HSF1 and RAP1 binding sites
Osmolarity/glycerol pathways CAT8, MIG1, GCN4
Amino acid starvation GCN4, CHA4, MET31
High scoring motif-parent pair (IS)
TPK1STRE pair (kinase that regulates MSN2 via
cellular localization) indirect effect

Direct binding
Indirect effect
Co-occurrence
71
Case Study In silico knockout

Training and test sets Same as heat shock and
osmolarity case study
Knockout Remove USV1 from regulator list and
retrain
Results
Test error 12 (increase from 9)
Identify putative downstream targets of USV1
target genes that change from correct to
incorrect label
GO annotation analysis reveals putative
functions Nucleoside transport, cell-wall
organization and biogenesis, heat-shock protein
activity
Putative functions match those identified in wet
lab USV1 knockout (Segal et al., 2003)

72
Conclusions Gene Regulation

New predictive model for study of gene regulation
First gene regulation model to make quantitative
predictions.
Using actual expression levels - no clustering.
Strong prediction accuracy on held-out
experiments
Interpretable hypotheses significant regulators,
binding motifs, regulator-motif pairs
New methodology for biological analysis
comparative training/test studies, in silico
knockouts

73
Plan of talk

Boosting
Alternating Decision Trees
Data-mining ATT transaction logs.
The I/O bottleneck in data-mining.
Resistance of boosting to over-fitting.
Confidence rated prediction.
Confidence-rating for object recognition.
Gene regulation modeling.
Summary

74
Summary

Moving from density estimation to classification
can make hard problems tractable.
Boosting is an efficient and flexible method for
constructing complex and accurate classifiers.
I/O is the main bottleneck to data-mining
Sampling, data localization and parallelization
help.
Correlation -gt Causation still a hard problem,
requires domain specific expertise and
integration of data sources.

75
Future work

New applications
Bio-informatics.
Vision / Speech and signal processing.
Information Retrieval and Information Extraction.
Theory
Improving the robustness of learning algorithms.
Utilization of unlabeled examples in
confidence-rated classification.
Sequential experimental design.
Relationships between learning algorithms and
stochastic differential equations.

76
Extra
77
Plan of talk

Boosting
Alternating Decision Trees
Data-mining ATT transaction logs.
The I/O bottleneck in data-mining.
High-energy physics.
. Resistance of boosting to over-fitting.
Confidence rated prediction.
Confidence-rating for object recognition.
Gene regulation modeling.
Summary

78
Analysis for the MiniBooNE experiment

Goal To test for neutrino mass by searching for
neutrino oscillations.
Important because it may lead us to physics
beyond the Standard Model.
The BooNE project began in 1997.
The first beam induced neutrino events were
detected in September, 2002.

MiniBooNE detector (Fermi Lab)
79
MiniBooNE Classification Task
Ion Stancu. UC Riverside
80
(No Transcript)
81
Results
82
Using confidence to reduce labeling
Unlabeled data
Partially trained classifier
Query-by-committee, Seung, Opper
Sompolinsky Freund, Seung, Shamir Tishby
83
Discriminative approach
Voice Pitch
84
Results from Yotam Abramson.
85
(No Transcript)

Write a Comment

User Comments (0)