Regression

About This Presentation

Title:

Regression

Description:

Paper, Files, Information Providers, Database Systems, OLTP. Linear Regression Models ... Wnew = Wold * ( E Wold = Learning Parameter (between 0 and 1) ... – PowerPoint PPT presentation

Number of Views:240

Avg rating:3.0/5.0

Slides: 91

Provided by: abony

Category:

more less

Transcript and Presenter's Notes

Title: Regression

1
Regression

dr. János Abonyi
University of Veszprem
abonyij_at_fmt.vein.hu
www.fmt.vein.hu/softcomp/dw

2
Increasing potential to support business decisions
End User
Making Decisions
Business Analyst
Data Presentation
Visualization Techniques
Data Mining
Data Analyst
Information Discovery
Data Exploration
Statistical Analysis, Querying and Reporting
Data Warehouses / Data Marts
OLAP, MDA
DBA
Data Sources
Paper, Files, Information Providers, Database
Systems, OLTP
3
Linear Regression Models
Here the Xs might be

Raw predictor variables (continuous or
coded-categorical)
Transformed predictors (X4log X3)
Basis expansions (X4X32, X5X33, etc.)
Interactions (X4X2 X3 )

Popular choice for estimation is least squares
4
Least Squares

hat matrix
Often assume that the Ys are independent and
normally distributed, leading to various
classical statistical tests and confidence
intervals
5
Evaluating the Model

Variation Measures
Coeff. Of Determination
Standard Error of Estimate
Test Coefficients for Significance

?
Y
b
b
X
?
?
i
i
0
1
6
Variation Measures
Unexplained Sum of Squares (Yi -?Yi)2
Y

Yi
SSE
?
Total Sum of Squares (Yi - Y)2
Y
b
b
X
?
?
i
i
0
1
SST
Explained Sum of Squares (Yi - Y)2

SSR
?
Y
X
X
i
7
Coefficient of Determination

Proportion of Variation Explained by
Relationship Between X Y

0 ? r2 ? 1
Explained
Variation
SSR
2
r
?
?
Total Variation
SST
Ability of equation to fit the data
keep in mind that R2 (and t-stats) represent
correlation, not causation
8
Evaluating the Fit of A Regression Line

Adjusted R2
R2 will tend to be higher the fewer the data
points one is trying to fit a regression line to
fit of regression line (measured by R2) will
always improve with more explanatory variables
adjusted R2 accounts for different sample sizes
and different number of explanatory variables
Nnumber of observations, Knumber of
coefficients to be estimated (including constant)

9
Tutorial
10
Too Many Predictors?
When there are lots of Xs, get models with high
variance and prediction suffers. Three
solutions

Subset selection
Shrinkage/Ridge Regression
Derived Inputs

All-subsets, leaps-and-bounds, stepwise, AIC,
BIC, etc.
11
Ridge Regression
subject to
Equivalently
This leads to Choose ? by cross-validation.
12
effective number of Xs
13
The Lasso
subject to
Quadratic programming algorithm needed to solve
for the parameter estimates
q0 var. sel. q1 lasso q2 ridge Learn q?
14
(No Transcript)
15
Dummy variables

Dummy variables and interaction terms
suppose you think men buy more pizzas than women,
for any given level of advertising
want different intercepts for women (a1) and men
(a1a2)

Men
Pizzas per Month
Women
a2
a1
Advertising
if add male dummy, its coefficient represents
intercept differential Q a1 a2(Male)
b(Advertising) a1 is the intercept for women
since when Male0, second term vanishes
16
Dummy II.

suppose you believe that in the absence of
advertising men and women buy the same number of
pizzas, but women respond more to advertising

Pizzas per Month
Women
slope b2
Men
slope b1
Advertising
want to interact advertising with female
dummy women Q a b2ADV men Q a
b1ADV all Q a b1ADV c(ADVFemale) c
measures the difference in slope coefficients
between females and males
17
variance of error term is not constant

violates one of the basic assumptions of
regression analysis, that residuals have constant
variance

Y

X
18

what to do?
heterosdedasticity could be caused by wrong
functional form so going to nonlinear equation
may help

Y
Y
X
X
19
Polynomial model family

Linear in w ? Reduces to the linear regression
case, but with more variables.
Number of terms grows as DM

20
Example Polynomial model
21
Generalized linear model
Linear in w ? Reduces to the linear regression
case, but with more variables. Requires good
guess on basis functions hk(x)
22
Example Generalized linear model
23
Basis Expansions for Linear Models
Here the hms might be

hm(X)Xm, m1,,p recovers the original model
hm(X)Xj2 or hm(X) Xj Xk
hm(X)I(Lm?Xk ?Um),

24
knots
25
Regression Splines
Bottom left panel uses
Number of parameters (3 regions) X (2 params
per region) - (2 knots X 1 constraint per
knot) 4
26
cubic spline
27
Cubic Spline
continuous first and second derivatives
Number of parameters (3 regions) X (4 params
per region) - (2 knots X 3 constraints per
knot) 6Knot discontinuity
essentially invisible to the human eye
28
Image Source ww.physiol.ucl.ac.uk/fedwards/
ca120neuron.jpg
Introduction to Artificial Neural Network Models
29
Definition
Neural Network A broad class of models that mimic
functioning inside the human brain

There are various classes of NN models.
They are different from each other depending on
Problem types Prediction, Classification ,
Clustering
Structure of the model
Model building algorithm

For this discussion we are going to focus
on Feed-forward Back-propagation Neural
Network (used for Prediction and Classification
problems)
30
A bit of biology . . .
Most important functional unit in human brain a
class of cells called NEURON
Hippocampal Neurons Source heart.cbl.utoronto.ca
/ berj/projects.html
Schematic

Dendrites Receive information

Cell Body Process information

Axon Carries processed information to other
neurons

Synapse Junction between Axon end and
Dendrites of other Neurons

31
An Artificial Neuron
Dendrites
Cell Body
Axon
X1
Direction of flow of Information
w1
X2
w2
V f(I)
I
. . .
I w1X1 w2X2 w3X3 wpXp
wp
Xp

Receives Inputs X1 X2 Xp from other neurons
or environment
Inputs fed-in through connections with weights
Total Input Weighted sum of inputs from all
sources
Transfer function (Activation function) converts
the input to output
Output goes to other neurons or environment

32
Transfer Functions
There are various choices for Transfer /
Activation functions
1
0
Logistic f(x) ex / (1 ex)
Threshold 0 if xlt 0 f(x) 1 if
x gt 1
Tanh f(x) (ex e-x) / (ex e-x)
33
ANN Feed-forward Network
A collection of neurons form a Layer
Input Layer - Each neuron gets ONLY one
input, directly from outside
Hidden Layer - Connects Input and Output
layers
Output Layer - Output of each neuron
directly goes to outside
34
ANN Feed-forward Network
Number of hidden layers can be
None
One
More
35
ANN Feed-forward Network
Couple of things to note

Within a layer neurons are NOT connected to
each other.
Neuron in one layer is connected to neurons ONLY
in the NEXT layer. (Feed-forward)

Jumping of layer is NOT allowed

36
One particular ANN model
What do we mean by A particular Model ?
Input X1 X2 X3
Output Y
Model Y f(X1 X2 X3)
For an ANN
Algebraic form of f(.) is too complicated to
write down.

However it is characterized by
Input Neurons
Hidden Layers
Neurons in each Hidden Layer
Output Neurons
WEIGHTS for all the connections

Fitting an ANN model Specifying values for
all those parameters
37
One particular Model an Example
Model Y f(X1 X2 X3)
Input X1 X2 X3
Output Y
Parameters Example Input Neurons 3 Hidden
Layers 1 Hidden Layer Size 3 Output
Neurons 3 Weights Specified
38
Prediction using a particular ANN Model
Input X1 X2 X3
Output Y
Model Y f(X1 X2 X3)
-0.2
0.6
-0.1
0.1
0.7
0.5
0.1
-0.2
Suppose Actual Y 2 Then Prediction Error
(2-0.478) 1.522
39
Building ANN Model
How to build the Model ?
Input X1 X2 X3 Output Y Model Y
f(X1 X2 X3)
Input Neurons Inputs 3 Output
Neurons Outputs 1
Architecture is now defined How to get the
weights ???
Given the Architecture There are 8 weights to
decide. W (W1, W2, , W8)
Training Data (Yi , X1i, X2i, , Xpi ) i
1,2,,n Given a particular choice of W, we will
get predicted Ys ( V1,V2,,Vn) They are function
of W. Choose W such that over all prediction
error E is minimized
E ? (Yi Vi) 2
40
Training the Model
How to train the Model ?
E ? (Yi Vi) 2
41
Back Propagation
Bit more detail on Back Propagation
Each weight Shares the Blame for prediction
error with other weights. Back Propagation
algorithm decides how to distribute the blame
among all weights and adjust the weights
accordingly. Small portion of blame leads to
small adjustment. Large portion of the blame
leads to large adjustment.
E ? (Yi Vi) 2
42
Weight adjustment during Back Propagation
Weight adjustment formula in Back Propagation
Vi the prediction for ith observation is a
function of the network weights vector W ( W1,
W2,.) Hence, E, the total prediction error is
also a function of W
E( W ) ? Yi Vi( W ) 2
Gradient Descent Method For every individual
weight Wi, updation formula looks like
Wnew Wold ? ( ?E / ?W) Wold
? Learning Parameter (between 0 and 1)
Another slight variation is also used sometimes
W(t1) W(t) ? ( ?E / ?W) W(t) ? (W(t)
- W(t-1) )
? Momentum (between 0 and 1)
43
Geometric interpretation of the Weight adjustment
Consider a very simple network with 2 inputs and
1 output. No hidden layer. There are only two
weights whose values needs to be specified.
E( w1, w2 ) ? Yi Vi(w1, w2 ) 2

A pair ( w1, w2 ) is a point on 2-D plane.
For any such point we can get a value of E.
Plot E vs ( w1, w2 ) - a 3-D surface - Error
Surface
Aim is to identify that pair for which E is
minimum
That means identify the pair for which the
height of the error surface is minimum.

Gradient Descent Algorithm
Start with a random point ( w1, w2 )
Move to a better point ( w1, w2 ) where the
height of error surface is lower.
Keep moving till you reach ( w1, w2 ), where
the error is minimum.

44
Crawling the Error Surface
45
Training Algorithm
Decide the Network architecture ( Hidden
layers, Neurons in each Hidden Layer)
Decide the Learning parameter and Momentum
Initialize the Network with random weights
Feed forward the I-th observation thru the Net
Compute the prediction error on I-th observation
Back propagate the error and adjust weights
E ? (Yi Vi) 2
Check for Convergence
46
Convergence Criterion
When to stop training the Network ?
Ideally when we reach the global minima of the
error surface
We dont
How do we know we have reached there ?

Suggestion
Stop if the decrease in total prediction error
(since last cycle) is small.
Stop if the overall changes in the weights (since
last cycle) are small.

Drawback Error keeps on decreasing. We get a
very good fit to training data. BUT The network
thus obtained have poor generalizing power on
unseen data The phenomenon is also known as -
Over fitting of the Training data The network is
said to Memorize the training data. - so that
when an X in training set is given, the
network faithfully produces the corresponding
Y. -However for Xs which the network didnt
see before, it predicts poorly.
47
Convergence Criterion
Modified Suggestion Partition the training
data into Training set and Validation set Use
Training set - build the model Validation
set - test the performance of the model on unseen
data
Typically as we have more and more training
cycles Error on Training set keeps on
decreasing. Error on Validation set keeps first
decreases and then increases.
Stop training when the error on Validation set
starts increasing
48
Choice of Training Parameters
Learning Parameter and Momentum - needs to be
supplied by user from outside. Should be between
0 and 1 What should be the optimal values of
these training parameters ? - No clear
consensus on any fixed strategy. - However,
effects of wrongly specifying them are well
studied.
Learning Parameter Too big Large leaps in
weight space risk of missing global minima. Too
small - Takes long time to converge to
global minima - Once stuck in local minima,
difficult to get out of it.
Suggestion Trial and Error Try various choices
of Learning Parameter and Momentum See which
choice leads to minimum prediction error
49
Wrap Up

Artificial Neural network (ANN) A class of
models inspired by biological Neurons
Used for various modeling problems Prediction,
Classification, Clustering, ..
One particular subclass of ANNs Feed forward
Back propagation networks
Organized in layers Input, hidden, Output
Each layer is a collection of a number of
artificial Neurons
Neurons in one layer in connected to neurons in
next layer
Connections have weights
Fitting an ANN model is to find the values of
these weights.
Given a training data set weights are found by
Feed forward Back propagation algorithm, which is
a form of Gradient Descent Method a popular
technique for function minimization.
Network architecture as well as the training
parameters are decided upon by trial and error.
Try various choices and pick the one that gives
lowest prediction error.

50
Instance Based Learning

Key idea just store all training examples
ltxi,f(xi)gt
Nearest neighbor
Given query instance xq, first locate nearest
training example xn, then estimate f(xq)f(xn)
K-nearest neighbor
Given xq, take vote among its k nearest neighbors
(if discrete-valued target function)
Take mean of f values of k nearest neighbors (if
real-valued) f(xq)?i1k f(xi)/k

51
Voronoi Diagram
query point qf
nearest neighbor qi
52
3-Nearest Neighbors
query point qf
3 nearest neighbors
2x,1o
53
7-Nearest Neighbors
query point qf
7 nearest neighbors
3x,4o
54
Nearest Neighbor (continuous)
1-nearest neighbor
55
Nearest Neighbor (continuous)
3-nearest neighbor
56
Nearest Neighbor (continuous)
5-nearest neighbor
57
When to Consider Nearest Neighbors

Instances map to points in RN
Less than 20 attributes per instance
Lots of training data
Advantages
Training is very fast
Learn complex target functions
Do not loose information
Disadvantages
Slow at query time
Easily fooled by irrelevant attributes

58
Locally Weighted Regression

Give more weight to neighbors closer to the query
point
Kernel function is the function of distance that
is used to determine the weight of each training
example. In other words, the kernel function is
the function K such that wiK(d(xi,xq))

59
Kernel Functions
60
Distance Weighted k-NN

Give more weight to neighbors closer to the query
point
f(xq) ?i1k wi f(xi) / ?i1k wi
where wiK(d(xq,xi))
and d(xq,xi) is the distance between xq and xi
Instead of only k-nearest neighbors use all
training examples (Shepards method)

61
Distance Weighted NN
K(d(xq,xi)) 1/ d(xq,xi)2
62
Distance Weighted NN
K(d(xq,xi)) 1/(d0d(xq,xi))2
63
Distance Weighted NN
K(d(xq,xi)) exp(-(d(xq,xi)/?0)2)
64
Curse of Dimensionality

Curse of dimensionality nearest neighbor is
easily misled when instance space is
high-dimensional
One approach
Stretch j-th axis by weight zj, where z1,,zn
chosen to minimize prediction error
Use cross-validation to automatically choose
weights z1,,zn
Note setting zj to zero eliminates this dimension
alltogether (feature subset selection)

65
Distance Weighted Average

Weighting the data
f(xq) ?i f(xi) K(d(xi,xq))/ ?i K(d(xi,xq))
Relevance of a data point (xi,f(xi)) is measured
by calculating the distance d(xi,xq) between the
query xq and the input vector xi
Weighting the error criterion
E(xq) ?i (f(xq)-f(xi))2 K(d(xi,xq))
the best estimate f(xq) will minimize the
cost E(q), therefore ?E(q)/?f(xq)0

66
Locally Weighted Regression

Local
the function is approximated based only on data
near the query point.
Weighted
the contribution of each training example is
weighted by its distance from the query point.
Regression
approximating real-valued function

67
A Local Approximation

Method-1 minimize the squared error over k
nearest neighbor
Method-2 minimize the squared error over entire
set D, with weights
Method-3 combine 1,2

68
Local Linear Models

Estimate the parameters ?k such that they locally
(near the query point xq) match the training data
either by
weighting the data
wiK(d(xi,xq))1/2 and transforming
ziwi xi
viwi yi
or by weighting the error criterion
E ?i1N (xiT ? yi)2 K(d(xi,xq))
still linear in ? with LSQ solution
? ((WX)T WX)-1 (WX)T WF(X)

69
Design Issues in Local Regression

Local model order (constant, linear, quadratic)
Distance function d
feature scaling d(x,q)(?j1d mj(xj-qj)2)1/2
irrelevant dimensions mj0
kernel function K
smoothing parameter bandwidth h in K(d(x,q)/h)
hm global bandwidth
h distance to k-th nearest neighbor point
hh(q) depending on query point
hhi depending on stored data points

70
Remarks on Locally Weighted Regression

In most cases, the target function is
approximated by a constant, linear, or quadratic
function
More complex functional forms are not used
because
The cost of fitting more complex functions for
each query instance is high.
These simple approximations model the target
functions quite well over a sufficiently small
subregion of the instance space.

71
RBF Networks
H hidden layer radial basis functions
i1,2,d j1,2,,H k1,2,c
d inputnodes
x1
c outputnodes
?
z1
x2
...
..
Wkj
netk
Wji
zk
....

..
?
yj
Linear act. function
zc

x(d-1)
?
xd
? spread constant
72
RBFN Principle of Operation
Using Gaussian radial basis functions
Using sigmoidal radial basis functions
73
Radial Basis Function Network

where xu instance from X,
Ku(d(xu,x)) kernel function
One common choice for Ku(d(xu,x)) is

74
Training Radial Basis Function Networks

Q-1 What xu to use for each kernel function
Ku(d(xu,x)) ?
Scatter uniformly throughout instance space
Or use training instances (reflect instance
distribution)
Q-2 How to train weights (assume here Gaussian
Ku) ?
First choose variance (and perhaps mean) for each
Ku (e.g. use EM)
Then hold Ku fixed, and train linear output layer
Efficient methods to fit linear function

75
Training Radial Basis Function Networks

Training
construct kernel function
adjust weights
RBF networks provide a global approximation to
the target function, represented by a linear
combination of many local kernel functions.

76
Local Linear Models
77
Linear Local Model Example
78
Linear Local Model Example
79
Tree-Based Methods

Overview
Principle behind Divide and conquer
Variance will be increased
Finesse the curse of dimensionality with the
price of mis-specifying the model
Partition the feature space into a set of
rectangles
For simplicity, use recursive binary partition
Fit a simple model (e.g. constant) for each
rectangle
Classification and Regression Trees (CART)
Regress Trees
Classification Trees
Hierarchical Mixture Experts (HME)

80
CART

An example (in regression case)

81
Regression Trees

Partition the space into M regions R1, R2, ,
RM.

82
How CART Sees An Elephant
It was six men of Indostan To learning much
inclined, Who went to see the Elephant (Though
all of them were blind), That each by
observation Might satisfy his mind . -- The
Blind Men and the Elephant by John Godfrey Saxe
(1816-1887)
83
Regression Trees Grow the Tree

The best partition to minimize the sum of
squared error
Finding the global minimum is computationally
infeasible
Greedy algorithm at each level choose variable j
and value s as
The greedy algorithm makes the tree unstable
The error made at the upper level will be
propagated to the lower level

84
Local Linear Model Tree (LOLIMOT)

incremental tree construction algorithm
partitions input space by axis-orthogonal splits
adds one local linear model per iteration

start with an initial model (e.g. single LLM)
identify LLM with worst model error Ei
check all divisions split worst LLM
hyper-rectangle
in halves along each possible dimension
find best (smallest error) out of possible
divisions
add new validity function and LLM
repeat from step 2. until termination criteria is
met

85
LOLIMOT
Initial global linear model
Split along x1 or x2
Pick split that minimizes model error (residual)
86
LOLIMOT Example
87
LOLIMOT Example
88
Regression Tree how large should we grow the
tree ?