Regression - PowerPoint PPT Presentation

1 / 90
About This Presentation
Title:

Regression

Description:

Paper, Files, Information Providers, Database Systems, OLTP. Linear Regression Models ... Wnew = Wold * ( E Wold = Learning Parameter (between 0 and 1) ... – PowerPoint PPT presentation

Number of Views:240
Avg rating:3.0/5.0
Slides: 91
Provided by: abony
Category:
Tags: regression | wold

less

Transcript and Presenter's Notes

Title: Regression


1
Regression
  • dr. János Abonyi
  • University of Veszprem
  • abonyij_at_fmt.vein.hu
  • www.fmt.vein.hu/softcomp/dw

2
Increasing potential to support business decisions
End User
Making Decisions
Business Analyst
Data Presentation
Visualization Techniques
Data Mining
Data Analyst
Information Discovery
Data Exploration
Statistical Analysis, Querying and Reporting
Data Warehouses / Data Marts
OLAP, MDA
DBA
Data Sources
Paper, Files, Information Providers, Database
Systems, OLTP
3
Linear Regression Models
Here the Xs might be
  • Raw predictor variables (continuous or
    coded-categorical)
  • Transformed predictors (X4log X3)
  • Basis expansions (X4X32, X5X33, etc.)
  • Interactions (X4X2 X3 )

Popular choice for estimation is least squares
4
Least Squares

hat matrix
Often assume that the Ys are independent and
normally distributed, leading to various
classical statistical tests and confidence
intervals
5
Evaluating the Model
  • Variation Measures
  • Coeff. Of Determination
  • Standard Error of Estimate
  • Test Coefficients for Significance

?
Y
b
b
X
?
?
i
i
0
1
6
Variation Measures
Unexplained Sum of Squares (Yi -?Yi)2
Y

Yi
SSE
?
Total Sum of Squares (Yi - Y)2
Y
b
b
X
?
?
i
i
0
1
SST
Explained Sum of Squares (Yi - Y)2

SSR
?
Y
X
X
i
7
Coefficient of Determination
  • Proportion of Variation Explained by
    Relationship Between X Y


0 ? r2 ? 1
Explained
Variation
SSR
2
r
?
?
Total Variation
SST
Ability of equation to fit the data
keep in mind that R2 (and t-stats) represent
correlation, not causation
8
Evaluating the Fit of A Regression Line
  • Adjusted R2
  • R2 will tend to be higher the fewer the data
    points one is trying to fit a regression line to
  • fit of regression line (measured by R2) will
    always improve with more explanatory variables
  • adjusted R2 accounts for different sample sizes
    and different number of explanatory variables
  • Nnumber of observations, Knumber of
    coefficients to be estimated (including constant)

9
Tutorial
10
Too Many Predictors?
When there are lots of Xs, get models with high
variance and prediction suffers. Three
solutions
  • Subset selection
  • Shrinkage/Ridge Regression
  • Derived Inputs


All-subsets, leaps-and-bounds, stepwise, AIC,
BIC, etc.
11
Ridge Regression
subject to
Equivalently
This leads to Choose ? by cross-validation.
12
effective number of Xs
13
The Lasso
subject to
Quadratic programming algorithm needed to solve
for the parameter estimates
q0 var. sel. q1 lasso q2 ridge Learn q?
14
(No Transcript)
15
Dummy variables
  • Dummy variables and interaction terms
  • suppose you think men buy more pizzas than women,
    for any given level of advertising
  • want different intercepts for women (a1) and men
    (a1a2)

Men
Pizzas per Month
Women
a2
a1
Advertising
if add male dummy, its coefficient represents
intercept differential Q a1 a2(Male)
b(Advertising) a1 is the intercept for women
since when Male0, second term vanishes
16
Dummy II.
  • suppose you believe that in the absence of
    advertising men and women buy the same number of
    pizzas, but women respond more to advertising

Pizzas per Month
Women
slope b2
Men
slope b1
Advertising
want to interact advertising with female
dummy women Q a b2ADV men Q a
b1ADV all Q a b1ADV c(ADVFemale) c
measures the difference in slope coefficients
between females and males
17
variance of error term is not constant
  • violates one of the basic assumptions of
    regression analysis, that residuals have constant
    variance


Y











X
18
  • what to do?
  • heterosdedasticity could be caused by wrong
    functional form so going to nonlinear equation
    may help

Y
Y
X
X
19
Polynomial model family
  • Linear in w ? Reduces to the linear regression
    case, but with more variables.
  • Number of terms grows as DM

20
Example Polynomial model
21
Generalized linear model
Linear in w ? Reduces to the linear regression
case, but with more variables. Requires good
guess on basis functions hk(x)
22
Example Generalized linear model
23
Basis Expansions for Linear Models
Here the hms might be
  • hm(X)Xm, m1,,p recovers the original model
  • hm(X)Xj2 or hm(X) Xj Xk
  • hm(X)I(Lm?Xk ?Um),

24
knots
25
Regression Splines
Bottom left panel uses
Number of parameters (3 regions) X (2 params
per region) - (2 knots X 1 constraint per
knot) 4
26
cubic spline
27
Cubic Spline
continuous first and second derivatives
Number of parameters (3 regions) X (4 params
per region) - (2 knots X 3 constraints per
knot) 6Knot discontinuity
essentially invisible to the human eye
28
Image Source ww.physiol.ucl.ac.uk/fedwards/
ca120neuron.jpg
Introduction to Artificial Neural Network Models
29
Definition
Neural Network A broad class of models that mimic
functioning inside the human brain
  • There are various classes of NN models.
  • They are different from each other depending on
  • Problem types Prediction, Classification ,
    Clustering
  • Structure of the model
  • Model building algorithm

For this discussion we are going to focus
on Feed-forward Back-propagation Neural
Network (used for Prediction and Classification
problems)
30
A bit of biology . . .
Most important functional unit in human brain a
class of cells called NEURON
Hippocampal Neurons Source heart.cbl.utoronto.ca
/ berj/projects.html
Schematic
  • Dendrites Receive information
  • Cell Body Process information
  • Axon Carries processed information to other
    neurons
  • Synapse Junction between Axon end and
    Dendrites of other Neurons

31
An Artificial Neuron
Dendrites
Cell Body
Axon
X1
Direction of flow of Information
w1
X2
w2
V f(I)
I
. . .
I w1X1 w2X2 w3X3 wpXp
wp
Xp
  • Receives Inputs X1 X2 Xp from other neurons
    or environment
  • Inputs fed-in through connections with weights
  • Total Input Weighted sum of inputs from all
    sources
  • Transfer function (Activation function) converts
    the input to output
  • Output goes to other neurons or environment

32
Transfer Functions
There are various choices for Transfer /
Activation functions
1
0
Logistic f(x) ex / (1 ex)
Threshold 0 if xlt 0 f(x) 1 if
x gt 1
Tanh f(x) (ex e-x) / (ex e-x)
33
ANN Feed-forward Network
A collection of neurons form a Layer
Input Layer - Each neuron gets ONLY one
input, directly from outside
Hidden Layer - Connects Input and Output
layers
Output Layer - Output of each neuron
directly goes to outside
34
ANN Feed-forward Network
Number of hidden layers can be
None
One
More
35
ANN Feed-forward Network
Couple of things to note
  • Within a layer neurons are NOT connected to
    each other.
  • Neuron in one layer is connected to neurons ONLY
    in the NEXT layer. (Feed-forward)
  • Jumping of layer is NOT allowed

36
One particular ANN model
What do we mean by A particular Model ?
Input X1 X2 X3
Output Y
Model Y f(X1 X2 X3)
For an ANN
Algebraic form of f(.) is too complicated to
write down.
  • However it is characterized by
  • Input Neurons
  • Hidden Layers
  • Neurons in each Hidden Layer
  • Output Neurons
  • WEIGHTS for all the connections

Fitting an ANN model Specifying values for
all those parameters
37
One particular Model an Example
Model Y f(X1 X2 X3)
Input X1 X2 X3
Output Y
Parameters Example Input Neurons 3 Hidden
Layers 1 Hidden Layer Size 3 Output
Neurons 3 Weights Specified
38
Prediction using a particular ANN Model
Input X1 X2 X3
Output Y
Model Y f(X1 X2 X3)
-0.2
0.6
-0.1
0.1
0.7
0.5
0.1
-0.2
Suppose Actual Y 2 Then Prediction Error
(2-0.478) 1.522
39
Building ANN Model
How to build the Model ?
Input X1 X2 X3 Output Y Model Y
f(X1 X2 X3)
Input Neurons Inputs 3 Output
Neurons Outputs 1
Architecture is now defined How to get the
weights ???
Given the Architecture There are 8 weights to
decide. W (W1, W2, , W8)
Training Data (Yi , X1i, X2i, , Xpi ) i
1,2,,n Given a particular choice of W, we will
get predicted Ys ( V1,V2,,Vn) They are function
of W. Choose W such that over all prediction
error E is minimized
E ? (Yi Vi) 2
40
Training the Model
How to train the Model ?
E ? (Yi Vi) 2
41
Back Propagation
Bit more detail on Back Propagation
Each weight Shares the Blame for prediction
error with other weights. Back Propagation
algorithm decides how to distribute the blame
among all weights and adjust the weights
accordingly. Small portion of blame leads to
small adjustment. Large portion of the blame
leads to large adjustment.
E ? (Yi Vi) 2
42
Weight adjustment during Back Propagation
Weight adjustment formula in Back Propagation
Vi the prediction for ith observation is a
function of the network weights vector W ( W1,
W2,.) Hence, E, the total prediction error is
also a function of W
E( W ) ? Yi Vi( W ) 2
Gradient Descent Method For every individual
weight Wi, updation formula looks like
Wnew Wold ? ( ?E / ?W) Wold
? Learning Parameter (between 0 and 1)
Another slight variation is also used sometimes
W(t1) W(t) ? ( ?E / ?W) W(t) ? (W(t)
- W(t-1) )
? Momentum (between 0 and 1)
43
Geometric interpretation of the Weight adjustment
Consider a very simple network with 2 inputs and
1 output. No hidden layer. There are only two
weights whose values needs to be specified.
E( w1, w2 ) ? Yi Vi(w1, w2 ) 2
  • A pair ( w1, w2 ) is a point on 2-D plane.
  • For any such point we can get a value of E.
  • Plot E vs ( w1, w2 ) - a 3-D surface - Error
    Surface
  • Aim is to identify that pair for which E is
    minimum
  • That means identify the pair for which the
    height of the error surface is minimum.
  • Gradient Descent Algorithm
  • Start with a random point ( w1, w2 )
  • Move to a better point ( w1, w2 ) where the
    height of error surface is lower.
  • Keep moving till you reach ( w1, w2 ), where
    the error is minimum.

44
Crawling the Error Surface
45
Training Algorithm
Decide the Network architecture ( Hidden
layers, Neurons in each Hidden Layer)
Decide the Learning parameter and Momentum
Initialize the Network with random weights
Feed forward the I-th observation thru the Net
Compute the prediction error on I-th observation
Back propagate the error and adjust weights
E ? (Yi Vi) 2
Check for Convergence
46
Convergence Criterion
When to stop training the Network ?
Ideally when we reach the global minima of the
error surface
We dont
How do we know we have reached there ?
  • Suggestion
  • Stop if the decrease in total prediction error
    (since last cycle) is small.
  • Stop if the overall changes in the weights (since
    last cycle) are small.

Drawback Error keeps on decreasing. We get a
very good fit to training data. BUT The network
thus obtained have poor generalizing power on
unseen data The phenomenon is also known as -
Over fitting of the Training data The network is
said to Memorize the training data. - so that
when an X in training set is given, the
network faithfully produces the corresponding
Y. -However for Xs which the network didnt
see before, it predicts poorly.
47
Convergence Criterion
Modified Suggestion Partition the training
data into Training set and Validation set Use
Training set - build the model Validation
set - test the performance of the model on unseen
data
Typically as we have more and more training
cycles Error on Training set keeps on
decreasing. Error on Validation set keeps first
decreases and then increases.
Stop training when the error on Validation set
starts increasing
48
Choice of Training Parameters
Learning Parameter and Momentum - needs to be
supplied by user from outside. Should be between
0 and 1 What should be the optimal values of
these training parameters ? - No clear
consensus on any fixed strategy. - However,
effects of wrongly specifying them are well
studied.
Learning Parameter Too big Large leaps in
weight space risk of missing global minima. Too
small - Takes long time to converge to
global minima - Once stuck in local minima,
difficult to get out of it.
Suggestion Trial and Error Try various choices
of Learning Parameter and Momentum See which
choice leads to minimum prediction error
49
Wrap Up
  • Artificial Neural network (ANN) A class of
    models inspired by biological Neurons
  • Used for various modeling problems Prediction,
    Classification, Clustering, ..
  • One particular subclass of ANNs Feed forward
    Back propagation networks
  • Organized in layers Input, hidden, Output
  • Each layer is a collection of a number of
    artificial Neurons
  • Neurons in one layer in connected to neurons in
    next layer
  • Connections have weights
  • Fitting an ANN model is to find the values of
    these weights.
  • Given a training data set weights are found by
    Feed forward Back propagation algorithm, which is
    a form of Gradient Descent Method a popular
    technique for function minimization.
  • Network architecture as well as the training
    parameters are decided upon by trial and error.
    Try various choices and pick the one that gives
    lowest prediction error.

50
Instance Based Learning
  • Key idea just store all training examples
    ltxi,f(xi)gt
  • Nearest neighbor
  • Given query instance xq, first locate nearest
    training example xn, then estimate f(xq)f(xn)
  • K-nearest neighbor
  • Given xq, take vote among its k nearest neighbors
    (if discrete-valued target function)
  • Take mean of f values of k nearest neighbors (if
    real-valued) f(xq)?i1k f(xi)/k

51
Voronoi Diagram
query point qf
nearest neighbor qi
52
3-Nearest Neighbors
query point qf
3 nearest neighbors
2x,1o
53
7-Nearest Neighbors
query point qf
7 nearest neighbors
3x,4o
54
Nearest Neighbor (continuous)
1-nearest neighbor
55
Nearest Neighbor (continuous)
3-nearest neighbor
56
Nearest Neighbor (continuous)
5-nearest neighbor
57
When to Consider Nearest Neighbors
  • Instances map to points in RN
  • Less than 20 attributes per instance
  • Lots of training data
  • Advantages
  • Training is very fast
  • Learn complex target functions
  • Do not loose information
  • Disadvantages
  • Slow at query time
  • Easily fooled by irrelevant attributes

58
Locally Weighted Regression
  • Give more weight to neighbors closer to the query
    point
  • Kernel function is the function of distance that
    is used to determine the weight of each training
    example. In other words, the kernel function is
    the function K such that wiK(d(xi,xq))

59
Kernel Functions
60
Distance Weighted k-NN
  • Give more weight to neighbors closer to the query
    point
  • f(xq) ?i1k wi f(xi) / ?i1k wi
  • where wiK(d(xq,xi))
  • and d(xq,xi) is the distance between xq and xi
  • Instead of only k-nearest neighbors use all
    training examples (Shepards method)

61
Distance Weighted NN
K(d(xq,xi)) 1/ d(xq,xi)2
62
Distance Weighted NN
K(d(xq,xi)) 1/(d0d(xq,xi))2
63
Distance Weighted NN
K(d(xq,xi)) exp(-(d(xq,xi)/?0)2)
64
Curse of Dimensionality
  • Curse of dimensionality nearest neighbor is
    easily misled when instance space is
    high-dimensional
  • One approach
  • Stretch j-th axis by weight zj, where z1,,zn
    chosen to minimize prediction error
  • Use cross-validation to automatically choose
    weights z1,,zn
  • Note setting zj to zero eliminates this dimension
    alltogether (feature subset selection)

65
Distance Weighted Average
  • Weighting the data
  • f(xq) ?i f(xi) K(d(xi,xq))/ ?i K(d(xi,xq))
  • Relevance of a data point (xi,f(xi)) is measured
    by calculating the distance d(xi,xq) between the
    query xq and the input vector xi
  • Weighting the error criterion
  • E(xq) ?i (f(xq)-f(xi))2 K(d(xi,xq))
  • the best estimate f(xq) will minimize the
    cost E(q), therefore ?E(q)/?f(xq)0

66
Locally Weighted Regression
  • Local
  • the function is approximated based only on data
    near the query point.
  • Weighted
  • the contribution of each training example is
    weighted by its distance from the query point.
  • Regression
  • approximating real-valued function

67
A Local Approximation
  • Method-1 minimize the squared error over k
    nearest neighbor
  • Method-2 minimize the squared error over entire
    set D, with weights
  • Method-3 combine 1,2

68
Local Linear Models
  • Estimate the parameters ?k such that they locally
    (near the query point xq) match the training data
    either by
  • weighting the data
  • wiK(d(xi,xq))1/2 and transforming
  • ziwi xi
  • viwi yi
  • or by weighting the error criterion
  • E ?i1N (xiT ? yi)2 K(d(xi,xq))
  • still linear in ? with LSQ solution
  • ? ((WX)T WX)-1 (WX)T WF(X)

69
Design Issues in Local Regression
  • Local model order (constant, linear, quadratic)
  • Distance function d
  • feature scaling d(x,q)(?j1d mj(xj-qj)2)1/2
  • irrelevant dimensions mj0
  • kernel function K
  • smoothing parameter bandwidth h in K(d(x,q)/h)
  • hm global bandwidth
  • h distance to k-th nearest neighbor point
  • hh(q) depending on query point
  • hhi depending on stored data points

70
Remarks on Locally Weighted Regression
  • In most cases, the target function is
    approximated by a constant, linear, or quadratic
    function
  • More complex functional forms are not used
    because
  • The cost of fitting more complex functions for
    each query instance is high.
  • These simple approximations model the target
    functions quite well over a sufficiently small
    subregion of the instance space.

71
RBF Networks
H hidden layer radial basis functions
i1,2,d j1,2,,H k1,2,c
d inputnodes
x1
c outputnodes
?
z1
x2
...
..
Wkj
netk
Wji
zk
....

..
?
yj
Linear act. function
zc

x(d-1)
?
xd
? spread constant
72
RBFN Principle of Operation
Using Gaussian radial basis functions
Using sigmoidal radial basis functions
73
Radial Basis Function Network
  • where xu instance from X,
  • Ku(d(xu,x)) kernel function
  • One common choice for Ku(d(xu,x)) is

74
Training Radial Basis Function Networks
  • Q-1 What xu to use for each kernel function
    Ku(d(xu,x)) ?
  • Scatter uniformly throughout instance space
  • Or use training instances (reflect instance
    distribution)
  • Q-2 How to train weights (assume here Gaussian
    Ku) ?
  • First choose variance (and perhaps mean) for each
    Ku (e.g. use EM)
  • Then hold Ku fixed, and train linear output layer
  • Efficient methods to fit linear function

75
Training Radial Basis Function Networks
  • Training
  • construct kernel function
  • adjust weights
  • RBF networks provide a global approximation to
    the target function, represented by a linear
    combination of many local kernel functions.

76
Local Linear Models
77
Linear Local Model Example
78
Linear Local Model Example
79
Tree-Based Methods
  • Overview
  • Principle behind Divide and conquer
  • Variance will be increased
  • Finesse the curse of dimensionality with the
    price of mis-specifying the model
  • Partition the feature space into a set of
    rectangles
  • For simplicity, use recursive binary partition
  • Fit a simple model (e.g. constant) for each
    rectangle
  • Classification and Regression Trees (CART)
  • Regress Trees
  • Classification Trees
  • Hierarchical Mixture Experts (HME)

80
CART
  • An example (in regression case)

81
Regression Trees
  • Partition the space into M regions R1, R2, ,
    RM.

82
How CART Sees An Elephant
It was six men of Indostan To learning much
inclined, Who went to see the Elephant (Though
all of them were blind), That each by
observation Might satisfy his mind . -- The
Blind Men and the Elephant by John Godfrey Saxe
(1816-1887)
83
Regression Trees Grow the Tree
  • The best partition to minimize the sum of
    squared error
  • Finding the global minimum is computationally
    infeasible
  • Greedy algorithm at each level choose variable j
    and value s as
  • The greedy algorithm makes the tree unstable
  • The error made at the upper level will be
    propagated to the lower level

84
Local Linear Model Tree (LOLIMOT)
  • incremental tree construction algorithm
  • partitions input space by axis-orthogonal splits
  • adds one local linear model per iteration
  • start with an initial model (e.g. single LLM)
  • identify LLM with worst model error Ei
  • check all divisions split worst LLM
    hyper-rectangle
  • in halves along each possible dimension
  • find best (smallest error) out of possible
    divisions
  • add new validity function and LLM
  • repeat from step 2. until termination criteria is
    met

85
LOLIMOT
Initial global linear model
Split along x1 or x2
Pick split that minimizes model error (residual)
86
LOLIMOT Example
87
LOLIMOT Example
88
Regression Tree how large should we grow the
tree ?
  • Trade-off between accuracy and generalization
  • Very large tree overfit
  • Small tree might not capture the structure
  • Strategies
  • 1 split only when we can decrease the error
    (short-sighted, e.g. XOR)
  • 2 Cost-complexity pruning (preferred)

89
Regression Tree - Pruning
  • Cost-complexity pruning
  • Pruning collapsing some internal nodes
  • Cost complexity
  • Choose best alpha weakest link pruning
  • Each time collapse an internal node which add
    smallest error
  • Choose from this tree sequence the best one by
    cross-validation

90
Discussions on Trees
  • Linear Combination Splits
  • Split the node based on
  • Improve the predictive power
  • Hurt interpretability
Write a Comment
User Comments (0)
About PowerShow.com