Lecture 11' Bayesian Regression with conjugate and nonconjugate priors

1 / 13
About This Presentation
Title:

Lecture 11' Bayesian Regression with conjugate and nonconjugate priors

Description:

The second term is the maximum likelihood estimate of the variance. The third terms states that our variance estimates will be greater if our prior ... –

Number of Views:223
Avg rating:3.0/5.0
Slides: 14
Provided by: jeffgry
Category:

less

Transcript and Presenter's Notes

Title: Lecture 11' Bayesian Regression with conjugate and nonconjugate priors


1
Lecture 11. Bayesian Regression with conjugate
and non-conjugate priors
  • Set-up of the Bayesian Regression Model
  • The standard improper non-informative prior
  • Conjugate Prior Analysis

2
A really, really fast review of multiple
regression
  • Many studies concern the relationship between two
    or more observable quantities.
  • How do changes in quantity y (the dependent
    variable) vary as a function of another quantity
    x (the independent variable)?
  • Regression models allow us to examine the
    conditional distribution of y given x,
    parameterized as p(yB,x) when the n observations
    (yi, xi) are exchangeable.
  • The normal linear model occurs when a
    distribution of y given x is normal with a mean
    equal to a linear function of X E(yiB,X)
    B1X1i BkXki for i ? 1,,n and X1 is a
    vector of ones.
  • The ordinary linear regression model occurs when
    the variance of y given X, B is assumed to be
    constant over all observations.
  • In other words, we have an ordinary linear
    regression model when
  • yi N(B1 BkXki , ?2) for i ? 1,,n

3
More review of regression
  • If yi N(B1 BkXki , ?2), then it is well
    known that the ordinary least squares estimates
    and the maximum likelihood estimates of the
    parameters B1, , Bm are equivalent.
  • If B B1, , BmT, then the frequentist
    estimate B is
  • B (XTX)-1XTY
  • We know by the Gauss-Markov theorem that this
    estimate is BLUE.
  • The frequentist estimate of ?2 is s2
    (Y-XB)T(Y-XB)/(n-k)
  • The uncertainty about the quantities B is
    summarized by the regression coefficients
    standard errors which is the diagonal of the
    matrix Var(B) s2(XTX)-1
  • Finally, we know that if Vi is the ith diagonal
    element of Var(B), then (Bk-0)/(sVi1/2) tn-k
  • This statistic forms the basis for our hypothesis
    tests.

4
The Bayesian Setup
  • For the normal linear model, we have
  • yi N(?i, ?2) for i ? 1,,n
  • where ?i is just an indicator for the expression
  • ?i B0 B1X1i BkXki
  • The object of statistical inference is the
    posterior distribution of the parameters B0,,Bk
    and ?2.
  • By Bayes Rule, we know that this is simply
  • p(B0,,Bk, ?2 Y, X) ? p(B0,,Bk, ?2) ? ?i p(yi
    ?i, ?2)

5
Bayesian regression with standard non-informative
priors
  • By Bayes Rule, the posterior distribution is
  • p(B0,,Bk, ?2 Y, X) ? p(B0,,Bk, ?2) ? ?i p(yi
    ?i, ?2)
  • To make inferences about the regression
    coefficients, we obviously need to choose a prior
    distribution for B, ?2.
  • The standard non-informative prior distribution
    is uniform on (B, log ?2), which is equivalent
    to
  • p(B, log ?2)? ?-2
  • This prior is a good choice for statistical
    models when you have a lot of data points and
    only a few parameters.
  • Why? Because if you have a large lot of data and
    few parameters, then the likelihood function is
    very sharply peaked, which means that the
    likelihood (the data) will dominate posterior
    inferences. With small sample sizes or a lot of
    parameters, prior distributions or hierarchical
    models become more important for analysis.

6
Bayesian regression with non-informative priors
  • If YB,?2,X N(XB, ?2I) and p(B, log ?2)? ?-2
    then it follows that the conditional posterior
    distribution p(B?2data) can be written
  • p(B?2data) Nmultivariate (B, ?2(XTX)-1)
    where B (XTX)-1XTY
  • This follows from by completing the square, like
    we have seen over and over again when dealing
    with the normal distribution.
  • The posterior distribution of ?2 can be written
    as
  • p(?2data)Scaled Inv-??2(n-k,s2) where s2
    (Y-XB)T(Y-XB)/(n-k)
  • To make inferences about the marginal posterior
    of B, we can either
  • 1) take repeated samples from the posterior of
    ?2 and then B ?2 like we did when we studied the
    normal model with unknown mean and variance or,
  • 2) integrate out ?2 from the conditional
    posterior for B and find that the marginal
    distribution of B is a multivariate t
    distribution
  • p(B data) multivariate tn-k(B, s2(XTX)-1)
  • ? Notice the close comparison with the classical
    results. The key difference would be
    interpretation of the standard errors.

7
Congdon Example 4.1
  • In Example 4.1, Congdon considers a peculiar
    example concerning a method for weighing small
    stuff, where the method has an error that is
    assumed to be normal with mean zero.
  • The dependent variable is the recorded weight and
    the independent variables are dummy variables for
    the presence or absence of object A and object B.
    X11 if A is present and 0 otherwise. X21 if B
    is present and 0 otherwise. Both objects may be
    weighed together.
  • Thus, yi N(B1X1i B2X2i , ?2) note, no
    intercept
  • B1 weight of object A and B2 weight of object
    B
  • Congdon adopts the reference prior p(B1, B2, ?2)
    ? ?2

8
WinBugs Implementation of Example 4.1
  • model
  • for (i in 118) this set of statements
    generates the sample variance s2
  • yhati lt- b1Xi,1b2Xi,2
  • ei lt- yi-yhati
  • e2i lt- pow(ei,2)
  • s2 lt- sum(e2)/16 this
    equals ?e2 / (n - k)
  • for (i in 12) mean of regression coeff
  • bi lt- inprod(XTX.INVi,,Xy)
  • Xyi lt- inprod(X,i,y)
  • for (j in 12)
  • Preci,j lt- XTXi,j/s2
  • Dispi,j lt- s2XTX.INVi,j
  • XTXi,j lt- inprod(X,i,X,j)
  • XTX.INV12,12 lt- inverse(XTX,)
  • note WinBugs now has a different command for
    inversion than that used by Congdon

9
Bayesian Model Evaluation
  • Model evaluation proceeds just like it would in
    the traditional OLS framework. Models are largely
    evaluated based on their fitted (predicted)
    values. For example,
  • In this respect, Bayesian model evaluation is
    easy.
  • The fitted value for an observation i is a random
    draw from a t distribution with mean XiB and
    variance s2(IXi(XTX)Xi) on n-k degrees of
    freedom. In most cases, you can simply use XiB
    as the fitted value and ignore this additional
    uncertainty.
  • Other diagnostics that you should use are plots
    of errors versus covariates histograms for the
    error term, etc. to make sure that the
    assumptions of the normal linear model hold.

10
Comment on the regression setup
  • There are two subtle points regarding the
    Bayesian regression setup.
  • First, a full Bayesian model includes a
    distribution for the independent variable X, p(X
    ?). Therefore, we have a joint likelihood p(X,Y
    ?,?,?) and joint prior p(?,?,?). The
    fundamental assumption of the normal linear model
    is that p(yX, ?, ?) and p(X ?) are independent
    in their prior distributions such that the
    posterior distribution factors into p(?, ?, ?
    X,Y) p(?, ? X,Y) p(? X,Y)
  • As a result, p(?, ? X,Y) ? p(?, ?) p(Y ?, ?,
    X)
  • Second, when we setup our probability model, we
    are implicitly conditioning on a model, call it
    H, which represents our beliefs about the
    data-generating process. Thus,
  • p(?, ? X,Y,H) ? p(?, ?H) p(Y ?, ?, X,H)
  • It is important to keep in mind that our
    inferences are dependent on H, and this is
    equally true for the frequentist perspective,
    where results can be dependent on the choice of
    likelihood function, covariates, etc.

11
Conjugate priors and the normal linear model
  • Suppose that instead of an improper prior, we
    decide to use the conjugate prior.
  • For the normal regression model, the conjugate
    prior distribution for p(B0,,Bk, ?2) is the
    normal-inverse-gamma distribution.
  • Weve seen this distribution before when we
    studied the normal model with unknown mean and
    variance. We know that this distribution can be
    factored such that
  • p(B0,,Bk, ?2) p(B0,,Bk ?2) p(?2)
  • p(B0,,Bk ?2) NMV(Bprior , ?prior),
  • where ?prior is the prior covariance matrix for
    the Bs.
  • and p(?2) Inverse-Gamma(aprior, bprior).

12
Conjugate priors and the normal linear model
  • If we use a conjugate prior, then the prior
    distribution will have the same form. Thus, the
    posterior distribution will also be
    normal-inverse-gamma. If we integrate out ?2 the
    marginal for B will be a multivariate t-dist

- Notice that the coefficients are essentially a
weighted average of the prior coefficients
described by Bprior and the standard OLS
estimates B. - The weights are provided by the
conditional prior precision ?-1 and the data XTX.
This should make clear that as we increase our
prior precision (decrease our prior variance) for
B, we place greater posterior weight on our prior
beliefs relative to the data. Note Zellner
(1971) treats Bprior and the conditional prior
variance ? in the following way suppose you have
two data setsY1,X1 and Y2,X2. He sets Bprior
equal to the posterior mean for a regression
analysis of X1, Y1 with the improper prior 1/?2
and sets ? equal to X1TX1.
13
Conjugate priors and the normal linear model
  • To summarize our uncertainty about the
    coefficients, the variance-covariance matrix for
    B is given by

The posterior standard deviation can be taken
from the square root of the diagonal of this
matrix.
The main object of interest here is in the
expression
The second term is the maximum likelihood
estimate of the variance. The third terms states
that our variance estimates will be greater if
our prior values for the regression coefficients
differ from their posterior values, especially if
we indicate a great deal of confidence in our
prior beliefs by assigning small variances in the
matrix ?. The fourth term states that our
variance estimates for the regression
coefficients will be greater if the standard OLS
estimates differ from the posterior values,
especially if XTX is a large number.
Winbugs implementation would proceed in a manner
akin to the earlier example.
Write a Comment
User Comments (0)
About PowerShow.com