Understanding Bayesian Multiple Regression
Dear All, I'm new to Bayesian statistics, and I'm still trying to understand some of the assumptions and differences between this method and frequentist statistics. Basically, I want to understand the association between x1 and y, while also accounting for the variance of additional variables (x2 and x3). Long post coming up:
I'm currently running into something I don't understand with regard to Bayesian Multiple Linear Regression and (regular) Linear Multiple regression. So basically, what I'm really interested in is the effects of one continuous variable, let's call this x1, on an outcome variable, y. I'm also interested in "controlling for" additional regressors, x2 and x3 by including them in the model. In other words, I want to know about the effects of x1 on y while accounting for the additional effects of x2 and x3, or y ~ x1 + x2 + x3 (with x1 being the main factor that I’m interested in). In a "regular" multiple linear regression, I can look at the t score of x1 to discern whether this variable has a statistically significant effect on variable y (all this is relatively straightforward, right?). We could also calculate the same effect a different way: we could calculate residuals of the relationship between x1 ~ x2 + x3, and y ~ x2 + x3, and correlate the residuals in x1 and residuals in y. In this way, we're still looking at the relationships between x1 and y, accounting for the variance attributed to x2 and x3, and we get the same resulting t-value and same amount of variance accounted for. In other words, this is the same calculation, just done two different ways.
Doing a similar analysis in a Bayesian framework (I’m using JASP), this looks a little different. In this regression, ultimately I end up comparing two models: a null model that includes the regressors of no interest, which is then compared to a model that includes the factor I’m interested in. As I understand it, my null model is y ~ x2 + x3, and my alternative model is y ~ x1 + x2 + x3, which allows me to compare and determine the relative effect of x1, giving me a bayes factor for the effects of x1 + mean values for the regression coefficient for all the factors included. But what if I want to do a similar comparison as with the residuals for the frequentist linear regression? Here I’m taking the same residuals from before x1 residuals (controlling for x2 + x3), y residuals (controlling for x2 + x3), and doing a Bayesian linear regression of y residuals ~ x1 residuals. However, the resulting output from this model is not the same as what I get using the Bayesian multiple regression—the mean coefficients are different, as well as the Bayes factors. The coefficients are close but not identical (e.g., -5.4 vs. -5.6), and the Bayes factors are dissimilar. Ostensibly, these should be accounting for the same amount of variance, so I don’t understand why these results diverge (because at least in a frequentist sense, these should be identical calculations). To be clear, I’m not asking why the Bayesian and frequentist stats diverge, I’m asking why the two Bayesian analyses (one using all the variables, and one using the residuals) would result in different outcomes? Why are they so different?
My understanding is that it’s okay that these results are different, but I don’t actually know why. I realize it’s weird to mix methods in this way, but this is moreso for my understanding of how the variance for the additional variables is being accounted for. My inclination is that this difference is due to the model selection process associated with the Bayesian analysis? I still have a beginner understanding of Bayesian stats, but am trying to understand how this works. Can someone help by explaining why these results would diverge?
Comments
Dear rpizzie,
Thanks for your thoughtful question. I think there is a (Bayesian) issue with the two-step method, where you first compute the residuals and then introduce x1. First, what you ought to have is a distribution across each residual -- there is uncertainty in the regression coefficients for x2 and x3, and this ought to propagate to the residuals. Of course you could ignore this uncertainty and use the posterior mean, but then the correspondence between the two analysis methods would break down (to examine this, you could use a very, very large N, so the posterior uncertainty becomes negligible). I am also a little worried about correlations between the beta's not being taken into account in the two-step method, but I could be wrong there.
Cheers,
E.J.
Thanks, E.J. this is very helpful--I appreciate your response. This makes sense to me--that the residual model doesn't adequately account for the uncertainty that's introduced by X2 and X3. Thank you!