Backsliding from Bayes
For the past few years I've been moving toward incorporating Bayesian statistical analysis into my research and teaching. However, the more I learn about the standard implementations of Bayesian analysis, the more of a let-down I feel--somewhat like a religious convert on the verge of backsliding into a life of sin.
To me, the main attraction of Bayesian analysis is that, in principle, it allows us to answer the kind of statistical question that we would really like to answer. Specifically, it allows us to answer the question, “Given the data, what’s the likelihood that the alternative hypothesis is true (i.e., that the effect size is non-zero) relative to the likelihood that the null hypothesis is true?” That would be the posterior odds. But in JASP and R, the emphasis is on outputting the Bayes Factor, which is not about the likelihoods of hypotheses, but about the likelihoods data: the likelihood of the data under the presumption of true alternative hypothesis, relative to the likelihood of the data given a true null hypothesis. We could, in principle, find the probabilities (which I see as the ultimate promise of the Bayesian approach) by multiplying the Bayes factors by the prior probabilities. But the stickler is that the prior probability for the alternative hypothesis and for the null hypothesis are ill-defined. The priors are well-defined for the range of possible values of the population parameter in question (e.g., the difference between two population means). But since that range can be infinite, it is not so easy to precisely specify the likelihood that that the population parameter has some non-zero value.
So the end result is that, at least in JASP and in every other implementation I’m aware of, we aren’t given the two posterior probabilities—of the alternative and of the null hypothesis. Instead, we’re given the continuous distribution of posterior probabilities, under the presumption of a somewhat arbitrary (though not unreasonable) distribution of prior probabilities. I‘m therefore unable to draw the kind of statistical conclusion I initially set out to draw, which is: “Given the data, he alternative hypothesis is ___ times as likely to be true as is the null hypothesis.” I feel that if I can’t do that, I might as well slide back to the easy and familiar sin of null-hypothesis testing.
Richard Anderson
Bowling Green State University
randers@bgsu.edu
R
Comments
Hi Richard,
I think you might have mixed the prior distribution with the prior odds.
Say you want to test if there is or isn't a difference between two groups. For this you would construct two models:
For each model we have a prior distribution for the differences - for model 0 (which in our case is a point-null model) our prior is that the difference is exactly 0 (100% of the prior distribution is on this one point). However, for model A, we might specify that the difference are between 1.2 and 5.6 (a bounded uniform distribution), or any other shape that represents our prior belief about what a plausible difference if is there was one.
Note that these priors do not specify the relative plausibility of these models - just what are plausible values for the difference if each of the models is correct.
The relative plausibility of the models is specified with the prior odds -- the relative probability you give the models before observing data - if the odds are > 1, this means you have a prior belief that there is a non-zero difference.
Thus, you can get many different prior combinations - you can say: a priori, it is more probable that the null is correct, but if it is not, the difference is likely to be huge / a priori, it is more probable that the null is wrong, and that that difference is likely to be quite small, etc.
After observing the data, the prior distribution of each model will be updated, resulting in a posterior distribution for each model: For model 0 it will be unchanged (note: because it is a point null in our example), and for model A the relative plausibility of the possible values (as specified by the prior probability) will change according to the observed data.
The Bayes factor is affected by the prior distributions of each model (it takes into account the specified relative plausibility of the difference in each model), but the prior odds have no affect on the BF.
The posterior odds, however, take everything into account: they are the result of multiplying the prior odds by the BF (which is affected by the prior distribution), and represent the relative probability of the models after observing data.
Hope this helps!
You might also find this paper relevant - Parameter estimation and Bayes factors.
Hi Richard,
So far, as you indicate, JASP has focused mostly on the *evidence*, that is, the relative predictive performance for two competing hypotheses (aka the Bayes factor). The reason that we have not put in prior for the hypotheses is that these are often highly dependent on the researcher's opinion. So JASP will inform you about the degree to which the data necessitate a change in beliefs, but JASP does not stipulate the end result of your belief. Of course, this is easy to do by hand (just multiply the BF by the prior odds to get the posterior odds).
However, I am gradually becoming convinced that there is added value to specifying the prior odds (or at least offering the opportunity to play around with that). So in our upcoming version (a week or so away, I promise! :-)) we have an AB test (comparison between two proportions) in which the prior model probabilities *can* be specified. Over time, we plan to adjust our other tests in the same way.
As far as backsliding to p is concerned, I don't think we ought to have the perfect be the enemy of the good. BF give you the *evidence*, so they inform you about the degree to which the data warrant a change of belief; this is, imo, often exactly what researchers wish to state when they present their experiments: the data are evidence for a proposition when these data bring about a change in belief.
P-values only focus on H0 (and ignore predictions under H1), violate the likelihood principle, do not condition on the data that were observed, do not relate to belief or change in belief, etc etc. When you have a leaky umbrella this does not mean you can just as well jump in a river. :-)
Cheers,
E.J.
Dear MSB,
Thank you. I think you are right and that I was quite wrong. But what this as done is help me better conceptualize and describe what may now be stronger reasons to backslide. I think that my misconception stemmed from what might be seen as a set of compromises that characterize Bayesian statistical analysis in software such as JASP and R's Bayesfactor package.
A straightforward way to conceptualize the null hypothesis, H, would be as a point on a flat distribution of possible effect sizes in a population. An alternative hypothesis could be defined as a point located someplace other than at zero, on that same distribution. Or perhaps the alternative hypothesis be could be defined as ~H (i.e., "Not H;" "Not null") consisting of an infinite set of points that exclude the zero point. Initially,, there would be uncertainty about which of the two hypotheses is the true hypothesis. However, the Bayes factor represents a compromise. Rather than being points on an infinite distribution, each hypothesis is a second order uncertainty. The null hypothesis is particular "model" of the second order uncertainty, such that there there is zero second order uncertainty about the the population effect size: It is certainly zero and is represented as a distribution with zero variance and centered on zero. In contrast, the alternative hypothesis is an alternate model of the second order uncertainty: In this model, there is tremendous uncertainty about the location of the population effect size and is represented by a distribution with a very large variance, BUT STILL CENTERED AT ZERO, WITH ZERO STILL BEING SUBJECTIVELY MORE LIKELY THAN ANY OTHER VALUE. Thus, while the two hypotheses may have a 50:50 odds of being true, each of the two models--the two distributions of second order uncertainty is represented by a continuous distribution. This recasting of the hypotheses from being two points (or two sets of points) to being two models of the degree of second order uncertainty is quite substantial in my view. I would call it a substantial compromise.
A second compromise, it seems to me, is that we naturally would like to apply Bayes theorem to update our belief in the relative likelihoods of two hypotheses "H" (the null hypothesis) and "I" (the alternative hypothesis SPECIFIED as, say a Cauchy distribution with a location of 0 and a scale of .707). But the Bayes factor does not support updating with respect to H and I. Instead, it involves updating with respect to H and J, where J is not the same hypothesis as I. Granted, J is conceptually similar to I in that both bear the label, "alternative hypothesis," but they are two distinct hypotheses with distinct mathematical descriptions. Hypothesis I changes into hypothesis J because of Bayesian updating at a different conceptual level than the level characterized as 50:50 uncertainty (for example) about the relative likelihoods of H and I. So this seems like a second very substantial compromise (made necessary by the first compromise).
Thus, I still feel a sense of having backslid.
I do have in mind a way to reduce the degree of compromise, by reconceptualizing the Bayesian estimation approach as as subsuming a particular approach to Bayesian hypothesis testing (not the Bayes factor), though it might not be mathematically tractable.
R
But the Bayes factor does not support updating with respect to H and I. Instead, it involves updating with respect to H and J, where J is not the same hypothesis as I
The BF is based on model I completely - it simply incorporated information regarding the data in the same process that also produces the posterior distribution. in other words, you can have completely different BFs and identical posterior distributions, because your priors and data were different.
However, that "posterior" distribution does not equal:
(
50:50 times
(
P(D|"A zero-variance distribution located at ES = 0.0") /
P(D|"A Cauchy distribution located at ES = 0.0 with scale = .707")
) )
That's because the approach doesn't permit defining the alternative hypothesis as, specifically, the degree of second-order uncertainty around a population effect size, but requires instead that the null hypothesis be defined less specifically, as a process. Correct?
R
The posterior distribution(s) does not equal what you wrote, but the posterior odds do.
Note sure what you mean by:
That's because the approach doesn't permit defining the alternative hypothesis as, specifically, the degree of second-order uncertainty around a population effect size, but requires instead that the null hypothesis be defined less specifically, as a process. Correct?
Both the null and the alternative can be defined however you want - they can bot be points, they can both have some continuous distributions, or a truncated distributions, of any shape...
BayesFactor
(and by thus also JASP) use the priors suggested by Rouder et al. (2012).I'd like to add that, although it is customary to center the prior distribution around the test-value, this is by no means required. In JASP, the t-tests, binomial, multinomial (next release), and AB test (next release) all allow the location of the prior distribution to be away from zero. This is more tricky for relatively complicated analyses like ANOVA and regression, but it is not a principled limitation. See for instance https://arxiv.org/abs/1704.02479
Cheers,
E.J.
OK. Here's a slide that I could use for teaching: I've annotated the JASP output to indicate how its various components are to be interpreted. I would be grateful for any feedback concerning errors in the annotation.
R
These look really good! It is okay if I use them for class?
Just one note: on the left panel, the gray points aren't exactly the point-null, they represent the likelihood of the null value on the alternative prior and posterior distributions. You can read more about these and their relationship to the Bayes factor in EJ's Savage-Dickey ratio paper.
Yes you are welcomed to use it. But I guess it would be more accurate to use version below, which says that the null hypothesis is "not shown."
R
Looks very nice! I love the idea of the annotations in general. Will pass this on to the lab as a recommended way to include explanations in tutorial papers. A few suggestions for improvement:
Cheers,
E.J.
OK. This is good. I'm now feeling more positive about increasing the emphasis I place on Bayesian statistical analysis in my classes and my research.
Regarding my annotations:
(1) Perhaps I can say "This distribution is NOT NEEDED for hypothesis testing" (rather than "IRRELEVANT")? That way I can keep things simple.
(2) Is your preference for "This distribution quantifies the relative plausibility of values for effect size in the population" just a preference? I ask because, as long as it is not wrong, I prefer something like "Prior to knowing the data, this is a distribution of the SUBJECTIVE likelihoods of the possible values of the population effect size."
(3) Would the addition of "assuming the effect exists" or "under H1" amount to mixing up two different approaches? With estimation, there isn't a pair of hypotheses--H0 and H1. At best there is, according to https://www.bayesianspectacles.org/bayes-factors-for-those-who-hate-bayes-factors/ , estimation involves an infinite number of hypotheses. So given that that right side of the slide is supposed to be about estimation, wouldn't it be more correct to leave out " 'assuming the effect exists' or 'under H1' "?
R
wrt 2: "likelihood" usually refers to something proportional to p(y|theta); so it has a technical connotation that does not match this particular context.
wrt 3: Correct, with estimation there isn't a pair of hypotheses -- there is only the model that assumes the effect exists, and no special attention is being paid to the hypothesis that the effect may be absent. But this is important to stress, because if H0 did play a role, the posterior distribution for delta would be a mixture between a point mass at zero and a bell curve under H1.
Cheers,
E.J.
OK. My slide is getting really close. I hope it is permitted to speak of a "prior probability" distribution (even though the probabilities don't sum to 1.0), such that:
"Prior to knowing the data, this is a distribution of the subjective probabilities of the possible values of the population effect-size."
R
FYI: My finalized slide.
R
Looks good. I'd be consistent and use "Bayes factors" instead of "Bayes Factors".
E.J.