When the BF and the classical results diverge

Mila_Marinova · September 2020

Hello JASP-ers,

I sometimes get a questions such as "Why classical results are not significant, but BF shows evidence for X",. For example: " Results showed no main effect of type of activity, F(1,177) = 2.60, p = 0.11, ηp2 = 0.01, and no main effect of age group, F(2,177) = 1.33, p = 0.27, ηp2 = 0.02. Bayesian ANOVA, however, showed evidence for the presence of a main effect of type of activity, BFIncl = 126.73, and a main effect of age, BFIncl = 96.70".

To be honest, I never have good explanations why such differences occur. Can you recommend a reading (paper, bolg post etc), where I can get more info on the matter?

Thanks in advance,

Mila

EJ · October 2020

Hi Mila,

Good timing, we have started a project to get to the bottom of this. The discrepancy seems to center exclusively on repeated measures ANOVA. Do you have an example data set you could share?

Cheers,

E.J.

Mila_Marinova · October 2020

Hello E.J.,

Thanks for the response! To be honest, I rarely get discrepancies, but sometimes it happens and indeed exclusively with ANOVAs. Whenever the example data set from above is available publicly, I will let you know!

Best

Mila

AceOfBayes · November 2020

Dear JASP experts,

I have also been wondering what it means when inferences from NHST and Bayesian analysis differ. I do find this to be the case for repeated-measures ANOVA, particularly for higher-order interactions. (But also Wilcoxon signed-rank tests yield different results.)

I have a question in particular regarding the differences between classical and Bayesian results for higher-order interactions in rm-ANOVA. Is it possible that the Bayesian model space considered in JASP for this type of analysis is somewhat biased against higher (i.e. 3rd-order in my case) interactions due to the way JASP reduces the number of models to be tested?

I assume that JASP follows the same reasoning described in Rouder et al. (2017, http://dx.doi.org/10.1037/met0000057), which deems implausible all models including an interaction in the absence of the corresponding main effects. This has the convenient property of avoiding a combinatorial explosion of models to consider. Furthermore the paper suggests a procedure according to which only the effects of the "winning" model have any chance of being assigned BF10s larger than 1:

BF of the model including the effect in question divided by the highest-ranking model without that effect. For effects not included in the model, BF10 is the BF of the highest-ranking model including the effect in question divided by the BF of the winning model (which does not have that effect). (I hope I understood this correctly.)

Now for 3-way ANOVAs this means that there is only a single model for which the 3rd-order interaction could be supported by the data. This is only the case if the data also support all main effects and 2nd-order interactions! Furthermore this very specific model including the 3rd-order interaction plus all its additional assumptions that come with it (because of how the model space had been constrained) is still outnumbered by all remaining models that do not assume a 3rd-order interaction. This makes me wonder if the definition of the model space may inadvertently bias the analysis against detecting higher-order interactions and that this effect may be more severe the higher the order of the interaction.

Classical rm-ANOVA does not suffer this problem because it only computes the F for the model including the effect against the model without the effect. I have the feeling that this could explain some of the discrepancies I am observing between these two types of analyses. Could you comment on this?

Is there a way in JASP to change what models are considered in this analysis? One could, as a compromise, for instance exclude only models where there is an interaction without any main effects of the corresponding factors. So an interaction AB would be plausible if there is a main effect for A or B (or both).

Moreover, I do not follow the explanation from that paper why models with interaction are implausible unless all main effects are present as well. Referring to the paper mentioned above, the function describing ice-cream price offered depending on fat and sugar content may support this claim in this particular case but there is now reason to believe that it could look completely different for other (in fact the vast majority of) other cases (Fig. 6B). What is shown in Fig. 6A is in fact an example of a double dissociation which is the much sought-after type of evidence in a lot of neuroscience research to establish structure-function relationships. Besides, in practice it seems too strict to require that for the existence of a j-th order interaction it must be the case that all i-th order interactions (for all i < j) and all main effects exist as well.

For now I feel a little hesitant to trust the Bayesian type of analysis.

I believe I have data that I could share if it helps.

I'd be highly grateful for your advice.

Thanks & best,

Michael

EJ · November 2020

Hi Michael,

Thanks for your question. Some quick thoughts:

"BF of the model including the effect in question divided by the highest-ranking model without that effect." I do not like this procedure because it cherry-picks the highest-ranking model from a larger set, an action that should incur a penalty.
If you want to test a three-way interaction I would compare the full model against the model that misses that interaction but does involve the other terms. Ergo, this comparison would assess whether adding the interaction is supported by the data.
The principle of marginality (i.e., including the main effects for models that feature an interaction) is also motivated by the idea that a transformation of the data in the interaction-only model will suddenly make the main effect reappear.
There are some discrepancies for interaction in RM ANOVA. We are currently writing a discussion paper on this very topic. As soon as it is done we will post it on PsyArXiv and tweet about it. This paper may well solve your issues.

Cheers,

E.J.

JohnnyB · November 2020

Hi Michael,

To add to EJ's response (we are working on the mentioned project together), there are several Bayes factors to consider:

The direct Bayes factor which compares 1 model against 1 model (this is what you get in the main table).
The inclusion Bayes factor, which compares 1 or more models against 1 or more models (in order to get this table you tick the "effects" tickbox).

In the latter case, to respect the principle of marginality, JASP has an option where you can choose which models to consider in the radiobuttons underneath the "effects" tickbox. Selecting "matched"will only consider models that respect the marginality.

The inclusion Bayes factor is (in my opinion), a double-edged sword. It uses one of the cool features of Bayesian inference, Model averaging, to tell a different story than you usually get in ANOVA. But of course this story is very dependent on which models are under consideration (something that is not always made explicit).

If you want to have the analsysis that corresponds to what a p-value does, you take the approach mentioned by EJ, and simply compare the full model to the full model without the one effect you want to test (however, this can also violate the principle of marginality).

That said, we have found that there is a discrepancy between the Bayesian and frequentist RM ANOVA when 2 or more factors are in the model. It seems that the frequentist analysis considers random effects for participants by default (i.e., the manipulation having a different effect for different subjects). Since the Bayesian analysis does not do this, this can lead to diverging results, so this might be the culprit in your scenario as well.

Kind regards,

Johnny

AceOfBayes · December 2020

Hi EJ and Johnny,

Thanks very much for your time!

1) "I do not like this procedure because it cherry-picks the highest-ranking model from a larger set, an action that should incur a penalty." / "The inclusion Bayes factor, which compares 1 or more models against 1 or more models (in order to get this table you tick the "effects" tickbox)"

I do not have a strong opinion on whether that is a good way to perform inference in Bayesian rm-ANOVA. I just chose that method because it was proposed in a paper from methods people who probably know what they are doing ;)

Yes, the inclusion BF was mentioned in the appendix of that paper as well. Frankly, this model averaging method seemed somehow a little more intuitive to me. The fact that it depends on what models are being considered also applies to the "cherry-picking" method (EJ), I would suspect. A comparison based on some ground truth simulations would be nice to have here...

2) Yes, that makes a lot of sense to me. This should be the direct Bayesian counterpart of the classical F-test for model comparison in ANOVA. It seems so logical to me now that it makes me wonder why this isn't the default suggested test in Bayesian rm-ANOVA?

In my case, however, this still leads to discrepant results with BF10 << 1 (I'm taking the ratio of the values in the BF10 column for the two models - the model without interaction is ranked higher) although the F-test of this 3rd-order interaction turns out significant (altough not "very" significant - just under .05). Hmmm... this is strange.

3) I am not sure I understand this. If I am looking at the untransformed data, I will not see the main effects and it will look like the data did not support the respective model. Perhaps I need a little more background on the marginality principle.

4) "We are currently writing a discussion paper on this very topic" / "It seems that the frequentist analysis considers random effects for participants by default (i.e., the manipulation having a different effect for different subjects)"

Interesting! This would be very good to know. It would be extremely helpful to have this for Bayesian rm-ANOVA built-in in JASP as well. So would you then recommend for now not to use Bayesian rm-ANOVA when dealing with 2 or more factors?

Thanks again for taking your time to answer my questions on the message board while writing papers that could solve my problem :) I'm looking forward to it.

Cheers,

Michael

EJ · December 2020

Hi Michael,

We are wrapping up the paper now, and I suspect it is more efficient if we send it along when it is done. It is surprising how what complications lurk beneath the surface of a relatively straightforward model.

Cheers,

E.J.

Rik · November 2021

Hello,

I was wondering if the above-mentioned paper has already been published, as I am observing similar divergent results myself.

Apologies if this has already been posted elsewhere; I couldn't find the paper myself nor posts about it on this forum, but might have looked in the wrong places.

Best wishes,

Rik

JohnnyB · November 2021

Hi Rik,

The paper is online - https://psyarxiv.com/y65h8/

We are now setting up a special issue with responses to this paper, and will then also publish a collaborative guidelines paper where we come back to the questions posed in the paper. If you have any additional questions please let me know!

Cheers,

Johnny

Rik · November 2021

Thanks a lot Johnny!

Howdy, Stranger!

Categories

When the BF and the classical results diverge

Comments

Howdy, Stranger!

Quick Links

Categories

When the BF and the classical results diverge

Comments