# Bayes factor (inclusion BF) contradicts p value

I'm running Repeated Measures ANOVAs on this data file: https://osf.io/3gynm/ (results_aggregated_exp2_index_vs_thumb.txt)

I include the factors "Type" (probe vs. irrelevant) and "Hand-position" (index vs. thumb), for the "duration RT mean" variable: "duration_RT_mean_probe_0", "duration_RT_mean_irrelevant_0", "duration_RT_mean_probe_1", "duration_RT_mean_irrelevant_1" (where _0 is index and _1 is thumb). See:

And what's weird is that for the interaction I find a p = .004:

But a BF = 0.182 (inclusion BF based on matched model - when based on all models, it's even 0.077):

So p value supports difference quite strongly (.004), and BF supports equivalence substantially (1 / 0.182 = 5.495). (There is a similar contradiction for the Type main effect too, but there at least the p value is not so clear.)

Any ideas what to make of this? Perhaps any references about interpreting such a case?

## Comments

I've understood this analysis quite differently from you, so maybe someone can help both of us.

My interpretation of your data would be as follows: when testing Type*Handpos against the null model, p=.004 suggests that there is only a 0.4% chance of data as extreme as yours or even more extreme than yours occurring, if the null model is true. So you reject the null model, you reject that there is no effect in your data. Therefore, you assume that there is an effect. To investigate which effect is most relevant or most "effective" in your data, you run your Bayesian analysis. I always run it set to BF01 and with the best model at the top. That makes selecting the model which best explains your data easy.

Based on the analysis of effects which you have provided, I would conclude to only add handpos to the model and to exclude type as well as type*handpos. The model containing handpos on its own has a much higher BFincl than the other two. I hope I've got that right...

Interesting case. Could you also provide the regular table with all the models separately? Maybe also a plot of the results?

Sometimes such discrepancies are due to model misspecification, for instance heteroscedasticity etc.

Cheers,

E.J.

@eniseg2: Thanks for the reply, but I'm not comparing against null, it's inclusion Bayes factors, which compares effects taking into account all effects. I'm no expert, but all in all I'm pretty sure it should generally give results that correspond to those from regular ANOVA.

@EJ:

Sure, here is the full table:

And here is a plot with 95% CI error bars:

(I know the probe vs. irrelevant difference looks tiny, but the correlation is super high [r(70) = .98], hence the significant difference.)

Density plots:

Q-Q plot:

There are no between-subject factors, and the sphericity looks fine:

Again, the entire dataset is available at https://osf.io/3gynm/

Btw, I actually ran these tests in R first, and only checked in JASP because of the strange results - but it's all the same. That's just to say that it's not something JASP-specific.

Hmm I don't quite get this then. The discrepancy between the analyses is really large. I will ask some other people to look at this as well. Also, you could t-test just the probe vs irrelevant difference for the index case -- I assume your p-value will be even more significant, and the BF will be highly in favor of the alternative. Such a result would make the discrepancy even more mysterious.

E.J.

Update: if you present the R code, Richard can look with the BayesFactor package at what you did!

Cheers,

E.J.

Yes, the probe-irrelevant difference in case of index:

t(115) = –3.37, p = .001, d = –0.31, 95% CI [–0.50, –0.13], BF10 = 20.22.

Correlation: r(114) = .979, 95% CI [.969, .985], p < .001.

Descriptives: M±SD = 80.78±21.25 vs. 82.16±21.37

In case of thumbs:

t(115) = 0.29, p = .772, d = 0.03, 95% CI [–0.16, 0.21], BF01 = 9.31. (BFplain = 0.1074)

Correlation: r(114) = .984, 95% CI [.977, .989], p < .001.

Descriptives: M±SD = 93.05±22.28 vs. 92.95±22.30

Here are the R codes along with a simplified dataset (with the relevant data only):

@gaspar We've encountered the same issue with our data. Have you figured out how to interpret your findings?

Nope, I wouldn't know how to. I'm still hoping @EJ or @richarddmorey might be able to explain it.

(Fortunately this is not so important for me in this present data; this is just an additional and fairly unimportant exploratory analysis in our paper. Plus we originally reported it without BFs, I only checked this when it was already under review. Still it would be good to at least mention and clarify this in a footnote before publishing it.)