# overwhelmed by how to report results

Hi,

The more I read in the forum, the more confused i get. With a simple design, results are still manageable. For t tests, one can report BF10 or BF01, done. For a simple 2x2, one can describe the model that provides the best evidence (maybe its for a main effect, maybe its for an interaction, maybe both are good), one can report the best or various. Alternatively, if the BF10 is arbitrary, one can look at BF01. Report BF01 instead, done. (Maybe I remember wrong, but in the workshop I believe it was said that the model comparison is whats better than just looking at effects and reporting BFincl?? though for simple designs it wouldnt make much difference, right?).

For a more complicated design, which results in e.g. 166 models, it becomes more tricky.

From the workshop, I remembered to look at the models first, compare to null. That might very straightforward identify that the model including a main effect is best, the others arent even comparable, wonderful, I'll report that BF10 for that model, maybe it even is consistent with my p value, winning.

But with 166 models, I might have a lot of them providing extreme evidence. So I can then compare to the best model, and see whether they really are as good as the best model against the null. But maybe they are, so then I need to understand what drives these models (many containing interactions).

So from the forum, it appears I should go into the output for effects. So here, I can look at effects across all models, hopefully that identifies some main effects. If so, wonderful, I'll report the BFincl for whatever effects win. I might or might not have a strong BFincl for any many effect, but importantly I have evidence for an interaction. Though, this is not yet real evidence for the interaction as in this analyses the models that contain an interaction also contain the main effects. So to evaluate the interaction, I need to look at the output of effects across matched models (or Bawes) - so this strips away the effect from the models.

Thus, this analysis wont ever provide strong BFincl for main effects (since they were stripped, correct), but instead allows me to understand whether the interaction exists and wasnt just driven by a main effect, correct?

So having identified maybe some main effects, some interactions, do I then stick to the BFincl, or should I return to the model comparison?

I guess my confusion comes from my various data sets:

1) While for one main effect I get a BF10=80, when I look at BF01 instead, various models (incl interactions) provide extreme BF01. When I look at analysis of effects across all models, that main effect has a BFincl=13. When I look at analysis of effects across matched models, that main effect has a BFincl=90, while the interactions have weak BFincl. So the interactions in BF01 were driven be the main effect. But how is my BFincl for the main effect now 90 if in this Bawes analyis the models are stripped of the effect? So I'd say the strongest evidence is for that main effect, reporting that model BF10, but do I also report the BFincl, and which one of the two?

2) On a different data set, again 166 models, my BF10 are weak, but the BF01 for main effects =10 and various other models are extreme. Here analysis of effects across all models show only effect of the fourway interaction with BFincl = 743. So again, to evaluate the interaction I look at effects across matched models, and here beautifully, that interaction turns into a BFincl = 1.743E+8. So though I actually have a strong BF01 (though only =10) for the null regarding main effects, there is extreme BFincl for that interaction. So I should just report the evidence for the interaction, I feel like I'm then not telling the whole story here?

3) In another data set, again 166, I have a wonderful BF10 for the model incl factor1 + factor2 + factor1*factor2. In effects across all models, each of the BFincl tell the same story, both factors = 2.8E+13 and their interaction = 2E+14. When I strip away the effect across matched models, the interaction comes out nicely with BFincl = 9.3E+15. So this all makes sense to me, each factor has strong evidence, and their interaction is there too.

But again, do I report the model, the BFincl across models or across matched models?

Sorry about the essay, I just really want to get this right.

MANY MANY THANKS!

Best,

Clarisse

## Comments

Dear Clarisse,

OK, let's tackle these one at a time:

Yes.

BF10 is just 1/BF01, so they provide exactly the same information. If BF10 is 0.1, say, it feels awkward to say "the data are 0.1 times more likely under H1 than under H0". It is then easier to report BF01 = 1/.1 = 10, and say "the data are 10 times more likely under H0 than under H1".

Right. I like the analysis of effects, but Richard doesn't. It's one of those things where you can have different opinions. But for simple designs there is not much benefit of averaging across models (which is what you do with BFincl) because there are only a few models to begin with.

Yes.

But with 166 models, I might have a lot of them providing extreme evidence. So I can then compare to the best model, and see whether they really are as good as the best model against the null. But maybe they are, so then I need to understand what drives these models (many containing interactions).

Yes.

Yes.

The models are set up so this is always the case. You are not allowed to define a model with interactions but without the constituent main effects.

Thus, this analysis wont ever provide strong BFincl for main effects (since they were stripped, correct), but instead allows me to understand whether the interaction exists and wasnt just driven by a main effect, correct?

Nope. The matched models analysis compares models with and without the effect of interest, but excluding higher-order interactions. So if you want to assess the evidence for main effect A, the matched model compares the null model to "A only", and "B only" to "A+B", but does not involve the "A+B+A*B" model.

With 166 models I would go for the BFincl.

1) While for one main effect I get a BF10=80, when I look at BF01 instead, various models (incl interactions) provide extreme BF01.

BF10 = 1/BF01, see above.

The matched model analysis will exclude the models that have the interaction. If these models do not receive support then removing them from consideration (as the matched model analysis does) makes the main effect of interest look better. I'd report both the "regular" analysis of effects as well as the matched model analysis.

BF10 = 1/BF01, see above.

I would report both analyses. In general, I recommend to produce an annotated JASP file, upload it to the OSF, and link to it in your manuscript.

But again, do I report the model, the BFincl across models or across matched models?

It is rarely the case that a single analysis answers all questions. Reporting a series of analyses provides a more complete picture.

Cheers,

E.J.

THANK YOU SO MUCH ! ! ! I think I am finally getting somewhere

one more question (sorry!):

what if I have an analyses where the 'strongest evidence' is from the BF01=36.8 for the model including all factors and interactions, but nothing else. How does one evaluate interactions? Especially with strong BF01? The BFincl (across all models) is below 1 for the interactions, but those would still incl models for main effects, so of course it would be low? So then I don't have strong evidence for the data against the null, but also not really data supporting the null, correct?

I need a little more context, perhaps a concrete example?

of course, thank you! so here i have the BF01 first against the null and with effects analyses incl all models, and then i have BF01 against the best model and effects across matched models.

I'm lost when there is neither a strong BF10 nor BF01 for any main effects, but either BF10 or BF01 (as here) for models including interactions. but the BFincl is not clarifying these results...

Do the best model has Position and irrelValue. The second best has only Position, and the third best has only irrelValue. The first model with an interaction enters at place 5, and is a factor of 3.293 worse than the best model.

This is echoed by the analysis of effects, where only Position and irrelValue get BF_inclusion's higher than 1 (for the matched models at least). But the preference is not strong.

In fact, the null model comes in at place 6, a factor of 3.568 worse than the best model. This does not provide strong evidence for the inclusion of any predictors, but there is some evidence for including "Position" (albeit weak).

Cheers,

E.J.

so if there is no strong evidence for inclusion of any predictors, then the null model is still the better?

What I'm confused by is when comparing against the null, and in the last row of the table the model including all terms and interactions comes to BF01 = 43.15, but in the comparison against the best model, that model is 153.95 times worse than the best model position + irrelValue. Is the "best" model defined as the best model when comparing all models but not when comparing to the null, because there this model only had a BF01 = .28 ?

I thought I first compare models against the null, to establish whether any models provide evidence for or against the null model. and then when various models provide decent evidence against the null, compared them with each other to see whether they are as good?

I'm generally unsure what to do with vague BF results like these, especially in the context of significant p values...

So when you look at BF01 and the "H0" is the null model, values higher than 1 are evidence

forthe null. So the BF01 = 43.15 is evidenceforthe null model that has no predictors. But the null model is not the best model -- this is the model with Position and irrelValue. Compared to the best model, the most complex model is 153.95 times worse, as you mention.So we have:

A = p(data | null) / p(data | most complex) = 43.15

B = p(data | best) / p(data | most complex) = 153.95

So if we compute C = A/B, the most complex model drops out, and we are left with

C = A/B = 43.15/153.95 = 0.28 = p(data | null) / p(data | best) = BF_null vs best

Cheers,

E.J.

thanks, i thought this all finally made sense but with another data set, i'm confused again!

Here, I have extreme evidence in various models against the null. So I compared to the best model (relevant value), and still all the models are worse only by a factor of < 1. But what surprised me here is that also the null model, and the models previously providing little evidence against the null, have factors much smaller than 1 against the best model. Shouldnt these factors be much higher, i.e. be much worse than the best model?

Analysis of effects highlights only relevant value, but the models with extreme evidence suggest that the other factors play a role, as well as their interactions, no?

I just realized that I'm still on the same question, the best model is not necessary the best model against the null (though it can be also the "best" model against the null). So what defines the best model in the "comparison against the best model", best model of the data?

I just realized that I'm still on the same question, the best model is not necessary the best model against the null (though it can be also the "best" model against the null). So what defines the best model in the "comparison against the best model", best model of the data?

sorry i realized what i was thinking wrong about (or i prefer to blame it on my shingles affected brain). i need to look at BF01 when comparing to the best model, then I know the factor by which any other model is worse.

Thanks you again so much for all your help !!