# Interpretation help (RM ANOVA)

Hi,

unfortunately, I have a problem that I can not handle. I found a significant interaction effect (p < .02) based on classical repeated measures ANOVA analysis but I got an anecdotal BFinclusion interaction score of ~1.6 based on Bayes analysis. Now my problem is how I should handle this in a paper. When I would publish the results without the BFs everything would be fine, but now I have to find the right words when I include the BFs. Because now a reviewer could possibly say that my study was not designed well enough to find the effect (or results are "worthless" because I can not say whether H0 or H1 is supported) and rejects the paper?

Another problem in this context is that I can not explain what the prior in the repeated measures ANOVA really means (r scale fixed effect of: 0.5; r scale random effects: 1; and r scale covariates: 0.354).

Thanks for any advice or help!

Best,

Markus

## Comments

Hi Markus,

Well, I would just be transparent. Sometimes you do get these conflicts and, in my opinion, they urge caution. If you had a specific contrast in mind then you ought to test that (I believe Richard has a blog post showing how this can be done; we are working to implement something like that into JASP but haven't done so yet). With respect to the prior scales, the settings are explained in the relevant papers. I think you may not have covariates and random effects, so then the only thing to explain is the r scale for the fixed effects. This is based on the width of a multivariate Cauchy. It has been chosen so that the results are consistent with a t-test in case of two conditions in a between-subjects design. I would not attempt to explain this but just mention they are the default settings.

Cheers,

E.J.

Hi E.J.

Thank you very much for your response. I posted my questions a second time because I thought that nobody would recognize it as a separate post.

Regarding my problem above, is it allowed to base my interpretation on the BFInclusion score? In other words, reporting the BF which compares the interaction model against the two main effects model and the BF that compares the interaction model against all cadidates models, but focussing on the BFinclusion score?

Thank you very much!

Best,

Markus

Hi Markus,

When you have few models, I am in favor of including the entire tables, perhaps as a supplement.

Cheers,

E.J.

Hi E.J.,

regarding my post above. When my interaction model has an anecdotal BF10 but my BFInclusion is moderate, which one of these two should I give more weight in my interpretation? Because for me it makes a difference to say the interaction effect gets weak or (at least) moderate support.

Thank you very much!

Best,

Markus

The reason for in the increased support in the inclusion method may be due to the fact that some models (like the null model, or the model with only one factor) perform very poorly. I am not so sure that this effect is of interest to you.

Cheers

E.J.

Hi EJ,

I have a similar issue as Markus. Also my Bayesian ANOVA is not as convinced of the existence of an effect than the classical ANOVA. However, in my case, the difference is rather large. The classical ANOVA (df=19) yields an F=9.00 and p = .007, whereas the BF for this effect is 0.6, so providing even anecdotal evidence for the Null (the jasp output and the figure of the means incl. within-subject 95% CI are attached!). From looking at single-subject data, I can say that the effect is indeed small (~10ms), but rather consistent over subjects. Only one subject is showing the opposite effect but three times as strong as everyone else. However, this alone is no reason for exclusion because the overall performance of that subject is still within 2SD of the sample mean.

I was wondering whether it is possible to have these to analyses to diverge so strongly, or whether it is more likely that an error must have happened somewhere along to road. And if it really is possible, do you know what the reasons for that could be, also given my data in particular? I understand that Bayesian stats tend to be in general a little more conservative than the classical ones, but why exactly is that?

This experiment is the third in a series of very similar ones, and the effects so far were always rather strong and consistent between Bayesian and classical approach. So, I was also wondering whether it is possible in JASP or R to provide the outcome of earlier ones as priors in later analyses? In another discussion, I read that simply multiplying the BF doesn't work. Is there a way?

Finally, what is your recommendation for how to tackle the issue? Just being transparent, along the lines of "classical ANOVA finds an effect, however this is not supported by Bayes", or would you take more measures? I was also running a t-test between the two conditions where I expected the effect to originate from and found moderate support for my hypothesis.

Your opinion is very much appreciated!

Thanks,

eduard

Hi Eduard,

I assume the interest is in the interaction? In general BFs are less enthusiastic because they look at both sides of the coin --H0 and H1-- instead of just focusing on H0. Indeed, multiplying BFs is not allowed, as it uses the prior again. So the correct approach, as you suggest, is to compute BFs using the updated distributions. This is not yet possible in JASP.

Being transparent is always good. However, perhaps you can achieve more informative results by not just testing an interaction "in general", but opt for a more informative contrast. I believe Richard has a blogpost on that. In addition, sometimes we see big differences between the two paradigms when particular assumptions are not met (outliers, heterogeneity of variances, etc). So you could check that too. Maybe Richard likes to weight in as well.

Cheers,

E.J.

Hi EJ,

Thanks for your reply.

Indeed the interaction is what matters most.

Do you happen to mean this blog post? In a 2x2 design, wouldn't this boil down to a simple t-test?

Is it possible directly in R with the BayesFactor package?

Just for sakes of clarity, if some assumptions would not be met, this would mostly concern the outcome of the classical ANOVA?

Thanks again,

Eduard

Hi Eduard,

I'm not sure what tests would be most effected by a violation of assumptions. It feels a little like comparing apples and oranges, but perhaps it can be done. Yes I meant that blog post -- or the next one, http://bayesfactor.blogspot.nl/2015/01/multiple-comparisons-with-bayesfactor-2.html. What I'm saying is that your interaction can be specified more exactly as a specific ordering of means (equality and inequality constraints).

E.J.

This looks interesting. I'll give it a try.

And a last thing. Provided that neither this more specific analysis turns out to support our hypothesis, how much of a problem would it be to just try to publish the data nevertheless? (Of course, this is a highly subjective question. I just wondered how it might appear to reviewers.)

In any case, thanks for your support. Very helpful.

Eduard

I don't think it's a problem at all. Did you see this paper by Etz and Lakens about not every study needing to provide picture-perfect results? Besides, I think you should only be applauded for being transparent. And my guess is that this will happen.

Cheers,

E.J.

Hi EJ,

I ran the analysis that you suggested (specifying the interaction as a specific ordering of means), which seems to work. So that is good. However, on the way I bumped into a couple of things, that I'm not quite sure whether I understand.

Mostly, I'm not sure which parameter to choose for the "whichModel" parameter in the BayesFactor analysis. I tried "top", "bottom" and the default value, and I think I understand what they mean conceptually. The problem is that I don't know how to extract the BF for each effect (M1, M2, IE) if I follow the standard procedure as Richard is describing in his blog entry (which is the default,

`withmain`

), which yields only BF for each model compared to the NULLmodel. If I use`bottom`

however, I do get adjusted BF for each effect, but the original BF are not comparable to the JASP output any more. I suppose, the reason for that is that in each case I compare the factors to a different Nullmodel.Therefore, my questions: How do I extract the BF for each effect from the BF given by an analysis in that form:

`bf<- anovaBF(DV~color_IV1*IV2+subj,data= data_df, whichRandom="subj",whichModels='withMain')`

And, What do I have to keep in mind respective the interpretation of the BF when choosing a different parameter for

`whichModels`

along the lines of:`bf<- anovaBF(DV~color_IV1*IV2+subj,data= data_df, whichRandom="subj",whichModels='bottom')`

I hope I could formulate my problems clear enough. If not I gladly rephrase or give more detail.

Thanks,

Eduard

Hi Eduard,

This is really a question for Richard, who is in charge of the "BayesFactor" component of this Forum. I'll specifically attend him to your question.

Cheers,

E.J.

Thanks!

Hi Eduard,

Simply use the function "as.vector" to extract the Bayes factors from a Bayes factor object.

The only differences between the whichModels specifications are which models are tested, and to which models they are compared:

"all" gives all combinations of effects, including those with interactions but not the constituent main effects. So, for a two-way anova, you'll get all of these compared to the null model:

a

b

a + b

a + b + a:b

a + a:b

b + a:b

a:b

"withmain" gives all models, excluding when a main effect is not with its interaction all compared to the null model.

a

b

a + b

a + b + a:b

"top" gives

a + a:b

b + a:b

a + b

all compared to the "full" model a + b + a:b; that is, each effect is "taken away" from the full model and tested.

"bottom" gives

a

b

a:b

all compared to the null model with no effects.

I would not recommend using "top" or "bottom" in everyday research. They are mostly added for convenience of people generating subsets of models. The problem with them can best be seen in the regression context. Imagine two covariates that are highly correlated, but are also correlate with the DV. Testing with "top" will lead you to the conclusion that neither covariate is needed, because they share a great deal of variance. Testing with "bottom" would lead you to the conclusion that both are needed. What you should to is compare a, b, and a+b to one another (in the ANOVA context you'd have a:b in there too). Then you'd see that a and b alone are good, but a+b and the null are bad. That is, you need one of the two covariates but not both. You can only see this by looking at the constellation of model comparisons.

The idea that you can get a separate Bayes factor for each effect -- as opposed to comparing models -- is flawed. You'll fall into the same traps that people fall into with p values (e.g., problems with multicollinearity). My recommendation is to stick with the model comparison ("withmain" or "all").

Hi Richard,

Thanks for your input. It really cleared things up.

But once you're already in the discussion, would you mind helping me out on the initial discussion here?

In brief, I have three experiments, each with a 2x2 repeated measures design (basically replications of each other). In the first and second experiment, classical and Bayesian ANOVA agree that the best model is the full model (M1,M2, IE) In the third model however, there is no evidence for the interaction according to BF, even though the classical ANOVA finds a rather significant effect (p = .007).

As I have rather specific predictions with respect to the interactions, EJ suggested that I check whether I can find evidence for the interaction with a more specific predictions by using order restrictions (according to this blog post of yours.

The example you give is for a univariate ANOVA with 3 levels. So firstly, I was wondering whether this method is also applicable for a 2x2 design? Secondly, if it is possible to use it, what would be the number of possible orderings? My first idea was 24 because I have 4 cells and each could be different from each other. However, I don't want to artificially blow up the number of possible orderings as this would have a huge effect on the end result.

For sake of completeness, here is the code that I used:

Thanks for your help

Eduard

Hi @richarddmorey,

Not sure whether you are just busy, or whether you haven't seen this post yet. In latter case, I hope this is post is reaching you. Otherwise, sorry for spamming. It's just that I'm a little excited about that analysis. Once I know whether my procedure makes sense, we can submit the manuscript.

Thanks,

Edaurd

Dear all,

I will revive this thread because my question is very related. I have a 2 x 2 repeated-measures design and a frequentist approach indicates that I have two main effects with very large effect sizes (and high significance) and an interaction of medium-to-large effect size (and high significance). However, the Bayesian approach indicates that - compared to the full model - there is anecdotal evidence for a model with only two main effects, but no interaction.

The reason why I want to revive this thread is because I tried to inform myself of why and when these diverging findings can occur. Wikipedia suggests that this Lindley's paradox can occur under specific circumstances: https://en.wikipedia.org/wiki/Lindley's_paradox

This becomes clear with an example: In my particular case, the 95- Confidence interval of the difference between conditions ranges from 5 ms to 20 ms. Thus the actual difference in cell means might

very, very small. While the frequentist says: "The difference is clearly different from zero", the Bayesian says: "well, the difference is so small, it might as well be zero". This is especially the case, because there is no good alternative (H1), meaning that the actual difference is probably actually close to zero.That being said, in my particular case, the actual effect is expected to be

that small. The interaction represents the modulation of a main effect which itself has a mean difference of 35 ms, naturally the modulation of this effect is even smaller. So from a theoretical perspective, I am totally okay to find an interaction which has a corresponding confidence interval of 5 to 20 ms. I have the feeling that the Bayesian approach misrepresents this case here, becausesmall, but consistentfindings seem to be taken as evidence rather for the null than for the alternative hypothesis.Would you agree with my reasoning? I am curious what the experts of the field think about this issue.

I would like to add to my previous post:

What has been bothering me all day is that I get completely different results if I choose a different program to receive the Bayes Factor of that particular interaction.

I used the calculator MorePower 6.0: https://www.ncbi.nlm.nih.gov/pubmed/22437511

MorePower bases the calculation of the Bayes Factor on the BIC, which itself is computed based on the sums-of-squares of a conventional ANOVA. The method is explained in Masson (2011): https://link.springer.com/article/10.3758/s13428-010-0049-5

Unfortunately, I lack the expertise to be able to understand the difference from the computation of the Bayes Factor in JASP from the computation of the Bayes Factor with MorePower. However, the difference couldn't be more drastic:

While MorePower tells me that there is "decisive evidence" for H1 (which is in line with my frequentist findings), Jasp considers "anecdotal evidence" for H0.

Any opinions or insights why the two computations of the Bayes Factor lead to such drastic differences would be of immense help. Thank you!

Hi Tcarsten,

I have not tried MorePower 6.0, but I will say that for RM designs, there is a more recent paper by Masson that seems relevant: https://web.uvic.ca/psyc/masson/NM16.pdf

In general, I think the problem with the model is twofold. First, you probably have a specific form of the interaction in mind -- the model penalizes the interaction for its complexity (it spreads its predictions thinly across the data space). We are working on contrasts that will allow you to specify more informative models. Second, yes, you might expect small effects and that should change the prior. In general, whenever you expect really small effects I recommend that you use your substantive knowledge and also shift the location of the prior. This cannot be done in the current version of JASP, but for the t-test we will have it implemented in the upcoming version. For a rationale see https://arxiv.org/abs/1704.02479

Cheers,

E.J.