In Lakens/TOSTER: Two One-Sided Tests (TOST) Equivalence Testing

For an open access tutorial paper explaining how to set equivalence bounds, and how to perform and report equivalence tests, see Lakens, D., Scheel, A. M., & Isager, P. M. (2018). Equivalence Testing for Psychological Research: A Tutorial. Advances in Methods and Practices in Psychological Science, 1(2), 259-269. https://doi.org/10.1177/2515245918770963

Warning: many of the TOSTER functions within this particular vignette are outdated. Where this is the case, we have included the updated code as well. Please see the other vignettes for details on the new functions. The "Introduction to t_TOST" vignette, in particular, will be helpful. This vignette, and the functions used, will continue to be in the package for continuity purposes.

Equivalence Testing with TOSTER

Scientists should be able to provide support for the absence of a meaningful effect. Currently researchers often incorrectly conclude an effect is absent based a non-significant result. It is statistically impossible to support the hypothesis that a true effect size is exactly zero. What is possible in a Frequentist hypothesis testing framework is to statistically reject effects large enough to be deemed worthwhile. When researchers want to argue for the absence of an effect that is large enough to be worthwhile to examine, they can test for equivalence.

A widely recommended approach within a Frequentist framework is to test for equivalence using the TOST procedure (Schuirmann, 1987), because of its simplicity and widespread use in other scientific disciplines. The goal in the TOST approach is to specify a lower and upper bound, such that results falling within this range are deemed equivalent to the absence of an effect that is worthwhile to examine (e.g., $\Delta$L = -0.3 to $\Delta$U = 0.3, where $\Delta$ is a difference that can be defined by either standardized differences such as Cohen's d, or raw differences such as 0.3 scale point on a 5-point scale). In the TOST procedure the null hypothesis is the presence of a true effect of $\Delta$L or $\Delta$U, and the alternative hypothesis is an effect that falls within the equivalence bounds, or the absence of an effect that is worthwhile to examine. The observed data is compared against $\Delta$L and $\Delta$U in two one-sided tests. If the p-value for both tests indicates the observed data is surprising, assuming $\Delta$L or $\Delta$U are true, we can follow a Neyman-Pearson approach to statistical inferences and reject effect sizes larger than the equivalence bounds. When making such a statement, we will not be wrong more often, in the long run, than our Type 1 error rate (e.g., 5%).

In this document I will present a number of practical examples for equivalence tests for independent t-tests, dependent t-tests, one-sample t-tests, correlations, and meta-analyses.

Equivalence Test for Meta-analysis

Hyde, Lindberg, Linn, Ellis, and Williams (2008) report that effect sizes for gender differences in mathematics tests across the 7 million students in the US represent trivial differences, where a trivial difference is specified as an effect size smaller then d = 0.1. The present a table with Cohen's d and se is reproduced below:

|Grades|d + se| |---|---| |Grade 2 |0.06 +/- 0.003| |Grade 3 |0.04 +/- 0.002| |Grade 4 |-0.01 +/- 0.002| |Grade 5 |-0.01 +/- 0.002| |Grade 6 |-0.01 +/- 0.002| |Grade 7 |-0.02 +/- 0.002| |Grade 8 |-0.02 +/- 0.002| |Grade 9 |-0.01 +/- 0.003| |Grade 10 |0.04 +/- 0.003| |Grade 11 |0.06 +/- 0.003|

library("TOSTER")
TOSTmeta(ES = 0.06, se = 0.003, low_eqbound_d=-0.1, high_eqbound_d=0.1, alpha=0.05)

We see that indeed, all effect size estimates are measured with such high precision, we can conclude that they fall within the equivalence bound of d = -0.1 and d = 0.1. However, note that all of the effects are also statistically significant - so, the the effects are statistically different from zero, and practically equivalent.

Hyde, J. S., Lindberg, S. M., Linn, M. C., Ellis, A. B., & Williams, C. C. (2008). Gender similarities characterize math performance. Science, 321(5888), 494-495.

Independent Groups Equivalence Test

Independent Groups Student's Equivalence Test

Eskine (2013) showed that participants who had been exposed to organic food were substantially harsher in their moral judgments relative to those exposed to control (d = 0.81, 95% CI: [0.19, 1.45]). A replication by Moery & Calin-Jageman (2016, Study 2) did not observe a significant effect (Control: n = 95, M = 5.25, SD = 0.91, Organic Food: n = 89, M = 5.22, SD = 0.83). Following Simonsohn's (2015) recommendation the equivalence bound was set to the effect size the original study had 33% power to detect (with n = 21 in each condition, this means the equivalence bound is d = 0.48, which equals a difference of 0.429 on a 7-point scale given the sample sizes and a pooled standard deviation of 0.894). Using a TOST equivalence test with alpha = 0.05, assuming equal variances, and equivalence bounds of d = -0.48 and d = 0.48 is significant, t(182) = -3.03, p = 0.001. We can reject effects more extreme than d = 0.48 (or a raw difference of 0.429 scalepoints).

# OLD CODE
#TOSTtwo(m1=5.25,m2=5.22,sd1=0.95,sd2=0.83,n1=95,n2=89,low_eqbound_d=-0.48, high_eqbound=0.48, alpha = 0.05, var.equal=TRUE)
TOSTtwo.raw(m1=5.25,m2=5.22,sd1=0.95,sd2=0.83,n1=95,n2=89,low_eqbound=-0.429, high_eqbound=0.429, alpha = 0.05, var.equal=TRUE)

# NEW CODE
tsum_TOST(m1=5.25,m2=5.22,sd1=0.95,sd2=0.83,n1=95,n2=89,eqb=0.429, alpha = 0.05, var.equal=TRUE)

Thanks to Brent Donnelan for pointing out the SD in the control condition in the replication was 0.91, not 0.95 as in an earlier version of this vignette, and the raw equivalence bound is 0.429, not 0.384 as in an earlier version. I mistook the n (95) and the SD, and originally calculated raw bounds for 0.43, not 0.48 by mistake.

*Moery, E., & Calin-Jageman, R. J. (2016). Direct and Conceptual Replications of Eskine (2013): Organic Food Exposure Has Little to No Effect on Moral Judgments and Prosocial Behavior. Social Psychological and Personality Science, 7(4), 312-319. https://doi.org/10.1177/1948550616639649 *

Independent Groups Welch's Equivalence Test

Deary, Thorpe, Wilson, Starr, and Whally (2003) report the IQ scores of 79,376 children in Scotland (39,343 girls and 40,033 boys). The IQ score for girls (M = 100.64, SD = 14.1) girls and boys (100.48, SD = 14.9) was non-significant. With such a huge sample, we can examine whether the data is equivalent using very small equivalence bounds, such as -0.05 and 0.05. Because sample sizes are unequal, and variances are unequal, Welch's t-test is used, and we can conclude IQ scores are indeed equivalent, based on equivalence bounds of -0.05 and 0.05.

# OLD CODE
TOSTtwo(m1=100.64,m2=100.48,sd1=14.1,sd2=14.9,n1=39343,n2=40033,low_eqbound_d=-0.05, high_eqbound_d=0.05, alpha = 0.05, var.equal=FALSE)

# NEW CODE
tsum_TOST(m1=100.64,m2=100.48,sd1=14.1,sd2=14.9,n1=39343,n2=40033,eqb=0.05, alpha = 0.05, var.equal=FALSE,
          eqbound_type = "SMD")

Deary, I. J., Thorpe, G., Wilson, V., Starr, J. M., & Whalley, L. J. (2003). Population sex differences in IQ at age 11: The Scottish mental survey 1932. Intelligence, 31(6), 533-542.

One-Sample Equivalence Test

Lakens, Semin, & Foroni (2012) examined whether the color of Chinese ideographs (white vs. black) would bias whether participants judged that the ideograph correctly translated positive, neutral, or negative words. In Study 4 the prediction was that participants would judge negative words to be correctly translated by black ideographs above guessing average, but no effects were predicted for translations in the other 5 between subject conditions in the 2 (ideograph color: white vs. black) x 3 (word meaning: positive, neutral, negative) design. The table below is a summary of the data in all 6 conditions:

|Color |Valence |M |% |SD |t |df |p |d| |---|---|---|---|---|---|---|---|---| |White |Positive |6.05 |50.42 |1.50 |0.15 |20 |.89 |.03| |White |Negative |6.68 |55.67 |1.70 |1.75 |18 |.10 |.40| |White |Neutral |5.95 |49.58 |1.40 |0.87 |19 |.87 |.04| |Black |Positive |6.45 |53.75 |1.91 |1.06 |19 |.30 |.24| |Black |Negative |6.95 |57.92 |1.15 |3.71 |19 |.00 |.80| |Black |Neutral |5.71 |47.58 |1.79 |0.73 |20 |.47 |.16|

With 19 to 21 participants in each between subject condition, the study had 80% power to detect equivalence with equivalence bounds of -0.68 to d = 0.68.

# OLD CODE
powerTOSTone(alpha=0.05, statistical_power=0.8, low_eqbound_d=-0.68, high_eqbound_d=0.68)

# NEW CODE -- note there is a minor discrepancy
# because the new function uses a different solution for power
power_t_TOST(type = "one.sample",eqb = 0.68,
             power = 0.8,alpha=.05)

When we perform 5 equivalence tests using equivalence bounds of -0.68 and 0.68, using a Holm-Bonferroni sequential procedure to control error rates, we can conclude statistical equivalence for the data in row 1, 3, and 6, but the conclusion is undetermined for tests 2 and 4, where the data is neither significantly different from guessing average, nor statistically equivalent. For example, performing the test for the data in row 6:

# OLD CODE
TOSTone(m=5.71,mu=6,sd=1.79,n=20,low_eqbound_d=-0.68, high_eqbound_d=0.68, alpha=0.05)

# NEW CODE
tsum_TOST(m1=5.71-6,sd1=1.79,n1=20,eqb=0.68, eqbound_type = "SMD")

It is clear I should have been more tentative in my conclusions in this study. Not only can I not conclude equivalence in some of the conditions, the equivalence bound I had 80% power to detect is very large, meaning the possibility that there are theoretically interesting but smaller effects remains.

Lakens, D., Semin, G. R., & Foroni, F. (2012). But for the bad, there would not be good: Grounding valence in brightness through shared relational structures. Journal of Experimental Psychology: General, 141(3), 584-594. DOI: 10.1037/a0026468

Equivalence test for correlations

Olson, Fazio, and Hermann (2007) reported correlations between implicit and explicit measures of self-esteem, such as the IAT, Rosenberg's self-esteem scale, a feeling thermometer, and trait ratings. In Study 1 71 participants completed the self-esteem measures. Because no equivalence bounds are mentioned, we can see which equivalence bounds the researchers would have 80% power to detect.

# OLD
powerTOSTr(alpha=0.05, statistical_power=0.8, 
           low_eqbound_r=-0.24, high_eqbound_r=0.24)

# NEW
power_z_cor(alpha=0.05, 
            power=0.8, 
            rho = 0,
            null=0.24,
            alternative = "equ")

With 71 pairs of observations between measures, the researchers have 80% power for equivalence bounds of r = -0.24 and r = 0.24.

The correlations observed by Olson et al (2007), Study 1, are presented in the table below (significant correlations are flagged by an asteriks).

|Measure |IAT |Rosenberg|Feeling thermometer| Trait ratings| |---|---|---|---|---| |IAT |- |-.12 |-.09 |-.06| |Rosenberg| | - |.62| .09| |Feeling thermometer| | | - |.29| |Trait ratings| | | |-|

We can test each correlation, for example the correlation between the IAT and the Rosenberg self-esteem scale of -0.12, given 71 participants, and using equivalence bounds of r_U = 0.24 and r_L = -0.24.

# OLD CODE
TOSTr(n=71, r=-0.12, low_eqbound_r=-0.24, high_eqbound_r=0.24, alpha=0.05)

# NEW CODE
corsum_test(n=71, r=-0.12, null=0.24, alpha=0.05,
            alternative = "equivalence")

We see that none of the correlations can be declared to be statistically equivalent. Instead, the tests yield undetermined outcomes: the correlations are not significantly different from 0, nor statistically equivalent.

Olson, M. A., Fazio, R. H., & Hermann, A. D. (2007). Reporting tendencies underlie discrepancies between implicit and explicit measures of self-esteem. Psychological Science, 18(4), 287-291.

Lakens/TOSTER documentation built on June 9, 2025, 8:57 p.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

Tweet to @rdrrHQ

GitHub issue tracker

ian@mutexlabs.com