In Jaeoc/lmhyp: Informed Hypothesis Testing for Regression

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.path = "README-"
)

lmhyp

This package provides an easy way to test hypotheses about coefficients in multiple regression. Simply fit a linear model in R with lm, specify your hypotheses and test them with test_hyp. The package enables formal testing of hypotheses using a default Bayesian prior and is particularly useful for testing multiple contradicting hypotheses. A tutorial paper for the package is available at https://doi.org/10.3758/s13428-018-01196-9.

Basic example

library(lmhyp)

###Fit a linear model with lm
dt <- as.data.frame(scale(mtcars[, c(1, 3:4, 6)]))
fit <- lm(mpg ~ disp + hp + wt, data = dt)

###Define hypotheses and test them
hyp <- "wt > disp; wt < disp; wt = disp"
test_hyp(fit, hyp)

Installation

You can install lmhyp from github with:

# install.packages("devtools")
devtools::install_github("Jaeoc/lmhyp")

Usage

This function is based on a Bayes factor method by Mulder (2014). In essence, it uses a number of observations equal to the number of predictors in the linear model to construct a minimally informative prior, and the remainder of the observations are then used to test the hypotheses.

The function test_hyp takes as primary input a linear model object and a string vector specifying hypotheses. Simply specify the linear model as you normally would in R using lm. If hypotheses involve multiple variables it is typically preferable to standardize the variables first to facilitate interpretation.

fit <- lm(mpg ~ disp + hp + wt, data = dt)

Hypotheses are then specified using the variable names of the fitted object. Spaces do not matter. For example, if we think that both wt and disp have negative associations with the DV, this can be specified as:

Hyp <- "wt < 0, disp < 0"

When comparing several variables with the same value or variable they can be grouped with parentheses. The following specification is equivalent to the above:

Hyp <- "(wt, disp) < 0"

Parentheses become more important when defining hypotheses that involve more variables. For example, a hypothesis might be that the third variable hp also has a negative association with the DV, but that wt has a stronger negative association than either of the other variables. In that case it becomes necessary to use parentheses because the function does not permit repeating variables when specifying a hypothesis. This hypothesis would be then be specified as follows:

H1 <- "wt < (disp, hp) < 0"

If we have several contradicting hypotheses, as in the basic example, we can compare these directly by including them all in the same string vector separated with semicolons. For example, an alternative hypothesis might be that wt has a negative association, but that disp and hp have no association with the DV:

H2 <- "wt < (disp, hp) = 0"

Since H1 and H2 make predictions regarding the same variables we can test them directly against each other. Note the semicolons that separate each hypothesis:

H1v2 <- "wt < (disp, hp) < 0; wt < (disp, hp) = 0"

Testing the hypotheses is then as simple as inputting the lm object and hypothesis object into the test_hyp function:

result <- test_hyp(fit, H1v2)

When specified hypotheses are overlapping as in this case (in both hypotheses wt < 0) the function takes slightly longer to finish since it runs 1e4 Monte Carlo iterations to check the extent of the overlap.

The function gives as primary output which hypotheses were specified and the posterior probability of each hypothesis. If the input hypotheses are not exhaustive (i.e., do not cover the full parameter space), the posterior probability of the complement is also output. The complement is the hypothesis that neither of the input hypotheses is true. In the current case we get the following output:

result

We see that there is strong evidence in the data for H1 and very weak evidence for H2 and the complement Hc. Exactly how the hypotheses compare to each other we can find out by printing the Bayes factor matrix:

result$BF_matrix

By definition, the Bayes Factor is interpreted as the likelihood of the data under a hypothesis compared to another. However, because the prior by default is the same for all hypotheses we can directly interpret the BFs as the likelihood of a hypothesis compared to another. Thus, given the data, H1 is r round(result$BF_matrix[1, 2], 0) times as likely as H2 ($B_{12}$ = r round(result$BF_matrix[1, 2], 2)) and r round(result$BF_matrix[1, 3], 0) times as likely as the complement ($B_{1c}$ = r round(result$BF_matrix[1, 3], 2)).

It is also possible to specify one's own prior probabilites as a vector of numbers. These must then be specified for all hypotheses, including the complement if one exists. The function will throw an error if too few or too many probabilites have been specified. They can be specified both as probabilites, e.g, c(0.5, 0.2, 0.3) or as relative weights, e.g., c(5, 2, 3). In the latter case the function will normalize the input values. So for our example we might get:

test_hyp(fit, H1v2, c(5, 2, 3))

As can be seen the posterior probabilites are now different from when we used the default equal prior probabilites.

An alternative to specifying explicit hypotheses is to specify the option "exploratory". This will print the posterior probabilites for the hypotheses "X < 0; X = 0; X > 0" for all variables, including the intercept, and assumes equal prior probabilities.

test_hyp(fit, "exploratory")

Also here it would be possible to ask for the BF_matrix for each variable if we had saved the test as an object.

Finally, the function provides some supplementary output intended to provide a deeper understanding of the underlying method and primary output. Calling BF_computation illustrates the computation of the Bayes factor of each hypothesis against the unconstrained hypothesis. In the output c(E) is the prior density, c(I|E) the prior probability, c the product of these two, and columns prefixed with "f" the equivalent for the posterior. B(t,u) is the Bayes factor of hypothesis t against the unconstrained (u) and PP(t) is the posterior probability of t.

round(result$BF_computation,2)

Note that I round the output here for more convenient presentation. Thus, the Bayes factor against the unconstrained hypothesis is calculated as f(I|E) / c(I|E) = B(t, u) and the posterior probability, for say $H_1$, as PP(1) = B(1, u) / [B(1,u) + B(2,u) + B(c, u)] = 0.99.

The second supplementary output we can ask for is 90% credibility intervals around the Bayes factors against the constrained hypothesis. In most cases our method uses an analytic solution, but under some circumstances (when the matrix with the coefficients of the order constraints is not of full row rank) it uses a numerical solution with the number of draws equal to the input value mcrep (by default 1e6). This is the case for the current example and we might then print these uncertainty intervals by calling BFu_CI:

round(result$BFu_CI,2)

Here the Bayes factor and its lower and upper bound is presented in all cases where this is relevant.

References

Mulder, J. (2014). Prior adjusted default Bayes factors for testing (in) equality constrained hypotheses. Computational Statistics & Data Analysis, 71, 448-463.

J. Mulder, A. Olsson-Collentine (2019). Simple Bayesian Testing of Scientific Expectations in Linear Regression Models. Behavior Research Methods, https://doi.org/10.3758/s13428-018-01196-9.

Jaeoc/lmhyp documentation built on April 3, 2020, 11:50 a.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

Tweet to @rdrrHQ

GitHub issue tracker

ian@mutexlabs.com