MSE_Test.default: Comparing Test MSE's for Full and Reduced Models
In tim-coleman/RFtest: Scalable and Efficient Hypothesis Tests for Random Forests

Description Usage Arguments Value Author(s) Examples

View source: R/MSE_Test_File.R

Implementation of a test which permutes trees between two forests, one with var left intact, and one with var replaced with a permuted version of itself, where the permutation is done row-wise.

MSE_Test(X, y, X.test = FALSE, y.test = FALSE, var,
  NTest = nrow(X.test), B = 1000, NTree = 500, p = 1/2,
  base.learner = "rpart", mtry = ncol(X), importance = T,
  alpha = if (base.learner == "lm") 1,
  glm_cv = if (base.learner == "lm") "external" else "none",
  lambda = if (glm_cv == "none" & base.learner == "lm") 1 else NULL,
  ranger = F)

MSE_Test(formula, data, ...)

`X`	Data frame of covariates - the training data.
`y`	Response vector. Currently only numeric responses (regression) are supported.
`X.test`	Covariates of the test set with which the MSE is calculated.
`y.test`	Responses in the test set with which the MSE is calculated.
`base.learner`	One of `"rpart"`, `"ctree"`, `"rtree"`, or `"lm"`. Base model to be used in the bagging.
`NTree`	Number of base learners.
`mtry`	`"mtry"` parameter associated with random forest models.
`var`	Variable of interest. Should correspond to the name of a variable in both `X` and `X.test`.
`NTest`	If `X.test, y.test` are not specified, this number of test points are drawn at random from `X, y` to serve as a test set.
`B`	Number of permutations to use in the test. Note: this is the number of times the trees are permuted between forests to generate the permutation distribution, not the number of times each feature is permuted.
`p`	Fractional exponent of sample size, i.e. k = n^p observations are drawn.
`base.learner`	One of `"rpart"`, `"ctree"`, `"rtree"`, or `"lm"`. Base model to be used in the bagging.
`importance`	Logical. Should the standardized score of the test statistic (its "importance") be returned?
`form`	A `"formula"` object - no need to provide this by default.
`alpha`	Mixing parameter if `base.learner = "lm"` is chosen, quantifies amount between LASSO and Ridge penalties.
`glm_cv`	Should internal cross validation be performed on each Elastic Net model?
`lambda`	Regularization parameter if `base.learner = "lm"` is chosen.
`ranger`	If `base.learner = "rtree"` or `base.learner = "ctree"`, should the models be `ranger` objects or `randomForest` objects (if rtree is chosen) or `cforest` objects (if ctree is chosen.)

An object of the S4 class MSE_Test

`var`	Variable whose importance was tested, a name of a column in `X`.
`originalStat`	A named vector of two quantities, `Original MSE`, which corresponds to the MSE of the full model and `Permuted MSE` which corresponds to MSE of the reduced model.
`PermDiffs`	A vector of the differences in permuted MSEs - these make up the permutation distribution.
`Importance`	A scalar of the SD Importance Z-score.
`Pvalue`	The p-value for the hypothesis tested.
`test_pts`	The test data frame.
`weak_learner`	The base models used in the ensemble.
`model_original`	The full model ensemble - list of base learners, like in `bag.s`.
`model_permuted`	The reduced model ensemble - list of base learners, like in `bag.s`.
`test_stat`	Which test statistic is used. Will always be `"MSE"` for this function.

Tim Coleman

N <- 1250
Nvar <- 10
N_test <- 150
name_vec <- paste("X", 1:(2*Nvar), sep = "")

# training data:
X <- data.frame(replicate(Nvar, runif(N)),
                replicate(Nvar, cut(runif(N), 3,
                                      labels = as.character(1:3)))) %>%
  mutate(Y = 5*(X3) + .5*X2^2 + ifelse(X6 > 10*X1*X8*X9, 1, 0) +  rnorm(N, sd = .05))
names(X) <- c(name_vec, "Y")

# some testing data:
X.t1 <- data.frame(replicate(Nvar, runif(N_test)),
                   replicate(Nvar, cut(runif(N_test), 3,
                                       labels = as.character(1:3)))) %>%
  mutate(Y = 5*(X3) + .5*X2^2 + ifelse(X6 > 10*X1*X8*X9, 1, 0) +  rnorm(N_test, sd = .05))
names(X.t1) <- c(name_vec, "Y")

# Not specifying test points:
M_no_test <- MSE_Test(X = X %>% dplyr::select(-Y), y = X$Y,
                      base.learner = "lm", NTest = 100, NTree = 150, B = 1000, var = c( "X3"),
                      p = .85, glm_cv = T)

summary(M_no_test)

# Specifying test points:
M_test <- MSE_Test(X = X %>% select(-Y), y = X$Y, X.test = X.t1 %>% select(-Y), y.test = X.t1$Y,
                      base.learner = "ctree", NTree = 250, B = 1000, var = c( "X2"),
                      p = .85)
summary(M_test)