RPtest: Goodness of fit tests for potentially high-dimensional linear...

Description Usage Arguments Details Value References See Also Examples

View source: R/RPtest.R

Description

Can test for the significance of (potentially large) groups of predictors and the presence of nonlinearity or heteroscedasticity in the context of both low and high-dimensional linear models. Outputs a p-value. Also allows for the calibration of arbitrary goodness of fit tests via specification of RPfunction.

Usage

1
2
3
4
5
RPtest(x, y, resid_type = c("Lasso", "OLS"), test = c("nonlin", "group",
  "hetero"), x_alt, RPfunction = NULL, B = 49L, rand_gen = rnorm,
  noise_matrix = NULL, mc.cores = 1L, nfolds = 5L, nperms = 2L,
  beta_est = NULL, resid_only = FALSE, output_all = FALSE,
  verbose = FALSE)

Arguments

x

Input matrix with nobs rows, each an observation vector.

y

Response vector.

resid_type

Type of residuals used for the test (see details below). Use Lasso when the null model is high-dimensional; otherwise use OLS.

test

Type of departure from the linear model to test for (see details below). Ignored if RPfunction is given.

x_alt

If test is group, this gives the set of variables whose significance we wish to ascertain, after controlling for those in x. If RPfunction is given, it is the input matrix passed to the function RPfunction.

RPfunction

A residual prediction (RP) function that must permit calling as RPfunction(x_alt, resid) where resid is a numeric vector with nobs components. The output must be either a single number or a numeric vector (in the latter case RPfunction would encode a number of RP functions).

B

The number of bootstrap samples to use - note the p-value produced will always be at least 1/B.

rand_gen

A function to generate the simulated errors up to an unknown scale factor. It must permit calling as rand_gen(nobs*B). Determines the form of errors in the null model. The default rnorm equates to a null of a (sparse) Gaussian linear model. Setting rand_gen=NULL resamples residuals to generate simulated errors and approximates a null of i.i.d. errors with unknown distribution.

noise_matrix

An optional matrix whose columns are the simulated errors to use. Note that B and rand_gen will be ignored if this is supplied.

mc.cores

The number of cores to use. Will always be 1 in Windows.

nfolds

Number of folds to use when performing cross-validation to obtain beta_est, the initial estimate of the vector of regression coefficients, via Lasso estimation.

nperms

Number of permutations of the data for which nfolds cross-validation is to be performed. Thus in total prediction errors on nfolds*nperms folds are averaged over.

beta_est

An optional user-supplied estimate.

resid_only

If TRUE only outputs the residuals without applying an RP function.

output_all

In addition to the p-value, gives further output (see Value below).

verbose

Whether to print addition information.

Details

The function works by first computing residuals from a regression of y on x. Next B sets of errors generated through rand_gen are added to a signal derived from beta_est and aritificial residuals are computed. The option resid_only=TRUE then outputs these residuals along with the original residuals, scaled to have l_2-norm squared equal to nobs. The residuals in question are OLS residuals when resid_type=OLS (case a - for use when the null hypothesis is low-dimensional so the number of columns of x is smaller than nobs-1), and square-root / scaled Lasso residuals otherwise (case b). The options for test then apply different functions to the residuals as described below.

nonlin

In case (a), the test statistic is the RSS (residual sum of squares) of a randomForest fit from regressing the residuals on to x; case (b) is similar but the OOB error is used and the regression is carried out on the equicorrelation set rather than all of x.

group

x_alt is first residualised with respect to x by (a) OLS or (b) sparse_proj. Then the RSS from Lasso fits from regressions of the residuals on to x_alt are used.

hetero

Uses the RSS from Lasso fits from regressions of the squared residuals to the equicorrelation set (b) or all of x (a).

Value

When resid_only=FALSE and output_all=FALSE, the output is a single p-value. Otherwise, a list with some of the following components is returned (resid_only=FALSE causes the last two components to be omitted):

p-value

p-value

beta_est

estimated vector of regression coefficients beta_est

sigma_est

set to 1 when resid_type=OLS; otherwise the normalised root-RSS derived from beta_est used in generated the simulated errors

resid

scaled residuals

resid_sim

simulated scaled residuals

test

the test statistic(s) - may be a vector if multiple RP functions are being used such as when test=group

test_sim

a list of simulated test statistics

References

Shah, R. D., Buhlmann, P. (2016) Goodness of fit tests for high-dimensional linear models http://arxiv.org/abs/1511.03334

See Also

RPtest_single and sqrt_lasso

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
# Testing for nonlinearity
set.seed(1)
x <- scale(matrix(runif(100*200), 100, 200))
y <- x[, 1] + x[, 1]^4 + rnorm(nrow(x))
out <- RPtest(x, y, test="nonlin", B=9L, nperms=2, resid_type = "Lasso")

# Testing significance of a group
y <- x[, 1:5] %*% rep(1, 5) + x[, 151] + rnorm(nrow(x))
(out <- RPtest(x[, 1:150], y, test="group", x_alt = x[, 151:200], B=9L, nperms=1))

# Testing for heteroscedasticity
x <- scale(matrix(runif(250*100), 250, 100))
hetero_sig <- x[, 1] + x[, 2]
var_vec <- hetero_sig - min(hetero_sig) + 0.01
var_vec <- var_vec / mean(var_vec)
sd_vec <- sqrt(var_vec)
y <- x[, 1:5] %*% rep(1, 5) + sd_vec*rnorm(nrow(x))
(out <- RPtest(x, y, test="hetero", B=9L, nperms=1))

Example output

[1] 0.1
[1] 0.1

RPtests documentation built on May 29, 2017, 9:06 a.m.