prototest.multivariate: Perform Prototype or F tests for Significance of Groups of...
In prototest: Inference on Prototypes from Clusters of Features

Description Usage Arguments Details Value Author(s) References See Also Examples

View source: R/prototest.multivariate.R

Perform prototype or F tests for significance of groups of predictors in the multivariate model. Choose either exact or approximate likelihood ratio prototype tests (ELR) or (ALR) or F test or marginal screening prototype test. Options for selective or non-selective tests. Further options for non-sampling or hit-and-run reference distributions for selective tests.

1
2
3

prototest.multivariate(x, y, groups, test.group, type = c("ELR", "ALR", "F", "MS"), 
selected.col = NULL, lambda, mu = NULL, sigma = 1, 
hr.iter = 50000, hr.burn.in = 5000, verbose = FALSE, tol = 10^-8)

`x`	input matrix of dimension n-by-p, where p is the number of predictors over all predictor groups of interest. Will be mean centered and standardised before tests are performed.
`y`	response variable. Vector of length n, assumed to be quantitative.
`groups`	group membership of the columns of `x`. Vector of length p, which each element containing the goup label of the corresponding column in `x`.
`test.group`	group label for which we test nullity. Should be one of the values seen in `groups`. See Details for further explanation.
`type`	type of test to be performed. Can select one at a time. Options include the exact and approximate likelihood ratio prototype tests of Reid et al (2015) (ELR, ALR), the F test and the marginal screening prototype test of Reid and Tibshirani (2015) (MS). Default is ELR.
`selected.col`	preselected columns selected by the user. Vector of indices in the set {1, 2, ... p}. Used in conjunction with `groups` to ascertain for which groups the user has specified selected columns. Should it find any selected columns within a group, no further action is taken to select columns. Should no columns within a group be specified, columns are selected using either lasso or the marginal screening procedure, depending on the test. If all groups have prespecified columns, a non-selective test is performed, using the classical distributional assumptions (exact and/or asymptotic) for the test in question. If any selection is performed, selective tests are performed. Default is `NULL`, requiring the selection of columns in all the groups.
`lambda`	regularisation parameter for the lasso fit. Same for each group. Must be supplied when at least one group has unspecified columns in `selected.col`. Will be supplied to `glmnet`. This is the unstandardised version, equivalent to `lambda`/`n` supplied to `glmnet`.
`mu`	mean parameter for the response. See Details below. If supplied, it is first subtracted from the response to yield a zero-mean (at the population level) vector for which we proceed with testing. If `NULL` (the default), this parameter is treated as nuisance parameter and accounted for as such in testing.
`sigma`	error standard deviation for the response. See Details below. Must be supplied. If not, it is assumed to be 1. Required for computation of some of the test statistics.
`hr.iter`	number of hit-and-run samples required in the reference distribution of the a selective test. Applies only if `selected.col` is `NULL`. Default is 50000. Since dependent samples are generated, large values are required to generate good reference distributions. If set to 0, the function tries to applu a non-sampling selective test (provided `selected.col` is `NULL`), if possible. If non-sampling test is not possible, the function exits with a message.
`hr.burn.in`	number of burn-in hit-and-run samples. These are generated first so as to make subsequent hit-and-run realisations less dependent on the observed response. Samples are then discarded and do not inform the null reference distribution.
`verbose`	should progress be printed?
`tol`	convergence threshold for iterative optimisation procedures.

The model underpinning each of the tests is

\emph{y = mu + sum_k theta_k hat_y_k + epsilon}

where \emph{epsilon} is Gaussian with mean 0 and variance sigma^2 and K is the number of predictor groups. \emph{y_hat_k} depends on the particular test considered.

In particular, for the ELR, ALR and F tests, we have \emph{y_hat_k = P_M_k(y - mu)}, where \emph{P_M_k = X_M_kX_M_k^dagger}. \emph{X_M} is the input matrix reduced to the columns with indices in the set M. \emph{M_k} is the set of indices selected from considering group k of predictors in isolation. This set is either provided by the user (via selected.col) or is selected automatically (if selected.col is NULL). If the former, a non-selective test is performed; if the latter, a selective test is performed, with the restrictions \emph{Ay <= b}, as set out in Lee et al (2015) and stacked as in Reid and Tibshirani (2015).

For the marginal screening prototype (MS) test, \emph{y_hat_k = x_j_star} where \emph{x_j} is the \emph{jth} column of x and is the column of maximal marginal correlation with the response in set \emph{C_k}, where \emph{C_k} is the set of indices in the overall predictor set corresponding to predictors in the \emph{kth} group.

All tests test the null hypothesis H_0: \emph{theta_k_star = 0}, where \emph{k_star} is supplied by the user via test.group. Details of each are described in Reid et al (2015).

A list with the following four components:

`ts`	The value of the test statistic on the observed data.
`p.val`	Valid p-value of the test.
`selected.col`	Vector with columns selected for prototype formation in the test. If initially `NULL`, this will now contain indices of columns selected by the automatic column selection procedures of the test.
`y.hr`	Matrix with hit-and-run replications of the response. If sampled selective test was not performed, this will be `NULL`.

Stephen Reid

Reid, S. and Tibshirani, R. (2015) Sparse regression and marginal testing using cluster prototypes. http://arxiv.org/pdf/1503.00334v2.pdf. Biostatistics doi: 10.1093/biostatistics/kxv049
Reid, S., Taylor, J. and Tibshirani, R. (2015) A general framework for estimation and inference from clusters of features. Available online: http://arxiv.org/abs/1511.07839.

prototest.univariate

require (prototest)

### generate data
set.seed (12345)
n = 100
p = 80

X = matrix (rnorm(n*p, 0, 1), ncol=p)


beta = rep(0, p)
beta[1:3] = 0.1 # three signal variables: number 1, 2, 3
signal = apply(X, 1, function(col){sum(beta*col)})
intercept = 3

y = intercept + signal + rnorm (n, 0, 1)

### treat all columns as if in same group and test for signal

# non-selective ELR test with nuisance intercept
elr = prototest.univariate (X, y, "ELR", selected.col=1:5)
# selective F test with nuisance intercept; non-sampling
f.test = prototest.univariate (X, y, "F", lambda=0.01, hr.iter=0) 
print (elr)
print (f.test)

### assume variables occur in 4 equally sized groups
num.groups = 4
groups = rep (1:num.groups, each=p/num.groups)

# selective ALR test -- select columns 21-25 in 2nd group; test for signal in 1st; hit-and-run
alr = prototest.multivariate(X, y, groups, 1, "ALR", 21:25, lambda=0.005, hr.iter=20000)
# non-selective MS test -- specify first column in each group; test for signal in 1st
ms = prototest.multivariate(X, y, groups, 1, "MS", c(1,21,41,61)) 
print (alr)
print (ms)

Loading required package: intervals
Loading required package: MASS
Loading required package: glmnet
Loading required package: Matrix

Attaching package: 'Matrix'

The following object is masked from 'package:intervals':

    expand

Loading required package: foreach
Loaded glmnet 2.0-16

     ts  p.val
1 0.084 0.7722
    ts p.val
1 3.51 0.694
     ts  p.val
1 4.147 0.0756
     ts  p.val
1 1.596 0.1106