Hypothesis Testing for Covariate or Group effect in Nonparametric Regression

Description

Tests the significance of a covariate or a group of covariates in a nonparametric regression based on residuals from a local polynomial fit of the remaining covariates using high dimensional one-way ANOVA.

Usage

1
2
npmodelcheck(X, Y, ind_test, p = 7, degree.pol = 0, kernel.type =
 "epanech", bandwidth = "CV", gridsize = 30, dim.red = c(1, 10))

Arguments

X

matrix with observations, rows corresponding to data points and columns correspond to covariates.

Y

vector of observed responses.

ind_test

index or vector with indices of covariates to be tested.

p

size of the window W_i. See Details.

degree.pol

degree of the polynomial to be used in the local fit.

kernel.type

kernel type, options are "box", "trun.normal", "gaussian", "epanech",
"biweight", "triweight" and "triangular". "trun.normal" is a gaussian kernel truncated between -3 and 3.

bandwidth

bandwidth, vector or matrix of bandwidths for the local polynomial fit. If a vector of bandwidths, it must correspond to each covariate of X_-(ind_test), that is, the covariates not being tested. If "CV", leave-one-out cross validation with criterion of minimum MSE is performed to select a unique bandwidth that will be used for all dimensions of X_-(ind_test); if "GCV", Generalized Cross Validation is performed to select a unique bandwidth that will be used for all dimensions of X_-(ind_test); if "CV2" leave-one-out cross validation for each covariate of X_-(ind_test); and if "GCV2", GCV for each covariate of X_-(ind_test). It can be a matrix of bandwidths (not to be confused with bandwidth matrix H), where each row is a vector of the same dimension of the columns of X_-(ind_test), representing a bandwidth that changes with the location of estimation for multidimensional X. See localpoly.reg.

gridsize

number of possible bandwidths to be searched in cross-validation. If left as default 0, gridsize is taken to be 5+as.integer(100/d^3). If cross-validation is not performed, it is ignored.

dim.red

vector with first element indicating 1 for Sliced Inverse Regression (SIR) and 2 for Supervised Principal Components (SPC); the second element of the vector should be number of slices (if SIR), or number of principal components (if SPC). If 0, no dimension reduction is performed. See Details.

Details

To test the significance of a single covariate, say X_j, assume that its observations X_ij, i = 1,...n, define the factor levels of a one-way ANOVA. To construct the ANOVA, each of these factor levels is augmented by including residuals from nearby covariate values. Specifically, cell "i" is augmented by the values of the residuals corresponding to observations X_ij for "i" in W_i (W_i defines the neighborhood, and has size "p"). These residuals are obtained from a local polynomial fit of the remaining covariates X_-(j). Then, the test for the significance of X_j is the test for no factor effects in the high-dimensional one-way ANOVA. See references for further details.

When testing the significance of a group of covariates, the window W_i is defined using the fist supervised principal component (SPC) of the covariates in that group; and the local polynomial fit uses the remaining covariates X_-(ind_test).

Dimension reduction (SIR or SPC) is applied on the remaining covariates (X_-(ind_test)), which are used on the local polynomial fit. This reduction is used to moderate the effect of the curse of dimensionality when fitting nonparametric regression for several covariates. For SPC, the supervision is done in the following way: only covariates with p-values (from univariate "npmodelcheck" test with Y) < 0.3 can be selected to compose the principal components. If no covariate has p-value < 0.3, then the most significant covariate will be the only component. For SIR, the size of the effective dimension reduction space is selected automatically through sequential testing (see references for details).

Value

bandwidth

bandwidth used for the local polynomial fit

predicted

vector with the predicted values with the remaining covariates

p-value

p-value of the test

Author(s)

Adriano Zanin Zambom <adriano.zambom@gmail.com>

References

Zambom, A. Z. and Akritas, M. G. (2012). a) Nonparametric Model Checking and Variable Selection. Statistica Sinica, v. 24, pp. 1837.

Zambom, A. Z. and Akritas, M. G. (2012). b) Signicance Testing and Group Variable Selection. Journal of Multivariate Analysis, v. 133, pp. 51.

Li, K. C. (1991). Sliced Inverse Regression for Dimension Reduction. Journal of the American Statistical Association, 86, 316-327.

Bair E., Hastie T., Paul D. and Tibshirani R. (2006). Prediction by supervised principal components. Journal of the American Statistical Association, 101, 119-137.

See Also

localpoly.reg, npvarselec

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
X = matrix(1,100,5)

X[,1] = rnorm(100)
X[,2] = rnorm(100)
X[,3] = rnorm(100)
X[,4] = rnorm(100)
X[,5] = rnorm(100)
Y = X[,3]^3 + rnorm(100)

npmodelcheck(X, Y, 2, p = 9, degree.pol = 0, kernel.type = "trun.normal", 
bandwidth = "GCV",  dim.red = 0)

npmodelcheck(X, Y, 3, p = 7, degree.pol = 0, kernel.type = "trun.normal", 
bandwidth = "CV",  dim.red = c(2,2))

npmodelcheck(X, Y, c(1,2), p = 11, degree.pol = 0, kernel.type = "box", 
bandwidth = "CV",  dim.red = c(1,10))

npmodelcheck(X, Y, c(3,4), p = 5, degree.pol = 0, kernel.type = "box", 
bandwidth = "CV",  dim.red = c(1,20))

npmodelcheck(rnorm(100), rnorm(100), 1, p = 5, degree.pol = 1, 
kernel.type = "box", bandwidth = "CV",  dim.red = c(1,20))