fit.pfc: Fit a high-dimensional principal fitted components model... In abundant: High-Dimensional Principal Fitted Components and Abundant Regression

Description

Let (x_1, y_1), …, (x_n, y_n) denote the n measurements of the predictor and response, where x_i\in R^p and y_i \in R. The model assumes that these measurements are a realization of n independent copies of the random vector (X,Y)', where

X = μ_X + Γ β\{f(Y) - μ_f\}+ ε,

μ_X\in R^p; Γ\in R^{p\times d} with rank d; β \in R^{d\times r} with rank d; f: R \rightarrow R^r is a known vector valued function; μ_f = E\{f(Y)\}; ε \sim N_p(0, Δ); and Y is independent of ε. The central subspace is Δ^{-1} {\rm span}(Γ).

This function computes estimates of these model parameters by imposing constraints for identifiability. The mean parameters μ_X and μ_f are estimated with \bar x = n^{-1}∑_{i=1}^n x_i and \bar f = n^{-1} ∑_{i=1}^n f(y_i). Let \widehatΦ = n^{-1}∑_{i=1}^{n} \{f(y_i) - \bar f\}\{f(y_i) - \bar f\}', which we require to be positive definite. Given a user-specified weight matrix \widehat W, let

(\widehatΓ, \widehatβ) = \arg\min_{G\in R^{p\times d}, B \in R^{d\times r}} ∑_{i=1}^n [x_i - \bar x - GB\{f(y_i) - \bar f\}]'\widehat W [x_i - \bar x - GB\{f(y_i) - \bar f\}],

subject to the constraints that G'\widehat W G is diagonal and B \widehatΦ B' = I. The sufficient reduction estimate \widehat R: R^p \rightarrow R^d is defined by

\widehat R(x) = (\widehatΓ'\widehat W \widehatΓ)^{-1} \widehatΓ' \widehat W(x - \bar x).

Usage

 1 2 3 fit.pfc(X, y, r=4, d=NULL, F.user=NULL, weight.type=c("sample", "diag", "L1"), lam.vec=NULL, kfold=5, silent=TRUE, qrtol=1e-10, cov.tol=1e-4, cov.maxit=1e3, NPERM=1e3, level=0.01) 

Arguments

 X The predictor matrix with n rows and p columns. The ith row is x_i defined above. y The vector of measured responses with n entries. The ith entry is y_i defined above. r When polynomial basis functions are used (which is the case when F.user=NULL), r is the polynomial order, i.e, f(y) = (y, y^2, …, y^r)'. The default is r=4. This argument is not used when F.user is specified. d The dimension of the central subspace defined above. This must be specified by the user when weight.type="L1". If unspecified by the user this function will use the sequential permutation testing procedure, described in Section 8.2 of Cook, Forzani, and Rothman (2012), to select d. F.user A matrix with n rows and r columns, where the ith row is f(y_i) defined above. This argument is optional, and will typically be used when polynomial basis functions are not desired. weight.type The type of weight matrix estimate \widehat W to use. Let \widehatΔ be the observed residual sample covariance matrix for the multivariate regression of X on f(Y) with n-r-1 scaling. There are three options for \widehat W: weight.type="sample" uses a Moore-Penrose generalized inverse of \widehatΔ for \widehat W, when p ≤q n-r-1 this becomes the inverse of \widehatΔ; weight.type="diag" uses the inverse of the diagonal matrix with the same diagonal as \widehatΔ for \widehat W; weight.type="L1" uses the L1-penalized inverse of \widehatΔ described in equation (5.4) of Cook, Forzani, and Rothman (2012). In this case, lam.vec and d must be specified by the user. The QUIC algorithm of Hsieh et al. (2011) is used through the R package QUIC. lam.vec A vector of candidate tuning parameter values to use when weight.type="L1". If this vector has more than one entry, then kfold cross validation will be performed to select the optimal tuning parameter value. kfold The number of folds to use in cross-validation to select the optimal tuning parameter when weight.type="L1". Only used if lam.vec has more than one entry. silent Logical. When silent=FALSE, progress updates are printed. qrtol The tolerance for calls to qr.solve(). cov.tol The convergence tolerance for the QUIC algorithm used when weight.type="L1". cov.maxit The maximum number of iterations allowed for the QUIC algorithm used when weight.type="L1". NPERM The number of permutations to used in the sequential permutation testing procedure to select d. Only used when d is unspecified. level The significance level to use to terminate the sequential permutation testing procedure to select d.

Value

A list with

 Gamhat this is \widehatΓ described above. bhat this is \widehatβ described above. Rmat this is \widehat W\widehatΓ(\widehatΓ'\widehat W \widehatΓ)^{-1}. What this is \widehat W described above. d this is d described above. r this is r described above. GWG this is \widehatΓ'\widehat W \widehatΓ fc a matrix with n rows and r columns where the ith row is f(y_i) - \bar f. Xc a matrix with n rows and p columns where the ith row is x_i - \bar x. y the vector of n response measurements. mx this is \bar x described above. mf this is \bar f described above. best.lam this is selected tuning parameter value used when weight.type="L1", will be NULL otherwise. lam.vec this is the vector of candidate tuning parameter values used when weight.type="L1", will be NULL otherwise. err.vec this is the vector of validation errors from cross validation, one error for each entry in lam.vec. Will be NULL unless weight.type="L1" and lam.vec has more than one entry. test.info a dataframe that summarizes the results from the sequential testing procedure. Will be NULL unless d is unspecified.

References

Cook, R. D., Forzani, L., and Rothman, A. J. (2012). Estimating sufficient reductions of the predictors in abundant high-dimensional regressions. Annals of Statistics 40(1), 353-384.

Cho-Jui Hsieh, Matyas A. Sustik, Inderjit S. Dhillon, and Pradeep Ravikumar (2011). Sparse inverse covariance matrix estimation using quadratic approximation. Advances in Neural Information Processing Systems 24, 2011, p. 2330–2338.

pred.response

Examples

  1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 set.seed(1) n=20 p=30 d=2 y=sqrt(12)*runif(n) Gam=matrix(rnorm(p*d), nrow=p, ncol=d) beta=diag(2) E=matrix(0.5*rnorm(n*p), nrow=n, ncol=p) V=matrix(c(1, sqrt(12), sqrt(12), 12.8), nrow=2, ncol=2) tmp=eigen(V, symmetric=TRUE) V.msqrt=tcrossprod(tmp$vec*rep(tmp$val^(-0.5), each=2), tmp$vec) Fyc=cbind(y-sqrt(3),y^2-4)%*%V.msqrt X=0+Fyc%*%t(beta)%*%t(Gam) + E fit=fit.pfc(X=X, y=y, r=3, weight.type="sample") ## display hypothesis testing information for selecting d fit$test.info ## make a response versus fitted values plot plot(pred.response(fit), y) 

Example output

Loading required package: QUIC
d0 test.statistic pvalue
1  0    0.109842773  0.000
2  1    0.001617754  0.003
3  2    0.001565799  0.406


abundant documentation built on May 2, 2019, 2:08 a.m.