# fit.pfc: Fit a high-dimensional principal fitted components model... In abundant: High-Dimensional Principal Fitted Components and Abundant Regression

## Description

Let (x_1, y_1), …, (x_n, y_n) denote the n measurements of the predictor and response, where x_i\in R^p and y_i \in R. The model assumes that these measurements are a realization of n independent copies of the random vector (X,Y)', where

X = μ_X + Γ β\{f(Y) - μ_f\}+ ε,

μ_X\in R^p; Γ\in R^{p\times d} with rank d; β \in R^{d\times r} with rank d; f: R \rightarrow R^r is a known vector valued function; μ_f = E\{f(Y)\}; ε \sim N_p(0, Δ); and Y is independent of ε. The central subspace is Δ^{-1} {\rm span}(Γ).

This function computes estimates of these model parameters by imposing constraints for identifiability. The mean parameters μ_X and μ_f are estimated with \bar x = n^{-1}∑_{i=1}^n x_i and \bar f = n^{-1} ∑_{i=1}^n f(y_i). Let \widehatΦ = n^{-1}∑_{i=1}^{n} \{f(y_i) - \bar f\}\{f(y_i) - \bar f\}', which we require to be positive definite. Given a user-specified weight matrix \widehat W, let

(\widehatΓ, \widehatβ) = \arg\min_{G\in R^{p\times d}, B \in R^{d\times r}} ∑_{i=1}^n [x_i - \bar x - GB\{f(y_i) - \bar f\}]'\widehat W [x_i - \bar x - GB\{f(y_i) - \bar f\}],

subject to the constraints that G'\widehat W G is diagonal and B \widehatΦ B' = I. The sufficient reduction estimate \widehat R: R^p \rightarrow R^d is defined by

\widehat R(x) = (\widehatΓ'\widehat W \widehatΓ)^{-1} \widehatΓ' \widehat W(x - \bar x).

## Usage

 1 2 3 fit.pfc(X, y, r=4, d=NULL, F.user=NULL, weight.type=c("sample", "diag", "L1"), lam.vec=NULL, kfold=5, silent=TRUE, qrtol=1e-10, cov.tol=1e-4, cov.maxit=1e3, NPERM=1e3, level=0.01) 

## Arguments

 X The predictor matrix with n rows and p columns. The ith row is x_i defined above. y The vector of measured responses with n entries. The ith entry is y_i defined above. r When polynomial basis functions are used (which is the case when F.user=NULL), r is the polynomial order, i.e, f(y) = (y, y^2, …, y^r)'. The default is r=4. This argument is not used when F.user is specified. d The dimension of the central subspace defined above. This must be specified by the user when weight.type="L1". If unspecified by the user this function will use the sequential permutation testing procedure, described in Section 8.2 of Cook, Forzani, and Rothman (2012), to select d. F.user A matrix with n rows and r columns, where the ith row is f(y_i) defined above. This argument is optional, and will typically be used when polynomial basis functions are not desired. weight.type The type of weight matrix estimate \widehat W to use. Let \widehatΔ be the observed residual sample covariance matrix for the multivariate regression of X on f(Y) with n-r-1 scaling. There are three options for \widehat W: weight.type="sample" uses a Moore-Penrose generalized inverse of \widehatΔ for \widehat W, when p ≤q n-r-1 this becomes the inverse of \widehatΔ; weight.type="diag" uses the inverse of the diagonal matrix with the same diagonal as \widehatΔ for \widehat W; weight.type="L1" uses the L1-penalized inverse of \widehatΔ described in equation (5.4) of Cook, Forzani, and Rothman (2012). In this case, lam.vec and d must be specified by the user. The glasso algorithm of Friedman et al. (2008) is used through the R package glasso. lam.vec A vector of candidate tuning parameter values to use when weight.type="L1". If this vector has more than one entry, then kfold cross validation will be performed to select the optimal tuning parameter value. kfold The number of folds to use in cross-validation to select the optimal tuning parameter when weight.type="L1". Only used if lam.vec has more than one entry. silent Logical. When silent=FALSE, progress updates are printed. qrtol The tolerance for calls to qr.solve(). cov.tol The convergence tolerance for the QUIC algorithm used when weight.type="L1". cov.maxit The maximum number of iterations allowed for the QUIC algorithm used when weight.type="L1". NPERM The number of permutations to used in the sequential permutation testing procedure to select d. Only used when d is unspecified. level The significance level to use to terminate the sequential permutation testing procedure to select d.

## Value

A list with

 Gamhat this is \widehatΓ described above. bhat this is \widehatβ described above. Rmat this is \widehat W\widehatΓ(\widehatΓ'\widehat W \widehatΓ)^{-1}. What this is \widehat W described above. d this is d described above. r this is r described above. GWG this is \widehatΓ'\widehat W \widehatΓ fc a matrix with n rows and r columns where the ith row is f(y_i) - \bar f. Xc a matrix with n rows and p columns where the ith row is x_i - \bar x. y the vector of n response measurements. mx this is \bar x described above. mf this is \bar f described above. best.lam this is selected tuning parameter value used when weight.type="L1", will be NULL otherwise. lam.vec this is the vector of candidate tuning parameter values used when weight.type="L1", will be NULL otherwise. err.vec this is the vector of validation errors from cross validation, one error for each entry in lam.vec. Will be NULL unless weight.type="L1" and lam.vec has more than one entry. test.info a dataframe that summarizes the results from the sequential testing procedure. Will be NULL unless d is unspecified.

## References

Cook, R. D., Forzani, L., and Rothman, A. J. (2012). Estimating sufficient reductions of the predictors in abundant high-dimensional regressions. Annals of Statistics 40(1), 353-384.

Friedman, J., Hastie, T., and Tibshirani R. (2008). Sparse inverse covariance estimation with the lasso. Biostatistics 9(3), 432-441.

pred.response

## Examples

  1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 set.seed(1) n=20 p=30 d=2 y=sqrt(12)*runif(n) Gam=matrix(rnorm(p*d), nrow=p, ncol=d) beta=diag(2) E=matrix(0.5*rnorm(n*p), nrow=n, ncol=p) V=matrix(c(1, sqrt(12), sqrt(12), 12.8), nrow=2, ncol=2) tmp=eigen(V, symmetric=TRUE) V.msqrt=tcrossprod(tmp$vec*rep(tmp$val^(-0.5), each=2), tmp$vec) Fyc=cbind(y-sqrt(3),y^2-4)%*%V.msqrt X=0+Fyc%*%t(beta)%*%t(Gam) + E fit=fit.pfc(X=X, y=y, r=3, weight.type="sample") ## display hypothesis testing information for selecting d fit$test.info ## make a response versus fitted values plot plot(pred.response(fit), y) 

### Example output Loading required package: QUIC
d0 test.statistic pvalue
1  0    0.109842773  0.000
2  1    0.001617754  0.003
3  2    0.001565799  0.406


abundant documentation built on Jan. 4, 2022, 5:08 p.m.