In pknight24/KPR: Kernel-penalized regression

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

The KPR package provides estimation and inference methods for kernel-penalized regression models, designed for doubly-structured high dimensional data. Kernel-penalized regression is an extension of ridge regression that allows for the inclusion of external sample-wise and parameter-wise structure encoded with positive definite similarity matrices (kernels). For example, a researcher analyzing human microbiome data may want a regression model that incorporates UniFrac distances between samples (and/or phylogenetic distances between microbes). KPR provides both estimation and inference for individual variables in a penalized regression model that accounts for this type of prior structure among samples and variables. For further reading on the theory and application of kernel-penalized regression models, see (Randolph et al, 2018).

Model formulation

Given a data matrix $X \in \mathbb{R}^{n \times p}$, a continuous outcome $Y \in \mathbb{R}^n$, and sample- and variable-wise similarity kernels $H \in \mathbb{R}^{n \times n}$ and $Q \in \mathbb{R}^{p \times p}$, the kernel-penalized regression model seeks a vector $\hat{\beta} \in \mathbb{R}^p$ such that

$$\hat{\beta} = \mathop{\mathrm{arg\,min}}{\beta} \left{\Vert Y - X\beta \Vert^2{H} + \lambda\Vert \beta \Vert^2_{Q^{-1}}\right}$$

where $||v||^2_A = v^TAv$ for a positive definite matrix $A$. The tuning parameter $\lambda > 0$ determines the strength of regularization.

The KPR package also provides support for adaptive kernel-penalized regression models, which can include multiple sample- and variable-wise kernels in the fitting procedure. Given data $X \in \mathbb{R}^{n \times p}$, outcome $Y \in \mathbb{R}^n$, sample-wise kernels $H_1, H_2, ... H_h \in \mathbb{R}^{n \times n}$ and variable-wise kernels $Q_1, Q_2, ... Q_q \in \mathbb{R}^{p \times p}$, adaptive KPR seeks $\hat{\beta} \in \mathbb{R}^p$ with

$$\hat{\beta} = \mathop{\mathrm{arg\,min}}{\beta} \left{\Vert Y - X\beta \Vert^2{\sum\sigma_iH_i} + \lambda\Vert \beta \Vert^2_{\sum\alpha_jQ_j^{-1}}\right}$$

where $\sum\sigma_i = \sum\alpha_j = 1$. The parameters $\sigma_i$ and $\alpha_j$ represent a numerical weight assigned to each kernel. This provides a useful interpretation: a weight close to one indicates that the weight is highly informative to model estimation, and a loop near zero shows that the kernel should be left out of the model altogether. KPR estimates the optimal weight set with maximum likelihood.

Installation

Currently, the only way to install the package is from GitHub.

install.packages("devtools")
devtools::install_github("pknight24/KPR")

Data processing

The demonstrate the KPR model, we use data from the study by (Yatsunenko et al, 2012), which is included in the package. The yatsunenko object is a list containing raw microbe abundance data, patristic distances between genera, and the country of origin and age of each subject. We will fit a kernel penalized regression model with the microbial abundances as the penalized variables, and subject age as the outcome.

In order to incorporate the patristic distance matrix into the regression model, we first need to convert it to a similarity kernel using a function provided by the KPR package.

library(KPR)
data(yatsunenko)

age <- yatsunenko$age
counts <- yatsunenko$raw.counts
patristic <- yatsunenko$patristic

The generateSimilarityKernel function computes Gower's centered similarity kernel $K$ from any distance matrix $D$, specifically

$$K = -\frac{1}{2}JD^{(2)}J$$

where $J$ is a centering matrix, and $D^{(2)}$ denotes a matrix of squared distances.

Additionally, if the resulting similarity kernel has any negative or very small eigenvalues, generateSimilarityKernel will set these eigenvalues to the smallest positive (or "large enough") eigenvalue divided by 2. This essentially forces the kernel to "behave" as a positive definite matrix.

Q <- generateSimilarityKernel(patristic)

The KPR package also provides a function (aitchisonVariation) to generate Aitchison Variation matrices for compositional data, as used in (Randolph et al, 2018).

Finally, we will take the centered log ratio of the count data to account for sample quality, as well as center the age vector. Kernel penalized regression models are sensitive to scale, and thus the data should be processed with this in mind before beginning the estimation procedure.

counts.clr <- log(counts + 1) - apply(log(counts + 1), 1, mean)
X <- apply(counts.clr, 2, function(x) x - mean(x)) # column center the centered log ratio counts
Y <- log(age) - mean(log(age))

Estimation

Kernel penalized regression models can be fit using the KPR function.

kpr.out <- KPR(X = X, Y = Y,  Q = Q)

The KPR function returns an object of class KPR, which includes all of the data used to fit the model, a beta.hat vector of estimated coefficients, and the optimal tuning parameter $\lambda$ which is computed with restricted maximum likelihood. We can also take advantage of the adaptive capabilities of the software to provide an additional identity matrix kernel to the model. This ridge-like penalty can improve stability and accuracy when the other kernels are poorly conditioned or scientifically uninformative. To do so, we pass the two kernels to the KPR() function as a list, shown below.

I <- diag(ncol(X))
kpr.out.2 <- KPR(X = X, Y = Y, Q = list(Q, I))

We can access the weights assigned to each kernel using the alpha field.

kpr.out.2$alpha

Inference

The output of KPR can be passed to the inference function to detect statistically significant variables, testing the hypothesis

$$\textrm{H}{o}: \beta_j = 0, \textrm{ H}{\textrm{a}}: \beta_j \neq 0$$

for $j \in 1, 2, ... p$.

kpr.out <- inference(kpr.out)

This new kpr.out object contains a vector p.values, which contains p-values generated with the GMD inference procedure (Wang).

The KPR package also provides a GMD.biplot function (see Wang et al.) for a qualitative analysis of two-way structured data.

GMD.biplot(X, Y, Q = Q)

References

Buhlmann, P. (2013). Statistical significance in high-dimensional linear models. Bernoulli, 19, 1212-1242.
Dezeure, R. et al (2014). High-dimensional inference: confidence intervals, p-values and R-Software hdi. Statistical Science, 30, 10.1214/15-STS527.
Lampe, J.W. et al (2019). Colonic mucosal and exfoliome transcriptomic profiling and fecal microbiome response to a flaxseed lignan extract intervention in humans. American Journal of Clinical Nutrition, 110, 2, 377-390.
Randolph, T. et al (2018). Kernel-penalized regression for analysis of microbiome data. The Annals of Applied Statistics, 12, no. 1, 540-566.
Wang, Y. et al. Generalized Matrix Decomposition Regression. Technical report.
Wang, Y. et al. The GMD-biplot and its application to microbiome data. Technical report.
Yatsunenko, T. et al (2012). Human gut microbiome viewed across age and geography. Nature, 486, 222-227.
Zhao, S., and Shojaie, A. (2016). A signifiance test for graph-constrained estimation. Biometrics, 72, 484-493.