knitr::opts_chunk$set( collapse = TRUE, comment = "#>" )
The KPR
package provides estimation and inference methods for kernel-penalized regression models, designed for doubly-structured high
dimensional data. Kernel-penalized regression is an extension of
ridge regression that allows for the inclusion of external sample-wise and
parameter-wise structure encoded with positive definite similarity matrices
(kernels). For example, a researcher analyzing human microbiome data
may want a regression model that incorporates UniFrac distances between samples (and/or
phylogenetic distances between microbes). KPR
provides both
estimation and inference for individual variables in a penalized
regression model that accounts for this type of prior structure among
samples and variables. For further reading on the theory and
application of kernel-penalized regression models, see (Randolph et
al, 2018).
Given a data matrix $X \in \mathbb{R}^{n \times p}$, a continuous outcome $Y \in \mathbb{R}^n$, and sample- and variable-wise similarity kernels $H \in \mathbb{R}^{n \times n}$ and $Q \in \mathbb{R}^{p \times p}$, the kernel-penalized regression model seeks a vector $\hat{\beta} \in \mathbb{R}^p$ such that
$$\hat{\beta} = \mathop{\mathrm{arg\,min}}{\beta} \left{\Vert Y - X\beta \Vert^2{H} + \lambda\Vert \beta \Vert^2_{Q^{-1}}\right}$$
where $||v||^2_A = v^TAv$ for a positive definite matrix $A$. The tuning parameter $\lambda > 0$ determines the strength of regularization.
The KPR
package also provides support for adaptive kernel-penalized regression models, which can include multiple sample- and variable-wise kernels in the fitting procedure. Given data $X \in \mathbb{R}^{n \times p}$, outcome $Y \in \mathbb{R}^n$, sample-wise kernels $H_1, H_2, ... H_h \in \mathbb{R}^{n \times n}$ and variable-wise kernels $Q_1, Q_2, ... Q_q \in \mathbb{R}^{p \times p}$, adaptive KPR seeks $\hat{\beta} \in \mathbb{R}^p$ with
$$\hat{\beta} = \mathop{\mathrm{arg\,min}}{\beta} \left{\Vert Y - X\beta \Vert^2{\sum\sigma_iH_i} + \lambda\Vert \beta \Vert^2_{\sum\alpha_jQ_j^{-1}}\right}$$
where $\sum\sigma_i = \sum\alpha_j = 1$. The parameters $\sigma_i$ and $\alpha_j$ represent a numerical weight assigned to each kernel. This provides a useful interpretation: a weight close to one indicates that the weight is highly informative to model estimation, and a loop near zero shows that the kernel should be left out of the model altogether. KPR
estimates the optimal weight set with maximum likelihood.
Currently, the only way to install the package is from GitHub.
install.packages("devtools") devtools::install_github("pknight24/KPR")
The demonstrate the KPR model, we use data from the study by
(Yatsunenko et al, 2012), which is included in the package. The
yatsunenko
object is a list containing raw microbe abundance data,
patristic distances between genera, and the country of origin and age of each subject. We will fit
a kernel penalized regression model with the microbial abundances as
the penalized variables, and subject age as the outcome.
In order to incorporate the patristic distance matrix into the regression
model, we first need to convert it to a similarity kernel using a
function provided by the KPR
package.
library(KPR) data(yatsunenko) age <- yatsunenko$age counts <- yatsunenko$raw.counts patristic <- yatsunenko$patristic
The generateSimilarityKernel
function computes Gower's centered
similarity kernel $K$ from any distance matrix $D$,
specifically
$$K = -\frac{1}{2}JD^{(2)}J$$
where $J$ is a centering matrix, and $D^{(2)}$ denotes a matrix of squared distances.
Additionally, if the resulting similarity kernel has any negative or very small
eigenvalues, generateSimilarityKernel
will set these eigenvalues to
the smallest positive (or "large enough") eigenvalue divided
by 2. This essentially forces the kernel to "behave" as a positive definite matrix.
Q <- generateSimilarityKernel(patristic)
The KPR
package also provides a function (aitchisonVariation
) to generate Aitchison
Variation matrices for compositional data, as used in (Randolph et al,
2018).
Finally, we will take the centered log ratio of the count data to account for sample quality, as well as center the age vector. Kernel penalized regression models are sensitive to scale, and thus the data should be processed with this in mind before beginning the estimation procedure.
counts.clr <- log(counts + 1) - apply(log(counts + 1), 1, mean) X <- apply(counts.clr, 2, function(x) x - mean(x)) # column center the centered log ratio counts Y <- log(age) - mean(log(age))
Kernel penalized regression models can be fit using the KPR
function.
kpr.out <- KPR(X = X, Y = Y, Q = Q)
The KPR
function returns an object of class KPR
, which includes
all of the data used to fit the model, a beta.hat
vector of estimated coefficients, and the optimal tuning parameter $\lambda$ which is computed with restricted maximum likelihood. We can also take advantage of the adaptive capabilities of the software to provide an additional identity matrix kernel to the model. This ridge-like penalty can improve stability and accuracy when the other kernels are poorly conditioned or scientifically uninformative. To do so, we pass the two kernels to the KPR()
function as a list, shown below.
I <- diag(ncol(X)) kpr.out.2 <- KPR(X = X, Y = Y, Q = list(Q, I))
We can access the weights assigned to each kernel using the alpha
field.
kpr.out.2$alpha
The output of KPR
can be passed to the inference
function to detect statistically significant variables, testing the hypothesis
$$\textrm{H}{o}: \beta_j = 0, \textrm{ H}{\textrm{a}}: \beta_j \neq 0$$
for $j \in 1, 2, ... p$.
kpr.out <- inference(kpr.out)
This new kpr.out
object contains a vector p.values
, which contains p-values generated with the GMD inference procedure (Wang).
The KPR package also provides a GMD.biplot
function (see Wang et al.) for a qualitative analysis of two-way structured data.
GMD.biplot(X, Y, Q = Q)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.