knitr::opts_chunk$set( collapse = TRUE, comment = "#>" )
tsmvr or Truly Sparse Multivariate Regression is an R package for solving sparse multivariate regression problems with error covariance estimation. The workhorse algorithm in tsmvr is adapted from the algorithm described by J. Chen and Q. Gu in their 2016 paper High Dimensional Multivariate Regression and Precision Matrix Estimation via Nonconvex Optimization .
A multivariate regression problem is a regression problem with multiple responses. Formally,
$ Y = X B + E $
Here, X is the design matrix of $n$ observations of $p$ features, Y is the design matrix of $n$ observations of $q$ responses, $B$ is the regression matrix, and $E$ is the error term. Given $X$ and $Y$, tsmvr solves this problem for $B$ under the constraint that $B$ is sparse and the condition that the errors may be correlated. Under the hood, the error correlations are encoded in the precision matrix \mathbf{\Omega}, which has its own sparsity constraint.
For a first example, define some problem parameters.
n = 1000 # number of observations p = 100 # number of predictors q = 10 # number of responses sparsity = 0.1 # sparsity of true regression matrix s1 = round(p * q * sparsity * 1.1) # fitted sparsity will be a little larger than the true sparsity s2 = 3 * q - 4 # constrains precision matrix to have the number of entries as a tri-diagonal matrix
The following code generates a synthetic dataset. Here, the dataset has a true regression matrix of r sparsity
.
set.seed(1729) data = tsmvr::make_data( n = n, p = p, q = q, b1 = sqrt(sparsity), b2 = sqrt(sparsity) )[[1]]
Important: the data in design matrix generated in the code chunk above have mean zero (and standard deviation one). tsmvr assumes all data have zero mean, so it is important to do zeroing transformations to the data before running algorithm.
The function \code{tsmvr_solve} solves the regression problem using hard-thresholded block-wise alternating gradient descent with fixed learning rates.
# library(tsmvr) gd_gd_solution = tsmvr_solve( X = data$X, Y = data$Y, s1 = s1, s2 = s2, B_type = 'gd', Omega_type = 'gd', eta1 = 0.05, eta2 = 0.1, skip = 50, max_iter = 10 )
Sometimes the fixed learning rate method is cumbersome because the best learning rates need to be found by trial and error. In that case, tsmvr uses a generalized linesearch procedure to find learning rates for the user. In this case, the paramaters \code{eta1} and \code{eta2} becomes the initial learning rates in the linesearch procedure. The cost for not having to choose the learning rates is that the algorithm runs slower.
library(tsmvr) ls_ls_solution = tsmvr_solve( X = data$X, Y = data$Y, s1 = s1, s2 = s2, B_type = 'ls', Omega_type = 'ls', eta1 = 0.05, eta2 = 0.1, skip = 50 )
k-fold cross validation may be performed using the function \code{tsmvr_cv}
set.seed(1) validated = tsmvr::tsmvr_cv( X = data$X, Y = data$Y, s1 = s1, s2 = s2, k = 3, B_type ='ls', Omega_type = 'ls', eta1 = 0.05, eta2 = 0.2 )
Similarly, replicated k-fold cross validation may be performed using the function \code{tsmvr_replicate}. Be warned, the code chunk below will take time to run.
set.seed(3) replicated = tsmvr::tsmvr_replicate( X = data$X, Y = data$Y, s1 = s1, s2 = s2, k = 2, rep = 2, B_type ='ls', Omega_type = 'ls', eta1 = 0.05, eta2 = 0.1 )
Finally, replicated k-fold cross validation may be used to search a space of \code{s1} and \code{s2} values to find the pair of values that minimizes the cross validation error. The code chunk below takes time to run.
s1_grid = c(80,100,120,140) s2_grid = c(25,26,31,35) set.seed(5) grid = tsmvr_gridsearch( X = data$X, Y = data$Y, s1_grid = s1_grid, s2_grid = s2_grid, k = 2, reps = 3, B_type = 'ls', Omega_type = 'ls', eta1 = 0.1, eta2 = 0.2, quiet = F )
Also a quote using >
:
"He who gives up [code] safety for [code] speed deserves neither." (via)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.