Getting Started with DMRnet"

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

This document

The purpose of this vignette is to introduce readers to DMRnet package, bring them up to speed by providing a simple use case example and point interested readers towards more comprehensive informative material.

DMRnet package

DMRnet is an R package for regression and classification with a family of model selection algorithms. DMRnet package supports both continuous and categorical predictors. The overall number of regressors may exceed the number of observations.

The selected model consists of a subset of numerical regressors and partitions of levels of factors.

The available model selection algorithms are the following:

Algorithm-wise, the package user has the following choices:

As this vignette is introductory only, and the choice of an algorithm is a somewhat more advanced topic, from now on it is assumed that the user works with DMRnet algorithm only.

OK, so let's get started

To load the package into memory execute the following line:

library(DMRnet)

DMRnet package features two built-in data sets: Promoter and Miete [Fahrmeir et al., 2004].

We will work with Miete data set in this vignette. The Miete data consists of $n = 2053$ households interviewed for the Munich rent standard 2003. The response is continuous, it is monthly rent per square meter in Euros. There are 9 categorical and 2 continuous variables.

data("miete")
X <- miete[,-1]
y <- miete$rent
head(X)

Out of 9 categorical variables 7 have 2 levels each, and the two remaining (area and rooms) have 25 and 6 levels, respectively. This sums up to 45 levels. The way DMRnet handles levels results in 36 parameters for the categorical variables (the first level in each categorical variable is already considered included in the intercept). The 2 continuous variables give 2 additional parameters, the intercept is one more, so it totals in $p = 39$.

Estimating and inspecting a sequence of models

In contrast to explicit group lasso methods which need a design matrix with explicit groups, DMRnet package internally transforms an input matrix into a design matrix (e.g. by expanding factors into a set of one-hot encoded dummy variables). Thus, X needs no additional changes and already is all set to be fed into DMRnet:

models <- DMRnet(X, y, family="gaussian")
models

Here, family="gaussian" argument was used to indicate, that we are interested in a linear regression for modeling a continuous response variable. The family="gaussian" is the default and could have been omitted as well. In a printout above you can notice a bunch of other default values set for some other parameters (e.g. nlambda=100 requesting the cardinality of a net of lambda values used) in model estimation.

The last part of a printout above presents a sequence of models estimated from X. Notice the models have varying number of parameters (model dimension df ranges from 39 for a full model down to 1 for the 39-th model).

Let us plot the path for the coefficients for the 10 smallest models as a function of their model dimension df:

plot(models, xlim=c(1, 10), lwd=2)

Notice how you can also pass other parameters (xlim=c(1, 10), lwd=2) to change the default behavior of the plot() function.

To inspect the coefficients for a particular model in detail, one can use the coef() function. For the sake of the example, let's examine the last model from the plot, with df=10:

coef(models, df=10)

Notice how area-related coefficients are all equal, effectively merging all 12 listed area levels with a coefficient -55.36 and merging all 13 remaining area levels with a coefficient 0.0. The 13 remaining area levels merged with a coefficient 0.0 were unlisted in a printout above. The reason behind it is that only non-zero coefficients get listed in the coef() function.

Selecting best models

There are two basic methods for model selection supported in DMRnet package. One is based on a $K$-fold Cross Validation (CV), the other is based on Generalized Information Criterion (GIC) [Foster and George, 1994].

Model selection with GIC

A GIC-based model selection is performed directly on a precomputed sequence of models

gic.model <- gic.DMR(models)
plot(gic.model)

One can read a model dimension with

gic.model$df.min

One also has access to the best model coefficients with

coef(gic.model)

Model selection with $K$-fold Cross Validation

The default $K$ value is 10. Because the data needs to be partitioned into CV subsets which alternate as training and validating sets, the precomputed sequence of models in models variable cannot be used directly, as was the case with GIC-based model selection. Instead, to perform a 10-fold CV-based model selection execute

cv.model <- cv.DMRnet(X, y)
plot(cv.model)

As before, one can access the best model dimension with

cv.model$df.min

In a plot above there is also a blue dot indicating a df.1se model. This is the smallest model (respective to its dimension) having its error within the boundary of one standard deviation (the blue dotted line) from the best model. This model improves interpretability without erring much more than the best model.

cv.model$df.1se

Returning to the best model, df.min, its dimension is the same as when we used GIC directly. Let's check if the CV-selected model is the same as the best model selected with GIC:

coef(cv.model)==coef(gic.model)

Predicting response for new observations

Models selected with both model selection methods can be used to predict response variable values for new observations:

predict(gic.model, newx=head(X))
predict(cv.model, newx=head(X))

No wonder the corresponding predictions are all equal, since the selected models were the same.

In models selected with CV, one can switch between df.min and df.1se models with the use of md argument, as follows:

predict(cv.model, md="df.min", newx=head(X))  # the default, best model
predict(cv.model, md="df.1se",  newx=head(X)) # the alternative df.1se model

One can predict the response variable values for the whole sequence of models as well:

predict(models, newx=head(X))

Notice how the df12 model predictions overlap with the values we obtained for the df.min model, and df3 overlaps with the values we obtained for the df.1se model.

Classification with 2 classes

For a classification task with 2 classes, the non-default family="binomial" should be provided in a call to DMRnet and cv.DMRnet (but not to gic.DMR) and the response variable should be a factor with two classes, with the last level in alphabetical order interpreted as the target class. It is illustrated with the following code with a somewhat artificial example:

binomial_y <- factor(y > mean(y))     #changing Miete response var y into a binomial factor with 2 classes
binomial_models <- DMRnet(X, binomial_y, family="binomial")
gic.binomial_model <- gic.DMR(binomial_models)
gic.binomial_model$df.min

A corresponding predict call has a type parameter with the default value "link", which returns the linear predictors. To change that behavior one can substitute the default with type="response" to get the fitted probabilities or type="class" to get the class labels corresponding to the maximum probability. So to get actual values compatible with a binomial y, type="class" should be used:

predict(gic.binomial_model, newx=head(X), type="class")

Please note that 1 is the value of a target class in the predict output.

References

  1. Szymon Nowakowski, Piotr Pokarowski, Wojciech Rejchel and Agnieszka Sołtys, 2023. Improving Group Lasso for High-Dimensional Categorical Data. In: Computational Science – ICCS 2023. Lecture Notes in Computer Science, vol 14074, p. 455-470. Springer, Cham.
  2. Szymon Nowakowski, Piotr Pokarowski and Wojciech Rejchel. 2021. Group Lasso Merger for Sparse Prediction with High-Dimensional Categorical Data. arXiv [stat.ME]. https://doi.org/10.48550/arXiv.2112.11114
  3. Aleksandra Maj-Kańska, Piotr Pokarowski and Agnieszka Prochenka, 2015. Delete or merge regressors for linear model selection. Electronic Journal of Statistics 9(2): 1749-1778.
  4. Piotr Pokarowski and Jan Mielniczuk, 2015. Combined l1 and greedy l0 penalized least squares for linear model selection. Journal of Machine Learning Research 16(29): 961-992. https://www.jmlr.org/papers/volume16/pokarowski15a/pokarowski15a.pdf
  5. Piotr Pokarowski, Wojciech Rejchel, Agnieszka Sołtys, Michał Frej and Jan Mielniczuk, 2022. Improving Lasso for model selection and prediction. Scandinavian Journal of Statistics, 49(2): 831–863.
  6. Ludwig Fahrmeir, Rita Künstler, Iris Pigeot, Gerhard Tutz, 2004. Statistik: der Weg zur Datenanalyse. 5. Auflage, Berlin: Springer-Verlag.
  7. Dean P. Foster and Edward I. George, 1994. The Risk Inflation Criterion for Multiple Regression. The Annals of Statistics 22 (4): 1947–75.


Try the DMRnet package in your browser

Any scripts or data that you put into this service are public.

DMRnet documentation built on Aug. 7, 2023, 5:11 p.m.