knitr::opts_chunk$set( collapse = TRUE, comment = "#>" )
The purpose of this vignette is to introduce readers to DMRnet package,
bring them up to speed by providing a simple
use case example and point interested readers towards more comprehensive informative material.
DMRnet packageDMRnet is an R package for regression and classification with a
family of model selection algorithms.
DMRnet package supports both continuous and categorical predictors.
The overall number of regressors may exceed the number of observations.
The selected model consists of a subset of numerical regressors and partitions of levels of factors.
The available model selection algorithms are the following:
DMRnet is the default and the most comprehensive model selection algorithm in the package, it can be used both for $p<n$ and $p \geq n$ scenarios. It first executes a screening step in order to reduce the problem to $p<n$ and then uses DMR subsequently.GLAMER stands for Group LAsso MERger and it is a new (simplified in relation to DMRnet) model selection algorithm that can be used both for $p<n$ and $p \geq n$ scenarios. In comparison to DMRnet, the step of reordering variables based on their t-statistics is skipped. This is the first partition selection algorithm in literature for which a partition selection consistency has been proven for high dimensional scenario [Nowakowski et al., 2021].PDMR stands for Plain DMR and it is a threshold parameter agnostic variant of GLAMER algorithm [Nowakowski et al., 2023]. It can be used both for $p<n$ and $p \geq n$ scenarios, as well. DMR [Maj-Kańska et al., 2015] is a model selection algorithm that can be used for $p<n$ scenario only.SOSnet [Pokarowski and Mielniczuk, 2015, Pokarowski et al., 2022] is a model selection algorithm that is used both for $p<n$ and $p \geq n$ scenarios for continuous-only regressors (with no categorical variables and, consequently, no partition selection). Algorithm-wise, the package user has the following choices:
DMRnet algorithm by calling DMRnet() or cv.DMRnet() functions.GLAMER or PDMR algorithms by calling DMRnet() or cv.DMRnet() functions and passing algorithm="var_sel", algorithm="glamer" or algorithm="PDMR" argument, respectively.DMR algorithm by calling DMR() or cv.DMR() functions.SOSnet algorithm is automatically chosen if DMRnet() or cv.DMRnet() functions were called with an input consisting of continuous-only regressors.As this vignette is introductory only, and the choice of an algorithm is a somewhat more advanced topic, from now on it is assumed that the user works with DMRnet algorithm only.
To load the package into memory execute the following line:
library(DMRnet)
DMRnet package features two built-in data sets: Promoter and Miete [Fahrmeir et al., 2004].
We will work with Miete data set in this vignette. The Miete data consists of $n = 2053$ households interviewed for the Munich rent standard 2003. The response is continuous, it is monthly rent per square meter in Euros. There are 9 categorical and 2 continuous variables.
data("miete") X <- miete[,-1] y <- miete$rent head(X)
Out of 9 categorical variables 7 have 2 levels each, and the two remaining (area and rooms) have 25 and 6 levels, respectively. This sums up to 45 levels. The way DMRnet handles levels results in 36 parameters for the categorical variables (the first level in each categorical variable is already considered included in the intercept). The 2 continuous variables give 2 additional parameters, the intercept is one more, so it totals in $p = 39$.
In contrast to explicit group lasso methods which need a design matrix with explicit groups, DMRnet package internally transforms an input matrix into a design matrix (e.g. by expanding factors into a set of one-hot encoded dummy variables). Thus, X needs no additional changes and already is all set to be fed into DMRnet:
models <- DMRnet(X, y, family="gaussian") models
Here, family="gaussian" argument was used to indicate, that we are interested in a linear regression for modeling a continuous response variable. The family="gaussian" is the default and could have been omitted as well. In a printout above you can notice a bunch of other default values set for some other parameters (e.g. nlambda=100 requesting the cardinality of a net of lambda values used) in model estimation.
The last part of a printout above presents a sequence of models estimated from X. Notice the models have varying number of parameters (model dimension df ranges from 39 for a full model down to 1 for the 39-th model).
Let us plot the path for the coefficients for the 10 smallest models as a function of their model dimension df:
plot(models, xlim=c(1, 10), lwd=2)
Notice how you can also pass other parameters (xlim=c(1, 10), lwd=2) to change the default behavior of the plot() function.
To inspect the coefficients for a particular model in detail, one can use the coef() function. For the sake of the example, let's examine the last model from the plot, with df=10:
coef(models, df=10)
Notice how area-related coefficients are all equal, effectively merging all 12 listed area levels with a coefficient -55.36 and merging all 13 remaining area levels with a coefficient 0.0. The 13 remaining area levels merged with a coefficient 0.0 were unlisted in a printout above. The reason behind it is that only non-zero coefficients get listed in the coef() function.
There are two basic methods for model selection supported in DMRnet package. One is based on a $K$-fold Cross Validation (CV), the other is based on Generalized Information Criterion (GIC) [Foster and George, 1994].
A GIC-based model selection is performed directly on a precomputed sequence of models
gic.model <- gic.DMR(models) plot(gic.model)
One can read a model dimension with
gic.model$df.min
One also has access to the best model coefficients with
coef(gic.model)
The default $K$ value is 10. Because the data needs to be partitioned into CV subsets which alternate as training and validating sets, the precomputed sequence of models in models variable cannot be used directly, as was the case with GIC-based model selection. Instead, to perform a 10-fold CV-based model selection execute
cv.model <- cv.DMRnet(X, y) plot(cv.model)
As before, one can access the best model dimension with
cv.model$df.min
In a plot above there is also a blue dot indicating a df.1se model. This is the smallest model (respective to its dimension)
having its error within the boundary of one standard deviation (the blue dotted line) from the best model.
This model improves interpretability without erring much more than the best model.
cv.model$df.1se
Returning to the best model, df.min, its dimension is the same as when we used GIC directly.
Let's check if the CV-selected model is the same as the best model selected with GIC:
coef(cv.model)==coef(gic.model)
Models selected with both model selection methods can be used to predict response variable values for new observations:
predict(gic.model, newx=head(X)) predict(cv.model, newx=head(X))
No wonder the corresponding predictions are all equal, since the selected models were the same.
In models selected with CV, one can switch between df.min and df.1se models with the use of md argument, as follows:
predict(cv.model, md="df.min", newx=head(X)) # the default, best model predict(cv.model, md="df.1se", newx=head(X)) # the alternative df.1se model
One can predict the response variable values for the whole sequence of models as well:
predict(models, newx=head(X))
Notice how the df12 model predictions overlap with the values we obtained for the df.min model, and df3 overlaps with the values we obtained for the df.1se model.
For a classification task with 2 classes, the non-default family="binomial" should be provided in a call to DMRnet and cv.DMRnet (but not to gic.DMR) and the response variable should be a factor with two classes, with the last level in alphabetical order interpreted as the target class. It is illustrated with the following code with a somewhat artificial example:
binomial_y <- factor(y > mean(y)) #changing Miete response var y into a binomial factor with 2 classes binomial_models <- DMRnet(X, binomial_y, family="binomial") gic.binomial_model <- gic.DMR(binomial_models) gic.binomial_model$df.min
A corresponding predict call has a type parameter with the default value "link", which returns the linear predictors. To change that behavior one can substitute the default with type="response" to get the fitted probabilities or type="class" to get the class labels corresponding to the maximum probability. So to get actual values compatible with a binomial y, type="class" should be used:
predict(gic.binomial_model, newx=head(X), type="class")
Please note that 1 is the value of a target class in the predict output.
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.