knitr::opts_chunk$set( collapse = TRUE, comment = "#>" )
The purpose of this vignette is to introduce readers to DMRnet
package,
bring them up to speed by providing a simple
use case example and point interested readers towards more comprehensive informative material.
DMRnet
packageDMRnet
is an R
package for regression and classification with a
family of model selection algorithms.
DMRnet
package supports both continuous and categorical predictors.
The overall number of regressors may exceed the number of observations.
The selected model consists of a subset of numerical regressors and partitions of levels of factors.
The available model selection algorithms are the following:
DMRnet
is the default and the most comprehensive model selection algorithm in the package, it can be used both for $p<n$ and $p \geq n$ scenarios. It first executes a screening step in order to reduce the problem to $p<n$ and then uses DMR
subsequently.GLAMER
stands for Group LAsso MERger and it is a new (simplified in relation to DMRnet) model selection algorithm that can be used both for $p<n$ and $p \geq n$ scenarios. In comparison to DMRnet
, the step of reordering variables based on their t-statistics is skipped. This is the first partition selection algorithm in literature for which a partition selection consistency has been proven for high dimensional scenario [Nowakowski et al., 2021].PDMR
stands for Plain DMR and it is a threshold parameter agnostic variant of GLAMER algorithm [Nowakowski et al., 2023]. It can be used both for $p<n$ and $p \geq n$ scenarios, as well. DMR
[Maj-Kańska et al., 2015] is a model selection algorithm that can be used for $p<n$ scenario only.SOSnet
[Pokarowski and Mielniczuk, 2015, Pokarowski et al., 2022] is a model selection algorithm that is used both for $p<n$ and $p \geq n$ scenarios for continuous-only regressors (with no categorical variables and, consequently, no partition selection). Algorithm-wise, the package user has the following choices:
DMRnet
algorithm by calling DMRnet()
or cv.DMRnet()
functions.GLAMER
or PDMR
algorithms by calling DMRnet()
or cv.DMRnet()
functions and passing algorithm="var_sel"
, algorithm="glamer"
or algorithm="PDMR"
argument, respectively.DMR
algorithm by calling DMR()
or cv.DMR()
functions.SOSnet
algorithm is automatically chosen if DMRnet()
or cv.DMRnet()
functions were called with an input consisting of continuous-only regressors.As this vignette is introductory only, and the choice of an algorithm is a somewhat more advanced topic, from now on it is assumed that the user works with DMRnet
algorithm only.
To load the package into memory execute the following line:
library(DMRnet)
DMRnet
package features two built-in data sets: Promoter and Miete [Fahrmeir et al., 2004].
We will work with Miete data set in this vignette. The Miete data consists of $n = 2053$ households interviewed for the Munich rent standard 2003. The response is continuous, it is monthly rent per square meter in Euros. There are 9 categorical and 2 continuous variables.
data("miete") X <- miete[,-1] y <- miete$rent head(X)
Out of 9 categorical variables 7 have 2 levels each, and the two remaining (area
and rooms
) have 25 and 6 levels, respectively. This sums up to 45 levels. The way DMRnet
handles levels results in 36 parameters for the categorical variables (the first level in each categorical variable is already considered included in the intercept). The 2 continuous variables give 2 additional parameters, the intercept is one more, so it totals in $p = 39$.
In contrast to explicit group lasso methods which need a design matrix with explicit groups, DMRnet
package internally transforms an input matrix into a design matrix (e.g. by expanding factors into a set of one-hot encoded dummy variables). Thus, X
needs no additional changes and already is all set to be fed into DMRnet
:
models <- DMRnet(X, y, family="gaussian") models
Here, family="gaussian"
argument was used to indicate, that we are interested in a linear regression for modeling a continuous response variable. The family="gaussian"
is the default and could have been omitted as well. In a printout above you can notice a bunch of other default values set for some other parameters (e.g. nlambda=100
requesting the cardinality of a net of lambda values used) in model estimation.
The last part of a printout above presents a sequence of models estimated from X
. Notice the models have varying number of parameters (model dimension df
ranges from 39 for a full model down to 1 for the 39-th model).
Let us plot the path for the coefficients for the 10 smallest models as a function of their model dimension df
:
plot(models, xlim=c(1, 10), lwd=2)
Notice how you can also pass other parameters (xlim=c(1, 10), lwd=2
) to change the default behavior of the plot()
function.
To inspect the coefficients for a particular model in detail, one can use the coef()
function. For the sake of the example, let's examine the last model from the plot, with df=10
:
coef(models, df=10)
Notice how area
-related coefficients are all equal, effectively merging all 12 listed area
levels with a coefficient -55.36 and merging all 13 remaining area
levels with a coefficient 0.0. The 13 remaining area
levels merged with a coefficient 0.0 were unlisted in a printout above. The reason behind it is that only non-zero coefficients get listed in the coef()
function.
There are two basic methods for model selection supported in DMRnet
package. One is based on a $K$-fold Cross Validation (CV), the other is based on Generalized Information Criterion (GIC) [Foster and George, 1994].
A GIC-based model selection is performed directly on a precomputed sequence of models
gic.model <- gic.DMR(models) plot(gic.model)
One can read a model dimension with
gic.model$df.min
One also has access to the best model coefficients with
coef(gic.model)
The default $K$ value is 10. Because the data needs to be partitioned into CV subsets which alternate as training and validating sets, the precomputed sequence of models in models
variable cannot be used directly, as was the case with GIC-based model selection. Instead, to perform a 10-fold CV-based model selection execute
cv.model <- cv.DMRnet(X, y) plot(cv.model)
As before, one can access the best model dimension with
cv.model$df.min
In a plot above there is also a blue dot indicating a df.1se
model. This is the smallest model (respective to its dimension)
having its error within the boundary of one standard deviation (the blue dotted line) from the best model.
This model improves interpretability without erring much more than the best model.
cv.model$df.1se
Returning to the best model, df.min
, its dimension is the same as when we used GIC directly.
Let's check if the CV-selected model is the same as the best model selected with GIC:
coef(cv.model)==coef(gic.model)
Models selected with both model selection methods can be used to predict response variable values for new observations:
predict(gic.model, newx=head(X)) predict(cv.model, newx=head(X))
No wonder the corresponding predictions are all equal, since the selected models were the same.
In models selected with CV, one can switch between df.min
and df.1se
models with the use of md
argument, as follows:
predict(cv.model, md="df.min", newx=head(X)) # the default, best model predict(cv.model, md="df.1se", newx=head(X)) # the alternative df.1se model
One can predict the response variable values for the whole sequence of models as well:
predict(models, newx=head(X))
Notice how the df12
model predictions overlap with the values we obtained for the df.min
model, and df3
overlaps with the values we obtained for the df.1se
model.
For a classification task with 2 classes, the non-default family="binomial"
should be provided in a call to DMRnet
and cv.DMRnet
(but not to gic.DMR
) and the response variable should be a factor with two classes, with the last level in alphabetical order interpreted as the target class. It is illustrated with the following code with a somewhat artificial example:
binomial_y <- factor(y > mean(y)) #changing Miete response var y into a binomial factor with 2 classes binomial_models <- DMRnet(X, binomial_y, family="binomial") gic.binomial_model <- gic.DMR(binomial_models) gic.binomial_model$df.min
A corresponding predict
call has a type
parameter with the default value "link"
, which returns the linear predictors. To change that behavior one can substitute the default with type="response"
to get the fitted probabilities or type="class"
to get the class labels corresponding to the maximum probability. So to get actual values compatible with a binomial y
, type="class"
should be used:
predict(gic.binomial_model, newx=head(X), type="class")
Please note that 1 is the value of a target class in the predict
output.
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.