knitr::opts_chunk$set( collapse = TRUE, comment = "#>", fig.path = "man/figures/README-", out.width = "100%" )
This package implements a genetic algorithm which performs simulatenous variable selection and structure discovery in generalized additive models of the form: $$g(\mathbb{E}(y_i)) = \alpha + \sum_{j\in \mathcal{L}} \beta_j x_{ij} + \sum_{k \in \mathcal{N}} f_k(x_{ik}) + \varepsilon_i$$ For a given dependent variable and a set of explanatory variables, the genetic algorithm determines which regressors should be included linearly (set $\mathcal{L}$), which nonparametrically (set $\mathcal{N}$), and which should be excluded from the regression equation. The aim is to minimize the Bayesian Information Criterion value of the model.
For more details on the motivation behind GAGAM and how it works see the paper: Cus, Mark. 2020. "Simultaneous Variable Selection And Structure Discovery In Generalized Additive Models". https://github.com/markcus1/gagam/blob/master/GAGAMpaper.pdf.
A pdf gagam
package manual is available here: https://github.com/markcus1/gagam/blob/master/gagam_package_manual.pdf.
gagam
can be installed and loaded with:
#library(devtools) #install_github("markcus1/gagam") library(gagam)
Suppose we have a dependent variable, $y$, ten explanatory variables, $x_1,x_2,\ldots,x_{10}$, and the true model only contains the first four. Moreover, $x_1$, $x_2$, $x_3$ have a linear effect on $y$ while $x_4$ has a nonlinear relationship with the dependent variable. We now demonstrate how to use gagam()
to extract the true model.
Begin by generating the data:
N <- 500 #number of observations set.seed <- 125 xdat <- matrix(rnorm(N*10,0,1),nrow=N,ncol=10) #matrix of explanatory variables (each drawn from N(0,1)) ydat <- 4*xdat[,1]+5*xdat[,2]+6*xdat[,3]+(xdat[,4])^2 + 4 + rnorm(N,0,0.25) #generate y according to the true model
We can now run GAGAM using:
example_gagam <- gagam(y=ydat,x=xdat,Kvar=6,no_gen=50,multicore=FALSE) example_gagam
The output is a list. The first element contains the optimal model as determined by GAGAM as a fitted mgcv::gam
object. The second and third elements show which variables the algorithm included in the model linearly and nonparametrically, respectively.
We can see that GAGAM correctly extracted the true model - $x_1$, $x_2$, $x_3$ are included linearly, $x_4$ is included nonparametrically, and the other variables are excluded.
As the example above demonstrates, the main function implementing GAGAM is gagam()
. This section explains the options available to the user when implementing gagam()
.
gagam()
has only two required arguments: y
and x
.
y
should contain the values of the dependent variable. It can take the form of a vector, matrix, data frame, or factor. y
should always be univariate - even if a matrix or a data frame is used, it should always have only one column.
x
should contain the observations of all explanatory variables. It can be a matrix or a data frame.
Note that if the columns of x
are named then the explanatory variables will inherit those names. Otherwise, the explanatory variables will automatically be named x1,x2,...,xp
.
The user is allowed to specify a number of algorithm parameters*:
no_gen
sets the number of iterations (generations) of the algorithm. By default, this is 100 and that is usually sufficient. For smaller problems no_gen
might be set to a lower number, e.g. 50.
pop_size
is the population size. By default, this is set to 500 but the user can increase it in steps of 500 (e.g. to 1000, 1500, 2000, ...). pop_size
should always be a multiple of 500.
Kvar
sets the maximum number of variables allowed in the model. In the example above we had 10 explanatory variables but we decided that we will only choose among models with six or fewer predictors. This argument is especially important if we have more explanatory variables than observations ($p>n$). Then, Kvar
should always be set to a number lower than $n$, otherwise the algorithm might arrive at an overdetermined model and return an error.
Kint
extends the model from the introduction to also allow first-order interactions. The argument specifies the maximum number of interactions allowed in the model. By default this is set to 0, i.e. no interactions allowed. gagam
with Kint
set to more than 0 is usually very (!) slow and is thus not recommended.
p_m
is the mutation rate for the main effects (predictors).
p_nonpar
is the mutation rate for the functional forms.
p_int
is the mutation rate for the interactions.
p_int_nonpar
is the mutation rate for functional forms of interactions.
multicore
is a logical which specifies whether multiple CPU cores should be used in fitting gagam
. Currently implemented using mclapply
and so will not work on Windows. By default this is set to TRUE
. Sometimes you might get an error "Error in mcfork: unable to fork, possible reason: Resource temporarily unavailable". If this happens, try restarting the R session.
cores
sets the number of CPU cores to use with multicore
. By default gagam
will use all available cores.
*Refer to the paper mentioned at the top for more details on these parameters.
gagam
is essentially a wrapper for the mgcv
package and particularly the mgcv::gam
function. Thus, the user can pass arguments to gam()
which determine how the model should be specified and estimated.
bs
is the spline basis used to estimate nonparametric terms. By default, we use natural cubic splines but other bases might be more appropriate in certain applications. See mgcv::smooth.terms
for an overview of what is available.
k
is the basis dimension. This is set to 10 by default. Currently, all nonparametric terms are estimated using the same basis and basis dimension. Also, note that due to the way generalized additive models are estimated, Kvar
times k
should be less than the number of observations, otherwise mgcv::gam
might return the error "Model has more coefficients than observations".
family
specifies the distribution of the dependent variable and the link function. Gaussian() is the default. Binomial(link="logit") might be used for classification. Any family supported by mgcv::gam
is allowed.
method
specifies the criterion for choosing the smoothing parameter. By default, we use the restricted maximum likelihood (REML). Any method
supported by mgcv::gam
is allowed.
optimizer
specifies the optimization method to optimize the criterion given by method
. See mgcv::gam
for more details.
always_par
allows the user to force some terms to always enter the model parametrically. This is useful if we have noncontinuous predictors such as dummy variables. In the extreme, one might make all variables parametric and use GAGAM just for variable selection. always_par
should be specified as a numeric vector where the elements give the column numbers in x
of variables to be estimated parametrically, e.g. always_par=c(1,2,3)
will ensure that the variables in the first three columns of x
are always parametric.
As mentioned in the introduction and in the paper, GAGAM uses the BIC to judge the quality of the models. In small samples some noise variables can reduce the BIC and get included in the recommended model. To counter this, gagam
implements three variable elimination methods (chosen using the argument reduc
):
reduc=c(1)
loops through the variables in the model returned by gagam
, removing one at a time and noting the changes in BIC. In the end, all variables whose removal increases the BIC by less than 7 are eliminated from the model.
reduc=c(3)
is similar to reduc=c(1)
but tries making nonparametric variables linear.
reduc=c(2)
applies reduc=c(1)
and then reduc=c(3)
.
More than one reduction can be specified, e.g. reduc=c(1,2,3)
. gagam
will return the original recommended model and the reduced models.
reduc
should generally only be used when the number of explanatory variables is very large and in small samples.
Summary, plot, and predict methods are available for objects returned by gagam
. These essentially just take the fitted mgcv::gam
object contained in the gagam
output and pass it to summary.gam/plot.gam/predict.gam. If one of the reduc
options was used, one can specify which model to use (NULL
for the original model, reduc = 1
for the model reduced using reduc=c(1)
etc.). We also allow the user to pass through additional arguments for plot.gam
/predict.gam
.
summary(example_gagam) plot(example_gagam)
The following code shows how to apply GAGAM to the Boston Housing data set.
library(mlbench) data(BostonHousing) y <- BostonHousing[,14] x <- BostonHousing[,1:13] #boston_gagam <- gagam(y,x,Kvar = 13,always_par = c(2,4,9)) #summary(boston_gagam)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.