This package implements a genetic algorithm which performs simulatenous variable selection and structure discovery in generalized additive models of the form:
For a given dependent variable and a set of explanatory variables, the
genetic algorithm determines which regressors should be included
linearly (set ), which nonparametrically (set
), and which should be excluded from the regression
equation. The aim is to minimize the Bayesian Information Criterion
value of the model.
For more details on the motivation behind GAGAM and how it works see the paper: Cus, Mark. 2020. “Simultaneous Variable Selection And Structure Discovery In Generalized Additive Models”. https://github.com/markcus1/gagam/blob/master/GAGAMpaper.pdf.
A pdf gagam
package manual is available here:
https://github.com/markcus1/gagam/blob/master/gagam_package_manual.pdf.
gagam
can be installed and loaded with:
#library(devtools)
#install_github("markcus1/gagam")
library(gagam)
#> Loading required package: mgcv
#> Loading required package: nlme
#> This is mgcv 1.8-31. For overview type 'help("mgcv-package")'.
#> Loading required package: parallel
#> Loading required package: foreach
#> Loading required package: doParallel
#> Loading required package: iterators
#> Loading required package: doMC
Suppose we have a dependent variable, , ten explanatory variables,
, and the true model only contains the first
four. Moreover,
,
,
have a linear effect on
while
has a nonlinear relationship with the dependent variable. We now
demonstrate how to use
gagam()
to extract the true model.
Begin by generating the data:
N <- 500 #number of observations
set.seed <- 125
xdat <- matrix(rnorm(N*10,0,1),nrow=N,ncol=10) #matrix of explanatory variables (each drawn from N(0,1))
ydat <- 4*xdat[,1]+5*xdat[,2]+6*xdat[,3]+(xdat[,4])^2 + 4 + rnorm(N,0,0.25) #generate y according to the true model
We can now run GAGAM using:
example_gagam <- gagam(y=ydat,x=xdat,Kvar=6,no_gen=50,multicore=FALSE)
#> [1] "Generation: 1 of 50"
#> [1] "Generation: 2 of 50"
#> [1] "Generation: 3 of 50"
#> [1] "Generation: 4 of 50"
#> [1] "Generation: 5 of 50"
#> [1] "Generation: 6 of 50"
#> [1] "Generation: 7 of 50"
#> [1] "Generation: 8 of 50"
#> [1] "Generation: 9 of 50"
#> [1] "Generation: 10 of 50"
#> [1] "Generation: 11 of 50"
#> [1] "Generation: 12 of 50"
#> [1] "Generation: 13 of 50"
#> [1] "Generation: 14 of 50"
#> [1] "Generation: 15 of 50"
#> [1] "Generation: 16 of 50"
#> [1] "Generation: 17 of 50"
#> [1] "Generation: 18 of 50"
#> [1] "Generation: 19 of 50"
#> [1] "Generation: 20 of 50"
#> [1] "Generation: 21 of 50"
#> [1] "Generation: 22 of 50"
#> [1] "Generation: 23 of 50"
#> [1] "Generation: 24 of 50"
#> [1] "Generation: 25 of 50"
#> [1] "Generation: 26 of 50"
#> [1] "Generation: 27 of 50"
#> [1] "Generation: 28 of 50"
#> [1] "Generation: 29 of 50"
#> [1] "Generation: 30 of 50"
#> [1] "Generation: 31 of 50"
#> [1] "Generation: 32 of 50"
#> [1] "Generation: 33 of 50"
#> [1] "Generation: 34 of 50"
#> [1] "Generation: 35 of 50"
#> [1] "Generation: 36 of 50"
#> [1] "Generation: 37 of 50"
#> [1] "Generation: 38 of 50"
#> [1] "Generation: 39 of 50"
#> [1] "Generation: 40 of 50"
#> [1] "Generation: 41 of 50"
#> [1] "Generation: 42 of 50"
#> [1] "Generation: 43 of 50"
#> [1] "Generation: 44 of 50"
#> [1] "Generation: 45 of 50"
#> [1] "Generation: 46 of 50"
#> [1] "Generation: 47 of 50"
#> [1] "Generation: 48 of 50"
#> [1] "Generation: 49 of 50"
#> [1] "Generation: 50 of 50"
example_gagam
#> $best_gam
#>
#> Family: gaussian
#> Link function: identity
#>
#> Formula:
#> y ~ x3 + x1 + x2 + s(x4, bs = "cr", k = 10)
#>
#> Estimated degrees of freedom:
#> 8.1 total = 12.1
#>
#> REML score: 52.69499
#>
#> $linear_mains
#> [1] "x3" "x1" "x2"
#>
#> $nonparametric_mains
#> [1] "x4"
#>
#> attr(,"class")
#> [1] "gagam"
The output is a list. The first element contains the optimal model as
determined by GAGAM as a fitted mgcv::gam
object. The second and third
elements show which variables the algorithm included in the model
linearly and nonparametrically, respectively.
We can see that GAGAM correctly extracted the true model -
,
,
are included linearly,
is included
nonparametrically, and the other variables are excluded.
As the example above demonstrates, the main function implementing GAGAM
is gagam()
. This section explains the options available to the user
when implementing gagam()
.
gagam()
has only two required arguments: y
and x
.
y
should contain the values of the dependent variable. It can take the
form of a vector, matrix, data frame, or factor. y
should always be
univariate - even if a matrix or a data frame is used, it should always
have only one column.
x
should contain the observations of all explanatory variables. It can
be a matrix or a data frame.
Note that if the columns of x
are named then the explanatory variables
will inherit those names. Otherwise, the explanatory variables will
automatically be named x1,x2,...,xp
.
The user is allowed to specify a number of algorithm parameters*:
no_gen
sets the number of iterations (generations) of the algorithm.
By default, this is 100 and that is usually sufficient. For smaller
problems no_gen
might be set to a lower number, e.g. 50.
pop_size
is the population size. By default, this is set to 500 but
the user can increase it in steps of 500 (e.g. to 1000, 1500, 2000, …).
pop_size
should always be a multiple of 500.
Kvar
sets the maximum number of variables allowed in the model. In the
example above we had 10 explanatory variables but we decided that we
will only choose among models with six or fewer predictors. This
argument is especially important if we have more explanatory variables
than observations (). Then,
Kvar
should always be set to a
number lower than , otherwise the algorithm might arrive at an
overdetermined model and return an error.
Kint
extends the model from the introduction to also allow first-order
interactions. The argument specifies the maximum number of interactions
allowed in the model. By default this is set to 0, i.e. no interactions
allowed. gagam
with Kint
set to more than 0 is usually very (!)
slow and is thus not recommended.
p_m
is the mutation rate for the main effects (predictors).
p_nonpar
is the mutation rate for the functional forms.
p_int
is the mutation rate for the interactions.
p_int_nonpar
is the mutation rate for functional forms of
interactions.
multicore
is a logical which specifies whether multiple CPU cores
should be used in fitting gagam
. Currently implemented using
mclapply
and so will not work on Windows. By default this is set to
TRUE
. Sometimes you might get an error “Error in mcfork: unable to
fork, possible reason: Resource temporarily unavailable”. If this
happens, try restarting the R session.
cores
sets the number of CPU cores to use with multicore
. By default
gagam
will use all available cores.
*Refer to the paper mentioned at the top for more details on these parameters.
gagam
is essentially a wrapper for the mgcv
package and particularly
the mgcv::gam
function. Thus, the user can pass arguments to gam()
which determine how the model should be specified and estimated.
bs
is the spline basis used to estimate nonparametric terms. By
default, we use natural cubic splines but other bases might be more
appropriate in certain applications. See mgcv::smooth.terms
for an
overview of what is available.
k
is the basis dimension. This is set to 10 by default. Currently, all
nonparametric terms are estimated using the same basis and basis
dimension. Also, note that due to the way generalized additive models
are estimated, Kvar
times k
should be less than the number of
observations, otherwise mgcv::gam
might return the error “Model has
more coefficients than observations”.
family
specifies the distribution of the dependent variable and the
link function. Gaussian() is the default. Binomial(link=“logit”) might
be used for classification. Any family supported by mgcv::gam
is
allowed.
method
specifies the criterion for choosing the smoothing parameter.
By default, we use the restricted maximum likelihood (REML). Any
method
supported by mgcv::gam
is allowed.
optimizer
specifies the optimization method to optimize the criterion
given by method
. See mgcv::gam
for more details.
always_par
allows the user to force some terms to always enter the
model parametrically. This is useful if we have noncontinuous predictors
such as dummy variables. In the extreme, one might make all variables
parametric and use GAGAM just for variable selection. always_par
should be specified as a numeric vector where the elements give the
column numbers in x
of variables to be estimated parametrically, e.g.
always_par=c(1,2,3)
will ensure that the variables in the first three
columns of x
are always parametric.
As mentioned in the introduction and in the paper, GAGAM uses the BIC to
judge the quality of the models. In small samples some noise variables
can reduce the BIC and get included in the recommended model. To counter
this, gagam
implements three variable elimination methods (chosen
using the argument reduc
):
reduc=c(1)
loops through the variables in the model returned by
gagam
, removing one at a time and noting the changes in BIC. In the
end, all variables whose removal increases the BIC by less than 7 are
eliminated from the model.
reduc=c(3)
is similar to reduc=c(1)
but tries making nonparametric
variables linear.
reduc=c(2)
applies reduc=c(1)
and then reduc=c(3)
.
More than one reduction can be specified, e.g. reduc=c(1,2,3)
. gagam
will return the original recommended model and the reduced models.
reduc
should generally only be used when the number of explanatory
variables is very large and in small samples.
Summary, plot, and predict methods are available for objects returned by
gagam
. These essentially just take the fitted mgcv::gam
object
contained in the gagam
output and pass it to
summary.gam/plot.gam/predict.gam. If one of the reduc
options was
used, one can specify which model to use (NULL
for the original model,
reduc = 1
for the model reduced using reduc=c(1)
etc.). We also
allow the user to pass through additional arguments for
plot.gam
/predict.gam
.
summary(example_gagam)
#>
#> Family: gaussian
#> Link function: identity
#>
#> Formula:
#> y ~ x3 + x1 + x2 + s(x4, bs = "cr", k = 10)
#>
#> Parametric coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 4.98238 0.01136 438.5 <2e-16 ***
#> x3 5.99686 0.01095 547.6 <2e-16 ***
#> x1 4.01328 0.01122 357.7 <2e-16 ***
#> x2 5.00179 0.01137 439.9 <2e-16 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Approximate significance of smooth terms:
#> edf Ref.df F p-value
#> s(x4) 8.104 8.706 1410 <2e-16 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> R-sq.(adj) = 0.999 Deviance explained = 99.9%
#> -REML = 52.695 Scale est. = 0.064403 n = 500
plot(example_gagam)
The following code shows how to apply GAGAM to the Boston Housing data set.
library(mlbench)
data(BostonHousing)
y <- BostonHousing[,14]
x <- BostonHousing[,1:13]
boston_gagam <- gagam(y,x,Kvar = 13,always_par = c(2,4,9))
#> [1] "Generation: 1 of 100"
#> [1] "Generation: 2 of 100"
#> [1] "Generation: 3 of 100"
#> [1] "Generation: 4 of 100"
#> [1] "Generation: 5 of 100"
#> [1] "Generation: 6 of 100"
#> [1] "Generation: 7 of 100"
#> [1] "Generation: 8 of 100"
#> [1] "Generation: 9 of 100"
#> [1] "Generation: 10 of 100"
#> [1] "Generation: 11 of 100"
#> [1] "Generation: 12 of 100"
#> [1] "Generation: 13 of 100"
#> [1] "Generation: 14 of 100"
#> [1] "Generation: 15 of 100"
#> [1] "Generation: 16 of 100"
#> [1] "Generation: 17 of 100"
#> [1] "Generation: 18 of 100"
#> [1] "Generation: 19 of 100"
#> [1] "Generation: 20 of 100"
#> [1] "Generation: 21 of 100"
#> [1] "Generation: 22 of 100"
#> [1] "Generation: 23 of 100"
#> [1] "Generation: 24 of 100"
#> [1] "Generation: 25 of 100"
#> [1] "Generation: 26 of 100"
#> [1] "Generation: 27 of 100"
#> [1] "Generation: 28 of 100"
#> [1] "Generation: 29 of 100"
#> [1] "Generation: 30 of 100"
#> [1] "Generation: 31 of 100"
#> [1] "Generation: 32 of 100"
#> [1] "Generation: 33 of 100"
#> [1] "Generation: 34 of 100"
#> [1] "Generation: 35 of 100"
#> [1] "Generation: 36 of 100"
#> [1] "Generation: 37 of 100"
#> [1] "Generation: 38 of 100"
#> [1] "Generation: 39 of 100"
#> [1] "Generation: 40 of 100"
#> [1] "Generation: 41 of 100"
#> [1] "Generation: 42 of 100"
#> [1] "Generation: 43 of 100"
#> [1] "Generation: 44 of 100"
#> [1] "Generation: 45 of 100"
#> [1] "Generation: 46 of 100"
#> [1] "Generation: 47 of 100"
#> [1] "Generation: 48 of 100"
#> [1] "Generation: 49 of 100"
#> [1] "Generation: 50 of 100"
#> [1] "Generation: 51 of 100"
#> [1] "Generation: 52 of 100"
#> [1] "Generation: 53 of 100"
#> [1] "Generation: 54 of 100"
#> [1] "Generation: 55 of 100"
#> [1] "Generation: 56 of 100"
#> [1] "Generation: 57 of 100"
#> [1] "Generation: 58 of 100"
#> [1] "Generation: 59 of 100"
#> [1] "Generation: 60 of 100"
#> [1] "Generation: 61 of 100"
#> [1] "Generation: 62 of 100"
#> [1] "Generation: 63 of 100"
#> [1] "Generation: 64 of 100"
#> [1] "Generation: 65 of 100"
#> [1] "Generation: 66 of 100"
#> [1] "Generation: 67 of 100"
#> [1] "Generation: 68 of 100"
#> [1] "Generation: 69 of 100"
#> [1] "Generation: 70 of 100"
#> [1] "Generation: 71 of 100"
#> [1] "Generation: 72 of 100"
#> [1] "Generation: 73 of 100"
#> [1] "Generation: 74 of 100"
#> [1] "Generation: 75 of 100"
#> [1] "Generation: 76 of 100"
#> [1] "Generation: 77 of 100"
#> [1] "Generation: 78 of 100"
#> [1] "Generation: 79 of 100"
#> [1] "Generation: 80 of 100"
#> [1] "Generation: 81 of 100"
#> [1] "Generation: 82 of 100"
#> [1] "Generation: 83 of 100"
#> [1] "Generation: 84 of 100"
#> [1] "Generation: 85 of 100"
#> [1] "Generation: 86 of 100"
#> [1] "Generation: 87 of 100"
#> [1] "Generation: 88 of 100"
#> [1] "Generation: 89 of 100"
#> [1] "Generation: 90 of 100"
#> [1] "Generation: 91 of 100"
#> [1] "Generation: 92 of 100"
#> [1] "Generation: 93 of 100"
#> [1] "Generation: 94 of 100"
#> [1] "Generation: 95 of 100"
#> [1] "Generation: 96 of 100"
#> [1] "Generation: 97 of 100"
#> [1] "Generation: 98 of 100"
#> [1] "Generation: 99 of 100"
#> [1] "Generation: 100 of 100"
summary(boston_gagam)
#>
#> Family: gaussian
#> Link function: identity
#>
#> Formula:
#> y ~ rad + tax + nox + ptratio + s(dis, bs = "cr", k = 10) + s(lstat, bs = "cr", k = 10) + s(rm, bs = "cr", k = 10) + s(crim, bs = "cr", k = 10)
#>
#> Parametric coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 47.434816 2.673583 17.742 < 2e-16 ***
#> rad 0.355996 0.058860 6.048 2.95e-09 ***
#> tax -0.011635 0.002609 -4.459 1.02e-05 ***
#> nox -18.552051 3.023186 -6.137 1.77e-09 ***
#> ptratio -0.718537 0.097222 -7.391 6.54e-13 ***
#> ---
#> Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#>
#> Approximate significance of smooth terms:
#> edf Ref.df F p-value
#> s(dis) 7.714 8.381 19.61 < 2e-16 ***
#> s(lstat) 5.791 6.751 43.75 < 2e-16 ***
#> s(rm) 5.561 6.401 31.77 < 2e-16 ***
#> s(crim) 2.082 2.281 32.84 5.51e-12 ***
#> ---
#> Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#>
#> R-sq.(adj) = 0.857 Deviance explained = 86.4%
#> -REML = 1379.7 Scale est. = 12.07 n = 506
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.