gamlssVGD: A Set of Functions for selecting Models using Validation or...
In gamlss: Generalized Additive Models for Location Scale and Shape

gamlssVGD

R Documentation

A Set of Functions for selecting Models using Validation or Test Data Sets and Cross Validation

Description

This is a set of function useful for selecting appropriate models.

The functions gamlssVGD, VGD, getTGD, TGD can be used when a subset of the data is used for validation or testing.

The function stepVGD() is a stepwise procedure for selecting an appropriate model for any of the parameters of the model minimising the test global deviance. The function stepVGDAll.A() can select a model using strategy A for all the parameters.

The functions gamlssCV, CV can be used for a k-fold cross validation.

Usage

gamlssVGD(formula = NULL, sigma.formula = ~1, nu.formula = ~1, 
          tau.formula = ~1, data = NULL, family = NO, 
          control = gamlss.control(trace = FALSE), 
          rand = NULL, newdata = NULL, ...)
          
VGD(object, ...)          

getTGD(object, newdata = NULL, ...)

TGD(object, ...)  

gamlssCV(formula = NULL, sigma.formula = ~1, nu.formula = ~1, 
         tau.formula = ~1, data = NULL, family = NO, 
         control = gamlss.control(trace = FALSE), 
         K.fold = 10, set.seed = 123, rand = NULL, 
         parallel = c("no", "multicore", "snow"), 
         ncpus = 1L, cl = NULL, ...)

CV(object, ...)

drop1TGD(object, scope, newdata, parameter = c("mu", "sigma", "nu", "tau"), 
         sorted = FALSE, trace = FALSE, 
         parallel = c("no", "multicore", "snow"), 
         ncpus = 1L, cl = NULL, ...)
         
add1TGD(object, scope, newdata, parameter = c("mu", "sigma", "nu", "tau"), 
        sorted = FALSE, trace = FALSE, 
        parallel = c("no", "multicore", "snow"), 
        ncpus = 1L, cl = NULL, ...)

stepTGD(object, scope, newdata, 
        direction = c("both", "backward", "forward"),
        trace = TRUE, keep = NULL, steps = 1000, 
        parameter = c("mu", "sigma", "nu", "tau"), 
        parallel = c("no", "multicore", "snow"), 
        ncpus = 1L, cl = NULL, ...)
        
stepTGDAll.A(object, scope = NULL, newdata = NULL, 
        steps = 1000, sigma.scope = NULL, nu.scope = NULL, 
        tau.scope = NULL, mu.try = TRUE, sigma.try = TRUE, 
        nu.try = TRUE, tau.try = TRUE,
        parallel = c("no", "multicore", "snow"), 
        ncpus = 1L, cl = NULL, ...)

Arguments

`formula`	A `gamlss` `mu` formula.
`sigma.formula`	Formula for `sigma`.
`nu.formula`	Formula for `nu`.
`tau.formula`	Formula for `tau`.
`data`	The data frame required for the fit.
`family`	The `gamlss.family` distribution.
`control`	The control for fitting the gamlss model.
`rand`	For `gamlssVGD` a variable with values 1 (for fitting) and 2 (for predicting). For `gamlssCV` a variable with k values indicating the cross validation sets.
`newdata`	The new data set (validation or test) for prediction.
`object`	A relevant R object.
`scope`	defines the range of models examined in the stepwise selection similar to `stepGAIC()` where you can see examples
`sigma.scope`	defines the range of models examined in the stepwise selection for `sigma`
`nu.scope`	defines the range of models examined in the stepwise selection for `nu`
`tau.scope`	defines the range of models examined in the stepwise selection for `tau`
`mu.try`	whether should try fitting models for `mu`
`sigma.try`	whether should try fitting models for `sigma`
`nu.try`	whether should try fitting models for `nu`
`tau.try`	whether should try fitting models for `tau`
`parameter`	which distribution parameter is required, default `what="mu"`
`sorted`	should the results be sorted on the value of TGD
`trace`	f `TRUE` additional information may be given on the fits as they are tried.
`direction`	The mode of stepwise search, can be one of `both`, `backward`, or `forward`, with a default of `both`. If the scope argument is missing the default for direction is backward
`keep`	see `stepGAIC()` for explanation
`steps`	the maximum number of steps to be considered. The default is 1000.
`K.fold`	the number of subsets of the data used
`set.seed`	the seed to be used in creating `rand`
`parallel`	The type of parallel operation to be used (if any). If missing, the default is "no".
`ncpus`	integer: number of processes to be used in parallel operation: typically one would chose this to the number of available CPUs.
`cl`	An optional parallel or snow cluster for use if `parallel = "snow"`. If not supplied, a cluster on the local machine is created for the duration of the call.
`...`	further arguments to be pass in the gamlss fit

Details

The function gamlssVGD() fits a gamlss model to the training data set determined by the arguments rand or newdata. The results is a gamlssVGD objects which contains the gamlss fit to the training data plus three extra components: i) VGD the global deviance applied to the validation data sets. ii) predictError which is VGD divided with the number of observations in the validation data set and iii) residVal the residuals for the validation data set.

The function VGD() extract the validated global deviance from one or more fitted gamlssVGD objects and can be used foe model comparison.

The function getTGD() operates different from the function gamlssVGD(). It assumes that the users already have fitted models using gamlss() and now he/she wants to evaluate the global deviance at a new (validation or test) data set.

The function TGD() extract the validated/test global deviance from one or more fitted gamlssTGD objects and can be use to compare models.

The gamlssCV() performs a k-fold cross validation on a gamlss models.

The function CV() extract the cross validated global deviance from one or more fitted gamlssCV objects and can be use to compare models.

The functions add1TGD(), drop1TGD() and stepTGD behave similar to add1(), drop1() and stepGAIC() functions respectively but they used validation or test deviance as the selection criterion rather than the GAIC.

Value

A fitted models of a set of global deviances.

Author(s)

Mikis Stasinopoulos

References

Chambers, J. M. and Hastie, T. J. (1991). Statistical Models in S, Chapman and Hall, London.

Rigby, R. A. and Stasinopoulos D. M. (2005). Generalized additive models for location, scale and shape,(with discussion), Appl. Statist., 54, part 3, pp 507-554.

Rigby, R. A., Stasinopoulos, D. M., Heller, G. Z., and De Bastiani, F. (2019) Distributions for modeling location, scale, and shape: Using GAMLSS in R, Chapman and Hall/CRC. An older version can be found in https://www.gamlss.com/.

Stasinopoulos D. M. Rigby R.A. (2007) Generalized additive models for location scale and shape (GAMLSS) in R. Journal of Statistical Software, Vol. 23, Issue 7, Dec 2007, https://www.jstatsoft.org/v23/i07/.

Stasinopoulos D. M., Rigby R.A., Heller G., Voudouris V., and De Bastiani F., (2017) Flexible Regression and Smoothing: Using GAMLSS in R, Chapman and Hall/CRC.

(see also https://www.gamlss.com/).

Venables, W. N. and Ripley, B. D. (2002) Modern Applied Statistics with S. Fourth edition. Springer.

Examples

data(abdom)
# generate the random split of the data
rand <- sample(2, 610, replace=TRUE, prob=c(0.6,0.4))
# the proportions in the sample
table(rand)/610
olddata<-abdom[rand==1,] # training data
newdata<-abdom[rand==2,] # validation data
#------------------------------------------------------------------------------
#  gamlssVGD
#-------------------------------------------------------------------------------
# Using rand
v1 <- gamlssVGD(y~pb(x,df=2),sigma.formula=~pb(x,df=1), data=abdom, family=NO, 
              rand=rand)
v2 <- gamlssVGD(y~pb(x,df=2),sigma.formula=~pb(x,df=1), data=abdom, family=LO, 
              rand=rand)
v3 <- gamlssVGD(y~pb(x,df=2),sigma.formula=~pb(x,df=1), data=abdom, family=TF, 
              rand=rand)
VGD(v1,v2,v3)
#-------------------------------------------------------------------------------
## Not run: 
#-------------------------------------------------------------------------------
# using two data set
v11 <- gamlssVGD(y~pb(x,df=2),sigma.formula=~pb(x,df=1), data=olddata, 
                  family=NO, newdata=newdata)
v12 <- gamlssVGD(y~pb(x,df=2),sigma.formula=~pb(x,df=1), data=olddata, 
                 family=LO, newdata=newdata)
v13 <- gamlssVGD(y~pb(x,df=2),sigma.formula=~pb(x,df=1), data=olddata, 
                 family=TF, newdata=newdata)
VGD(v11,v12,v13)
#-------------------------------------------------------------------------------
# function getTGD
#-------------------------------------------------------------------------------
# fit gamlss models first
g1 <- gamlss(y~pb(x,df=2),sigma.formula=~pb(x,df=1), data=olddata, family=NO)
g2 <- gamlss(y~pb(x,df=2),sigma.formula=~pb(x,df=1), data=olddata, family=LO)
g3 <- gamlss(y~pb(x,df=2),sigma.formula=~pb(x,df=1), data=olddata, family=TF)
# and then use 
gg1 <-getTGD(g1, newdata=newdata)
gg2 <-getTGD(g2, newdata=newdata)
gg3 <-getTGD(g3, newdata=newdata)

TGD(gg1,gg2,gg3)
#-------------------------------------------------------------------------------
#-------------------------------------------------------------------------------
# function gamlssCV
#-------------------------------------------------------------------------------
set.seed(123)
rand1 <- sample (10 , 610, replace=TRUE)
g1 <- gamlssCV(y~pb(x,df=2),sigma.formula=~pb(x,df=1), data=abdom, family=NO, 
               rand=rand1)
g2 <- gamlssCV(y~pb(x,df=2),sigma.formula=~pb(x,df=1), data=abdom, family=LO, 
               rand=rand1)
g3 <- gamlssCV(y~pb(x,df=2),sigma.formula=~pb(x,df=1), data=abdom, family=TF, 
               rand=rand1)
CV(g1,g2,g3)
CV(g1)
# using parallel 
set.seed(123)
rand1 <- sample (10 , 610, replace=TRUE)
nC <- detectCores()

system.time(g21 <- gamlssCV(y~pb(x,df=2), sigma.formula=~pb(x,df=1), data=abdom,
             family=NO, rand=rand1,parallel = "no", ncpus = nC ))

system.time(g22 <- gamlssCV(y~pb(x,df=2), sigma.formula=~pb(x,df=1), data=abdom,
             family=LO, rand=rand1,parallel = "multicore", ncpus = nC ))

system.time(g23 <- gamlssCV(y~pb(x,df=2), sigma.formula=~pb(x,df=1), data=abdom,
             family=TF, rand=rand1,parallel = "snow", ncpus = nC ))


CV(g21,g22,g23)
#-------------------------------------------------------------------------------
# functions add1TGD() drop1TGD() and stepTGD()
#-------------------------------------------------------------------------------
# the data
data(rent)
rand <- sample(2, dim(rent)[1], replace=TRUE, prob=c(0.6,0.4))
# the proportions in the sample
table(rand)/dim(rent)[1]
oldrent<-rent[rand==1,] # training set
newrent<-rent[rand==2,] # validation set

# null model
v0 <- gamlss(R~1, data=oldrent, family=GA)
# complete model
v1 <- gamlss(R~pb(Fl)+pb(A)+H+loc, sigma.fo=~pb(Fl)+pb(A)+H+loc, 
             data=oldrent, family=GA)

# drop1TGDP
system.time(v3<- drop1TGD(v1, newdata=newrent,  parallel="no"))
system.time(v4<- drop1TGD(v1, newdata=newrent,  parallel="multicore", 
                          ncpus=nC) )
system.time(v5<- drop1TGD(v1, newdata=newrent,  parallel="snow", ncpus=nC))
cbind(v3,v4,v5)

# add1TGDP
system.time(d3<- add1TGD(v0,scope=~pb(Fl)+pb(A)+H+loc, newdata=newrent,  
                       parallel="no"))
system.time(d4<- add1TGD(v0,scope=~pb(Fl)+pb(A)+H+loc, newdata=newrent,  
                        parallel="multicore", ncpus=nC) )
system.time(d5<- add1TGD(v0, scope=~pb(Fl)+pb(A)+H+loc,newdata=newrent,  
                        parallel="snow", ncpus=nC))

# stepTGD
system.time(d6<- stepTGD(v0, scope=~pb(Fl)+pb(A)+H+loc,newdata=newrent))
system.time(d7<- stepTGD(v0, scope=~pb(Fl)+pb(A)+H+loc,newdata=newrent,
                         parallel="multicore", ncpus=nC))
system.time(d8<- stepTGD(v0, scope=~pb(Fl)+pb(A)+H+loc,newdata=newrent,
                         parallel="snow", ncpus=nC))

## End(Not run)

gamlss documentation built on Aug. 21, 2025, 5:56 p.m.