plbpsm: Partial Linear Bivariate Penalized Spline Model

Description Usage Arguments Details Value References Examples

View source: R/plbpsm.R

Description

Fits a partial linear bivariate penalized spline model

Usage

1
2
3
4
5
plbpsm(formula, family = gaussian(), data = list(), ind_c = NULL,
  group = NULL, weights = NULL, na.action, offset = NULL,
  criterion = "GCV", method = "SCAD", control = list(), scale = 1,
  VS = FALSE, fit = TRUE, G = NULL, drop.unused.levels = TRUE,
  drop.intercept = NULL, ecdf = FALSE, backfitting = TRUE, ...)

Arguments

formula

a PLBPSM formula. These are exactly like the formula for a GLM except that univariable and bivariate smooth terms.

family

This is a family object specifying the distribution and link to use in fitting etc (see glm and family).

data

A data frame or list containing the model response variable and covariates required by the formula. By default the variables are taken from environment(formula).

ind_c

Defualt set to 'NULL', if not, it is the arbitrary chosen variables from the parametric part

group

A vector describing the grouping of the coefficients. For greatest efficiency and least ambiguity (see details), it is best if group is a factor or vector of consecutive integers, although unordered groups and character vectors are also allowed. If there are coefficients to be included in the model without being penalized, assign them to group 0 (or "0").

weights

weights on the contribution of the data to the log likelihood. Note that a weight of 2, for example, is equivalent to having made exactly the same observation twice. If you want to reweight the contributions of each datum without changing the overall magnitude of the log likelihood, then you should normalize the weights (e.g. weights <- weights/mean(weights)).

na.action

a function which indicates what should happen when the data contain ‘NA’s. The default is set by the 'na.action' setting of 'options', and is 'na.fail' if that is unset. The 'factory-fresh' default is 'na.omit'.

offset

Can be used to supply a model offset for use in fitting. Note that this offset will always be completely ignored when predicting, unlike an offset included in formula (this used to conform to the behaviour of lm and glm).

criterion

The criterion to choose the penalty parameter lambda. "GCV" to use generalized cross validation method and "CV" for cross validation

method

The variable selection method for linear covariates. "SCAD" is the SCAD method in penalizing the coefficients for linear covariates. "ALASSO" is the adaptive LASSO method in penalizing the coeffiecient for linear covariates

control

A list of fit control parameters to replace defaults returned by plbpsm.control. Any control parameters not supplied stay at their default values.

scale

scale parameter in exponential family

VS

'TRUE' for using ALASSO/SCAD to select linear variables

fit

If this argument is TRUE then gam sets up the model and fits it, but if it is FALSE then the model is set up and an object G containing what would be required to fit is returned is returned. See argument G.

G

Usually NULL, but may contain the object returned by a previous call to gam with fit=FALSE, in which case all other arguments are ignored except for scale, control, method and fit.

drop.unused.levels

by default unused levels are dropped from factors before fitting. For some smooths involving factor variables you might want to turn this off. Only do so if you know what you are doing.

drop.intercept

Set to TRUE to force the model to really not have the a constant in the parametric model part, even with factor variables present. Can be vector when formula is a list.

ecdf

the choice of whether empirical conditional density function is used.

backfitting

whether spline backfitted local polynomial estimation is applied.

...

further arguments for passing on e.g. to grplsfit.

Details

A generalized geoadditive model (GgAM) is a generalized linear model (GLM) in which the linear predictor is given by user specified univariate functions plus additive functions and a bivariate function of the location covariates of the linear predictor. A simple example is:

log(E(y_i))= u(z_1) + u(z_2) + f_{(x_{1i},x_{2i})}

where the (independent) response variables y_i~Poi, z_1 and z_2 are explantary variables, f is the bivariate smooth function of covariates x_{1i} and x_{2i}. The log is an example of a link function.

A Generalized Geoadditive Model (GgAM) is a generalized linear model (GLM) in which the linear predictor is given by a user specified bivariate functions of the location covariates plus a conventional parametric component of the linear predictor. A simple example is:

log(E(y_i))= z_1 + z_2 + u_{x_1} + u_{x_2} + b_{(s_{1i},s_{2i})}

where the (independent) response variables y_i~Poi, z_1 and z_2 are explantary variables, f is the bivariate smooth function of covariates x_{1i} and x_{2i}. The log is an example of a link function.

Model structure identification process is contained in plbpsm before model fitting to identify the linear and nonlinear continuous variables by using group adaptive LASSO.

We incorporate a variable selection mechanism into the Partial linear Spatial Regression Model (PLSM) and propose a double penalized least squares approach based on bivariate spline approximation over the spatial domain when the link is gaussian.

plbpsm in GgAM chooses the smoothing penalty parameter by using the Generalized Cross Validation (GCV) criterion or Cross validation (CV) criterion. The formula for GCV is:

n D/(n - DoF)^2

where D is the deviance, n the number of data, s the scale parameter and DoF the effective degrees of freedom of the model. The formula for CV is:

\frac { 1 } { n } ∑ _ { i = 1 } ^ { n } ≤ft( y _ { i } - \hat { y } ^ { ( - i ) } \right) ^ { 2 }

, where \hat{y}^ { ( - i ) } is the prediction of y_i computed by leaving out data points from ith folder and n is the number of folders.

In terms of shrinkage penalty, Adaptive LASSO and SCAD penaltiy could be used to do variable selection under PLSM and GPLSM. Broadly plbpsm works by first constructing basis functions and penalty coefficient matrices for bivariate smooth term in the model formula, obtaining a model matrix for the parametric part of the model formula. The algorithm chooses penalty parameter based on GCV/CV criterion for the bivariate smoothing part and chooses significant linear covariates based on adaptive LASSO and SCAD method using BIC criterion. And then, the refit is applied to the choosen model to get the final fit.

Details of the default underlying fitting methods are given in Yu et al. (2019), Wang et al.(2018), and Wang et al. (2019).

Value

If fit=FALSE the function returns a list G of items needed to fit a PLBPSM, but doesn't actually fit it. Otherwise the function returns an object of class "plbpsm" as described in plbpsmObject.

References

ADD MORE.

Examples

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
library(MASS)
library(grpreg)
library(mgcv)
library(Triangulation)
library(BPST)

#################### Horseshoe Domain Example ######################
###### VS in PLSM ######
data("eg1pop_dat")
eg1_V2 <- eg1pop_dat[['V2']]
eg1_T2 <- eg1pop_dat[['T2']]
eg1pop_rho03 <- eg1pop_dat[['rho03']]
n <- 100
Npop <- nrow(eg1pop_rho03)
# ind.pop <- (1:Npop)[!is.na(eg1pop_rho03[,1])]
ind.pop <- (1:Npop)
set.seed(1234)
sam.ind <- sort(sample(ind.pop,n,replace=FALSE))
sam <- eg1pop_rho03[sam.ind,]
lambda <- 10^(seq(-2,5,by=1))
data <- sam
data("horseshoe")
V <- TriMesh(horseshoe,4)$V
Tr <- TriMesh(horseshoe,4)$Tr

formula <- Y~z1+z2+z3+z4+z5+z6+z7+z8+b(x1,x2,V=V,Tr=Tr,d=2,r=1)
# example 0:default lambda
res0 <- plbpsm(formula=formula,data=as.data.frame(data),VS=TRUE)
predict(res0,as.data.frame(data))
res0$fitted.values
summary(res0)

### Estimation ###
formula <- Y~z1+z2+z3+z4+z5+z6+z7+z8+b(x1,x2,V=eg1_V2,Tr=eg1_T2,lambda=lambda,d=2,r=1)
res <- plbpsm(formula=formula,data=as.data.frame(data),VS=TRUE)

# ind_c is provided
res4 <- plbpsm(formula=formula,data=as.data.frame(sam),ind_c=c(1,2,4),VS=TRUE)

###### GPLSM ######
data(eg1pop_bin_rho00)
formula <- Y~z1+z2+z3+b(x1,x2,V=eg1_V2,Tr=eg1_T2,lambda=lambda,d=2,r=1)
n <- 100
Npop <- nrow(eg1pop_bin_rho00)
# ind.pop <- (1:Npop)[!is.na(eg1pop_bin_rho00[,1])]
ind.pop <- (1:Npop)
sam.ind <- sort(sample(ind.pop,n,replace=FALSE))
sam <- eg1pop_bin_rho00[sam.ind,]
data <- sam
res_eg1_bin <- plbpsm(formula=formula,data=as.data.frame(data),family=binomial())

data(eg1pop_nb_rho00)
formula <- Y~z1+z2+z3+b(x1,x2,V=eg1_V2,Tr=eg1_T2,d=2,r=1)
n <- 100
Npop <- nrow(eg1pop_bin_rho00)
# ind.pop <- (1:Npop)[!is.na(eg1pop_bin_rho00[,1])]
ind.pop <- (1:Npop)
sam.ind <- sort(sample(ind.pop,n,replace=FALSE))
sam <- eg1pop_bin_rho00[sam.ind,]
res_nb <- plbpsm(formula=formula,data=as.data.frame(sam),family=negbin(c(2:8)))

### GgAM ####
# Poisson #
data(eg1pop_poi2)
n <- 1000
set.seed(2008)
Npop <- nrow(eg1pop_poi2)
# ind.pop <- (1:Npop)[!is.na(eg1pop_poi2[,1])]
ind.pop <- (1:Npop)
sam.ind <- sort(sample(ind.pop,n,replace=FALSE))
sam <- eg1pop_poi2[sam.ind,]
formula <- Y~u(z1)+u(z2)+u(z3)
res_eg1_poi_add0 <- plbpsm(formula=formula,data=as.data.frame(sam),family=quasipoisson(),
group=c(0,0,0))
summary(res_eg1_poi_add0)
formula <- Y~u(z1)+u(z2)+u(z3)+b(x1,x2,V=eg1_V2,Tr=eg1_T2,d=2,r=1)

### Estimation ###
res_eg1_poi_add <- plbpsm(formula=formula,data=as.data.frame(sam),family='poisson',group=c(0,0,0))
res_intercptonly <- plbpsm(formula=Y~1,data=as.data.frame(sam),family='poisson')
### Inference & Plotting ###
summary(res_eg1_poi_add)
plot(res_eg1_poi_add)

### GgAM Prediction ###
pred <- predict(res_eg1_poi_add,as.data.frame(eg1pop_poi2[1:200,]))
origy <- eg1pop_poi2[1:200,'Y']
origy[which(is.na(pred))] <- NA
mean((pred-origy)^2,na.rm=TRUE)
# 2.8527

### GGAM-SMILE: Nonlinear detection ###
data('dat_poi_ggams')
formula <- y~u(x1)+u(x2)+u(x3)+b(s1,s2,V=V,Tr=Tr,d=2,r=1)
res_poi <- plbpsm(formula=formula,data=as.data.frame(dat_poi_ggams),family = "poisson")
# compare to the following Known GGAM model #
formula <- y~x1+u(x2)+u(x3)+b(s1,s2,V=V,Tr=Tr,d=2,r=1)
res_poi <- plbpsm(formula=formula,data=as.data.frame(dat_poi_ggams),family = "poisson",group=c(0,0))
# end of horseshoe domain example

################### Rectangular Domain Example #######################
### VS in PLSM ###
data("eg2pop_dat")
eg2_V20=eg2pop_dat[['V20']]
eg2_T20=eg2pop_dat[['T20']]
eg2pop=eg2pop_dat[['pop']]
n=100
Npop=nrow(eg2pop)
# ind.pop=(1:Npop)[!is.na(eg2pop[,1])]
ind.pop <- (1:Npop)
sam.ind <- sort(sample(ind.pop,n,replace=FALSE))
sam=eg2pop[sam.ind,]
formula <- Y~z1+z2+z3+z4+z5+z6+z7+z8+b(x1,x2,V=eg2_V20,Tr=eg2_T20)
res0 <- plbpsm(formula=formula,data=as.data.frame(sam),VS=TRUE)

### More complicated
formula <- Y~z2+u(z1)+b(x1,x2,V=eg2_V20,Tr=eg2_T20)
res_ex <- plbpsm(formula=formula,data=as.data.frame(data),group=c(0,0))

funstatpackages/GgAM documentation built on Nov. 4, 2019, 12:59 p.m.