see: Stagewise Estimating Equations Implementation

Description Usage Arguments Details Value Author(s) References Examples

View source: R/see.R

Description

Function to perform SEE, a Forward Stagewise regression approach for model selection / dimension reduction using Generalized Estimating Equations

Usage

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
see(y, ...)

## S3 method for class 'formula'
see(formula, data = list(), clusterID, waves = NULL,
  contrasts = NULL, subset, ...)

## Default S3 method:
see(y, x, waves = NULL, ...)

## S3 method for class 'fit'
see(y, x, family, clusterID, waves = NULL,
  corstr = "independence", alpha = NULL, intercept = TRUE, offset = 0,
  control = sgee.control(maxIt = 200, epsilon = 0.05, stoppingThreshold =
  min(length(y), ncol(x)) - intercept, undoThreshold = 0), standardize = TRUE,
  verbose = FALSE, ...)

Arguments

y

Vector of response measures that corresponds with modeling family given in 'family' parameter. y is assumed to be the same length as clusterID and is assumed to be organized into clusters as dictated by clusterID.

...

Not currently used

formula

Object of class 'formula'; a symbolic description of the model to be fitted

data

Optional data frame containing the variables in the model.

clusterID

Vector of integers that identifies the clusters of response measures in y. Data and clusterID are assumed to 1) be of equal lengths, 2) sorted so that observations of a cluster are in contiguous rows, and 3) organized so that clusterID is a vector of consecutive integers.

waves

An integer vector which identifies components in clusters. The length of waves should be the same as the number of observations. waves is automatically generated if none is supplied, but when using subset parameter, the waves parameter must be provided by the user for proper calculation.

contrasts

An optional list provided when using a formula. similar to contrasts from glm. See the contrasts.arg of model.matrix.default.

subset

An optional vector specifying a subset of observations to be used in the fitting process.

x

Design matrix of dimension length(y) x nvar, the number of variables, where each row is represents an observation of predictor variables.

family

Modeling family that describes the marginal distribution of the response. Assumed to be an object such as gaussian() or poisson().

corstr

A character string indicating the desired working correlation structure. The following are implemented : "independence" (default value), "exchangeable", and "ar1".

alpha

An initial guess for the correlation parameter value between -1 and 1 . If left NULL (the default), the initial estimate is 0.

intercept

Binary value indicating where an intercept term is to be included in the model for estimation. Default is to include an intercept.

offset

Vector of offset value(s) for the linear predictor. offset is assumed to be either of length one, or of the same length as y. Default is to have no offset.

control

A list of parameters used to contorl the path generation process; see sgee.control.

standardize

A logical parameter that indicates whether or not the covariates need to be standardized before fitting. If standardized before fitting, the unstandardized path is returned as the default, with a standardizedPath and standardizedX included separately. Default value is TRUE.

verbose

Logical parameter indicating whether output should be produced while bisee is running. Default value is FALSE.

Details

Function to implement SEE, a stagewise regression approach that is designed to perform model selection in the context of Generalized Estimating Equations. Given a response y and a design matrix x (excluding intercept) SEE generates a path of stagewise regression estimates for each covariate based on the provided step size epsilon.

The resulting path can then be analyzed to determine an optimal model along the path of coefficient estimates. The summary.sgee function provides such functionality based on various possible metrics, primarily focused on the Mean Squared Error. Furthermore, the plot.sgee function can be used to examine the path of coefficient estimates versus the iteration number, or some desired penalty.

A stochastic version of this function can also be called. using the auxiliary function sgee.control the parameters stochastic, reSample, and withReplacement can be given to see to perform a sub sampling step in the procedure to make the SEE implementation scalable for large data sets.

Value

Object of class sgee containing the path of coefficient estimates, the path of scale estimates, the path of correlation parameter estimates, the iteration at which SEE terminated, and initial regression values including x, y, codefamily, clusterID, groupID, offset, epsilon, and numIt.

Author(s)

Gregory Vaughan

References

Vaughan, G., Aseltine, R., Chen, K., Yan, J., (2017). Stagewise Generalized Estimating Equations with Grouped Variables. Biometrics 73, 1332-1342. URL: http://dx.doi.org/10.1111/biom.12669, doi:10.1111/biom.12669.

Wolfson, J. (2011). EEBoost: A general method for prediction and variable selection based on estimating equations. Journal of the American Statistical Association 106, 296–305.

Tibshirani, R. J. (2015). A general framework for fast stagewise algorithms. Journal of Machine Learning Research 16, 2543–2588.

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
#####################
## Generate test data
#####################

## Initialize covariate values
p <- 50 
beta <- c(rep(2,5),
          c(1, 0, 1.5, 0, .5),
          rep(0.5,5),
          rep(0,p-15))
groupSize <- 1
numGroups <- length(beta)/groupSize


generatedData <- genData(numClusters = 50,
                         clusterSize = 4,
                         clusterRho = 0.6,
                         clusterCorstr = "exchangeable",
                         yVariance = 1,
                         xVariance = 1,
                         numGroups = numGroups,
                         groupSize = groupSize,
                         groupRho = 0.3,
                         beta = beta,
                         family = gaussian(),
                         intercept = 1)



## Perform Fitting by providing formula and data
genDF <- data.frame(generatedData$y, generatedData$x)
names(genDF) <- c("Y", paste0("Cov", 1:p))
coefMat1 <- see(formula(genDF), data = genDF,
                 family = gaussian(),
                 waves = rep(1:4, 50), 
                 clusterID = generatedData$clusterID,
                 groupID = generatedData$groupID, 
                 corstr = "exchangeable",
                 control = sgee.control(maxIt = 50, epsilon = 0.5),
                 verbose = TRUE)

## set parameter 'stochastic' to 0.5 to implement the stochastic
## stagewise approach where a subsmaple of 50% of the data is taken
## before the path is calculation.
## See sgee.control for more details about the parameters for the
## stochastic stagewise approach

coefMat2 <- see(formula(genDF), data = genDF,
                 family = gaussian(),
                 waves = rep(1:4, 50), 
                 clusterID = generatedData$clusterID,
                 groupID = generatedData$groupID, 
                 corstr = "exchangeable",
                 control = sgee.control(maxIt = 50, epsilon = 0.5,
                                        stochastic = 0.5), 
                 verbose = FALSE)

par(mfrow = c(2,1))
plot(coefMat1)
plot(coefMat2)

sgee documentation built on May 1, 2019, 7:10 p.m.