split_data: Perform sample splitting

View source: R/base_samplesplits.R

split_dataR Documentation

Perform sample splitting

Description

This is a convenience function to split the input data into two independent sets, possibly accounting for single level clustering. These two sets can be used with qgcomp.partials to get "partial" positive/negative effect estimates from the original data, where sample splitting is necessary to get valid confidence intervals and p-values. Sample splitting is also useful for any sort of exploratory model selection, where the training data can be used to select the model and the validation model used to generate the final estimates (this process should not be iterative - e.g. no "checking" the results in the validation data and then re-fitting, as this invalidates inference in the validation set.) E.g. you could use the training data to select non-linear terms for the model and then re-fit in validation data to get unbiased estimates.

Usage

split_data(data, cluster = NULL, prop.train = 0.4)

Arguments

data

A data.frame for use in qgcomp fitting

cluster

NULL (default) or character value naming a cluster identifier in the data. This is to prevent observations from a single cluster being in both the training and validation data, which reduces the effectiveness of sample splitting.

prop.train

proportion of the original dataset (or proportion of the clusters identified via the 'cluster' parameter) that are used in the training data (default=0.4)

Value

A list of the following type: list( trainidx = trainidx, valididx = valididx, traindata = traindata, validdata = validdata )

e.g. if you call spl = split_data(dat), then spl$traindata will contain a 40% sample from the original data, spl$validdata will contain the other 60% and spl$trainidx, spl$valididx will contain integer indexes that track the row numbers (from the original data dat) that have the training and validation samples.

Examples

data(metals)
set.seed(1231124)
spl = split_data(metals)
Xnm <- c(
  'arsenic','barium','cadmium','calcium','chromium','copper',
 'iron','lead','magnesium','manganese','mercury','selenium','silver',
 'sodium','zinc'
)
dim(spl$traindata) # 181 observations = 40% of total
dim(spl$validdata) # 271 observations = 60% of total
splitres <- qgcomp.partials(fun="qgcomp.glm.noboot", f=y~., q=4, 
  traindata=spl$traindata,validdata=spl$validdata, expnms=Xnm)
splitres

# also used to compare linear vs. non-linear fits (useful if you have enough data)
set.seed(1231)
spl = split_data(metals, prop.train=.5)
lin = qgcomp.glm.boot(f=y~., q=4, expnms=Xnm, B=5, data=spl$traindata)
nlin1 = qgcomp.glm.boot(f=y~. + I(manganese^2) + I(calcium^2), expnms=Xnm, deg=2, 
  q=4, B=5, data=spl$traindata)
nlin2 = qgcomp.glm.boot(f=y~. + I(arsenic^2) + I(cadmium^2), expnms=Xnm, deg=2, 
  q=4, B=5, data=spl$traindata)
AIC(lin);AIC(nlin1);AIC(nlin2)
# linear has lowest training AIC, so base final fit off that (and bootstrap not needed)
qgcomp.glm.noboot(f=y~., q=4, expnms=Xnm, data=spl$validdata)

qgcomp documentation built on Aug. 10, 2023, 5:07 p.m.