selectX: A main function of the package for variable selection based...

Description Usage Arguments Value Examples

Description

This function is a wrapper to the functions bestlinearX(), bestlogitX() and bestprobitX(), with an additional option to call getSamples for improved speed. Take into account that sampling itself takes time, such that total computational burden is a trade-off between the load of the getSample function and the model optimization itself.

Usage

1
2
3
4
selectX(Y, X, model = "lm", returntype = "data", method = "opt.ic",
  KLIC = "AICc", crit.t = 1.64, crit.p = 0.05, test = "LR",
  share = 0.75, confidence.alternative = 0.85, max.iter = 50,
  tracelevel = 1, memorymanagement = TRUE)

Arguments

Y

A binary response variable.

X

A dataframe of multiple exogenous regressors.

model

Either "lm" for the linear probability model, "logit" for the logistic probability model, or "probit", for the probit model. The logit and probit models are solved using Iterated Weighted Least Squares, and optimization of the logit model is significantly faster than the probit model. Defaults to "lm".

returntype

Either "data" to return a dataset, or colnames" to only return the collumn names of the variables that are used in the optimal model. "data" by default.

method

The optimization strategy. Either "opt.ic" to optimize using information criteria, "opt.t" for step-wise elimination of insignificant values (statistically speaking not a sound procedure, but it will provide a parsimonious model that can be usefull as a benchmark), or "opt.h" to optimize by classical hypothesis tests. defaults to "opt.ic".

KLIC

the information criterion used by "opt.ic", either "AIC" or "AICc", defaults to the latter.

crit.t

The t-value indicating significance when using method "opt.t", defaults to 1.64.

crit.p

the p-value used by method "opt.h" in the hypothesis tests. Defaults to 0.05.

test

The hypothesis test used by "opt.h". Defaults to "LR" for the Likelihood Ratio test. Other options are "F", for an F test for joint significance of insignificant parameters, or "Chisq" for a wald test against the Chi squared distribution. Recommended setting is either "LR" as it is less dependent on correct estimation of the standard errors. Keep in mind that "Chisq" is an asymptotic test, anf "F" is more appropiate for small sample tests. Howver "Chisq" holds under milder conditions and should be used if no small sample theory is available for the model.

share

between 0-1, specifying the amount of data that should be passed on to the optimization strategies. Defaults to 0.75, to improve speed. Uses getSamples() to maintain first and second moments of the data.

confidence.alternative

passed on to getSample. Defaults to .85.

max.iter

passed on to getSample. Defaults to 50.

tracelevel

the amount of information to be printed. Passed on to underlying routines. Defaults to 1 for printing, set to 0 for no printing.

memorymanagement

TRUE/FALSE indicating whether garbage collection should be forec regularly when memory usage is high. Defaults to TRUE, recommended setting for large datasets.

Value

Either a dataframe of exogenous variables, or a vector containing the collumn names indicating the optimal variables extracted from the supplied dataset.

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
# load data
data(ITdata)
data(corinetable)
#Grab a sample (optional).
sample <- ITdata[getSamples(ITdata, share =.05),]
# Reclassify
catITdata <- reclassify(sample, reclasstable = corinetable)
# create a binary response dataset.
Y <- MLtoBinomData(catITdata[,1], class =1)
X <- catITdata[,-1]
selectX(Y, X, model ="lm", returntype = "colnames", method = "opt.t")
bestX <- selectX(Y, X)
describe(bestX)

BPJandree/AutoGLM documentation built on May 5, 2019, 10:25 a.m.