allanVarSelect: Memory Unlimited Forward Stepwise Variable Selection for...
In allan: Automated Large Linear Analysis Node

Description Usage Arguments Value Note Author(s) Examples

View source: R/allanVarSelect.R

The function performs forward stepwise variable selection for linear models on any sized dataset, even if it does not fit into R memory. AIC, BIC, and MSE are the available criteria for variable selection. The variable that minimizes these metrics is selected each step until the specified number of variables are entered into the model. The selection starts with a NULL model and adds variables.

allanVarSelect(BaseModel, TrnDataSetFile, ValDataSetFile, ResponseCol = 1, NumOfSteps = 10, criteria = "AIC", currentchunksize = -1, silent = TRUE, MemoryAllowed = 0.5, TestedRows = 1000, AdjFactor = 0.095)

`BaseModel`	A biglm object that has a formula that specifies the full model with all variables being considered for selection. ie. y ~ x1+x2+x3+.... etc. In order to get a biglm object to pass, you will need to create a biglm model on a small subsection of the dataset if the dataset cannot fit into R memory. Note: Offsets should be specified with an offset option instead of included in the model formula. Otherwise an error may result.
`TrnDataSetFile`	The training dataset that the BaseModel will be trained on. Unlimited by size.
`ValDataSetFile`	The validation dataset that the BaseModel will be validated on. AIC, BIC, and MSE will be calculated from this dataset to select variables. Unlimited by size.
`ResponseCol`	The column that the y or response variable is in in the dataset. Training, validation, as well as the smaller data chunk that the passed biglm object was initially fit on must all have the same format ie. same variables and columns.
`NumOfSteps`	Number of variables to enter into the final fitted model.
`criteria`	criteria for variable selection. "AIC","BIC", or "MSE" can be chosen
`currentchunksize`	See documentation for getbestchunksize.
`silent`	Boolean. Suppresses unnecessary output to screen if silent=TRUE.
`MemoryAllowed`	See function getbestchunksize for argument description.
`TestedRows`	See function getbestchunksize for argument description.
`AdjFactor`	See function getbestchunksize for argument description.

Returns the final fitted biglm object with the final number of variables specified. The selection statistics is saved in the object under $SelectionSummary.

Offsets should be specified with the offset option and not placed in the model formula to avoid errors.

Alan Lee alanlee@stanfordalumni.org

#Get external data.  For your own data skip this next line and replace all
#instance of SampleData with "YourFile.csv".
SampleData=system.file("extdata","SampleDataFile.csv", package = "allan")

#fit smaller data to biglm object
columnnames<-names(read.csv(SampleData, nrows=2,header=TRUE))
datafeed<-readinbigdata(SampleData,chunksize=1000,col.names=columnnames)
datafeed(TRUE)
firstchunk<-datafeed(FALSE)

#create a biglm model from the small chunk with all variables that will be consdered
#for variable selection.
bigmodel <- biglm(PurePremium ~ cont1 + cont2 + cont3 + cont4 + cont5,data=firstchunk,weights=~cont0)

#now run variable selection
FinalModel<-allanVarSelect(bigmodel,SampleData,SampleData,NumOfSteps=2,criteria="MSE",silent=FALSE)