In crazybilly/mumodels: Build Annual Fund Models for Millikin

mumodels is designed to quickly build models and predictions, primarily for Annual Fund predictions. This package is not suited for general use--field names specific to our data are hard coded.

The package provides three functions designed to build one another.

buildAFpredictors() - builds a data frame of data to do predictions (and/or) training with.
buildAFmodels() - builds a list of models. By defaults, builds lm, glm and rf models.
predictAF() - builds a data frame of predictions from a list of models (as generated by buildAFmodels()).

To get started, build some trainingdata, make the models, then run predictions from it:

# build training data
FY14trainingdata  <- buildAFpredictors(   2014
                                        , trainingsource = 'C:/lastyearsdata.csv'
                                       )

# build models
AFmodels  <- buildAFmodels(FY14trainingdata)


# make predictions

FY15predictions  <- predictAF(  AFmodels
                              , source = hallptbl
                              , currentyear = 2015
                             )

Building training data with `buildAFpredictors()`

buildAFpredictors() provides an easy way to build a data frame of data for training and/or predicting data.

FY14trainingdata  <- buildAFpredictors(  2014
                                       , yeartype = 'fiscal'
                                       , trainingsource = 'C:\users\crazybilly\hallpAtFY14start.csv'
                                    )

buildAFpredictors() takes 3 arguments:

trainingyear - the year that you want data for, in 4 digit format, eg. 2012.
yeartype - specifies whether the year set in trainingyear is a fiscal year or a calendar year. Choosing fiscal will set the start date of the year at 7/1.
trainingsource - where should the function look for the data. The default hallptbl tries to use the local warehouse database. Other database table connections (created via dplyr) are ok, too, as are csv files as in the example above. Note that variable names for the final data are hard coded, albeit with some measure of error checking for changing column names. If trainingsource doesn't have the proper column names, it will throw an error.
primaryonly - whether or not the result should only include primary constituents.

The result is a data frame suitable for doing predictions. The first column will always be pidm, and the last 3 columns are outcomes in 3 different formats:

outcome_totalg - a numeric variable of total dollars donated. This is giving only: pledges and memo credits are not included.
outcome_donorfactor - a factor variable indicating whether or not the person is a donor.
outcome_logg - a numeric variable which is log(outcome_totalg + 1)

If primaryonly is set at the default value of TRUE, the results will only be living, primary constituents. Deceased constituents are always filtered out, as are STUDs and organizations.

Building models with `buildAFmodels()`

buildAFmodels() makes it easy to build 3 quick models to predict annual giving.

AFmodels  <- buildAFmodels( FY14trainingdata )

Two arguments are avilable:

trainingdata - a data frame of training data. The data frames built by buildAFpredictors() work well here, but any data frame would work, ASSUMING it has the following columns:
outcome_totalg
outcome_donorfactor - used as outcome in glm and rf models
outcome_logg - used as outcome in lm model

The function assumes these columns exists in trainingdata and uses the appropriate outcome for the model. It also removes the pidm column, if it exists.

models - a vector of models to build. By default, the function builds lm, glm and rf models. To build fewer models, remove the names from the vector.

# only build the lm model
lm_only  <- buildAFmodels( FY14trainingdata, models = c('lm')

# do not build the rf model
lm_and_glm_models <- buildAFmodels( FY14trainingdata, models = c('lm', 'glm')

buildAFmodels() returns a list with each model as an item in the list.

Make predictions with `predictAF()`

Use predictAF() as the last step in the process by passing it the list of models you built with buildAFmodels() and a new/different source.

FY15predictions  <- predictAF(AFmodels, hallptbl, currentyear = 2015)

The function takes several arguments:

models - this is a list of models, preferably one built by buildAFmodels()
source - a source object for current data. This eventually gets passed to buildAFpredictors(), so database table connections and csvs are acceptable here.
buildsource - a logical value indicating whether or not the prediction data should be built from source. The default value of TRUE assumes you want to build the prediction data. If buildsource is FALSE, the object in source is passed directly to train(), which is useful if you want to run predictAF() with the same data multiple times without having to preproccess the data each time. Simply build prediction data with buildAFpredictors() and assign the object, then pass it to predictAF().

# build current data for predictions
FY15data  <- buildAFpredictors(2015)

# predict based on the already processed data
FY15predictions  <- predictAF( AFmodels, source = FY15data, buildsource = F)

currentyear - a 4 digit year which gets passed to buildAFpredictors(), assuming buildsource is TRUE.
yeartype - a choice between 'fiscal' or 'calendar', again passed to buildAFpredictors(), assuming buildsource is TRUE.
pidms - a vector of pidms on which do predictions. If this is left NA, everyone in source is predicted.

The result of predictAF() is a data frame with one row per constituent, a column of pidms and further rows of predictions, one for each model. With the defaults, then, the result is a 4 column data frame with columns of: