ml_diag: Machine Learning Diagnostics for Generalized Linear Models
In davidaarmstrong/mldiag: Machine Learning Model Diagnostics

Description Usage Arguments Details Value

View source: R/ml_diag.R

A decoupling shrinkage and selection (DSS) approach to model diagnostics.

ml_diag(
  mod,
  data,
  shrinkEngine = c("xgboost", "randomForest", "bartMachine"),
  shrinkEngine.args = list(xgboost = list(params = list(max_depth = 4, eta = 0.1)),
    randomForest = list(), bartMachine = list()),
  sampleProp = 0.5,
  retainMarginal = NULL,
  ...
)

`mod`	An object of class `lm` or a `glm` with `family=binomial`.
`data`	A data frame continaing the data used to estimate `mod`
`shrinkEngine`	The methods used in the shrinkage phase of the model.
`shrinkEngine.args`	Arguments to be passed down to the shrinkage engine.
`sampleProp`	Proportion of data (randomly sapmled) to use in the analaysis. The training and testing samples will be returned with the function. Defaults to using 50% of the data.
`retainMarginal`	A vector of names of factors in the dataset where you want the marginal distribution to be respected in the training and testing samples. The random sampling is done within each combination of these values, so unless you have a lot of data, there should be relatively few of these.
`...`	Arguments to be passed down to the shrinkage engine.

Model diagnostics are often based on model residuals. The ml_diag function uses a DSS approach to model diagnostics. Here, the we use non-parametric machine learning tools (like xgboost, randomForest or bartMachine) to generate the best possible predictions from the included model covariates. These predictions serve as an adjusted dependent variable that we predict with the parametric model originally fit to the data. If the fit of this auxiliary model is good, then the original parametric model is well specified. If, however, the model fit is poor, then there are important interactions and/or non-linearities that are not accounted for in the original parametric model. We then either jackknife out each variable or sequentially exclude each variable in turn based on best model fit improvement to see which variables are the cause of problems.

A list with the following elements:

`paramFit`	The r-squared for the shrinkage estimate regressed on the parametric model specification
`termFits1`	The r-squared for the shrinkage estimates regressed on the parametric model specification with each model term jackknifed out in turn.
`termFits2`	The r-squared from the shrinkage estimates regressed on the parametric model specification with the model terms removed sequentially (and cumulatively) based on lack of from the `termFits1` return.
`train.sample`	Observations used in the training sample after data with only model variables had been listwise deleted.
`test.sample`	Observations in the testing sample after data with only model variables had been listwise deleted.