ml_diag: Machine Learning Diagnostics for Generalized Linear Models

Description Usage Arguments Details Value

View source: R/ml_diag.R

Description

A decoupling shrinkage and selection (DSS) approach to model diagnostics.

Usage

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
ml_diag(
  mod,
  data,
  shrinkEngine = c("xgboost", "randomForest", "bartMachine"),
  shrinkEngine.args = list(xgboost = list(params = list(max_depth = 4, eta = 0.1)),
    randomForest = list(), bartMachine = list()),
  sampleProp = 0.5,
  retainMarginal = NULL,
  ...
)

Arguments

mod

An object of class lm or a glm with family=binomial.

data

A data frame continaing the data used to estimate mod

shrinkEngine

The methods used in the shrinkage phase of the model.

shrinkEngine.args

Arguments to be passed down to the shrinkage engine.

sampleProp

Proportion of data (randomly sapmled) to use in the analaysis. The training and testing samples will be returned with the function. Defaults to using 50% of the data.

retainMarginal

A vector of names of factors in the dataset where you want the marginal distribution to be respected in the training and testing samples. The random sampling is done within each combination of these values, so unless you have a lot of data, there should be relatively few of these.

...

Arguments to be passed down to the shrinkage engine.

Details

Model diagnostics are often based on model residuals. The ml_diag function uses a DSS approach to model diagnostics. Here, the we use non-parametric machine learning tools (like xgboost, randomForest or bartMachine) to generate the best possible predictions from the included model covariates. These predictions serve as an adjusted dependent variable that we predict with the parametric model originally fit to the data. If the fit of this auxiliary model is good, then the original parametric model is well specified. If, however, the model fit is poor, then there are important interactions and/or non-linearities that are not accounted for in the original parametric model. We then either jackknife out each variable or sequentially exclude each variable in turn based on best model fit improvement to see which variables are the cause of problems.

Value

A list with the following elements:

paramFit

The r-squared for the shrinkage estimate regressed on the parametric model specification

termFits1

The r-squared for the shrinkage estimates regressed on the parametric model specification with each model term jackknifed out in turn.

termFits2

The r-squared from the shrinkage estimates regressed on the parametric model specification with the model terms removed sequentially (and cumulatively) based on lack of from the termFits1 return.

train.sample

Observations used in the training sample after data with only model variables had been listwise deleted.

test.sample

Observations in the testing sample after data with only model variables had been listwise deleted.


davidaarmstrong/mldiag documentation built on April 17, 2020, 12:04 a.m.