getLOCO: Compute variable importance using LOCO values

View source: R/SEMml.R

getLOCOR Documentation

Compute variable importance using LOCO values

Description

This function computes the contributions of each variable to individual predictions using LOCO (Leave Out COvariates) values.

Usage

getLOCO(
  object,
  newdata,
  newoutcome = NULL,
  thr = NULL,
  ncores = 2,
  verbose = FALSE,
  ...
)

Arguments

object

A model fitting object from SEMml(), or SEMrun() functions.

newdata

A matrix containing new data with rows corresponding to subjects, and columns to variables.

newoutcome

A new character vector (as.factor) of labels for a categorical output (target)(default = NULL).

thr

A numeric value [0-1] indicating the threshold to apply to the LOCO values to color the graph. If thr = NULL (default), the threshold is set to thr = 0.5*max(abs(LOCO values)).

ncores

number of cpu cores (default = 2)

verbose

A logical value. If FALSE (default), the processed graph will not be plotted to screen.

...

Currently ignored.

Details

LOCO (Verdinelli & Wasserman, 2024) is a model-agnostic method for assessing the importance of individual features (covariates) in a ML predictive model. The procedure is simple: (i) train a model on the full dataset (with all covariates) and (ii) for each covariate of interest: (a) remove (leave out) that covariate from the dataset; (b) retrain the model on the remaining features; (c) compare predictions between the full model and the reduced mode, and (d) evaluate the difference in performance (e.g., using MSE, etc.). LOCO is computationally expensive (requires retraining for each feature). The getLOCO() function uses a lowest computation cost procedure (see Delicando & Pena, 2023). The individual relevance of each variable is measured by comparing the predictions of the model in the test set with those obtained when the variable of interest is leave-out and substituted by its ghost variable in the test set. This ghost variable is defined as the linear prediction of the covariate by using the rest of the variables in the ML model. This method yields similar LOCO results but requires much less computing time.

Value

A list od three object: (i) est: a data.frame including the connections together with their LOCO values; (iii) gest: if the outcome vector is given, a data.frame of LOCO values per outcome levels; and (iii) dag: DAG with colored edges/nodes. If LOCO > thr, the edge is highlighted in red. If the outcome vector is given, nodes with absolute connection weights summed over the outcome levels, i.e. sum(LOCO[outcome levels]) > thr, will be highlighted in pink.

Author(s)

Mario Grassi mario.grassi@unipv.it

References

Verdinelli, I; Wasserman, L. Feature Importance: A Closer Look at Shapley Values and LOCO. Statist. Sci. 39 (4) 623 - 636, November 2024. https://doi.org/10.1214/24-STS937

Delicado, P.; Peña, D. Understanding complex predictive models with ghost variables. TEST 32, 107–145 (2023). https://doi.org/10.1007/s11749-022-00826-x

Examples



# load ALS data
ig<- alsData$graph
data<- alsData$exprs
data<- transformData(data)$data

#...with train-test (0.5-0.5) samples
set.seed(123)
train<- sample(1:nrow(data), 0.5*nrow(data))

rf0<- SEMml(ig, data[train, ], algo="rf")

res<- getLOCO(rf0, data[-train, ], thr=0.2, verbose=TRUE)
table(E(res$dag)$color)



SEMdeep documentation built on Nov. 10, 2025, 5:06 p.m.