| getLOCO | R Documentation |
This function computes the contributions of each variable to individual predictions using LOCO (Leave Out COvariates) values.
getLOCO(
object,
newdata,
newoutcome = NULL,
thr = NULL,
ncores = 2,
verbose = FALSE,
...
)
object |
A model fitting object from |
newdata |
A matrix containing new data with rows corresponding to subjects, and columns to variables. |
newoutcome |
A new character vector (as.factor) of labels for a categorical output (target)(default = NULL). |
thr |
A numeric value [0-1] indicating the threshold to apply to the LOCO values to color the graph. If thr = NULL (default), the threshold is set to thr = 0.5*max(abs(LOCO values)). |
ncores |
number of cpu cores (default = 2) |
verbose |
A logical value. If FALSE (default), the processed graph will not be plotted to screen. |
... |
Currently ignored. |
LOCO (Verdinelli & Wasserman, 2024) is a model-agnostic method for assessing the importance of individual features (covariates) in a ML predictive model. The procedure is simple: (i) train a model on the full dataset (with all covariates) and (ii) for each covariate of interest: (a) remove (leave out) that covariate from the dataset; (b) retrain the model on the remaining features; (c) compare predictions between the full model and the reduced mode, and (d) evaluate the difference in performance (e.g., using MSE, etc.). LOCO is computationally expensive (requires retraining for each feature). The getLOCO() function uses a lowest computation cost procedure (see Delicando & Pena, 2023). The individual relevance of each variable is measured by comparing the predictions of the model in the test set with those obtained when the variable of interest is leave-out and substituted by its ghost variable in the test set. This ghost variable is defined as the linear prediction of the covariate by using the rest of the variables in the ML model. This method yields similar LOCO results but requires much less computing time.
A list od three object: (i) est: a data.frame including the connections together with their LOCO values; (iii) gest: if the outcome vector is given, a data.frame of LOCO values per outcome levels; and (iii) dag: DAG with colored edges/nodes. If LOCO > thr, the edge is highlighted in red. If the outcome vector is given, nodes with absolute connection weights summed over the outcome levels, i.e. sum(LOCO[outcome levels]) > thr, will be highlighted in pink.
Mario Grassi mario.grassi@unipv.it
Verdinelli, I; Wasserman, L. Feature Importance: A Closer Look at Shapley Values and LOCO. Statist. Sci. 39 (4) 623 - 636, November 2024. https://doi.org/10.1214/24-STS937
Delicado, P.; Peña, D. Understanding complex predictive models with ghost variables. TEST 32, 107–145 (2023). https://doi.org/10.1007/s11749-022-00826-x
# load ALS data
ig<- alsData$graph
data<- alsData$exprs
data<- transformData(data)$data
#...with train-test (0.5-0.5) samples
set.seed(123)
train<- sample(1:nrow(data), 0.5*nrow(data))
rf0<- SEMml(ig, data[train, ], algo="rf")
res<- getLOCO(rf0, data[-train, ], thr=0.2, verbose=TRUE)
table(E(res$dag)$color)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.