rf: Random forest models with Moran's I test of the residuals

View source: R/rf.R

rfR Documentation

Random forest models with Moran's I test of the residuals

Description

Fits a random forest model using ranger and extends it with spatial diagnostics: residual autocorrelation (Moran's I) at multiple distance thresholds, performance metrics (RMSE, NRMSE via root_mean_squared_error()), and variable importance scores computed on scaled data (via scale).

Usage

rf(
  data = NULL,
  dependent.variable.name = NULL,
  predictor.variable.names = NULL,
  distance.matrix = NULL,
  distance.thresholds = NULL,
  xy = NULL,
  ranger.arguments = NULL,
  scaled.importance = FALSE,
  seed = 1,
  verbose = TRUE,
  n.cores = parallel::detectCores() - 1,
  cluster = NULL
)

Arguments

data

Data frame with a response variable and a set of predictors. Default: NULL

dependent.variable.name

Character string with the name of the response variable. Must be a column name in data. For binary response variables (0/1), case weights are automatically computed using case_weights() to balance classes. Default: NULL

predictor.variable.names

Character vector with predictor variable names. All names must be columns in data. Alternatively, accepts the output of auto_cor() or auto_vif() for automated variable selection. Default: NULL

distance.matrix

Square matrix with pairwise distances between observations in data. Must have the same number of rows as data. If NULL, spatial autocorrelation of residuals is not computed. Default: NULL

distance.thresholds

Numeric vector of distance thresholds for spatial autocorrelation analysis. For each threshold, distances below that value are set to zero when computing Moran's I. If NULL, defaults to seq(0, max(distance.matrix), length.out = 4). Default: NULL

xy

Data frame or matrix with two columns containing coordinates, named "x" and "y". Not used by this function but stored in the model for use by rf_evaluate() and rf_tuning(). Default: NULL

ranger.arguments

Named list with ranger arguments. Arguments for this function can also be passed here. The default importance method is 'permutation' instead of ranger's default 'none'. The x, y, and formula arguments are not supported. See ranger help for available arguments. Default: NULL

scaled.importance

If TRUE, variable importance is computed on scaled data using scale, making importance scores comparable across models with different predictor units. Default: FALSE

seed

Random seed for reproducibility. Default: 1

verbose

If TRUE, display messages and plots during execution. Default: TRUE

n.cores

Number of cores for parallel execution. Default: parallel::detectCores() - 1

cluster

Cluster object from parallel::makeCluster(). Not used by this function but stored in the model for use in downstream functions. Default: NULL

Details

See ranger documentation for additional details. The formula interface is supported via ranger.arguments, but variable interactions are not permitted. For feature engineering including interactions, see the_feature_engineer().

Value

A ranger model object with additional slots:

  • ranger.arguments: Arguments used to fit the model.

  • importance: List with global importance data frame (predictors ranked by importance), importance plot, and local importance scores (per-observation difference in accuracy between permuted and non-permuted predictors, based on out-of-bag data).

  • performance: Model performance metrics including R-squared (out-of-bag and standard), pseudo R-squared, RMSE, and NRMSE.

  • residuals: Model residuals with normality diagnostics (residuals_diagnostics()) and spatial autocorrelation (moran_multithreshold()).

See Also

Other main_models: rf_spatial()

Examples


data(
  plants_df,
  plants_response,
  plants_predictors,
  plants_distance
)

m <- rf(
  data = plants_df,
  dependent.variable.name = plants_response,
  predictor.variable.names = plants_predictors,
  distance.matrix = plants_distance,
  distance.thresholds = c(100, 1000, 2000),
  ranger.arguments = list(
    num.trees = 50,
    min.node.size = 20
  ),
  verbose = FALSE,
  n.cores = 1
)

class(m)
#variable importance
m$importance$per.variable
m$importance$per.variable.plot

#model performance
m$performance

#autocorrelation of residuals
m$residuals$autocorrelation$per.distance
m$residuals$autocorrelation$plot

#model predictions
m$predictions$values

#predictions for new data (using stats::predict)
y <- stats::predict(
  object = m,
  data = plants_df[1:5, ],
  type = "response"
)$predictions

#alternative: pass arguments via ranger.arguments list
args <- list(
  data = plants_df,
  dependent.variable.name = plants_response,
  predictor.variable.names = plants_predictors,
  distance.matrix = plants_distance,
  distance.thresholds = c(100, 1000, 2000),
  num.trees = 50,
  min.node.size = 20,
  num.threads = 1
)

m <- rf(
  ranger.arguments = args,
  verbose = FALSE
)


spatialRF documentation built on Dec. 20, 2025, 1:07 a.m.

Related to rf in spatialRF...