rf: Random forest models with Moran's I test of the residuals
In spatialRF: Easy Spatial Modeling with Random Forest

View source: R/rf.R

rf	R Documentation

Random forest models with Moran's I test of the residuals

Description

Fits a random forest model using ranger and extends it with spatial diagnostics: residual autocorrelation (Moran's I) at multiple distance thresholds, performance metrics (RMSE, NRMSE via root_mean_squared_error()), and variable importance scores computed on scaled data (via scale).

Usage

rf(
  data = NULL,
  dependent.variable.name = NULL,
  predictor.variable.names = NULL,
  distance.matrix = NULL,
  distance.thresholds = NULL,
  xy = NULL,
  ranger.arguments = NULL,
  scaled.importance = FALSE,
  seed = 1,
  verbose = TRUE,
  n.cores = parallel::detectCores() - 1,
  cluster = NULL
)

Arguments

`data`	Data frame with a response variable and a set of predictors. Default: `NULL`
`dependent.variable.name`	Character string with the name of the response variable. Must be a column name in `data`. For binary response variables (0/1), case weights are automatically computed using `case_weights()` to balance classes. Default: `NULL`
`predictor.variable.names`	Character vector with predictor variable names. All names must be columns in `data`. Alternatively, accepts the output of `auto_cor()` or `auto_vif()` for automated variable selection. Default: `NULL`
`distance.matrix`	Square matrix with pairwise distances between observations in `data`. Must have the same number of rows as `data`. If `NULL`, spatial autocorrelation of residuals is not computed. Default: `NULL`
`distance.thresholds`	Numeric vector of distance thresholds for spatial autocorrelation analysis. For each threshold, distances below that value are set to zero when computing Moran's I. If `NULL`, defaults to `seq(0, max(distance.matrix), length.out = 4)`. Default: `NULL`
`xy`	Data frame or matrix with two columns containing coordinates, named "x" and "y". Not used by this function but stored in the model for use by `rf_evaluate()` and `rf_tuning()`. Default: `NULL`
`ranger.arguments`	Named list with ranger arguments. Arguments for this function can also be passed here. The default importance method is 'permutation' instead of ranger's default 'none'. The `x`, `y`, and `formula` arguments are not supported. See ranger help for available arguments. Default: `NULL`
`scaled.importance`	If `TRUE`, variable importance is computed on scaled data using scale, making importance scores comparable across models with different predictor units. Default: `FALSE`
`seed`	Random seed for reproducibility. Default: `1`
`verbose`	If `TRUE`, display messages and plots during execution. Default: `TRUE`
`n.cores`	Number of cores for parallel execution. Default: `parallel::detectCores() - 1`
`cluster`	Cluster object from `parallel::makeCluster()`. Not used by this function but stored in the model for use in downstream functions. Default: `NULL`

Details

See ranger documentation for additional details. The formula interface is supported via ranger.arguments, but variable interactions are not permitted. For feature engineering including interactions, see the_feature_engineer().

Value

A ranger model object with additional slots:

ranger.arguments: Arguments used to fit the model.
importance: List with global importance data frame (predictors ranked by importance), importance plot, and local importance scores (per-observation difference in accuracy between permuted and non-permuted predictors, based on out-of-bag data).
performance: Model performance metrics including R-squared (out-of-bag and standard), pseudo R-squared, RMSE, and NRMSE.
residuals: Model residuals with normality diagnostics (residuals_diagnostics()) and spatial autocorrelation (moran_multithreshold()).

Examples


data(
  plants_df,
  plants_response,
  plants_predictors,
  plants_distance
)

m <- rf(
  data = plants_df,
  dependent.variable.name = plants_response,
  predictor.variable.names = plants_predictors,
  distance.matrix = plants_distance,
  distance.thresholds = c(100, 1000, 2000),
  ranger.arguments = list(
    num.trees = 50,
    min.node.size = 20
  ),
  verbose = FALSE,
  n.cores = 1
)

class(m)
#variable importance
m$importance$per.variable
m$importance$per.variable.plot

#model performance
m$performance

#autocorrelation of residuals
m$residuals$autocorrelation$per.distance
m$residuals$autocorrelation$plot

#model predictions
m$predictions$values

#predictions for new data (using stats::predict)
y <- stats::predict(
  object = m,
  data = plants_df[1:5, ],
  type = "response"
)$predictions

#alternative: pass arguments via ranger.arguments list
args <- list(
  data = plants_df,
  dependent.variable.name = plants_response,
  predictor.variable.names = plants_predictors,
  distance.matrix = plants_distance,
  distance.thresholds = c(100, 1000, 2000),
  num.trees = 50,
  min.node.size = 20,
  num.threads = 1
)

m <- rf(
  ranger.arguments = args,
  verbose = FALSE
)

spatialRF documentation built on Dec. 20, 2025, 1:07 a.m.