train_rf: Train a Random Forest

View source: R/train.R

train_rfR Documentation

Train a Random Forest

Description

Train a random forest with ranger from a dataframe of writer profiles estimated with get_cluster_fill_rates. train_rf calculates the distance between all pairs of writer profiles using one or more distance measures. Currently, the available distance measures are absolute, Manhattan, Euclidean, maximum, and cosine.

Usage

train_rf(
  df,
  ntrees,
  distance_measures,
  output_dir = NULL,
  run_number = 1,
  downsample_diff_pairs = TRUE
)

Arguments

df

A dataframe of writer profiles created with get_cluster_fill_rates

ntrees

An integer number of decision trees to use

distance_measures

A vector of distance measures. Any combination of 'abs', 'euc', 'man', 'max', and 'cos' may be used.

output_dir

A path to a directory where the random forest will be saved.

run_number

An integer used for both the set.seed function and to distinguish between different runs on the same input dataframe.

downsample_diff_pairs

Whether to downsample the number of different writer distances before training the random forest. If TRUE, the different writer distances will be randomly sampled, resulting in the same number of different writer and same writer pairs.

Details

The absolute distance between two n-length vectors of cluster fill rates, a and b, is a vector of the same length as a and b. It can be calculated as abs(a-b) where subtraction is performed element-wise, then the absolute value of each element is returned. More specifically, element i of the vector is |a_i - b_i| for i=1,2,...,n.

The Manhattan distance between two n-length vectors of cluster fill rates, a and b, is \sum_{i=1}^n |a_i - b_i|. In other words, it is the sum of the absolute distance vector.

The Euclidean distance between two n-length vectors of cluster fill rates, a and b, is \sqrt{\sum_{i=1}^n (a_i - b_i)^2}. In other words, it is the sum of the elements of the absolute distance vector.

The maximum distance between two n-length vectors of cluster fill rates, a and b, is \max_{1 \leq i \leq n}{\{|a_i - b_i|\}}. In other words, it is the sum of the elements of the absolute distance vector.

The cosine distance between two n-length vectors of cluster fill rates, a and b, is \sum_{i=1}^n (a_i - b_i)^2 / (\sqrt{\sum_{i=1}^n a_i^2}\sqrt{\sum_{i=1}^n b_i^2}).

Value

A random forest

Examples

rforest <- train_rf(
  df = train,
  ntrees = 200,
  distance_measures = c("euc"),
  run_number = 1,
  downsample = TRUE
)

handwriterRF documentation built on April 4, 2025, 5:38 a.m.