train_rf | R Documentation |
Train a random forest with ranger from a dataframe of writer profiles
estimated with get_cluster_fill_rates
. train_rf
calculates
the distance between all pairs of writer profiles using one or more distance
measures. Currently, the available distance measures are absolute, Manhattan,
Euclidean, maximum, and cosine.
train_rf(
df,
ntrees,
distance_measures,
output_dir = NULL,
run_number = 1,
downsample_diff_pairs = TRUE
)
df |
A dataframe of writer profiles created with
|
ntrees |
An integer number of decision trees to use |
distance_measures |
A vector of distance measures. Any combination of 'abs', 'euc', 'man', 'max', and 'cos' may be used. |
output_dir |
A path to a directory where the random forest will be saved. |
run_number |
An integer used for both the set.seed function and to distinguish between different runs on the same input dataframe. |
downsample_diff_pairs |
Whether to downsample the number of different writer distances before training the random forest. If TRUE, the different writer distances will be randomly sampled, resulting in the same number of different writer and same writer pairs. |
The absolute distance between two n-length vectors of cluster fill rates, a
and b, is a vector of the same length as a and b. It can be calculated as
abs(a-b) where subtraction is performed element-wise, then the absolute
value of each element is returned. More specifically, element i of the vector is |a_i
- b_i|
for i=1,2,...,n
.
The Manhattan distance between two n-length vectors of cluster fill rates, a and b, is
\sum_{i=1}^n |a_i - b_i|
. In other words, it is the sum of the absolute
distance vector.
The Euclidean distance between two n-length vectors of cluster fill rates, a and b, is
\sqrt{\sum_{i=1}^n (a_i - b_i)^2}
. In other words, it is the sum of the elements of the
absolute distance vector.
The maximum distance between two n-length vectors of cluster fill rates, a and b, is
\max_{1 \leq i \leq n}{\{|a_i - b_i|\}}
. In other words, it is the sum of the elements of the
absolute distance vector.
The cosine distance between two n-length vectors of cluster fill rates, a and b, is
\sum_{i=1}^n (a_i - b_i)^2 / (\sqrt{\sum_{i=1}^n a_i^2}\sqrt{\sum_{i=1}^n b_i^2})
.
A random forest
rforest <- train_rf(
df = train,
ntrees = 200,
distance_measures = c("euc"),
run_number = 1,
downsample = TRUE
)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.