Description Usage Arguments Value References Examples
This uses the maximum dissimilarity method for creating a training-test split. This is better than just using a random subset for the training data. By maximizing the dissimilarity of the rows of the data frame the variability of the data set is preserved. This means the training data will be legitimately representative of the whole dataset and obviates any concerns about the impact of the training-test split on the final inferences. This function is nearly deterministic in regards to which observations are chosen which also facilitates reproducibility.
1 | trainSubset(data, p, y = NULL)
|
data |
a data frame of the full data set. |
p |
the target proportion of the data set you wish to use for the training set. the size of the subset is rounded to the nearest integer. setting p = 0.80 with a data frame of 233 rows will result in around 186 observations in the training data, for example. The final number may be slightly less than p*n due to rounding. |
y |
an optional character string indicating the column name of the intended response variable. if supplied this chooses observations of the response variable near the median as the seed in order to faciliate unbiasedness in sampling values only near one of the upper or lower quantiles. |
a vector of integers corresponding to the rows chosen for the training data.
Willett, P. 1999. "Dissimilarity-Based Algorithms for Selecting Structurally Diverse Sets of Compounds," Journal of Computational Biology, 6, 447-457.
1 2 3 |
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.