View source: R/2_5_textTrainN.R
textTrainN | R Documentation |
(experimental) Compute cross-validated correlations for different sample-sizes of a data set. The cross-validation process can be repeated several times to enhance the reliability of the evaluation.
textTrainN(
x,
y,
sample_percents = c(25, 50, 75, 100),
handle_word_embeddings = "individually",
n_cross_val = 1,
sampling_strategy = "subsets",
use_same_penalty_mixture = TRUE,
model = "regression",
penalty = 10^seq(-16, 16),
mixture = c(0),
seed = 2024,
...
)
x |
Word embeddings from textEmbed (or textEmbedLayerAggregation). If several word embedding are provided in a list they will be concatenated. |
y |
Numeric variable to predict. |
sample_percents |
(numeric) Numeric vector that specifies the percentages of the total number of data points to include in each sample (default = c(25,50,75,100), i.e., correlations are evaluated for 25 each new sample. |
handle_word_embeddings |
Determine whether to use a list of word embeddings or an individual word_embedding (default = "individually", also "concatenate"). If a list of word embeddings are provided, then they will be concatenated. |
n_cross_val |
(numeric) Value that determines the number of times to repeat the cross-validation (i.e., number of tests). (default = 1, i.e., cross-validation is only performed once). Warning: The training process gets proportionately slower to the number of cross-validations, resulting in a time complexity that increases with a factor of n (n cross-validations). |
sampling_strategy |
Sample a "random" sample for each subset from all data or sample a "subset" from the larger subsets (i.e., each subset contain the same data). |
use_same_penalty_mixture |
If TRUE it only searches the penalty and mixture search grid once, and then use the same thereafter; if FALSE, it searches the grid every time. |
model |
Type of model. Default is "regression"; see also "logistic" and "multinomial" for classification. |
penalty |
(numeric) Hyper parameter that is tuned (default = 10^seq(-16,16)). |
mixture |
A number between 0 and 1 (inclusive) that reflects the proportion of L1 regularization (i.e. lasso) in the model (for more information see the linear_reg-function in the parsnip-package). When mixture = 1, it is a pure lasso model while mixture = 0 indicates that ridge regression is being used (specific engines only). |
seed |
(numeric) Set different seed (default = 2024). |
... |
Additional parameters from textTrainRegression. |
A tibble containing correlations for each sample. If n_cross_val > 1, correlations for each new cross-validation, along with standard-deviation, mean and standard error of correlation is included in the tibble. The information in the tibble is visualised via the textTrainNPlot function.
See textTrainNPlot
.
# Compute correlations for 25%, 50%, 75% and 100% of the data in word_embeddings and perform
# cross-validation thrice.
## Not run:
tibble_to_plot <- textTrainN(
x = word_embeddings_4$texts$harmonytext,
y = Language_based_assessment_data_8$hilstotal,
sample_percents = c(25, 50, 75, 100),
n_cross_val = 3
)
# tibble_to_plot contains correlation-coefficients for each cross_validation and
# standard deviation and mean value for each sample. The tibble can be plotted
# using the testTrainNPlot function.
# Examine tibble
tibble_to_plot
## End(Not run)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.