View source: R/4_4_textWordPrediction.R
| textWordPrediction | R Documentation |
For each unique word in 'words' the function:
Computes the **mean value of 'x'** (and optionally 'y') across all participants whose response contained that word.
Looks up the **decontextualised embedding** for that word from 'word_types_embeddings'.
Trains a **ridge regression** model: embedding -> mean x score. The out-of-sample predictions become the 'x_plotted' plotting coordinate, allowing generalisation to words unseen in training.
Optionally computes **permutation-based p-values** (see 'n_permutations') by shuffling 'x' labels and building a null distribution of prediction scores.
The returned 'word_data' tibble has column names that match the expectations
of textProjectionPlot: 'x_plotted' (and 'y_plotted') for coordinates
and 'p_values_x' (and 'p_values_y') for significance.
textWordPrediction(
words,
word_types_embeddings = word_types_embeddings_df,
x,
y = NULL,
n_models = 25,
n_permutations = 10000,
seed = 1003,
case_insensitive = TRUE,
text_remove = "[()]",
...
)
words |
Character vector **or** single-column tibble of free-text responses (one per participant). |
word_types_embeddings |
Word-type embeddings from |
x |
Numeric vector (or single-column tibble) of the outcome variable to project words onto the x-axis (e.g., a well-being scale score). |
y |
Optional numeric vector for a second outcome to
project onto the y-axis. Default |
n_models |
Number of null ridge regression models to fit,
each trained on a *different* permuted x vector.
Each null fit produces genuine cross-validated
out-of-sample null scores - one per word.
Determines p-value resolution: the minimum
non-trivial p-value step is approximately
|
n_permutations |
Number of bootstrap samples drawn from the
|
seed |
Integer seed for reproducibility. Default 1003. |
case_insensitive |
Logical. If |
text_remove |
Regex pattern for characters to strip before
processing (e.g., brackets). Default
|
... |
Additional arguments forwarded to
|
A named list:
The fitted textTrainRegression model for the x-axis.
(Only if 'y' is supplied) Fitted model for the y-axis.
A tibble with one row per unique word containing:
words, n (frequency), word_mean_value_x,
x_plotted (embedding-based prediction), p_values_x; plus
the y-equivalents when 'y' is provided.
The comment attribute on the output stores a human-readable description of all call parameters for reproducibility.
textProjection, textProjectionPlot,
textTrainRegression
## Not run:
library(text)
# --- Step 1: embed the text column (produces text-level + word-type embeddings)
embeddings <- textEmbed(Language_based_assessment_data_8["harmonywords"])
# --- Step 2: run textWordPrediction
result <- textWordPrediction(
words = Language_based_assessment_data_8$harmonywords,
word_types_embeddings = embeddings$word_types,
x = Language_based_assessment_data_8$hilstotal,
n_models = 5, # 5 real fits with different CV seeds
n_permutations = 10000, # 5 x 10 000 = 50 000 total null samples
seed = 1003
)
# --- Step 3: inspect word-level scores
result$word_data
# --- Step 4: pass directly to textProjectionPlot
textProjectionPlot(result)
## End(Not run)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.