textWordPrediction: Compute word-level prediction scores for plotting with...

View source: R/4_4_textWordPrediction.R

textWordPredictionR Documentation

Compute word-level prediction scores for plotting with textProjectionPlot().

Description

For each unique word in 'words' the function:

  1. Computes the **mean value of 'x'** (and optionally 'y') across all participants whose response contained that word.

  2. Looks up the **decontextualised embedding** for that word from 'word_types_embeddings'.

  3. Trains a **ridge regression** model: embedding -> mean x score. The out-of-sample predictions become the 'x_plotted' plotting coordinate, allowing generalisation to words unseen in training.

  4. Optionally computes **permutation-based p-values** (see 'n_permutations') by shuffling 'x' labels and building a null distribution of prediction scores.

The returned 'word_data' tibble has column names that match the expectations of textProjectionPlot: 'x_plotted' (and 'y_plotted') for coordinates and 'p_values_x' (and 'p_values_y') for significance.

Usage

textWordPrediction(
  words,
  word_types_embeddings = word_types_embeddings_df,
  x,
  y = NULL,
  n_models = 25,
  n_permutations = 10000,
  seed = 1003,
  case_insensitive = TRUE,
  text_remove = "[()]",
  ...
)

Arguments

words

Character vector **or** single-column tibble of free-text responses (one per participant).

word_types_embeddings

Word-type embeddings from textEmbed - specifically the '$word_types' component. These are *decontextualised*: one fixed vector per unique word type.

x

Numeric vector (or single-column tibble) of the outcome variable to project words onto the x-axis (e.g., a well-being scale score).

y

Optional numeric vector for a second outcome to project onto the y-axis. Default NULL.

n_models

Number of null ridge regression models to fit, each trained on a *different* permuted x vector. Each null fit produces genuine cross-validated out-of-sample null scores - one per word. Determines p-value resolution: the minimum non-trivial p-value step is approximately 1/n_models (e.g., 0.04 with 25 models, which is just below alpha = 0.05). Default 25.

n_permutations

Number of bootstrap samples drawn from the n_models null scores to smooth the null distribution. Does not require additional model fits. Set to 0 to skip p-values entirely. Default 10000.

seed

Integer seed for reproducibility. Default 1003.

case_insensitive

Logical. If TRUE (default), word matching ignores capitalisation.

text_remove

Regex pattern for characters to strip before processing (e.g., brackets). Default "[()]".

...

Additional arguments forwarded to textTrainRegression.

Value

A named list:

model_x

The fitted textTrainRegression model for the x-axis.

model_y

(Only if 'y' is supplied) Fitted model for the y-axis.

word_data

A tibble with one row per unique word containing: words, n (frequency), word_mean_value_x, x_plotted (embedding-based prediction), p_values_x; plus the y-equivalents when 'y' is provided.

The comment attribute on the output stores a human-readable description of all call parameters for reproducibility.

See Also

textProjection, textProjectionPlot, textTrainRegression

Examples

## Not run: 
library(text)

# --- Step 1: embed the text column (produces text-level + word-type embeddings)
embeddings <- textEmbed(Language_based_assessment_data_8["harmonywords"])

# --- Step 2: run textWordPrediction
result <- textWordPrediction(
  words                 = Language_based_assessment_data_8$harmonywords,
  word_types_embeddings = embeddings$word_types,
  x                     = Language_based_assessment_data_8$hilstotal,
  n_models              = 5,      # 5 real fits with different CV seeds
  n_permutations        = 10000,  # 5 x 10 000 = 50 000 total null samples
  seed                  = 1003
)

# --- Step 3: inspect word-level scores
result$word_data

# --- Step 4: pass directly to textProjectionPlot
textProjectionPlot(result)

## End(Not run)


text documentation built on June 13, 2026, 5:06 p.m.