cnorm.cv: Cross-validation for Term Selection in cNORM
In WLenhard/cNORM: Continuous Norming

cnorm.cv

R Documentation

Cross-validation for Term Selection in cNORM

Description

Assists in determining the optimal number of terms for the regression model using repeated Monte Carlo cross-validation. It leverages an 80-20 split between training and validation data, with stratification by norm group or random sample in case of using sliding window ranking.

Usage

cnorm.cv(
  data,
  formula = NULL,
  repetitions = 5,
  norms = TRUE,
  min = 1,
  max = 12,
  cv = "full",
  pCutoff = NULL,
  width = NA,
  raw = NULL,
  group = NULL,
  age = NULL,
  weights = NULL
)

Arguments

`data`	Data frame of norm sample or a cnorm object. Should have ranking, powers, and interaction of L and A.
`formula`	Formula from an existing regression model; min/max functions ignored. If using a cnorm object, this is automatically fetched.
`repetitions`	Number of repetitions for cross-validation.
`norms`	If TRUE, computes norm score crossfit and R^2. Note: Computationally intensive.
`min`	Start with a minimum number of terms (default = 1).
`max`	Maximum terms in model, up to (k + 1) * (t + 1) - 1.
`cv`	"full" (default) splits data into training/validation, then ranks. Otherwise, expects a pre-ranked dataset.
`pCutoff`	Checks stratification for unbalanced data. Performs a t-test per group. Default set to 0.2 to minimize beta error.
`width`	If provided, ranking done via 'rankBySlidingWindow'. Otherwise, by group.
`raw`	Name of the raw score variable.
`group`	Name of the grouping variable.
`age`	Name of the age variable.
`weights`	Name of the weighting parameter.

Details

Successive models, with an increasing number of terms, are evaluated, and the RMSE for raw scores plotted. This encompasses the training, validation, and entire dataset. If 'norms' is set to TRUE (default), the function will also calculate the mean norm score reliability and crossfit measures. Note that due to the computational requirements of norm score calculations, execution can be slow, especially with numerous repetitions or terms.

When 'cv' is set to "full" (default), both test and validation datasets are ranked separately, providing comprehensive cross-validation. For a more streamlined validation process focused only on modeling, a pre-ranked dataset can be used. The output comprises RMSE for raw score models, norm score R^2, delta R^2, crossfit, and the norm score SE according to Oosterhuis, van der Ark, & Sijtsma (2016).

This function is not yet prepared for the 'extensive' search strategy, introduced in version 3.3, but instead relies on the first model per number of terms, without consistency check.

For assessing overfitting:

CROSSFIT = R(Training; Model)^2 / R(Validation; Model)^2

A CROSSFIT > 1 suggests overfitting, < 1 suggests potential underfitting, and values around 1 are optimal, given a low raw score RMSE and high norm score validation R^2.

Suggestions for ideal model selection:

Visual inspection of percentiles with 'plotPercentiles' or 'plotPercentileSeries'.
Pair visual inspection with repeated cross-validation (e.g., 10 repetitions).
Aim for low raw score RMSE and high norm score R^2, avoiding terms with significant overfit (e.g., crossfit > 1.1).

Value

Table with results per term number: RMSE for raw scores, R^2 for norm scores, and crossfit measure.

References

Oosterhuis, H. E. M., van der Ark, L. A., & Sijtsma, K. (2016). Sample Size Requirements for Traditional and Regression-Based Norms. Assessment, 23(2), 191–202. https://doi.org/10.1177/1073191115580638

Examples

## Not run: 
# Example: Plot cross-validation RMSE by number of terms (up to 9) with three repetitions.
result <- cnorm(raw = elfe$raw, group = elfe$group)
cnorm.cv(result$data, min = 2, max = 9, repetitions = 3)

# Using a cnorm object examines the predefined formula.
cnorm.cv(result, repetitions = 1)

## End(Not run)

WLenhard/cNORM documentation built on June 10, 2025, 12:55 p.m.