cnorm.cv: Cross-validation for Term Selection in cNORM

View source: R/modelling.R

cnorm.cvR Documentation

Cross-validation for Term Selection in cNORM

Description

Assists in determining the optimal number of terms for the regression model using repeated Monte Carlo cross-validation. It leverages an 80-20 split between training and validation data, with stratification by norm group or random sample in case of using sliding window ranking.

Usage

cnorm.cv(
  data,
  formula = NULL,
  repetitions = 5,
  norms = TRUE,
  min = 1,
  max = 12,
  cv = "full",
  pCutoff = NULL,
  width = NA,
  raw = NULL,
  group = NULL,
  age = NULL,
  weights = NULL
)

Arguments

data

Data frame of norm sample or a cnorm object. Should have ranking, powers, and interaction of L and A.

formula

Formula from an existing regression model; min/max functions ignored. If using a cnorm object, this is automatically fetched.

repetitions

Number of repetitions for cross-validation.

norms

If TRUE, computes norm score crossfit and R^2. Note: Computationally intensive.

min

Start with a minimum number of terms (default = 1).

max

Maximum terms in model, up to (k + 1) * (t + 1) - 1.

cv

"full" (default) splits data into training/validation, then ranks. Otherwise, expects a pre-ranked dataset.

pCutoff

Checks stratification for unbalanced data. Performs a t-test per group. Default set to 0.2 to minimize beta error.

width

If provided, ranking done via 'rankBySlidingWindow'. Otherwise, by group.

raw

Name of the raw score variable.

group

Name of the grouping variable.

age

Name of the age variable.

weights

Name of the weighting parameter.

Details

Successive models, with an increasing number of terms, are evaluated, and the RMSE for raw scores plotted. This encompasses the training, validation, and entire dataset. If 'norms' is set to TRUE (default), the function will also calculate the mean norm score reliability and crossfit measures. Note that due to the computational requirements of norm score calculations, execution can be slow, especially with numerous repetitions or terms.

When 'cv' is set to "full" (default), both test and validation datasets are ranked separately, providing comprehensive cross-validation. For a more streamlined validation process focused only on modeling, a pre-ranked dataset can be used. The output comprises RMSE for raw score models, norm score R^2, delta R^2, crossfit, and the norm score SE according to Oosterhuis, van der Ark, & Sijtsma (2016).

For assessing overfitting:

CROSSFIT = R(Training; Model)^2 / R(Validation; Model)^2

A CROSSFIT > 1 suggests overfitting, < 1 suggests potential underfitting, and values around 1 are optimal, given a low raw score RMSE and high norm score validation R^2.

Suggestions for ideal model selection:

  • Visual inspection of percentiles with 'plotPercentiles' or 'plotPercentileSeries'.

  • Pair visual inspection with repeated cross-validation (e.g., 10 repetitions).

  • Aim for low raw score RMSE and high norm score R^2, avoiding terms with significant overfit (e.g., crossfit > 1.1).

Value

Table with results per term number: RMSE for raw scores, R^2 for norm scores, and crossfit measure.

References

Oosterhuis, H. E. M., van der Ark, L. A., & Sijtsma, K. (2016). Sample Size Requirements for Traditional and Regression-Based Norms. Assessment, 23(2), 191–202. https://doi.org/10.1177/1073191115580638

See Also

Other model: bestModel(), checkConsistency(), derive(), modelSummary(), print.cnorm(), printSubset(), rangeCheck(), regressionFunction(), summary.cnorm()

Examples

## Not run: 
# Example: Plot cross-validation RMSE by number of terms (up to 9) with three repetitions.
result <- cnorm(raw = elfe$raw, group = elfe$group)
cnorm.cv(result$data, min = 2, max = 9, repetitions = 3)

# Using a cnorm object examines the predefined formula.
cnorm.cv(result, repetitions = 1)

# For cross-validation without a cnorm model, rank data first and compute powers:
data <- rankByGroup(data = elfe, raw = "raw", group = "group")
data <- computePowers(data)
cnorm.cv(data)

# Specify formulas deliberately:
data <- rankByGroup(data = elfe, raw = "raw", group = "group")
data <- computePowers(data)
cnorm.cv(data, formula = formula(raw ~ L3 + L1A1 + L3A3 + L4 + L5))

## End(Not run)


WLenhard/cNORM documentation built on April 1, 2024, 5:41 p.m.