temporal_forest: Temporal Forest for Longitudinal Feature Selection

View source: R/temporal_forest.R

temporal_forestR Documentation

Temporal Forest for Longitudinal Feature Selection

Description

The main user-facing function for the TemporalForest package. It performs the complete three-stage algorithm to select a top set of features from high-dimensional longitudinal data.

Usage

temporal_forest(
  X = NULL,
  Y,
  id,
  time,
  dissimilarity_matrix = NULL,
  n_features_to_select = 10,
  min_module_size = 4,
  n_boot_screen = 50,
  keep_fraction_screen = 0.25,
  n_boot_select = 100,
  alpha_screen = 0.2,
  alpha_select = 0.05
)

Arguments

X

A list of numeric matrices, one for each time point. The rows of each matrix should be subjects and columns should be predictors. Required unless dissimilarity_matrix is provided.

Y

A numeric vector for the longitudinal outcome.

id

A vector of subject identifiers.

time

A vector of time point indicators.

dissimilarity_matrix

An optional pre-computed dissimilarity matrix (e.g., 1 - TOM). If provided, the network construction step (Stage 1) is skipped. The matrix must be square with predictor names as rownames and colnames. Defaults to NULL.

n_features_to_select

The number of top features to return in the final selection. This is passed to the number_selected_final argument of the internal function. Defaults to 10.

min_module_size

The minimum number of features in a module. Passed to the minClusterSize argument of the internal function. Defaults to 4.

n_boot_screen

The number of bootstrap repetitions for the initial screening stage within modules. Defaults to 50.

keep_fraction_screen

The proportion of features to keep from each module during the screening stage. Defaults to 0.25.

n_boot_select

The number of bootstrap repetitions for the final stability selection stage. Defaults to 100.

alpha_screen

The significance level for splitting in the screening stage trees. Defaults to 0.2.

alpha_select

The significance level for splitting in the selection stage trees. Defaults to 0.05.

Details

The function executes a three-stage process:

  1. Time-Aware Module Construction: Builds a consensus network across time points to identify modules of stably co-correlated features.

  2. Within-Module Screening: Uses bootstrapped mixed-effects model trees (glmertree) to screen for important predictors within each module.

  3. Stability Selection: Performs a final stability selection step on the surviving features to yield a reproducible final set.

Unbalanced Panels: The algorithm is robust to unbalanced panel data (i.e., subjects with missing time points). The consensus TOM is constructed using the time points available, and the mixed-effects models naturally handle missing observations.

Outcome Family: The current version is designed for Gaussian (continuous) outcomes, as it relies on glmertree::lmertree. Support for other outcome families is not yet implemented.

Reproducibility (Determinism): For reproducible results, it is recommended to set a seed using set.seed() before running. The algorithm has both stochastic and deterministic components:

  • Stochastic (depends on set.seed()): The bootstrap resampling of subjects in both the screening and selection stages.

  • Deterministic (does not depend on set.seed()): The network construction process (correlation, adjacency, and TOM calculation).

Value

An object of class TemporalForest with:

  • top_features (character): the K selected features in descending stability order.

  • candidate_features (character): all features that entered the final (second-stage) selection.

Input contract

  • X: list of numeric matrices, one per time point; columns (names and order) must be identical across all time points. The function does not reorder or reconcile columns.

  • Row order / binding rule: when rows from X are stacked internally, they are assumed to already be in subject-major × time-minor order in the user's data. The function does not re-order subjects or time.

  • Y, id, time: vectors of equal length. id and time may be integer/character/factor; time is coerced to a numeric sequence via as.numeric(as.factor(time)).

  • Missing values: this function does not perform NA filtering or imputation. Users should pre-clean the data (e.g., keep <- complete.cases(Y,id,time)).

Unbalanced panels

Missing time points per subject are allowed provided the user supplies X, Y, id, time that already align under the binding rule above. Stage 1 builds a TOM at the feature level for each available time-point matrix; the consensus TOM is the element-wise minimum across time points. Subject-level missingness at a given time does not prevent feature-wise similarity from being computed at other times. This function does not perform any subject-level alignment across time.

Outcome family

Current version targets Gaussian outcomes via glmertree::lmertree. Other families (e.g., binomial/Poisson) are not supported in this version.

Stability selection and thresholds

Final selection is top-K by bootstrap frequency (K = n_features_to_select). A probability cutoff (e.g., pi_thr) is not used and selection probabilities are not returned in the current API.

Reproducibility (determinism)

  • Stochastic (affected by set.seed()): bootstrap resampling and tree partitioning.

  • Deterministic: correlation/adjacency/TOM and consensus-TOM given fixed inputs.

Internal validation

An internal helper check_temporal_consistency is called automatically at the start (whenever dissimilarity_matrix is NULL). It throws an error if column names across time points are not identical (names and order).

Note

The current API does not expose selection probabilities, module labels, or a parameter snapshot; these may be added in a future version.

Author(s)

Sisi Shao, Jason H. Moore, Christina M. Ramirez

References

Shao, S., Moore, J.H., Ramirez, C.M. (2025). Network-Guided Temporal Forests for Feature Selection in High-Dimensional Longitudinal Data. Journal of Statistical Software.

See Also

select_soft_power, calculate_fs_metrics_cv, calculate_pred_metrics_cv, check_temporal_consistency

Examples


# Tiny demo: selects V1, V2, V3 quickly (skips Stage 1 via precomputed A)
set.seed(11)
n_subjects <- 60; n_timepoints <- 2; p <- 20
X <- replicate(n_timepoints, matrix(rnorm(n_subjects * p), n_subjects, p), simplify = FALSE)
colnames(X[[1]]) <- colnames(X[[2]]) <- paste0("V", 1:p)
X_long <- do.call(rbind, X)
id   <- rep(seq_len(n_subjects), each = n_timepoints)
time <- rep(seq_len(n_timepoints), times = n_subjects)
u <- rnorm(n_subjects, 0, 0.7)
eps <- rnorm(length(id), 0, 0.08)
Y <- 4*X_long[,"V1"] + 3.5*X_long[,"V2"] + 3.2*X_long[,"V3"] + rep(u, each = n_timepoints) + eps
A <- 1 - abs(stats::cor(X_long)); diag(A) <- 0
dimnames(A) <- list(colnames(X[[1]]), colnames(X[[1]]))
fit <- temporal_forest(
  X, Y, id, time,
  dissimilarity_matrix = A,
  n_features_to_select = 3,
  n_boot_screen = 6, n_boot_select = 18,
  keep_fraction_screen = 1, min_module_size = 2,
  alpha_screen = 0.5, alpha_select = 0.6
)
print(fit$top_features)


TemporalForest documentation built on Dec. 23, 2025, 1:06 a.m.