Additional Predictor with Maximum Effect Size

library(knitr)
opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
options(rmarkdown.html_vignette.check_title = FALSE)

Introduction

This vignette of package maxEff (Github, RPubs) documents three methods of selecting one from many numeric predictors for a regression model, to ensure that the additional predictor has the maximum effect size. The three methods are implemented in functions add_num(), add_dummy() and add_dummy_partition().

Note to Users

Examples in this vignette require that the search path has

library(maxEff)
library(groupedHyperframe)
library(survival)
library(rpart)

Users should remove parameter mc.cores = 1L from all examples and use the default option, which engages all CPU cores on the current host for macOS. The authors are forced to have mc.cores = 1L in this vignette in order to pass CRAN's submission check.

Terms and Abbreviations

c(
  '', 'Forward pipe operator', '`?base::pipeOp` introduced in `R` 4.1.0', 
  '`abs`', 'Absolute value', '`base::abs`',
  '`coxph`', 'Cox model', '`survival::coxph`',
  '`CRAN`, `R`', 'The Comprehensive R Archive Network', 'https://cran.r-project.org',
  '`factor`', 'Factor, or categorical variable', '`base::factor`',
  '`function`', '`R` function', '``base::`function` ``',
  '`groupedHyperframe`', 'Grouped hyper data frame', ' `groupedHyperframe::as.groupedHyperframe`',
  '`head`', 'First parts of an object', '`utils::head`; `utils:::head.default`',
  '`hypercolumns`, `hyperframe`', '(Hyper columns of) hyper data frame', '`spatstat.geom::hyperframe`',
  '`labels`', 'Labels from object', '`base::labels`; `maxEff::labels.node1`',
  '`levels`', 'Levels of a `factor`', '`base::levels`',
  '`listof`', 'List of objects', '`stats::listof`',
  '`logistic`', 'Logistic regression model', '`stats::glm(., family = binomial(\'logit\'))`',
  '`matrix`', 'Matrix', '`base::matrix`',
  '`partition`', 'Stratified partition', '`maxEff::statusPartition`, `caret::createDataPartition`', 
  '`PFS`', 'Progression/recurrence free survival', 'https://en.wikipedia.org/wiki/Progression-free_survival',
  '`predict`', 'Model prediction', '`stats::predict`; `maxEff::predict.add_num`; `maxEff::predict.add_dummy`',
  '`quantile`', 'Quantile', '`stats::quantile`',
  '`rpart`', 'Recursive partitioning and regression trees', '`rpart::rpart`',
  '`S3`, `generic`, `methods`', '`S3` object oriented system',  '`base::UseMethod`; `utils::methods`; `utils::getS3method`; https://adv-r.hadley.nz/s3.html',
  '`sort_by`', 'Sort an object by some criterion', '`base::sort_by`; `maxEff::sort_by.add_`',
  '`subset`', 'Subsets of object by conditions', '`base::subset`; `maxEff::subset.add_dummy`',
  '`Surv`', 'Survival object', '`survival::Surv`',
  '`update`', 'Update and re-fit a model call', '`stats::update`'
) |>
  matrix(nrow = 3L, dimnames = list(c('Term / Abbreviation', 'Description', 'Reference'), NULL)) |>
  t.default() |>
  as.data.frame.matrix() |> 
  kable()

Acknowledgement

This work is supported by NCI R01CA222847 (I. Chervoneva, T. Zhan, and H. Rui) and R01CA253977 (H. Rui and I. Chervoneva).

Handy Tools

node1(): Dichotomization via First Node of Recursive Partitioning

Consider the following recursive partitioning and regression example,

data(cu.summary, package = 'rpart')
(r = rpart(Price ~ Mileage, data = cu.summary, cp = .Machine$double.eps, maxdepth = 1L))

Function node1() creates a dichotomizing rule, i.e., a function, based on the first node of partitioning,

(foo = r |> node1())

The dichotomizing rule foo converts a numeric object to a logical object.

set.seed(125); rnorm(6, mean = 24.5) |> foo()

Developers may obtain the numeric cutoff value of foo, or a brief text of to describe foo, for further analysis.

foo |> get_cutoff()
foo |> labels()

statusPartition(): Stratified Partition by Survival Status

Function statusPartition() performs stratified partition based on the survival status of a Surv response, to avoid the situation that Cox model is degenerate if all subjects are censored. This is a deviation from the very popular function caret::createDataPartition(), which stratifies Surv response by the quantiles of the survival time.

y = (survival::capacitor) |> 
  with(expr = Surv(time, status))
set.seed(15); id = y |>
  statusPartition(times = 1L, p = .5)
table(y[id[[1L]], 2L]) / table(y[,2L]) # balanced by status
set.seed(15); id0 = y |>
  caret::createDataPartition(times = 1L, p = .5)
table(y[id0[[1L]], 2L]) / table(y[,2L]) # *not* balanced by status

Data Preparation

Data set Ki67 in package groupedHyperframe is a groupedHyperframe.

data(Ki67, package = 'groupedHyperframe')
Ki67
s = Ki67 |>
  aggregate_quantile(by = ~ patientID, probs = seq.int(from = .01, to = .99, by = .01))

Candidate of additional predictors are stored in a numeric-hypercolumn logKi67.quantile. Users are encouraged to learn more about groupedHyperframe class and function aggregate_quantile() from package groupedHyperframe vignette.

Partition into a training (80%) and test (20%) set.

set.seed(234); id = sample.int(n = nrow(s), size = nrow(s)*.8) |> sort.int()
s0 = s[id, , drop = FALSE] # training set
s1 = s[-id, , drop = FALSE] # test set

Let's consider a starting model of endpoint PFS with predictor Tstage on the training set s0.

summary(m <- coxph(PFS ~ Tstage, data = s0))

Additional numeric Predictor with Maximum Effect Size

Function add_num() treats each additional predictor as a numeric variable, and updates the starting model with each additional predictor. Function add_num() returns an 'add_num' object, which is a listof objects with an internal class 'add_num_'.

The S3 generic sort_by() sorts the 'add_num' object by the absolute value of the regression coefficient (i.e., effect size) of the additioanal numeric predictor.

The S3 generic head() chooses the top n element from the object returned from the previous step.

set.seed(14837); m1 = m |>
  add_num(x = ~ logKi67.quantile, mc.cores = 1L) |>
  sort_by(y = abs(effsize)) |>
  head(n = 2L)
m1

The S3 generic predict() uses model m1 on the test set s1.

m1[1L] |> predict(newdata = s1)

Additional logical Predictor with Maximum Effect Size

add_dummy(): Naive Practice

Function add_dummy() partitions each additional numeric predictor into a logical variable using function node1(), and updates the starting model by adding in each of the dichotomized logical predictor. Function add_dummy() returns an 'add_dummy' object, which is a listof node1 objects.

The S3 generic subset() subsets the the 'add_dummy' object by the balance of partition of the additional predictor.

The S3 generic sort_by() sorts the 'add_dummy' object by the absolute value of regression coefficient (i.e., effect size) of the additional logical predictor.

The S3 generic head() chooses the top n element from the object returned from the previous step.

set.seed(14837); m2 = m |>
  add_dummy(x = ~ logKi67.quantile, mc.cores = 1L) |>
  subset(subset = p1 > .05 & p1 < .95) |> 
  sort_by(y = abs(effsize)) |>
  head(n = 2L)
m2

The S3 generic predict() uses model m2 on the test set s1.

m2[1L] |> predict(newdata = s1)

add_dummy_partition(): via Repeated Partitions

Function add_dummy_partition() partitions each additional numeric predictor into a logical variable in the following steps.

  1. Generate multiple, i.e., repeated, partitions.
  2. For each partition, create a dichotomizing rule (via function node1()) on the training set. Apply this dichotomizing rule on the test set and obtain the estimated regression coefficient (i.e., effect size) of the additional logical predictor.
  3. Among all partitions, select the one with median effect size of the additional logical predictor.

Function add_dummy_partition() also returns an 'add_dummy' object.

set.seed(14837); m3 = m |> 
  add_dummy_partition(~ logKi67.quantile, times = 20L, mc.cores = 1L) |>
  subset(subset = p1 > .15 & p1 < .85) |>
  sort_by(y = abs(effsize), decreasing = TRUE) |>
  head(n = 2L)
m3
m3[1L] |> predict(newdata = s1)


Try the maxEff package in your browser

Any scripts or data that you put into this service are public.

maxEff documentation built on April 12, 2025, 2:11 a.m.