Memory-efficient Cox proportional hazards regression via streaming
Newton-Raphson. Peak RAM is O(p^2) in the number of covariates and flat in the
number of rows n, so models fit on datasets that do not fit in memory.
Coefficients are identical to survival::coxph() with Efron tie correction.

Development version from GitHub:
# install.packages("remotes")
remotes::install_github("tommycarstensen/coxstream-r")
In-memory fit, with the same formula interface as coxph():
library(survival)
library(coxstream)
fit <- coxstream(Surv(time, status) ~ age + sex, data = lung)
coef(fit)
fit
Out-of-core fit, streaming a time-DESCENDING-sorted parquet file one row group
at a time (requires the optional arrow package):
fit <- coxstream_arrow(
"events_sorted.parquet",
x_cols = c("age", "sex"),
time_col = "duration",
event_col = "event"
)
coef(fit)
The reader loads one row-group chunk at a time and frees it before the next, so peak RAM stays at O(batch_size * p), flat in n. Efron tie groups that span chunk boundaries are carried in running state, giving coefficients bit-identical to the in-memory fit.
Each Newton-Raphson iteration makes a single descending-time pass to accumulate the Cox partial-likelihood score and Hessian. Only running sums of size O(p) and O(p^2) are held, never the full risk set, so memory does not grow with n. The accumulation kernel is implemented in C++ via Rcpp.
coxstream_arrow()), testthat (tests)MIT, except src/arrow_c_abi.h, which is vendored from Apache Arrow under
Apache-2.0; see inst/COPYRIGHTS.
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.