knitr::opts_chunk$set( collapse = TRUE, message = FALSE, warning = FALSE, comment = "#>", fig.width = 8 ) invisible(suppressPackageStartupMessages(library(tidyverse)))
The {mxfda}
package contains tools for analyzing spatial single-cell data, including multiplex imaging and spatial transcriptomics, using methods from functional data analysis. Analyses for this package are executed and stored using an S4 object of class mxFDA
. This vignette outlines how to perform functional regression using spatial summary functions of multiplex image samples as model covariates. To set up an mxFDA
object from spatial single cell data, calculate spatial summary functions, and perform exploratory data analysis and visualization of these spatial summary functions, see vignette("mx_fda")
. To perform feature extraction using functional principal component analysis, see vignette("mx_fpca")
.
library(mxfda) library(tidyverse) library(ggpubr) library(broom)
Here we load data from the Ovarian cancer dataset where univariate nearest-neighbor G-functions for immune cells have already been estimated. See the vignette mxfda::mx_fda
for more details on extracting spatial summary functions.
data("ovarian_FDA")
We visualize these functions below.
plot(ovarian_FDA, y = "fundiff", what = "uni g") + geom_hline(yintercept = 0, color = "red", linetype = 2) + theme_minimal() + ggtitle("Nearest neighbor G-functions for immune cells")
The plot shows the G-function values as a function of radius (r) for multiple immune cell samples, where each black line is an individual sample. The red dashed line at g = 0 indicates a baseline where no spatial clustering or dispersion is observed. Deviations from this baseline highlight potential patterns of clustering or dispersion of immune cells at varying radii.
Single-cell imaging is common in cancer applications, where the analysis goal is to relate cell-level spatial information to patient-level time-to-event outcomes like overall survival or time-to-recurrence. Cox regression is a popular regression technique for assessing the association between covariates and a survival outcome. The standard Cox model is given by
$$\log h_i(t) = \log h_0(t) + \sum_{p = 1}^P \gamma_p Z_{ip},$$
where $h_i(t)$ is the hazard rate for subject $i$ and $h_0(t)$ is an unspecified baseline hazard. Each $Z_{ip}$ is a scalar covariate, and $P$ is the total number of scalar covariates. These are covariates like age, sex, treatment group, etc and can be contrasted with spatial summary functions which are known as functional covariates. The $\gamma_p$ are regression coefficients; these are interpreted as log hazard ratios.
For single cell spatial data, we have a spatial summary function for each subject (e.g. univariate Ripley's K), $X_i(r)$, that we want to add to the Cox regression model as a functional covariate. Below we provide three different strategies for incorporating $X_i(r)$ into a Cox regression model as a covariate.
In this approach, we first decompose the spatial summary functions using FPCA, then use subject-specific scores from FPCA as covariates in a Cox regression model. Conceptually, we are fitting the model given by
$$\log h_i(t) = \log h_0(t) + \sum_{p = 1}^P \gamma_p Z_{ip} + \sum_{k = 1}^K \beta_k c_{ik},$$
where $c_{ik}$ is the $k$th score from functional principal components analysis for the $i$th subject. $K$ represents the number of principal components, usually selected to explain at least 95% of the variance in the spatial summary functions (see vignette("mx_fpca")
). The $\beta_k$ are regression coefficients that tell us about the association between overall survival and the $k$th functional principal component. Like the $\gamma_p$, the the $\beta_k$ are interpreted as log hazard ratios, and $e^{\beta_k}$ is a hazard ratio. This is the approach to Cox regression with spatial summary functions as covariates taken in @vu2023funspace.
The code below runs FPCA the G-functions from the ovarian cancer data and calculates principal components that explain up to 99% variance using the argument pve
. The argument lightweight = TRUE
returns only what is needed for downstream analysis. To return all items calculate by FPCA when using refund::fpca.face
, use lightweight = FALSE
.
ovarian_FDA <- run_fpca(ovarian_FDA, metric = "uni g", r = "r", value = "fundiff", lightweight = TRUE, pve = .99) ovarian_FDA
Six FPCs were returned. Below, we extract the fpca scores from the ovarian_FDA
object and visualize the relationship between the first two FPC scores and survival time.
Gdf_fpc = extract_fpca_scores(ovarian_FDA, 'uni g fpca') p1 = Gdf_fpc %>% mutate(event = factor(event, levels = 0:1, labels = c("censored", "died"))) %>% ggplot(aes(fpc1, survival_time, color = event)) + geom_point() + labs(y = "survival time (days)", title = "fpc1") + theme(legend.position = c(.5, .7)) p2 = Gdf_fpc %>% mutate(event = factor(event, levels = 0:1, labels = c("censored", "died"))) %>% ggplot(aes(fpc2, survival_time, color = event)) + geom_point() + labs(y = "survival time (days)", title = "fpc2") + theme(legend.position = "none") ggarrange(p1, p2, nrow = 1, ncol = 2)
The figure above shows scatter plots of survival time versus functional principal components for FPC 1 and FPC 2. The left panel shows the relationship between survival time (in days) and the first functional principal component (fpc1) score, while the right panel illustrates the relationship with the second component (fpc2). Data points are color-coded based on whether a given patient experienced the event, with red representing censoring and blue representing individuals who died. These plots highlight the variation in survival times as explained by the two principal components. In general, it appears that subjects with lower scores for FPC 1 tended to have longer survival times.
Next, we use these scores as covariates in a Cox regression model. We also control for age.
library(survival) phmod_fpc = coxph(Surv(survival_time, event) ~ fpc1 + fpc2 + fpc3 + fpc4 + age, data = Gdf_fpc)
The results from this model are printed below in a tidy format. Note that all FPCs except FPC 2 have a statistically significant relationship with overall survival. The hazard ratio above 1 for FPC 1 indicates that higher scores are associate with worse survival outcomes, which is consistent with the exploratory analysis in the plot above.
tidy(phmod_fpc, exp = TRUE, conf.int = TRUE) %>% mutate(p.value = format.pval(p.value, digits = 1)) %>% select(term, hazard_ratio = estimate, conf.low, conf.high, p = p.value) %>% knitr::kable(digits = 2)
Linear and additive functional Cox regression models were used in @vu2022spf to model the relationship between survival and spatial summary functions without having to first decompose the spatial summary functions using FPCA. Linear functional cox regression was first introduced by @gellar2015cox and takes the form:
$$\log h_i(t;Z_i, X_i) = \log h_0(t) + \sum_{p = 1}^P \gamma_p Z_{ip} + \int_{r = 0}^R \beta(r)X_i(r)dr,$$
where $X_i(r)$ is the spatial summary function for subject $i$, and $\beta(r)$ is a coefficient function. For this model, the coefficient function is a log hazard ratio for the spatial summary function at radius $r$. In effect, this allows us to model the association between of spatial clustering and survival at many different radii simultaneously. Here $X_i(r)$ is considered a functional covariate, and the $Z_{ip}$ are scalar covariates.
The code below runs an LFCM for the G-functions from the ovarian cancer data with age as a scalar covariate. Setting the argument afcm = FALSE
runs an LFCM instead of an AFCM.
ovarian_FDA = run_fcm(ovarian_FDA, model_name = "fit_lfcm", formula = survival_time ~ age, event = "event", metric = "uni g", r = "r", value = "fundiff", afcm = FALSE)
Below we print the class and summary of the extracted model.
class(extract_model(ovarian_FDA, 'uni g', 'cox', 'fit_lfcm'))
summary(extract_model(ovarian_FDA, 'uni g', 'cox', 'fit_lfcm'))
Finally, we visualized the exponentiated estimated coefficient function, $e^{\hat{\beta}(r)}$, from this model. The code below extracts this quantity, then plots $e^{\hat{\beta}(r)}$ as a solid black line and 95% pointwise confidence intervals as dotted black lines. For the linear functional Cox model, $e^{\hat{\beta}(r)}$ is interpreted as a hazard ratio as a function of radius $r$. The red dotted line represents the hazard ratio of 1. Values above the red line indicate an increased hazard, while values below it suggest a decreased hazard. Values of the radius $r$ where the 95% confidence interval does not contain one are statistically significant.
lfcm_surface = extract_surface(ovarian_FDA, metric = "uni g", model = "fit_lfcm", analysis_vars = c("age")) plot(lfcm_surface) + ylim(0, 10)
Additive functional cox regression was first introduced by @cui2021additive and takes the form:
$$\log h_i(t;Z_i, X_i) = \log h_0(t) + \sum_{p = 1}^P \gamma_p Z_{ip} + \int_{r = 0}^R F\left(r, X_i(r)\right)dr.$$
This model is slightly more flexible than the LFCM, in that it allows the relationship between $X_i(r)$ and survival to vary nonlinearly with $r$.
The code below runs an AFCM with age as a scalar covariate. Setting the argument afcm = TRUE
runs an AFCM.
ovarian_FDA <- run_fcm(ovarian_FDA, model_name = "fit_afcm", formula = survival_time ~ age, event = "event", metric = "uni g", r = "r", value = "fundiff", afcm = TRUE)
class(extract_model(ovarian_FDA, 'uni g', 'cox', 'fit_afcm'))
summary(extract_model(ovarian_FDA, 'uni g', 'cox', 'fit_afcm'))
The AFCM model produces a coefficient surface, $\hat{F}\left(r, X_i(r)\right)$, rather than a coefficient function, illustrating the relationship between G value and radius $r$ in a continuous gradient. The code below extracts this estimated surface, then on the left plots the surface, and on the right plots regions of the surface that are statistically significant at the 95% confidence level. The color scale represents the magnitude of the hazard ratio.
For example, at $r= 30$ and $G = 0.3$, we observe a statistically significant region with a color corresponding to a hazard ratio value between 1.5 and 2.0. This indicates that at a radius of 30mm, individuals with a G value of 0.3, experience an expected 50% to 100% increased hazard compared to the baseline hazard.
afcm_surface = extract_surface(ovarian_FDA, metric = "uni g", model = "fit_afcm", analysis_vars = c("age"), p = 0.05) plot(afcm_surface)
C-index is a good way to compare across models, especially when incorporating cross validation. Larger c-index is indicative of a better predictive model. Below, we calculate the c-index for each model.
fit_afcm = extract_model(ovarian_FDA, 'uni g', 'cox', 'fit_afcm') fit_lfcm = extract_model(ovarian_FDA, 'uni g', 'cox', 'fit_lfcm') c_index = c( phmod_fpc$concordance[["concordance"]], extract_c(fit_lfcm, Gdf_fpc$survival_time, Gdf_fpc$event), extract_c(fit_afcm, Gdf_fpc$survival_time, Gdf_fpc$event) ) tibble(model = c("fpc", "lfcm", "afcm"), c_index) %>% knitr::kable(digits = 2)
Models with continuous or binary scalar outcomes and functional covariates are also possible, and are referred to as scalar-on-function regression (sofr) models in the functional data analysis literature (@goldsmith2011penalized).
Below we fit a sofr model with age as the outcome, G-functions from the ovarian cancer data as the functional covariate, and no additional scalar covariates using the run_sofr()
function.
ovarian_FDA <- run_sofr(ovarian_FDA, model_name = "fit_sofr_age", formula = age ~ 1, metric = "uni g", r = "r", value = "fundiff")
This model also produces an estimated coefficient surface $\hat{\beta}(r)$, which can be visualized. The code below extracts this quantity, then plots $\hat{\beta}(r)$ as a solid black line and 95% pointwise confidence intervals as dotted black lines. Values of the radius $r$ where the 95% confidence interval does not contain zero are statistically significant. At each radius value $r$, for a continuous outcome $\hat{\beta}(r)$ can be interpreted as a regular linear regression coefficient.
model = extract_model(ovarian_FDA, 'uni g', type = 'sofr', model_name = 'fit_sofr_age') plot(model, ylab=expression(paste(beta(t))), xlab="t")
Below we fit a sofr model with stage as the outcome, G-functions from the ovarian cancer data as the functional covariate, and age as a scalar covariate. Stage is a binary outcome, and as a result this model can also be called functional logistic regression. When the outcome is binary, you need to specify the argument family = "binomial"
. In addition, the outcome should be coded as a numeric rather than a factor or character variable.
ovarian_FDA <- run_sofr(ovarian_FDA, model_name = "fit_sofr_stage", formula = stage ~ age, family = "binomial", metric = "uni g", r = "r", value = "fundiff")
When the outcome is binary $\hat{\beta}(r)$ is interpreted as a log odds ratio at each radius $r$.
model = extract_model(ovarian_FDA, 'uni g', type = 'sofr', model_name = 'fit_sofr_stage') plot(model, ylab=expression(paste(beta(t))), xlab="t")
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.