library(SomaDataIO) library(ggplot2) knitr::opts_chunk$set( collapse = TRUE, comment = "#>", fig.path = "figures/lifting-" ) calc_ccc <- function(x, y) { k <- length(x) sdx <- sd(x) sdy <- sd(y) rho <- stats::cor(x, y, method = "pearson") v <- sdx / sdy # scale shift sx2 <- stats::var(x) * (k - 1) / k sy2 <- stats::var(y) * (k - 1) / k # location shift relative to scale u <- ( mean(x) - mean(y) ) / ( (sx2 * sy2)^0.25 ) rho * ( (v + 1 / v + u^2 ) / 2 )^-1 }
SomaDataIO
contains functionality to bridge (aka "lift") between
various SomaScan versions by linear transformations of RFU data.
Lifting between various versions is essentially a calibration of the
analytes/features in RFU space.
The SomaScan platform continually improves its technical processes between assay versions. The primary change of interest is content expansion, and other protocol changes may be implemented including: changing reagents, liquid handling equipment, and well volumes.
For any given analyte, these technical upgrades may result in minute measurement signal differences, requiring a calibration (aka "lifting" or "bridging") to bring RFU values into a comparable signal space. This is accomplished by applying an analyte-specific scalar, a linear transformation, to each analyte RFU measurement (column).
| Version | Commercial Name | Size |
|:------------- |:-------------------:| -------------:|
| V4
| 5k | 5284 |
| v4.1
| 7k | 7596 |
| v5.0
| 11k | 11083 |
There are 4 main requirements in order to reliably bridge across SomaScan signal space:
soma_adat
object attributes, where SomaScan signal information is
stored, must be intact (see is_intact_attr()
).5k
, 7k
, or 11k
. Older versions
of SomaScan are not supported.Lifting (aka "bridging") scalars are numeric values used to multiply a vector of RFU values to linearly transform them into another signal space.
Lifting scalars are generated from matched samples (n $>$ 1000) from a healthy, normal reference population were run across assay versions. This experiment was run separately for both serum and plasma and all SomaScan runs were first normalized as per the standard normalization procedure, and flagged samples were removed prior to further analysis.
For each analyte, the lifting scalar is computed as the ratio of
population medians between assay versions. For example,
the linear scalar for the $i^{th}$ analyte translating from
11k
$\rightarrow$ 7k
is defined as:
$$ R_i = \frac{\hat\mu_{7k}}{\hat\mu_{11k}}, $$
where $\hat\mu$ is the median signal for the $i^{th}$ analyte.
Signals generated in 11k
space can be multiplied by this scale factor
to translate into 7k
space.
Below is a concordance plot of what this shift would look like for a single analyte on a simulated reference population. Please see the section below on Lin's CCC for its definition and interpretation.
rfu <- dplyr::filter(example_data, SampleType == "Sample")$seq.9016.12 L <- length(rfu) rfu2 <- rfu + withr::with_seed(123, rnorm(L, mean = 500, sd = sd(rfu) / 3)) sf <- median(rfu) / median(rfu2) pre <- data.frame(x = rfu, y = rfu2) pre$group <- sprintf("pre-lift (%0.3f)", calc_ccc(pre$x, pre$y)) post <- data.frame(x = rfu, y = rfu2 * sf) post$group <- sprintf("post-lift (%0.3f)", calc_ccc(post$x, post$y)) plot_df <- rbind(pre, post) plot_df$group <- factor(plot_df$group, levels = rev(sort(unique(plot_df$group)))) lims <- range(plot_df[, -3L]) plot_df |> ggplot(aes(x = x, y = y, colour = group)) + geom_point(alpha = 0.5, size = 3) + scale_x_log10(guide = "axis_logticks") + scale_y_log10(guide = "axis_logticks") + scale_colour_manual(name = "CCC", values = c("#00A499", "#24135F")) + expand_limits(x = lims, y = lims) + labs(x = "SomaScan 7k", y = "SomaScan 11k", title = sprintf("Lifting Concordance (Scalar = %0.3f)", sf)) + geom_abline(slope = 1, intercept = 0, color = "black")
Measurements generated from the matched samples used to calculate the lifting scalars were also used to calculate the post-hoc Lin's Concordance Correlation Coefficient (CCC) estimates of the SomaScan bridge.
Lin's CCC is calculated by computing the correlation between post-lift RFU values and the RFU values generated on the original SomaScan version, and is defined by:
$$ CCC = \frac{2\rho\hat\sigma_x\hat\sigma_y}{(\hat\mu_x - \hat\mu_y)^2 + \hat\sigma^2_x + \hat\sigma^2_y}, $$
where $\rho$, $\hat\mu$, and $\hat\sigma$ are the Pearson correlation coefficient, and the estimated mean and standard deviation from assay version groups x and y respectively.
Lin's CCC was chosen to evaluate lifting performance because it is characterized not only by correlation (Pearson's $\rho$), but also accounts for deviation from the $y = x$ unit line (diagonal). CCC range is in $[-1, 1]$ and can be viewed as an estimate of the confidence in the bridging transformation (in normal reference samples) across SomaScan versions. Examples of factors that could affect lifting CCC are:
The getSomaScanLiftCCC()
function retrieves these values
from an internal object for either "serum"
and "plasma"
.
plasma <- getSomaScanLiftCCC("p") plasma serum <- getSomaScanLiftCCC("s") serum
cdf_df <- data.frame( ccc = c(plasma$plasma_11k_to_7k_ccc, serum$serum_11k_to_7k_ccc), matrix = rep(c("plasma", "serum"), each = nrow(plasma)) ) cdf_df <- cdf_df[!is.na(cdf_df$ccc), ] # rm NAs; non-comparable analytes ggplot(cdf_df, aes(x = ccc, colour = matrix)) + stat_ecdf(linewidth = 0.75) + scale_colour_manual(name = "", values = c("#00A499", "#24135F")) + labs(title = "CDF of CCC Values", x = "Lin's CCC (11k -> 7k)", y = "P(X < x)") + coord_cartesian()
As shown in distribution above, for the 11k
$\rightarrow$ 7k
lift,
post-bridging CCC values above 0.75 (considered high quality) are
approximately 88% and 84% of the SomaScan menu for plasma
and serum respectively. In fact, characterizing CCC lifting quality into 3
categories (Low, Medium, High) yields the table below:
fn <- function(x) { cdf <- stats::ecdf(x) data.frame(lo = cdf(0.5), med = cdf(0.75) - cdf(0.5), hi = 1 - cdf(0.75)) } do.call(rbind, tapply(cdf_df$ccc, cdf_df$matrix, fn)) |> round(3L) |> set_rn(c("Plasma", "Serum")) |> rn2col("Matrix") |> knitr::kable( col.names = c("Matrix", "Low [0, 0.5)", "Medium [0.5, 0.75)", "High [0.75, 1]"), caption = "Table 1. The proportion of the SomaScan menu split into 3 categories by CCC." )
For any given bridge, there is a common, intersecting subset of
analytes between SomaScan versions. Non-intersecting analytes will be
either missing or added in the new signal space. As a result,
bridging data across SomaScan may involve either skipping analytes (columns)
or scaling by 1.0. SomaDataIO
has internal checks that trigger
warnings if these conditions are met.
There are two scenarios to consider:
collapseAdats()
.11k
space with only 5284
analytes; see example below).5k
$\rightarrow$ 11k
Since example_data
object was originally run on SomaScan
r getSignalSpace(example_data)
, this vignette will demonstrate
the lifting/bridging process from a 5k
$\rightarrow$ 11k
signal space, the most recent SomaScan version.
r
is_intact_attr(adat)
r
attr(adat, "Header.Meta")$HEADER$StudyMatrix
r
getSignalSpace(adat)
checkSomaScanVersion(getSignalSpace(adat))
lift_adat()
.
r
lift_adat(adat, bridge = "<direction>")
Current bridge
options are: r dQuote(eval(formals(lift_adat)$bridge))
.# determine intact attributes # must be TRUE is_intact_attr(example_data)
# determine study matrix # must be Human Serum or EDTA-Plasma attr(example_data, "Header.Meta")$HEADER$StudyMatrix |> as.character()
Confirm that the matrix of the SomaScan run was "EDTA Plasma"
:
# determine if current space can be lifted # must be V4, v4.1, or v5.0 from_space <- getSignalSpace(example_data) from_space # must be NULL is.null(checkSomaScanVersion(from_space))
Finally, invoke lift_adat()
to perform the bridge/transformation:
lift_11k <- lift_adat(example_data, bridge = "5k_to_11k") is_lifted(lift_11k) # signal space was lifted is.soma_adat(lift_11k) # preserves 'soma_adat' class getSignalSpace(lift_11k) # current space getSomaScanVersion(lift_11k) # original space
Lifting SomaScan involves a simple linear transformation of a numeric vector (of RFU values), thus in one sense it will always be "successful". However, users often wish to know if this was the correct course of action for their data.
From the concordance plot in Figure 1, we can see that the
transformation is reducing the 11k
RFU brightness
by ~r round(100*(1 - sf))
% in accordance with the median signal difference
that existed in the reference population (of healthy normals).
Rare edge cases aside, this is usually the desired outcome, otherwise
downstream analysis would be confounded by the uncorrected shift in
SomaScan space, and would likely result in significant differences related
to signal space rather than actual biology.
Users often ask if certain analytes be should removed based on a given CCC threshold prior to analysis. The issue of choosing an appropriate threshold aside, unless there is prior knowledge justifying removal, we do not recommend removing analytes based on CCC alone.
This advice stems from how the CCC values are initially calculated; i.e. from a healthy, normal reference population sampled across two versions of SomaScan. Recall that CCC is influenced by CV and thus signaling range. For example, if a given analyte is near its limit of detection in a healthy population, and therefore likely has a high(er) CV, i.e. low CCC, removing this analyte may not be the desired course of action in a disease population where that analyte could be signaling in the linear range.
Therefore, we currently recommend careful evaluation on a case by case basis using prior knowledge and orthogonal justification before filtering analytes from discovery or exploratory analyses.
As always, if you have any bridging or lifting questions, we are here to help. Please reach out to us via:
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.