knitr::opts_chunk$set( collapse = TRUE, comment = "#>", fig.align = "center", out.width = '100%', cache=T ) options(width=75) figpath <- "../inst/cdoc/RAPToR-refbuilding_figs/"
load("../inst/extdata/dsaeschimann2017.RData") load("../inst/extdata/dshendriks2014.RData")
transp <- function(col, a=.5){ colr <- col2rgb(col) return(rgb(colr[1,], colr[2,], colr[3,], a*255, maxColorValue = 255)) }
This vignette is specifically focused on building RAPToR
references, and assumes knowledge of general use of the package (see the main "RAPToR"
vignette).
Building references is a key aspect of RAPToR
, as samples need an appropriate reference to be staged.
References are broadly usable once built, so they are worth the trouble to set up.
This vignette, will go through the general workflow of building a reference from selecting an appropriate dataset, to validating a model for interpolation.
Along the explanations, examples will be given using C. elegans larval development time-series data published by @aeschimann2017lin41 and @hendriks2014extensive, stored in dsaeschimann2017
and dshendriks2014
respectively. Code to create these objects can be found at the end of this vignette.
Further examples are given at the end for broader coverage of reference-building needs.
We start from a gene expression profiling time-series spanning the developmental stages of interest for your organism. Thankfully, these time-series experiments are increasingly numerous in the literature and many models will already have some kind of useable data. You may also make (or already have) your own time-series on hand.
There are a few expression profiling databases to search in and download data from. The most well-known are the Gene Expression Omnibus (GEO) and the Array Express.
Both of these databases have APIs to get their data directly from R (e.g the GEOquery
package, used in multiple data loading scripts of the RAPToR vignettes).
Several points of the experimental design should be kept in mind when selecting data for a reference.
Bioinformatics is a field plagued by diverse and fast-changing sets of IDs for genes, transcripts, etc.
When you build a reference, you should always convert it to IDs that are conventional and stable.
We like to use the organism-specific IDs (e.g, Wormbase for C. elegans : WBGene00016153
, Flybase for Drosophila : FBgn0010583
).
The ensembl biomart or its associated R package biomaRt
are a very useful resource to get gene, transcript or probe ID lists for conversion.
Below is a code snippet to get gene IDs for drosophila with biomaRt
.
library(biomaRt) # setup connection to ensembl mart <- biomaRt::useMart("ensembl", dataset = "dmelanogaster_gene_ensembl") # get list of attributes droso_genes <- biomaRt::getBM(attributes = c("ensembl_gene_id", "ensembl_transcript_id", "external_gene_name", "flybase_gene_id"), mart = mart) head(droso_genes) #> ensembl_gene_id ensembl_transcript_id external_gene_name flybase_gene_id #> 1 FBgn0015380 FBtr0330209 drl FBgn0015380 #> 2 FBgn0015380 FBtr0081195 drl FBgn0015380 #> 3 FBgn0010356 FBtr0088240 Taf5 FBgn0010356 #> 4 FBgn0266023 FBtr0343232 lncRNA:CR44788 FBgn0266023 #> 5 FBgn0250732 FBtr0091512 gfzf FBgn0250732 #> 6 FBgn0250732 FBtr0334671 gfzf FBgn0250732
When multiple probe or transcript IDs match a single gene ID, we usually sum-aggregate counts and mean-aggregate other expression values (microarray or already-processed RNAseq as TPMs).
This is taken care of with the format_ids()
function.
It's common practice to normalize expression data (e.g. to account for technical bias). You may deal with many different profiling technologies when building references, and may even join different datasets together for a single reference.
To stay as consistent as possible, we apply quantile-normalization on all data regardless of source or type.
For this, we use the normalizeBetweenArrays()
function from limma
.
We also log-transform the data with $log(X + 1)$.
library(RAPToR) library(limma) dsaeschimann2017$g <- limma::normalizeBetweenArrays(dsaeschimann2017$g, method = "quantile") dsaeschimann2017$g <- log1p(dsaeschimann2017$g) dshendriks2014$g <- limma::normalizeBetweenArrays(dshendriks2014$g, method = "quantile") dshendriks2014$g <- log1p(dshendriks2014$g)
It's good practice to explore the data a little before moving on.
We start with quick look at the data: expression matrix and associated sample (or phenotypic) data.
dsaeschimann2017$g[1:5,1:3] head(dsaeschimann2017$p, n = 5)
Outliers in time series data can be revealed using sample-sample correlation heatmaps or boxplots. This also shows the clear correlation between samples of similar development.
cor_dsaeschimann2017 <- cor(dsaeschimann2017$g, method = "spearman")
ord <- order(dsaeschimann2017$p$age) # order by development heatmap(cor_dsaeschimann2017[ord, ord], Colv = NA, Rowv = NA, scale = "none", keep.dendro = F, margins = c(1,1), # see Example 3 for transp() function. RowSideColors = transp(as.numeric(dsaeschimann2017$p$strain[ord])), labRow = "", labCol = "") # add time labels par(xpd = T) mtext(text = paste0(unique(dsaeschimann2017$p$age), 'h'), side = 1, line = 4, at = seq(-.1,1.05, l = 11))
boxplot(cor_dsaeschimann2017~interaction(dsaeschimann2017$p$strain, dsaeschimann2017$p$age), col = transp(1:4), # see Example 3 for transp() function. xaxt = "n", at = seq(1,44, l = 55)[c(T,T,T,T,F)], ylab = "Spearman correlation", xlab = "Chronological age (h)") #add time labels axis(side = 1, at = seq(2,42, l = 11), labels = paste0(unique(dsaeschimann2017$p$age), 'h')) legend(23,.87, fill = transp(1:4), legend = c("let-7", "lin-41", "let-7/lin-41", "N2"), bty = "n")
Everything looks good in this case.
Dimension reductions such as PCA or ICA (Principal or Independent Component Analysis) yield components that summarize the gene expression landscape and dynamics (@alter2000singular). Plotting these components with respect to time is a good way to grasp general dynamics in the data.
For PCA, we want to perform a non-scaled, centered PCA. Centering is done gene-wise, not sample-wise (hence the matrix rotation below).
pca_dsaeschimann2017 <- stats::prcomp(t(dsaeschimann2017$g), rank = 25, center = TRUE, scale = FALSE)
par(mfrow = c(2,4), pty='s') # for components 1:8 invisible(sapply(1:8, function(i){ # plot points colored by strain plot(dsaeschimann2017$p$age, pca_dsaeschimann2017$x[,i], lwd = 2, col = dsaeschimann2017$p$strain, xlab = "Chronological age", ylab = "PC", main = paste0("PC", i)) # connect the dots per strain sapply(levels(dsaeschimann2017$p$strain), function(l){ s <- which(dsaeschimann2017$p$strain == l) points(dsaeschimann2017$p$age[s], pca_dsaeschimann2017$x[s,i], col = which(levels(dsaeschimann2017$p$strain) == l), type = 'l', lty = 2) }) if(i == 1) # add legend on first panel legend("topleft", bty = 'n', legend = c("let-7", "lin-41", "let-7/lin-41", "N2"), pch = c(rep(1, 4)), lty = c(rep(NA, 4)), col = c(1:4), lwd = 3) }))
In this C. elegans larval development data, we see very consistent dynamics across different strains. Also, PC2 and PC3 capture an oscillatory dynamic which is characteristic of molting in C. elegans.
We can also plot a few random genes, to get a grasp on the noise in the data.
# select random genes set.seed(10) gtp <- sample(nrow(dsaeschimann2017$g), size = 4) par(mfrow = c(1,4), pty='s') invisible(sapply(gtp, function(i){ plot(dsaeschimann2017$p$age, dsaeschimann2017$g[i,], lwd = 2, col = dsaeschimann2017$p$strain, xlab = "Chronological age", ylab = "log(TPM+1)", main = rownames(dsaeschimann2017$g)[i]) sapply(levels(dsaeschimann2017$p$strain), function(l){ s <- which(dsaeschimann2017$p$strain == l) points(dsaeschimann2017$p$age[s], dsaeschimann2017$g[i,s], col = which(levels(dsaeschimann2017$p$strain) == l), type = 'l', lty = 2) }) if(i == gtp[4]) legend("topleft", bty = 'n', legend = c("let-7", "lin-41", "let-7/lin-41", "N2"), pch = c(rep(1, 4)), lty = c(rep(NA, 4)), col = c(1:4), lwd = 3) }))
Increasing the resolution of a profiling time series is a very unbalanced regression problem. We want to predict tens of thousands of dependent variables (genes) with a few independent variables (time, batch, ...).
In order to fit a large number of output variable (genes), we project the data in a linear dimensionally-reduced space to interpolate at component-level, before re-projecting the data back to gene-level. We do this with Principal Components or Independant Components ( Independant Component Analysis ).
Both PCA and ICA perform the same type of linear transformation on the data (they just optimize different criteria). We get the following :
$$ X_{(m\times n)} = G_{(m\times c)}S^{T}_{(n\times c)} $$ with $X$, the matrix of $m$ genes by $n$ samples, $G$ the gene loadings ($m$ genes by $c$ components) and $S^T$ the sample scores ($n$ samples by $c$ components). When performing PCA (or ICA) on gene expression data, $S$ is what's usually plotted (e.g. PC1 vs. PC2) to see how samples are grouped in the component space. It's what we plotted earlier in the section on observing data, for instance.
@alter2000singular demonstrated that singular value decomposition of gene expression data can be taken as "eigengenes", giving a global picture of the expression landscape and dynamics with a few components. GEIMs use this property. We fit a model on the columns of $S^T$ (eigengenes), predict in the component space, and reconstruct the gene expression data by a matrix product with the gene loadings.
We propose 2 model types : Generalized Additive Models (GAMs, the default) and Generalized Linear Models (GLMs).
GAMs use gam()
from the mgcv
package, and GLMs use the glm()
function of the stats
core package.
A standard R formula will be specified for the model.
This formula can include the tools and functions generally used with gam()
or glm()
, most notably the variety of polynomial or smoothing splines implemented through the s()
function of mgcv
for GAMs.
GLMs can also use splines from the splines
core package, such as ns()
for natural cubic splines.
Gene Expression Interpolation Models (GEIMs), are built with the ge_im()
function, which outputs a geim
object.
This function inputs 3 key arguments :
X
: the gene expression matrix of your time-series (genes as rows, samples as columns)p
: a dataframe of phenotypic data, samples as rows. This should include the age/time variable and any other covariates you want to include in the model (e.g batch, strain)formula
: the model formula. This should be a standard R formula, using terms found in p
. It must start with X~
.Another important argument is the number of components to interpolate on, nc
.
For example, using the dsaeschimann2017
dataset we could build the following model.
m_dsaeschimann2017 <- ge_im(X = dsaeschimann2017$g, p = dsaeschimann2017$p, formula = "X ~ s(age, bs = 'ts') + strain", nc = 32)
Additional parameters are detailed in the documentation of the function ?ge_im()
.
Note that a single model formula is specified and applied to all the components, but models are fitted independently on each components.
Model predictions can be generated with the predict()
function, as for any standard R model.
There are 5 types of GEIMs:
method="gam",dim_red="pca"
) (default)method="glm",dim_red="pca"
)method="gam",dim_red="ica"
)method="glm",dim_red="ica"
)method="limma"
)Our default GEIM is fitting GAMs on PCA components, which is a robust choice when applying a smoothing spline to the data.
When using GAMs, there can be no interaction between terms (by definition).
It is possible to include a by
argument to the s()
function, which essentially corresponds to separate model fits on each level of the specified group variable (and thus requires enough data to do so).
PCA and ICA interpolation usually yield near-identical results (ICA components tend to be more biologically meaningful or interpretable).
ICA tends to outperform PCA when the data is very noisy (by design, since ICA essentially does signal extraction).
It is however slower, especially if nc
is large.
With the last option ("limma"
), a model is fit per gene with the lmFit()
function of the limma
package.
This approach makes no effort to reduce the dimensionality of the problem (dim_red
and nc
arguments are ignored) which means there is no information loss or bias introduced by dimension reduction.
It is however very sensitive to noise.
The mperf()
function computes multiple indices to evaluate model performance.
In the formulas below, $X$ corresponds to the input gene expression matrix ($m$ genes as rows, $n$ samples as columns), $\hat{X}$ to the model predictions. $x_i$ corresponds to row $i$ of matrix $X$ and $x_i^{(j)}$ to sample $j$ of that row. This notation is derived from the general regression problem, where $X^T$ corresponds to the set of $m$ dependant variables to predict.
aCC
: average Correlation Coefficient.$$ aCC = \frac{1}{m}\sum^{m}{i=1}{CC} = \frac{1}{m}\sum^{m}{i=1}{\cfrac{\sum^{n}{j=1}{(x_i^{(j)}-\bar{x}_i)(\hat{x}_i^{(j)}-\bar{\hat{x}}_i)}}{\sqrt{\sum^{n}{j=1}{(x_i^{(j)}-\bar{x}_i)^2(\hat{x}_i^{(j)}-\bar{\hat{x}}_i)^2}}}} $$
aRE
: average Relative Error. $$ a\delta = \frac{1}{m}\sum^{m}{i=1}{\delta} = \frac{1}{m} \sum^{m}{i=1} \frac{1}{n} \sum^{n}_{j=1} \cfrac{| x_i^{(j)} - \hat{x}_i^{(j)} | }{x_i^{(j)}} $$
MSE
: Mean Squared Error. $$ MSE = \frac{1}{m} \sum^{m}{i=1} \frac{1}{n} \sum^{n}{j=1} (x_i^{(j)} - \hat{x}_i^{(j)} )^2 $$
aRMSE
: average Root MSE.$$ aRMSE = \frac{1}{m}\sum^{m}{i=1}{RMSE} = \frac{1}{m} \sum^{m}{i=1} \sqrt{\cfrac{\sum^{n}_{j=1} (x_i^{(j)} - \hat{x}_i^{(j)} )^2}{n}} $$
Note that these indices are computed and averaged with respect to variables (genes), not observations.
mperf()
outputs either the overall (averaged) index, or values per-gene (global
parameter).
# global values g_mp <- mperf(dsaeschimann2017$g, predict(m_dsaeschimann2017), is.t = T) g_mp # per gene values ng_mp <- mperf(dsaeschimann2017$g, predict(m_dsaeschimann2017), is.t = T, global = F) ng_mp <- lapply(ng_mp, na.omit) # remove NAs (eg. 0 variance genes) ng_mp$aRE <- ng_mp$aRE[ng_mp$aRE < Inf] # remove Inf values (/0)
Model performance is reflected by the distributions of prediction indices over all genes.
par(mfrow = c(2,2)) invisible(sapply(names(ng_mp), function(idx){ rg <- range(na.omit(ng_mp[[idx]])) # estimate density curve d <- density(na.omit(ng_mp[[idx]]), from = rg[1], to = rg[2]) plot(d, main = paste0(gsub("a", "", idx, fixed = T), " density (", length(ng_mp[[idx]]), " genes)"), xlab = idx, lwd = 2) # display global value abline(v = g_mp[[idx]], lty = 2, lwd = 2, col = "firebrick") text(g_mp[[idx]], .9*max(d$y), pos = 4, labels = idx, font = 2, col = "firebrick") }))
This data is quite clean, so model fits are very good across the board.
By default, the number of components to interpolate is set to the number of samples. However, we recommend setting a cutoff on explained variance of PCA (or ICA) components.
For example, on the dsaeschimann2017
dataset, we set the threshold at $99\%$ :
nc <- sum(summary(pca_dsaeschimann2017)$importance[3,] < .99) + 1 nc
This threshold must be set in accordance with the noise in the data. For example, in very noisy data, would you consider that $99\%$ of the variance in the dataset corresponds to meaningful information ?
You can also define nc
by plotting your components and stopping after the components stop capturing meaningful variation (dynamics) with respect to time/age. We define components with 'intelligible dynamics' with respect to time as those where a model fit explains $>0.5$ of the deviance (noisy components with no dynamics have poor fits).
In very noisy data, you may have to keep very few components (or even a single component) for the interpolation.
Choosing from different splines (and/or parameters) can be done with cross-validation (CV).
The ge_imCV()
function inputs the X
, p
, and a formula_list
to test, as well as other potential CV parameters (e.g. training set size).
The default training/validation set ratio is cv.s=0.8
, so $80\%$ of the data is used to build the model and the remainder for validation.
When including (factor) covariates in the model, the training set is built such that all groups are proportionately represented in the training set (based on terms of the first formula in the list).
The number of CV repeats is defined by cv.n
.
The model type (GAM/GLM and PCA/ICA) is fixed for all formulas in one call of ge_imCV()
.
ge_imCV()
returns model performance indices from mperf()
(excluding aCC
to reduce computing time).
Indices are computed on the validation set (CV Error, cve) and on the training set (Model PerFormance, mpf).
Below we choose between 4 available GAM smooth terms on the dsaeschimann2017
GEIM.
smooth_methods <- c("tp", "ts", "cr", "ps") flist <- as.list(paste0("X ~ s(age, bs = \'", smooth_methods, "\') + strain")) cat(unlist(flist), sep = '\n') set.seed(2) cv_dsaeschimann2017 <- ge_imCV(X = dsaeschimann2017$g, p = dsaeschimann2017$p, formula_list = flist, cv.n = 10, nc = nc, nb.cores = 4)
plot(cv_dsaeschimann2017, names = paste0("bs = ", smooth_methods), outline = F, swarmargs = list(cex = .8), boxwex=.5)
From the plots above, we can see the different splines perform similarly. All could work.
We choose ts
(a thin-plate regression spline), as it minimizes CV error without much impact on model performance.
Extra spline parameters can also be specified to the model.
With s()
, you can give the spline's basis dimension ($\simeq$ number of knots) to use with the k
parameter. By default, the spline is a penalized spline, so it will not necessarily use k
knots, but it will stay below that value.
By setting fx=TRUE
, a spline with k
basis dimension is forced. Note this fits a spline of dimension k
on all components, whereas the penalized spline will adjust.
The s()
or choose.k
documentation gives further information.
In our experience, the parameter estimation done by gam()
is usually sufficient and rarely requires tweaking.
Below are examples of spline parameter tweaking with the dsaeschimann2017
data.
ks <- c(4,6,8,10) flistk <- as.list(c( "X ~ s(age, bs = 'cr') + strain", paste0("X ~ s(age, bs = 'cr', k = ", ks , ") + strain"), paste0("X ~ s(age, bs = 'cr', k = ", ks , ", fx=TRUE) + strain") )) cat(unlist(flistk), sep = '\n') cv_dsaeschimann2017k <- ge_imCV(X = dsaeschimann2017$g, p = dsaeschimann2017$p, formula_list = flistk, cv.n = 10, nc = nc, nb.cores = 4)
par(mar = c(7,4,3,1)) plot(cv_dsaeschimann2017k, names = c("na", paste0("k=", ks, rep(c("", ", fx=T"), each = 4))), outline = F, col = transp(c("royalblue", rep(c(1, "firebrick"), each = 4))), tcol = c("royalblue", rep(c(1, "firebrick"), each = 4)), names.arrange = 5, swarmargs = list(cex = .8))
This confirms the parameters estimated by GAM (in blue) are optimal.
A ref
object can be built from a GEIM using make_ref()
, specifying interpolation resolution and relevant metadata (see ?plot.geim
).
On our dsaeschimann2017
example :
r_dsaeschimann2017 <- make_ref( m = m_dsaeschimann2017, # geim n.inter = 100, # interpolation resolution t.unit = "h past egg-laying (25°C)", # time unit cov.levels = list("strain"= "N2"), # covariate lvls to use for interpolation metadata = list("organism"= "C. elegans", # any metadata "profiling"= "bulk RNAseq"))
As any R model, GEIMs have a predict()
function (called internally by make_ref()
) which can be used to manually get predictions either in component space or at the gene level.
This can be useful for a deeper look at the model.
# first generate the new predictor data n.inter <- 100 # nb of new timepoints newdat <- data.frame( age = seq(min(dsaeschimann2017$p$age), max(dsaeschimann2017$p$age), l = n.inter), strain = rep("N2", n.inter) # we want to predict as N2 ) head(newdat) # predict at gene level pred_m_dsaeschimann2017 <- predict(m_dsaeschimann2017, newdata = newdat) # predict at component level pred_m_dsaeschimann2017_comp <- predict(m_dsaeschimann2017, newdata = newdat, as.c = TRUE)
After building a reference, we check interpolation results by:
Checking predictions against components allows you to immediately see if some dynamics get mishandled by the model, or if there is over fitting. It's acceptable to have slight offsets, no fit is perfect.
Plotting a model and a reference object (or equivalent metadata) shows component interpolation, with deviance explained (DE) and relative error (RE) for each component (this information is also returned by the plot function). DE can be used to define components with "intelligible" dynamics (w.r.t. time), when $DE>0.5$. In noisy data, this distinction can be useful to remove components which do not reflect meaningful developmental variation (but rather noise).
Predictions of the first few components from dsaeschimann2017
are plotted below.
par(mfrow = c(2,4)) fit_vals <- plot(m_dsaeschimann2017, r_dsaeschimann2017, ncs=1:8, col = dsaeschimann2017$p$strain, col.i = 'royalblue')
Of note, we are predicting model values as N2
(lightblue). While all strains are shown on the plots, some model parameters depend on the selected N2
strain.
The fit values are also returned by the plot function.
head(fit_vals)
You may notice some noisy components get "flattened", with a null model fitted. These components can be left in or removed as they generally have little to no impact on interpolation at the gene level (representing a minuscule part of total variance in the data). This can actually get rid of unwanted variation.
par(mfrow = c(1,3), pty='s') fit_vals <- plot(m_dsaeschimann2017, r_dsaeschimann2017, ncs=11:13, col = dsaeschimann2017$p$strain, col.i = 'royalblue', l.pos = 'bottomleft')
The interpolation should translate well at the gene level, on the full expression matrix.
par(mfrow = c(1,4)) invisible(sapply(gtp, function(i){ # from the earlier random gene plots plot(dsaeschimann2017$p$age, dsaeschimann2017$g[i,], lwd = 2, col = dsaeschimann2017$p$strain, xlab = "age", ylab = "GExpr", main = rownames(dsaeschimann2017$g)[i]) # connect the dots sapply(levels(dsaeschimann2017$p$strain), function(l){ s <- which(dsaeschimann2017$p$strain == l) points(dsaeschimann2017$p$age[s], dsaeschimann2017$g[i,s], col = which(levels(dsaeschimann2017$p$strain) == l), type = 'l', lty = 2) }) # add model prediction points(r_dsaeschimann2017$time, r_dsaeschimann2017$interpGE[i, ], col = "royalblue", type = 'l', lwd = 3) # legens if(i == gtp[4]) legend("topleft", bty = 'n', legend = c("let-7", "lin-41", "let-7/lin-41", "N2"), pch = c(rep(1, 4)), lty = c(rep(NA, 4)), col = c(1:4), lwd = 3) }))
Staging the samples used to build the reference on their interpolated version is a good first test. Then, staging another time-series from the literature on your reference is the best validation, if such data exists. This external validation confirms the interpolated dynamics indeed correspond to development processes and not a dataset-specific effect (which is unlikely, but not impossible).
We can use dshendriks2014
for external validation.
ae_test_dsaeschimann2017 <- ae(dsaeschimann2017$g, r_dsaeschimann2017) ae_test_dshendriks2014 <- ae(dshendriks2014$g, r_dsaeschimann2017)
par(mfrow = c(1,2), pty='s', mar=c(5,5,4,1)) rg <- range(c(ae_test_dsaeschimann2017$age.estimates[,1], dsaeschimann2017$p$age)) plot(ae_test_dsaeschimann2017$age.estimates[,1]~dsaeschimann2017$p$age, xlab = "Chronological age", ylab = "RAPToR age estimates (r_aeschimann2017)", xlim = rg, ylim = rg, main = "Chron. vs Estimated ages for dsaeschimann2017\n(on dsaeschimann2017 reference)", lwd = 2, col = factor(dsaeschimann2017$p$strain)) invisible(sapply(levels(factor(dsaeschimann2017$p$strain)), function(l){ s <- dsaeschimann2017$p$strain == l points(ae_test_dsaeschimann2017$age.estimates[s,1]~dsaeschimann2017$p$age[s], type = 'l', lty = 2, col = which(l==levels(factor(dsaeschimann2017$p$strain)))) })) abline(a = 0, b = 1, lty = 3, lwd = 2) legend("bottomright", legend = c("let-7", "lin-41", "let-7/lin-41", "N2", "x = y"), lwd=3, col=c(1:4, 1), bty='n', pch = c(1,1,1,1,NA), lty = c(rep(NA, 4), 3)) rg <- range(c(ae_test_dshendriks2014$age.estimates[,1], dshendriks2014$p$age)) plot(ae_test_dshendriks2014$age.estimates[,1]~dshendriks2014$p$age, xlab = "Chronological age", ylab = "RAPToR age estimates (r_aeschimann2017)", xlim = rg, ylim = rg, main = "Chron. vs Estimated ages for dshendriks2014\n(on dsaeschimann2017 reference)", lwd = 2) points(ae_test_dshendriks2014$age.estimates[,1] ~ dshendriks2014$p$age, type = 'l', lty = 2) abline(a = 0, b = 1, lty = 3, lwd = 2) legend("bottomright", legend = "x = y", lwd=3, col=1, lty = 3, bty='n')
Unlike development, where gene expression changes are robust across individuals (and to an extent across species), expression changes along aging are much more subtle, more variable and noisier across experiments. Individuals can age at different rates, in different ways, and the process of aging itself is also still poorly understood. This makes it difficult for RAPToR to work with aging data "as is", and we recommend to strengthen the aging signal by restricting genes to an informative set.
Empirically, we have found that genes with monotonous trends along aging make for a robust choice. Genes whose expression correlates with age values above a given threshold can thus be selected to build a reference from a single component. We use this strategy in our article (@bulteau2022real), and replicate it in Example 4.
Aging profiling data is particularly tricky to use for reference-building because on top of the many known factors that biologically influence aging, unknown differences in experimental procedures between labs can also influence aging, as evidenced by @lithgow2017long. They further show that the t-zero can be different between labs (e.g. egg-laying, hatching, feeding), some start counting at "day 1", others at "day 0". It is therefore likely that inferred age units will differ from known chronological age of staged samples. Time-series should however show clear correlation between chronological and estimated age.
Here are a few examples of reference building and validation on different organisms.
options(width = 80)
sessionInfo()
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.