knitr::opts_chunk$set( eval = FALSE, # for while editing vignette collapse = TRUE, comment = "#>" )
library(artemis)
(Adapted from manuscript Espe et al, in review)
The artemis
R package was created to aid in the design and analysis of
environmental DNA (eDNA) survey studies by offering a custom suite of
models for quantitative polymerase chain reaction (qPCR) data from extracted eDNA samples. Data from eDNA sampling surveys is often analyzed with occupancy models or GLMS, but there are
several characteristics of qPCR data in particular which made us feel that it would
benefit from a different modeling approach. Specifically, our approach to this data makes use of Bayesian truncated latent variable models written in Stan.
artemis
We created the artemis
R package to implement Bayesian censored
latent variable models, which mitigate the weaknesses of common
statistical analysis techniques applied to qPCR data. Additionally,
artemis
includes utilities to aid in the design and analysis of eDNA
survey studies, including simulation and power analysis functions.
At its core, artemis
is a specialized Generalized Linear
Model, where the predictors are assumed to additively affect the
response variable, in this case $ln[eDNA]$,
$$ ln[eDNA]{i} = X{i} \beta $$
where $\beta$ is a vector of effects on $ln[eDNA]{i}$, and $X{i}$
is a vector of predictors. Since artemis
directly models the
effect of the predictors on the latent (unobserved) variable, [eDNA],
it is unnecessary for the researcher to back-transform the data prior
to modeling. Internally, artemis
conducts this conversion using the
user-supplied values for the formula,
$$\hat{Cq_i} = \alpha_{std_curve} + \beta_{std_curve}* ln[eDNA]_i $$
Where $\alpha_{std_curve}$ and $\beta_{std_curve}$ are fixed values from setting the standard curve in the lab prior to qPCR.
Internally, the back-transformed $ln[eDNA]_i$ values are considered a sample with measurement error from the true $ln[eDNA]_i$ value ($\hat{ln[eDNA]_i}$) in the extract.
$$ ln[eDNA]i \sim Trunc. Normal(\hat{ln[eDNA]_i}, \sigma{Cq}, U) $$
Where the observed $ln[eDNA]_i$ values are censored at the predetermined threshold, $U$. This threshold is back-transformed from the threshold on Cq. Importantly, the $\hat{ln[eDNA]}$ values in the model are not censored, allowing the latent variable to reflect the "true" [eDNA] beyond the censorship point. The likelihood that a sampled $ln[eDNA]$ value will exceed the threshold is a function of the measurement error and the estimated latent $\hat{ln[eDNA]_i}$ value. We calculate this likelihood using the normal cumulative distribution function, $\Phi()$,
$$ Pr[ln[eDNA]_i > U ] = 1 - \Phi(\hat{ln[eDNA]_i} - \mu_i / \sigma)$$
Thus, the models in artemis
account for the data censoring process by
accounting for the probability that the observed value will exceed the
threshold. As detection limits vary with genetic assay, the upper
threshold on Cq in the model is adjustable by the user.
Lastly, there is an optional zero-inflated versions of both models,
eDNA_lm_zinf()
and eDNA_lmer_zinf()
. From multiple experiments, it
was observed there can be near-zero concentrations of eDNA even in
situations where higher concentrations were expected. This was
attributed to filter failures or other issues with sampling. To
account for this mechanism, the zero-inflated versions of the models
allow for zero [eDNA] concentrations from a secondary
mechanism. Currently, the functions do not support user-provided
predictors on the zero-inflated component, and just estimate a flat
probability of zero detections for all observations. However, users
can provide a prior for the expected probability of "true" zero
observations from a secondary mechanism.
This model formulation makes several assumptions:
$ln[eDNA]$ is assumed to be uniform within a sample.
$ln[eDNA]$ is sampled with normally distributed errors.
There are no false detections, i.e. the measurement error cannot result in a positive detection when eDNA is not present in the sample.
Importantly, this formulation produces estimates of the effect sizes which:
are modeled directly on $ln[eDNA]$ or copy number, rather than Cq, therefore are independent of the standard curve and can be compared between studies that use different standard curves.
account for the data censoring at the upper limit of qPCR cycles, which properly accounts for uncertainty and reduces bias in the estimates.
artemis
The artemis
installation guide can be found here.
Since artemis
was originally written to analyze aquatic eDNA samples, the examples and sample data included in the package (see ?eDNA_data
) reference
variables that are specific to aquatic eDNA survey data. These variables include volume
(the the number of mililiters of water that went through the filter in
each sample) and distance
(the distance in meters from the source of
eDNA). For example, eDNA_data
is from a "live car" experiment, where eDNA was sampled at known distances from a source of eDNA (fish in a net pen in unidirectional flow).
In the following examples, tech_rep
refers to qPCR replicates generated
from a single extracted sample or filter; rep
refers to a single extracted filter or sample.
artemis
FunctionalityThe functions included in artemis
can be grouped into a few categories; each category is either introduced in this vignette, or addressed in detail by its own vignette. Available vignettes are linked to individually below:
These categories are addressed in the sections that follow.
The artemis
package also includes methods for R's plot()
, summary()
, data.frame()
, and predict()
functions.
Often purpose of an eDNA sampling study is to inform a field sampling protocol. For these cases, we often want to know how likely we are to detect eDNA, given the way that we sampled. est_p_detect()
returns the probability of getting a detection across all planned samples and replicates, where n_rep
represents the product of the number of filters and technical replicates for each filter. For example, if you planned to take 2-5 filters at each variable level and then complete 6 extractions for each filter, that would be n_rep = 12:30
.
p_detect = est_p_detect(variable_levels = c(Intercept = 1, Distance = 100), betas = c(Intercept = -10.5, Distance = -0.03), ln_eDNA_sd = 1, std_curve_alpha = 21.2, std_curve_beta = -1.5, n_rep = 12:30) p_detect
We can also plot this,
plot(p_detect)
Alternatively, we can also use a fit model to estimate the p(detect) for a set of circumstances. Unlike the above, using a fit model allows us to estimate the probability of detection with the uncertainty in our effect size estimates taken into account.
model_fit = eDNA_lm(Cq ~ Distance_m + Volume_mL, data = eDNA_data, std_curve_alpha = 21.2, std_curve_beta = -1.5, seed = 1234, chains = 1) # Intercept is automatically supplied if in model_fit p_detect2 = est_p_detect(variable_levels = c(Distance_m = 100, Volume_mL = 50), model_fit = model_fit, n_rep = 1:12) print(p_detect2)
and we can plot the results,
plot(p_detect2, error_width = 0.1)
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.