In valentingar/ConFluxPro: A toolkit for soil gas analysis

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

This vignette shows how to estimate the uncertainty of the efflux and production estimates using bootstrap_error(). We will first generate a dataset of 'measurement uncertainty' in the input parameters and then use bootstrap_error() to estimate the resulting uncertainty in the models.

library(ConFluxPro)

Uncertainty of input variables

General workflow and uncertainty from `gasdata`

The most relevant source of uncertainty in FGM-based models is that of the input parameters. Especially parameters related to the estimation of the calculation of the diffusion coefficient DS are hard to measure and soil heterogeneity introduces more uncertainty. We'll use the example dataset to demonstrate this effect.

data("base_dat", package = "ConFluxPro")
mod_pf <- pro_flux(base_dat)

In this dataset, the molar fraction of CO~2~ is measured in three replicates at each depth within the soil. In a normal model, we use all replicates to fit a single profile. By randomly sampling from these replicates, we can estimate the uncertainty the concentration profiles measurement introduces into the model. This can be done in bootstrap_error()

set.seed(42) # to get exactly same result every time 
mod_pf_bs_gasdata <- 
  bootstrap_error(mod_pf, 
                  n_samples = 25, 
                  sample_from = "gasdata")

With n_samples = 25 we created 25 new models. In each of these, a random selection of all measured x_ppm values was created per profile. This is done by resampling the same number of observations at each depth while allowing replacing (meaning that a single measurement may be sampled multiple times!). If we extract the efflux, we now get a second column DELTA_efflux that gives the uncertainty. Notice, that the value of efflux has changed slightly from the original model. This is because it is now estimated as the mean value of all 25 models, while the standard deviation is the estimate of DELTA_efflux.

## after bootstrapping
mod_pf_bs_gasdata %>%
  filter(prof_id == 1) %>%
  efflux()

## originial model
mod_pf %>%
  filter(prof_id == 1) %>%
  efflux()

Similarly, we can extract the production(), where respective columns have also been added.

mod_pf_bs_gasdata %>%
  filter(prof_id == 1) %>%
  production()

Uncertainty from `soilphys`

However, not only the gas concentration data carries uncertainty. Indeed, many parameters contained within the soilphys dataset are subject to significant uncertainty due to sampling error and soil heterogeneity. In the present dataset this is not represented at all. So, let's pretend that we measured TPS in three replicates identified by replicate_id that have some spread around the mean. We will do this by randomly sampling from a normal distribution with a standard deviation of 0.01.

library(dplyr)
soilphys <- cfp_soilphys(base_dat)

set.seed(42)
soilphys_replicate_TPS <-
soilphys %>%
  # get base TPS info
  select(site, upper, lower, TPS) %>%
  distinct() %>%
  rename(TPS_mean = TPS) %>%
  # repeat each row 3 times 
  cross_join(data.frame(replicate_id = 1:3)) %>%
  # generate new, random TPS values
  mutate(TPS = rnorm(n(), TPS_mean, 0.01)) %>%
  # join with rest of dataset
  left_join(soilphys %>% select(!TPS),
            by = c("site", "upper", "lower"),
            relationship = "many-to-many") %>%
  # recaluclate DS and c_air
  complete_soilphys(DSD0_formula = "a*AFPS^b", quiet = TRUE) %>%
  cfp_soilphys(id_cols = c("site", "Date", "replicate_id"))

Now we can create a new dataset that includes these new values of TPS and create a cfp_pfmod model. Of course you could use real measurements instead and include the variability of other parameters (SWC, t, ...) as well. We don't call pro_flux(), because we don't want to calulcate the result from each replication individual.

replicate_dat <- 
  cfp_dat(cfp_gasdata(base_dat),
          soilphys_replicate_TPS,
          cfp_layers_map(base_dat))

mod_pf_replicate <-
  cfp_pfmod(replicate_dat)

Now we can run bootstrap_error() again. This time, we will tell the function to resample from the soilphys dataset when creating the bootstrapping runs. To do this, we need to tell it which id_cols define(s) the replications of the dataset. In our case, this is replicate_id. Here, only a single value for each profile and depth is sampled for each run. The new call looks like this:

set.seed(42)
mod_pf_bs_soilphys <- 
  bootstrap_error(mod_pf_replicate, 
                  n_samples = 25, 
                  sample_from = "soilphys",
                  rep_cols = "replicate_id")

If we take a look into the result of efflux() again, we can see that the uncertainty of the efflux estimate has increased compared to using the variability in gasdata. This is both because TPS, from which DS is derived, has a strong impact on the flux calculation and because the variability within the gasdata dataset is probably unrealistically low and more representative of an instrument than of a measurement error.

## boostrapping from soilphys
mod_pf_bs_soilphys %>%
  filter(prof_id < 10) %>%
  efflux()

## boostrapping from gasdata
mod_pf_bs_gasdata %>%
  filter(prof_id < 10) %>%
  efflux()

Finally, we can also use the variability from both datasets at the same time. This runs both sampling strategies independent from another.

set.seed(42)
mod_pf_bs_both <- 
  bootstrap_error(mod_pf_replicate, 
                  n_samples = 25, 
                  sample_from = "both",
                  rep_cols = "replicate_id")
mod_pf_bs_both %>%
  filter(prof_id < 10) %>%
  efflux()