knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

Introduction

To prevent soil pollution in the Netherlands, national legislation is in place. In this legislation for many substances, like heavy metals, organic pollatants, etc., several reference values are defined, including the so called 'background values', or 'achtergrondwaarden' in Dutch. These background values are soil concentrations and used to assess the quality of the soils. From a policy point of view, soils with concentrations below the background value are considered clean.

The background values are broadly based on the 95-percentile of measured soil concentrations from a representative set of soil samples from a defined area. While nationwide background values exists, local authorities can determine their own background values, if needed. For example in the case of specific soils or circumstances.

The choice for the 95-percentile is based on the probability that if one takes a sample from a clean soil, that only 5%, i.e. 1 in 20 cases, should fail the test.

Conceptually, the calculation of a 95-percentile, a rank order statistic, is relatively simple. However the implementation can be cumbersome considering that datasets ussually have less than 100 samples, and do contain outlying and censored values.

The purpose of this package is to provide several methods to calculate a rank order statistic, often the 95 percentile, but other percentiles are also possible. Besides this 95-percentile, confidence intervals of this percentile are also calculated.

Limits of quantification (LOQ)

It is common that datasets with chemical soil analyses contain values below a limit of quantification (LOQ). This concentrations are usually indicated with a 'less than' sign, e.g. '<1 mg/kg'. Such values can not be considerd as numerical values, due to the '<', while numerical values are needed for the data analyses.

One way to cope with these values is to transform them to negative concentrations. In theory, a measured soil concentration can not be negative, so they don't exists in the database (and if they do, you should complain to the laboratory). The following code replaces these '<' signs:

library(achtergrondwaarden)
soildata <- c("<1","2","3")
x <- as.numeric(replacelt(soildata))
x

These negative values can then easily imputed to non-negative values using a impution factor. The following code replaces the value with half the LOQ value.

x.imp <- replaceNegative(x,replaceval=0.5)
x.imp

Other values which can be used are for example 0, or 0.7 in case of log-normal distributed values. The default value is to replace the LOQ value with NA.

Outliers

In case of a sample size of 50, including 5 values as outliers, the 95-percentile will be based on these outlying values. It is often unclear if an outlier is the result of a sampling or analytical artefact or a true outlier based on variance in soil concentrations. The choice of how to deal with outliers is therefore difficult.

In case one wants to remove the outliers from the data, several methods are common. The achtergrondwaarden package contains a function to remove outliers based on the distance of the median. The measure of this distance is the median absolute deviation (MAD).

The following code removes outliers which are larger than the median plus b=3 times the MAD

soildata <- c(1,2,3,2,1,10)
x <- rmoutlier(soildata,b=3)
x

Estimating distribution parameters

To estimate the 95-percentile and it's confidence intervals one can use a theoretical approach. Considering a dataset of n=100, the 95 percentile is equal to the 95th value of the ordered data. The theoretical confidence intervals can then be calculated as follows:

library(binom)
n <- 100
ci <- binom.confint(x=0.95*n,n,methods="exact")
ci

given the above example, the 95 percentile and confidence intervals of a dataset x can be estimated

x <- rnorm(n)
quantile(x,probs=c(ci$lower,.95,ci$upper))

Bootstrapping

Another way to estimate percitiles, like the 95-percentile, is using bootstrapping. The achtergrondwaarde library uses the boot package to perform the bootstrapping. One can estimate a 95-percentile using the following code:

data(meuse,package="sp")
p <- pbci(meuse$zinc,p=.95)
p

Distribution fitting

A third way to estimate percentiles and confidence intervals is by fitting a distribution model and then estimate the interval using the distribution model parameters. For this aproach the package fitdistrplus is used. The logfitci function from this package is a wrapper around the fitdist and bootdist functions from fitdistrplus, using a logarithmic model. To calculate the estimates the following code can be used:

p <- logfitci(x=meuse$zinc,p=.95)
p

Distribution fitting and LOQs

When the data contains many censored values, fitting the distribution can lean to a non-optimal fit. In the fitdistrplus package the fitdistcens function is available. This function fits a distribution based on data including censored value. The logfitcicens function of the achtergrondwaarde package is a wrapper around the fitdistcens and bootdistcens functions from fitdistrplus package.

If we assume that the limit of quantification of the zinc data lies at the 20th-percentile and replace the values below this percentile with the percentile itself (using negative numbers), othen one can estimate the 95-percentile and intervals as follows.

x <- meuse$zinc
loq <- quantile(x,p=.2)
x <- ifelse(x>loq,x,-1*loq)
p <- logfitcicens(x)
p$f
p$result

all together now

The qestimates function estimates quantiles based on bootstrapping and both fitting functions mentioned above. This qestimate function calls respectively the pbci, logfitci and logfitcicens functions and returns the result as a data.frame. The function tries hard to not throw an error when the estimation fails, in that case the estimates will be NA.

Environmental research datasets often contain many parameters. Using a for loop or apply function the data can be gathered in one single data.frame for further analyses. If parameters do not contain values or have a variance of zero, resulting in an error in underlying functions, this function returns NA values. So it can be part of an automated procedure.

On single call to qestimate looks like this:

p <- qestimates(meuse$zinc,p=.95)
p$result


rivm-syso/achtergrondwaarden documentation built on Jan. 22, 2020, 12:44 a.m.