knitr::opts_chunk$set(echo=TRUE,results='hide')
Many datasets include variables that are naturally bounded. For instance, precipitation and streamflow have a lower bound of zero; proportions are between zero and one; cloud cover is historically measured in okta, an integer between 0 and 8; etc.
Several options exist to estimate the distribution of such data. The first one is to transform them so that the transformed data become unbounded. For instance, a logarithm transformation is often used for strictly positive data; however this cannot be applied when data can take zero values (e.g. precipitation and streamflow). An alternative option is to leave the data as is but to use a bounded distribution that will respect the natural bounds of the data.
This package focuses on this second approach and provides tools to build bounded distribution by trimming standard distributions such as the normal, lognormal, beta, etc. Trimming can be made in two disctinct ways:
This vignette starts by describing the basic tools of the package disTRIMbution and then illustrates its application to a hydrologic dataset. The package is built on top of the distributions3 package, which will be systematically loaded along with disTRIMbution.
library(disTRIMbution)
Before moving to trimmed distributions, let's start by recalling how the package distributions3 can be used to first define a distribution object, and then to evaluate its pdf / cdf / quantile functions, or to generate data from it (for more details, please see the distributions3's website)
D <- Normal(mu=0.75,sigma=0.5) # defines a distribution object pdf(D,-0.1) # evaluates the pdf of a N(0.75;0.5) at -0.1 cdf(D,0.75) # evaluates the cdf of a N(0.75;0.5) at 0.75 quantile(D,0.99) # evaluates the 0.99-quantile from a N(0.75;0.5) set.seed(1);random(D,5) # generate 5 values from a N(0.75;0.5)
The easiest way to understand how trimming works is to compare these four basic functions (pdf
, cdf
, quantile
and random
) for the untrimmed distribution, and its rectified / truncated versions. The package disTRIMbution provides a function plot_trim
for this very purpose. Let's start by the untrimmed version (which is the default):
plot_trim(D) # Plot a N(0.75;0.5) distribution
Let's now trim this distribution by truncation between zero and one. The figure below shows that truncation simply 'removes' all values that are outside the bounds; also note that the pdf values are rescaled compared to the figure above, to makes sure that it still integrates to one :
plot_trim(D,trim=c(0,1),doTrunc=TRUE) # Plot a N(0.75;0.5) TRUNCATED between 0 and 1
Let's now trim the distribution by rectification between zero and one (doTrunc=FALSE, which is the default option). Instead of 'removing' all values that are outside the bounds, rectification rather 'stacks' them on the bounds, thus leading to non-zero probabilities for the random variable to be equal to the bounds. These probabilities correspond to the colored areas below the pdf, and can be computed using the cdf of the parent distribution as shown below the figure.
plot_trim(D,trim=c(0,1)) # Plot a N(0.75;0.5) RECTIFIED between 0 and 1
cdf(D,0) # left-side probability (of reaching the lower bound) 1-cdf(D,1) # right-side probability (of reaching the upper bound)
The package disTRIMbution provides the functions pdf_trim
, cdf_trim
, quantile_trim
and random_trim
to compute the numbers behind these plots, as illustrated below.
pdf(D,c(-0.1,0.75,2)) # uses the distributions3 function pdf_trim(D,c(-0.1,0.75,2)) # same result as above since no trimming by default pdf_trim(D,c(-0.1,0.75,2),trim=c(0,1)) # rectification pdf_trim(D,c(-0.1,0.75,2),trim=c(0,1),doTrunc=TRUE) # truncation
cdf(D,c(-0.1,0.75,2)) # uses the distributions3 function cdf_trim(D,c(-0.1,0.75,2)) # same result as above since no trimming by default cdf_trim(D,c(-0.1,0.75,2),trim=c(0,1)) # rectification cdf_trim(D,c(-0.1,0.75,2),trim=c(0,1),doTrunc=TRUE) # truncation
quantile(D,c(0.05,0.5,0.95)) # uses the distributions3 function quantile_trim(D,c(0.05,0.5,0.95)) # same result as above since no trimming by default quantile_trim(D,c(0.05,0.5,0.95),trim=c(0,1)) # rectification quantile_trim(D,c(0.05,0.5,0.95),trim=c(0,1),doTrunc=TRUE) # truncation
set.seed(1);random(D,5) # uses the distributions3 function set.seed(1);random_trim(D,5) # same result as above since no trimming by default set.seed(1);random_trim(D,5,trim=c(0,1)) # rectification set.seed(1);random_trim(D,5,trim=c(0,1),doTrunc=TRUE) # truncation
Finally, note that infinity values -Inf
and Inf
can be used to trim on one side only:
plot_trim(D,trim=c(0,Inf)) # Plot a N(0.75;0.5) rectified below 0
plot_trim(D,trim=c(-Inf,1)) # Plot a N(0.75;0.5) rectified above 1
The package disTRIMbution also provides tools for estimating trimmed distributions. This is illustrated here using a dataset describing low flows in the Cowhouse Creek river, Texas, USA. The data frame CowhouseCreek
contains 3 columns: the year in 1951-2019, the annual minimum streamflow (in $m^3.s^{-1}$) and the annual duration below a low-flow threshold of 0.01 $m^3.s^{-1}$ (in year).
CowhouseCreek[31:40,] # Show a 10-year extract
Both variables in this dataset are bounded. The annual minimum streamflow is bounded from below by zero, and this value is reached quite frequently since the river is intermittent. The annual low flow duration is necessarily between zero and one when expressed in year. The value zero is reached fairly regularly; the value one has never been reached yet, however several high values around 0.8 suggest that this may well end up happening. It therefore makes sense to consider a rectified distribution for both variables.
par(mfrow=c(1,2)) # plot both variables side by side plot(CowhouseCreek$year,CowhouseCreek$min,pch=19,xlab='Year',ylab='Annual min. [m3/s]') plot(CowhouseCreek$year,CowhouseCreek$duration,pch=19,xlab='Year',ylab='Low flow duration [year]',ylim=c(0,1))
We start with the variable 'duration' and use a normal distribution rectified between zero and one. The latter can be estimated (using the maximum likelihood approach) with the function estim_ML
as shown below.
# Start by specifying the assumed distribution with 'first guess' parameters (D0) D0 <- Normal(mu=0.05,sigma=0.1) # Estimate parameters and return the estimated distribution D D <- estim_ML(d=D0,y=CowhouseCreek$duration,trim=c(0,1)) D
The goodness-of-fit of the estimated distribution D to the data can be evaluated by comparing, for instance, the estimated pdf with the raw data. The function plot_trim_pdf
can be used to plot the estimated pdf. Since it returns a ggplot object, a layer representing the raw data can easily be added as shown below.
library(ggplot2) g <- plot_trim_pdf(D,trim=c(0,1)) # estimated pdf g+geom_dotplot(data=CowhouseCreek,aes(x=duration),binwidth = 0.01,dotsize=5,alpha=0.5,fill='blue') # add layer for observations
An alternative and arguably easier-to-interpret comparison can be made by using the cdf rather than the pdf. The resulting figure suggests that the fit seems acceptable.
g <- plot_trim_cdf(D,trim=c(0,1)) # estimated cdf g+stat_ecdf(data=CowhouseCreek,aes(duration),color='blue') # add layer for observations
Remind that the value 1 has never been observed so far, meaning that the streamflow never spent the whole year below the low flow threshold. But what is the probability of this happening? This can be evaluated by using the estimated distribution as shown below. The result suggests a probability of around 0.4% - rather low but not impossible!
1-cdf(D,1) # 'right-side probability' of reaching 1
We now turn our attention to the variable 'min', and we are going to try a non-normal distribution for this variable. The list of available distributions include all those available in the package distributions3 (see here) plus a few additional ones provided by disTRIMbution (type help(package="disTRIMbution")
). Here we use the distribution GEVmin
, which is a natural candidate for such annual minimum values (see here).
# Start by specifying the assumed distribution with 'first guess' parameters (D0) D0 <- GEVmin(loc=0.1,scale=0.1,shape=-0.5) # Estimate parameters and return the estimated distribution D D <- estim_ML(d=D0,y=CowhouseCreek$min,trim=c(0,Inf)) D
As previously, comparison between the estimated and empirical cdf suggests an acceptable fit.
g <- plot_trim_cdf(D,trim=c(0,Inf)) # estimated cdf g+stat_ecdf(data=CowhouseCreek,aes(min),color='blue') # add layer for observations
Finally, note that the function PBoot
allows quantifying the uncertainty in parameter estimates by means of a parametric bootstrap resampling approach.
# perform 100 bootstrap replicates (at least 1000 would be better) boot <- PBoot(d=D,n=length(CowhouseCreek$min),nsim=100,trim=c(0,Inf)) # scatterplot of estimated parameters plot(boot)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.