knitr::opts_chunk$set(echo = TRUE, cache = TRUE, tidy.opts = list(width.cutoff = 60), tidy = FALSE)
require(snowfall)
This document shows how to use the indicator pipeline to create biodiversity indicators such as those for DEFRA's Biodiversity Indicators in Your Pocket. The pipeline is shared in the form of an R package called 'BRCindicators' making it easy to share and maintain.
The functions in BRCindicators work with yearly estimates of species abundance or occurrence and aggregate them into an scaled indicator value with bootstrapped confidence intervals
This package has the ability to read in the output of occupancy models created in the R package sparta, a package for estimating species trends from occurrence data. This package can be installed from Github and details of how to use the package are given in the package vignette. There is no need to use sparta to create your yearly species estimates as BRCindicators can also work with other data.
To create an indicator we first need to have species trends, let's create some using the sparta R package.
If you already have yearly estimates of abundance or occurrence for your species you can skip this stage. Here we show how you can create these estimates from raw species observation data using sparta.
# Install BRCindicators from github library(devtools) #install_github('biologicalrecordscentre/sparta')
Let's assume you have some raw data already, we can under take occupancy modelling like this
# Create data n <- 8000 # size of dataset nyr <- 50 # number of years in data nSamples <- 200 # set number of dates nSites <- 100 # set number of sites set.seed(125) # set a random seed # Create somes dates first <- as.Date(strptime("1950/01/01", "%Y/%m/%d")) last <- as.Date(strptime(paste(1950+(nyr-1),"/12/31", sep=''), "%Y/%m/%d")) dt <- last-first rDates <- first + (runif(nSamples)*dt) # taxa are set semi-randomly taxa_probabilities <- seq(from = 0.1, to = 0.7, length.out = 26) taxa <- sample(letters, size = n, TRUE, prob = taxa_probabilities) # sites are visited semi-randomly site_probabilities <- seq(from = 0.1, to = 0.7, length.out = nSites) site <- sample(paste('A', 1:nSites, sep=''), size = n, TRUE, prob = site_probabilities) # the date of visit is selected semi-randomly from those created earlier time_probabilities <- seq(from = 0.1, to = 0.7, length.out = nSamples) time_period <- sample(rDates, size = n, TRUE, prob = time_probabilities) myData <- data.frame(taxa, site, time_period)
# Load the sparta package library(sparta)
For demonstration purposes I have a faked dataset of 8000 species observations. In my dataset the species are named after the letters in the alphabet. Below I show how I can use the Bayesian occupancy models in sparta to create yearly estimates of occurrence. For more information please see the vignette for sparta
# Preview of my data head(myData) # First format our data formattedOccData <- formatOccData(taxa = myData$taxa, site = myData$site, survey = myData$time_period) # Here we are going to use the package snowfall to parallelise library(snowfall) # I have 4 cpus on my PC so I set cpus to 4 # when I initialise the cluster sfInit(parallel = TRUE, cpus = 4) # Export my data to the cluster sfExport('formattedOccData') # I create a function that takes a species name and runs my model occ_mod_function <- function(taxa_name){ library(sparta) # Note that this will write you results to your computer # the location is set to your user folder occ_out <- occDetFunc(taxa_name = as.character(taxa_name), n_iterations = 200, burnin = 15, occDetdata = formattedOccData$occDetdata, spp_vis = formattedOccData$spp_vis, write_results = TRUE, output_dir = '~/Testing_indicator_pipe', seed = 123) } # I then run this in parallel system.time({ para_out <- sfClusterApplyLB(unique(myData$taxa), occ_mod_function) }) # Stop the cluster sfStop() # We can see all the files this has created list.files('~/Testing_indicator_pipe')
Installing the package is easy and can be done in a couple of lines
library(devtools) install_github('biologicalrecordscentre/BRCindicators')
Now that we have some species trends data to work with (no doubt you already have your own) we can use the first function in BRCindicators. This function reads in all the output files from sparta (which are quite large and complex) and returns a simple summary table that we can use for calculating the indicator. If you have done your analysis without using sparta you can skip to the next step.
library(BRCindicators) # All we have to supply is the directory where out data is saved # You will note this is the 'output_dir' passed to sparta above. trends_summary <- summarise_occDet(input_dir = '~/Testing_indicator_pipe') # Lets see the summary head(trends_summary[,1:5])
Returned from this function is a summary of the data as a matrix. In each row we have the year, specified in the first column, and each subsequent column is a species. The values in the table are the mean of the posterior for the predicted proportion of sites occupied, a measure of occurrence.
Once we have species-year indicies we are in a position to proceed to calculating an indictor. To do this there are a number of mehods available, some of which are presented here in 'BRCindicators'
The geometric mean method is often used with data that do not have errors associated with them.
The first step is to re-scale the data so that the value for all species in the first year is the same. Once this is done we calculate the geometric mean across species for each year creating the indicator value. This function also accounts for species that have no data at the beginning of the dataset by entering them at the geometric mean for that year, this stops them dramatically changing the indicator value in the year they join the dataset. It also accounts for species that leave the dataset before the end by holding them at their last value. Finally limits to species values can be given, preventing extremely high or low values biasing the indicator.
The data I have generated in 'trends_summary' is very easy to work with but to show off what this function can do I'm going to mess it up a bit.
trends_summary[1:3, 'a'] <- NA trends_summary[1:5, 'b'] <- NA trends_summary[2:4, 'c'] <- 1000 trends_summary[45:50, 'd'] <- NA # Let's have a look at these changes head(trends_summary[,1:5]) tail(trends_summary[,1:5])
Now that I have 'messed up' the data a bit we have two species with data missing at the beginning and one species with data missing at the end. We also have one species with some very high values.
Now lets run this through the re-scaling function.
# Let's run this data through our scaling function (all defaults used) rescaled_trends <- rescale_species(Data = trends_summary) # Here's the result head(rescaled_trends[,c('year', 'indicator', 'a', 'b', 'c', 'd')]) tail(rescaled_trends[,c('year', 'indicator', 'a', 'b', 'c', 'd')])
You can see that species 'a' and 'b' enter the dataset at the geometric mean (the indicator value), all species are indexed at 100 in the first year and the very high values in 'c' are capped at 10000 at the end 'd' has been held at it's end value.
The 'indicator' column that is returned here is our indicator, calculated as the geometric mean of all the species in the data set.
We can get confidence intervals for this indicator by bootstrapping across species. We have a function for that too!
# This function takes just the species columns scaled_species <- rescaled_trends[,!colnames(rescaled_trends) %in% c('year', 'indicator')] indicator_CIs <- bootstrap_indicator(Data = scaled_species) # Returned are the CIs for our indicator head(indicator_CIs)
It is sometimes desirable to create a smoothed indicator value from the raw values. This can be achieved by fitting a GAM (general additive model) to the indicator using a spline. This spline is a smoothed curve that goes through the raw values for the indicator and is fitted using the function 'gam' in the 'mgcv' R package.
# The smoothing function takes the indicator values smoothed_indicator <- GAM_smoothing(rescaled_trends[,'indicator']) # In this example there is little support for a non-linear trend and # so the line almost linear plot(x = rescaled_trends[,'year'], y = rescaled_trends[,'indicator']) lines(x = rescaled_trends[,'year'], y = smoothed_indicator, col = 'red') # But if our indicator did support a non-linear trend it might look # like this eg_indicator <- jitter(sort(rnorm(50)), amount = 0.5) eg_smoothed <- GAM_smoothing(eg_indicator) plot(x = 1:50, y = eg_indicator) lines(x = 1:50, y = eg_smoothed, col = 'red')
Where there is little support for a non-linear trend a GAM smoothed line will tend towards linear. Where there is good support for a non-linear trend the smoothed line will become more 'bendy'.
We now have our indicator and the confidence intervals around it. The next step is to plot it. We have included a function that creates a simple plot using ggplot2, however you could easily create your own plots in R using the data.
# Plot our indicator. plot_indicator(indicator = rescaled_trends[,'indicator'], smoothed_line = smoothed_indicator, CIs = indicator_CIs)
In this plot you can see the high upper confidence interval in years 2-4, this is due to the artificially high values we gave to species 'c'.
The Bayesian Meta-Analysis method, or BMA, is suited to data with standard errors associated with them. As with other methods we require data from more than one species, across a number of years, with an error for each species-year estimate.
# Here is an example dataset for the BMA method data <- data.frame(species = rep(letters, each = 50), year = rep(1:50, length(letters)), index = runif(n = 50 * length(letters), min = 0, max = 1), se = runif(n = 50 * length(letters), min = 0.01, max = .1)) head(data)
It is important that your data is in the same format and that your columns are in the same order and have the same names. Remember you can use the function read.csv()
to read in the data from a .csv on your computer.
BMA is run using the function bma
, here we will use the default settings and then see what we can change.
bma_indicator <- bma(data)
The function returns a plot to your screen which is a diagnostic plot of the model. When the model has converged (i.e. reached a point where the three chains agree on the answer) the lines on the plots on the left will sit on top of one another and the plots on the right will have a nice bell shape. You can turn off this plot by setting plot
to FALSE
. By default the method runs the chains in series. Running them in parallel makes the models run faster (about half the time) but will slow down your computer more. We can change this with the parameter parallel
. The number of iterations the model runs is controlled by n.iter
and defaults to 10000. If you can it is better to run it for more iterations, though this will take longer. m.scale
gives the scale your data is on. It is very important that this is correct, choose from 'loge' (natural log, sometimes simply called 'log'), 'log10' (log to the base 10), or 'logit' (output from models of proportions or probabilities).
Let's implement a few of these changes
bma_indicator2 <- bma(data, parallel = TRUE, n.iter = 500, m.scale = 'log10')
Because we have reduced the number of interations the model no longer has a good convergence. The lines on the graphs on the left do not overlap and the graphs on the right are no longer a smooth bell shape.
The object that is returned is a data.frame with years as rows and columns giving the year value, index value and confidence intervals. You can write this to a csv using the function write.csv
.
head(bma_indicator)
We can use the plotting function in BRCindicators to plot the results of this analysis, which in this case are not all that interesting!
plot_indicator(indicator = bma_indicator[,'Index.M'], CIs = bma_indicator[,c(3,4)])
The multi-species indicator method was developed by Statistics Netherlands and the code is made available on their website. To find out more about the inner working of this method please read the detailed documentation on the authors website. Here is a simple example of how this method runs in BRCindicators
.
# Create some example data in the format required nyr = 20 species = rep(letters, each = nyr) year = rev(rep(1:nyr, length(letters))) # Create an index value that increases with time index = rep(seq(50, 100, length.out = nyr), length(letters)) # Add randomness to species index = index * runif(n = length(index), 0.7, 1.3) # Add correlated randomness across species, to years index = index * rep(runif(0.8, 1.2, n = nyr), length(letters)) se = runif(n = nyr * length(letters), min = 10, max = 20) data <- data.frame(species, year, index, se) # Our species are decreasing plot(data$year, data$index) # Species index values need to be 100 in the base year. Here I use # the first year as my base year and rescale to 100. The standard error # in the base year should be 0. min_year <- min(data$year) for(sp in unique(data$species)){ subset_data <- data[data$species == sp, ] multi_factor <- 100 / subset_data$index[subset_data$year == min_year] data$index[data$species == sp] <- data$index[data$species == sp] * multi_factor data$se[data$species == sp] <- data$se[data$species == sp] * multi_factor data$se[data$species == sp][1] <- 0 } # Our first year is now indexed at 100 plot(data$year, data$index) # Alternativly I could read in data from a csv # data <- read.csv('path/to/my/data.csv') # Run the MSI function msi_out <- msi(data) head(msi_out$CV) # I can capture the output figures too # pdf('test.pdf') # msi_out <- msi(data) # dev.off()
The code returns two plots to the console, the first plot shows the coefficient of variation (CV) for each of the species. Species with high values of CV may adversly effect the relaibility of the trend estimation. Use this graph to identify the CV values of the species and use the maxCV
parameter to set a threshold above which species will be excluded. The results of excluding species in this way can be tested by comparing trend plots. The CV values are hard to assign to species from this plot as the species are coded to numbers. To see the raw values look at the CV component of msi_out (i.e. msi_out$CV
). The second plot shows the smoothed trend and the MSI values. These two figures can be captured in the usual way in R by using pdf()
for example. In the example I create a dataset from random numbers but usually you would use read.csv()
to read in data from a local file.
Here is a second example which sets some additional parameters. The parameters for msi
get passed to msi_tool
so to see a list of all the parameters you can change look at the help documentation in msi_tool
usign ?msi_tool
at the R console. I cover most of hte important ones here.
msi_out <- msi(data, nsim = 500, # The number of Mote Carlo simulations SEbaseyear = 10, # The year to index on plotbaseyear = 15, # The year to set as 100 in plots index_smoot = 'INDEX', # plotbaseyear uses MSI not trend span = 0.7, # 'wigglyness' of line, between 0 and 1 lastyears = 5, # last X years of time series for short-term trends maxCV = 10, # maximum allowed Coefficient of Variation changepoint = 10, # compare trends before and after this year truncfac = 8, # max year-to-year index ratio TRUNC = 5, #set all indices below TRUNC to this plot = TRUE # should the plots be returned?) )
This set of parameters is unrealistic but shows the options available. Note that in the second graph the year 10 point now has a se = 0, year 15 MSI is set to 100, and the short term trend is reported for the last 5 years.
The analysis also returns data which provide more insights into the analysis and let you create your own plots if required.
# The returned object has 2 elements head(msi_out$results)
The first of the two elements (results
) returned gives all the data, and a little more, that is presented in the second figure.
# The returned object has 2 elements msi_out$trends # I could write this as a csv too # write.csv(msi_out$trends, file = 'path/to/my/output.csv')
The second element (trends
) returned give a summary of various trend assessments across the time series.
We have also added a plot method for the MSI output which provides a plot similar to that of the second figure we have seen already. Lets use this plot method to explore the effect of changing the span value in the analysis
for(i in c(0.3, 0.5, 0.7)){ # use a range of values for span msi_out <- msi(data, span = i, # span is set to i nsim = 200, plot = FALSE) print( # print makes the plot visible in the for loop plot(msi_out, title = paste('MSI - span =', i)) # plot ) }
As the value of span gets closer to 1 the trend line gets smoother.
The lambda indicator calculates an indicator using growth rates from one year to the next. Formulating the indicator in terms of growth rates has two distinct advantages over the conventional approach to constructing indicators. First, it means that the categorisation of species as ‘increasing’ or ‘decreasing’ can be made from the same set of data (the growth rates) as the construction of the indicator. Second, it provides an elegant solution to the problem of species that join the indicator after the first year (i.e. where the first year is unreliable): other indicators typically adopt a complicated rescaling approach to ensure that species entering the indicator after the first year do not bias the overall assessment. It also makes a simple and robust, though untestable, assumption about species that drop out of the indicator prior to the final year: specifically it assumes that their fluctuations are the same, in aggregate, as those of the species that remain in the indicator. For more details see http://webarchive.nationalarchives.gov.uk/20170302170037/http://jncc.defra.gov.uk/Docs/UKBI2015_TechBG_C4b-D1c_Bayesian_Final.docx
Very few species’ models produced reliable occupancy estimates for every year, so a majority of the time series contain missing values. This presents a problem for estimating growth rates for each species-year combination. Missing values of growth rate that would be equivalent to linear interpolation of the log odds between adjacent years with reliable estimates were therefore calculated. This indicator can therefore work with missing values.
Input data is on the occupancy scale, and is therefore bounded between 0 and 1.
# number of species nsp = 50 # number of years nyr = 40 #number of iterations iter = 500 # Build a random set of data myArray <- array(data = rnorm(n = nsp*nyr*iter, mean = 0.5, sd = 0.1), dim = c(nsp, nyr, iter), dimnames = list(paste0('SP',1:nsp), 1:nyr, 1:iter)) # Ensure values are bounded by 0 and 1 myArray[myArray > 1] <- 1 myArray[myArray < 0] <- 0 str(myArray)
lambda_indicator
takes in an array of data, a three-dimensional matrix. The dimensions of this array represent species, years, and iterations. Each row represents a species and each column a year. The third dimension of the array contains the iterations. Essentially each slice contains occupancy estimates for each species year combination for a single iteration and the overall array contains as many slices as there are iterations.
# Run the lambda_interpolation method on this data myIndicator <- lambda_indicator(myArray) # Plot the indicator plot_indicator(myIndicator$summary[,'indicator'], myIndicator$summary[,c('lower' ,'upper')])
There are a number of options available in the lambda_indicator
function
myIndicator <- lambda_indicator(myArray, index = 1, # Set the index value to 1 not 100 year_range = c(30,40), # Set year range threshold_yrs = 5) # set a threshold plot_indicator(myIndicator$summary[,'indicator'], myIndicator$summary[,c('lower' ,'upper')])
Note that there are a range of threshold functions that allow you to adjust which data points are used in the indicator. There are options to remove species year estimates based on their standard deviaction, Rhat value and based on the number of years a species is present in the dataset. Note that the Rhat threshold can only be used if you are using a directory path as you input rather than an array.
We have demonstrated how you might run the indicator functions one at a time, however in a 'pipeline' we want data to flow through seamlessly. Additionally there are a number of parameters in the functions that we have not shown you that you might find useful. Here is an example of how you can create your own pipeline function. Our function will wrap around the functions described above, setting the parameters to meet our needs. Once we have done this it will allow use to execute our pipeline in one line.
# I call my function 'run_pipeline' and the only arguement it # takes is the directory of sparta's output run_pipeline <- function(input_dir){ require(sparta) require(BRCindicators) # Create the trends summary trends_summary <- summarise_occDet(input_dir = input_dir) # Rescale the values and get the indicator values # Here I set the index to 1 and change the value limits rescaled_trends <- rescale_species(Data = trends_summary, index = 1, max = 100, min = 0.001) # Bootstrap the indicator to get CIs scaled_species <- rescaled_trends[,!colnames(rescaled_trends) %in% c('year', 'indicator')] # This time I set the iterations to twice the default and # use custom confidence intervals indicator_CIs <- bootstrap_indicator(Data = scaled_species, CI_limits = c(0.25, 0.75), iterations = 20000) # Get the smoothed indicator line smoothed_indicator <- GAM_smoothing(rescaled_trends[,'indicator']) # This time I specify the years and index value plot_indicator(indicator = rescaled_trends[,'indicator'], year = rescaled_trends[,'year'], index = 1, CIs = indicator_CIs, smoothed_line = smoothed_indicator) ## I'll return all my data return(cbind(smoothed_indicator, indicator_CIs, as.data.frame(trends_summary))) }
Once we have created this function we can run this pipeline on a directory in one line, or put it in a loop to run across many directories.
# Now we can run the pipeline in one line, like a boss indicator_data <- run_pipeline(input_dir = '~/Testing_indicator_pipe') head(indicator_data)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.