In NINAnor/SurveyPower: Functions for calculating power of survey design

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  tidy = T
)

require(tidyverse)
require(sf)
#require(ggplot2)
require(SurveyPower)

Todo:

Loop function
input til gen prob
input til sample_
make umbrella function to sample_?
input to model
store estimates and sds
Redo sample function to allow diff strategies

Sampling strategies

To revisit or not to revisit

When surveying locations to identify time trends, a common simple design is to pick a number of sites that you revisit over some time, revisiting all sites at all survey times. This way, you can account for and limit the effect of consistent differences between sites. If, for example, one site show consistently higher values than another, they may still show the same trend, and revisiting the sites helps us account for these systematic differences. Figure \ref{revisitPlot} shows a hypothetical example of 5 kommunes that differ in their overall values, but follow the same time trend. If the overall levels differ between sites, but they follow the same overall time trend, it makes sense to revisit each location at every sampling time. Since they follow the same trend, you don't have to visit many locations to identify the underlying common trend.

system.time(test5yearsTrend <- createOccNorm(map10km, 
                              intercept = 10,
                              sigmaFylke = 0, 
                              sigmaKommune = 0.5, 
                              sigmaGrid = 0.5,
                              nYears = 5,
                              interceptTrend = -0.1,
                              sigmaFylkeTrend = 0,
                              sigmaKommuneTrend = 0,
                              sortGrid = F, 
                              sortFylke = F, 
                              sortKommune = F)
            )

```r"} test5yearsTrend %>% filter(KOMMUNENUMMER < 110) %>% select(kommune, KOMMUNENUMMER, norm, year) %>% ggplot(.) + geom_smooth(aes(x = year, y = norm, color = kommune), stat = "summary", fun.y = "mean", lwd =2)

But all locations might not follow the same time trend so closely. If the locations have varying trends, as in figure \ref{spreadPlot}, it becomes harder to identify a potential underlying, common trend. In this case, we need to visit more individual locations to be able to identify the underlying trend. This might mean that we cannot revisit each site each year within the constraints of the budget. We may then have to stagger our visits, so that we revisit each location every other year, or every fourth year, to be able to cover more locations.


```r

varying5yearsTrend <- createOccNorm(map10km, 
                              intercept = 10,
                              sigmaFylke = 0, 
                              sigmaKommune = 0.5, 
                              sigmaGrid = 0.5,
                              nYears = 5,
                              interceptTrend = -0.1,
                              sigmaFylkeTrend = 0,
                              sigmaKommuneTrend = .4,
                              sortGrid = F, 
                              sortFylke = F, 
                              sortKommune = F

)

```r"} varying5yearsTrend %>% filter(KOMMUNENUMMER < 110) %>% select(kommune, KOMMUNENUMMER, norm, year) %>% ggplot(.) + geom_smooth(aes(x = year, y = norm, color = kommune), stat = "summary", fun.y = "mean", lwd =2)

This means we end up with a trade-off between revisiting schedule and the number of locations to visit, given that we can only visit so many locations each year. Figures \ref{onePlot}, \ref{twoPlot}, \ref{threePlot}, and \ref{sixPlot} shows the various ways we could allocate a yearly capacity of surveying 6 locations a year, for 6 years, while keeping the number of times each location is visited the same for all locations (a balanced data set).  

```r"}
dat <- as.tibble(expand.grid("Location" = paste("Location ", 6:1), "year" = 1:6)) 
dat$visit <- 1
dat


ggplot(dat, aes(y=Location, x=year, fill=visit)) + 
  geom_tile(colour="white", 
            width=.9, height=.9) + theme_minimal() +
    scale_fill_gradientn(colors = "green")

```r"} dat <- as.tibble(expand.grid("Location" = paste("Location ", 12:1), "year" = 1:6)) dat$visit <-0

dat$visit[dat$year == 1] <- rep(c(0, 1), 6) dat$visit[dat$year == 2] <- rep(c(1, 0), 6) dat$visit[dat$year == 3] <- rep(c(0, 1), 6) dat$visit[dat$year == 4] <- rep(c(1, 0), 6) dat$visit[dat$year == 5] <- rep(c(0, 1), 6) dat$visit[dat$year == 6] <- rep(c(1, 0), 6)

ggplot(dat, aes(y=Location, x=year, fill=visit)) + geom_tile(colour="white", width=.9, height=.9) + theme_minimal() + scale_fill_gradientn(colors = c("white", "green"))

```r"}
dat <- as.tibble(expand.grid("Location" = paste("Location ", 18:1), "year" = 1:6)) 
dat$visit <-0

dat$visit[dat$year == 1] <- rep(c(0, 0, 1), 6)
dat$visit[dat$year == 2] <- rep(c(0, 1, 0), 6)
dat$visit[dat$year == 3] <- rep(c(1, 0, 0), 6)
dat$visit[dat$year == 4] <- rep(c(0, 0, 1), 6)
dat$visit[dat$year == 5] <- rep(c(0, 1, 0), 6)
dat$visit[dat$year == 6] <- rep(c(1, 0, 0), 6)



ggplot(dat, aes(y=Location, x=year, fill=visit)) + 
  geom_tile(colour="white", 
            width=.9, height=.9) + theme_minimal() +
    scale_fill_gradientn(colors = c("white", "green"))

```r"} dat <- as.tibble(expand.grid("Location" = paste("Location ", 36:1), "year" = 1:6)) dat$visit <-0

dat$visit[dat$year == 1] <- rep(c(0, 0, 0, 0, 0, 1), 6) dat$visit[dat$year == 2] <- rep(c(0, 0, 0, 0, 1, 0), 6) dat$visit[dat$year == 3] <- rep(c(0, 0, 0, 1, 0, 0), 6) dat$visit[dat$year == 4] <- rep(c(0, 0, 1, 0, 0, 0), 6) dat$visit[dat$year == 5] <- rep(c(0, 1, 0, 0, 0, 0), 6) dat$visit[dat$year == 6] <- rep(c(1, 0, 0, 0, 0, 0), 6)

ggplot(dat, aes(y=Location, x=year, fill=visit)) + geom_tile(colour="white", width=.9, height=.9) + theme_minimal() + scale_fill_gradientn(colors = c("white", "green"))

There are some rules of how such survey schemes can be set up in a balanced way. For simplicity, we limit our explorations to balanced cases.

We can define these parameters: 
$$
\begin{aligned}
Sampling~ capacity~ each~ year: a \\
Total~ timespan~ of~ survey: T \\
Resampling~ interval~ per~ locality: t \\
Number~ of~ localities: l \\
Total~ number~ of~ samples: s \\
Replicates~ per~ location: r
\end{aligned}
$$

The number of localities that we can survey is then $l = t * a$, which can be maximised by maximising $t$ to $t = T$. The minimum timespan between revisits in a location is $t = T / a$, which also minimizes the total number of locations. However, the number of samples per location $r$ is at the same time maximized as this follows $r = T / t$. The total number of samples is defined as $s = T * a$. Lastly, balanced survey schemes relies on the total survey length being evenly divisible by the yearly total survey capacity, i.e. as long as $T\bmod a=0$. 

The possible survey schemes from relationships can be calculated by the custom function sampleAlternatives for convenience, which creates an object with its own plot method. Figure \ref{resampPlot} show how the number of possible locations and replicates (visits) per location depend on the yearly survey capacity and the time lag between visits to each location. As the figure shows, you can survey a staggering amount of locations in 40 years, if you are willing drastically lower the number of revisits. The trade-off between the number of locations and the number of visits per location is pretty sharp, and it is difficult to achieve high numbers in both, even with a high yearly capacity.


```r"}
testSampleAlt <- sampleAlternatives(maxTime = 40,  
                        maxCapacity= 400, 
                        stepsCapacity = 100)

class(testSampleAlt)

plot(testSampleAlt,
     color = "resampleTime", 
     allTicks = F)

It is important to keep in mind the time span on the survey effort, within which we expect to evaluate the results. For most programs, it would be unreasonably long to have to wait 40 years to evaluate possible time trends. Figures \ref{T9a3Plot} and \ref{T9a300Plot} show a more reasonable time span of 9 years, with low and high yearly capacities, respectively.

```r"} plot(sampleAlternatives(maxTime = 9, maxCapacity= 3, stepsCapacity = 1), color = "totSamples")

```r"}
plot(sampleAlternatives(maxTime = 9,  maxCapacity= 300, stepsCapacity = 60), color = "totSamples")

For an active real world example, we can plot the situation for the bumblebee and butterfly survey. This survey is done each year in 18 locations for each of three regions, together comprising 54 locations. The survey has been active since 2009 and thus exemplifies the results of potential schemes in the year 2021. Today, we revisit the same locations each year, resulting in 18 locations per region. But if we where to revisit each site only every third year, we could visit 54 locations 4 times in the same time span. Or for all regions, instead of visiting 54 locations 12 times, we could visit 162 locations 3 times each, figure \ref{T12a54Plot}.

```r"} plot(sampleAlternatives(maxTime = 12, maxCapacity= 54, stepsCapacity = 18), color = "resampleTime")

The optimal strategy will vary depending on the various sources of variation. A high yearly variability increases the need for multiple samples per location, while a high variability between sites increases the need to visit many sites. If we are interested in the absolute numbers in each region, and this varies by sites, we must visit more sites per region. If we are interested in the trends in each region, and these trends vary by site, we also need to visit more sites per region. It is not easy working out the optimal strategy, or calculate the statistical power of various survey schemes analytically. We need to simulate various situations and test out different strategies.

Insect survey scheme - Power analysis
============
For the general insect survey power analysis, the workflow for a power analysis is to

1. create a general framework that simulates insect communities throughout the country,
a. this includes generating underlying distribution parameters, and
b. drawing samples from these distributions
2. use this method to generate many simulated insect communities under different scenarios
3. sample this community, mimicking various survey schemes
4. calculate statistics on the power and general accuracy of the survey scheme.

We simulate the communities in a way that is as similar as possible to the models that would be used to analyze real data. There are two main reasons for this. First, it makes it possible to troubleshoot the process by simulating and analysing the data using the same basic model. Secondly, the parameter estimates are relevant to the real world scenarios, for instance making it possible to feed the parameter estimates from future models on real data back to the simulation routine, to update the power analysis.

Different types of variables are relevant to insect survey schemes. First we may want to model occurrences of individual species, using presence absence data. This would typically be done in a logistic regression using a binomial distribution. Secondly, we may want to record data such as the biomass of the whole sample or a specific taxa, which would be normally distributed. Lastly, we want to analyse count data, of the amount of individuals, or the amount of species in a sample. This is typically done via a Poisson distribution. For all of these three cases, the statistical models will try to estimate the parameters of these distributions.

Data simulation
-----------------------
The data generation is divided into two parts. First we produce a set of distribution parameters for each potential location. Locations are squared up with any of the ssb standard grids, so we could generate data on 250x250m, 500x500m, 1x1km, 5x5km, or 10x10km resolution. As already described, we can choose to produce occurrence probatilities for a binomial distribution, means for a normal distribution, or lambdas for a Poisson distribution. This is the underlying process parameters that we will try to estimate from the sampled data.

Second, we simulate observing actual measurements, or drawing samples from these distributions, which constitutes the observation model. This is then constitutes the observed data. These both steps will be made many times, and the resulting dataset will be analysed, to empirically estimate the power of various sampling schemes.

I have created some routines to ease thes repetative steps.

Accessing the SSB grids
--------------
I have made a function that pulls the ssb grid from NINAs postgis database, for a range of potential scales.

```r
postgresConnect()
map10km <- getSSBGrid()

Note that the finer scale maps are quite large. I have also included the 10km scale grids in the package to make it self-contained. The map is of class sf and can be plotted by standard functions. When there is multiple value colums, we can select the one we'd like to plot.

map10km
plot(map10km["kommune"], key.pos = NULL)

Add occurrence probabilities

Here we create occurrence probabilities for all grid cells in the map, where the baseline probability (intercept) of occurrence is 0.5. In addition, each fylke has different probabilities, here varying randomly (normal) with a standard deviation of 0.4, and each kommune has varying occurrence probabilities in similar fashion, here varying with a standard deviation of 0.1. Lastly, there is "residual" random variation among all cells, here with a standard deviation of 0.05.

We can also initiate an overall time trend, here the background probability decreases with 0.05 (5%) each year. The resulting probabilities is added in a column called `prob´, which we then can plot.

mapOcc5years <- createOccProb(map10km, 
                              intercept = 0.5,
                              sigmaFylke = 0.4, 
                              sigmaKommune = 0.1, 
                              sigmaGrid = 0.05,
                              nYears = 5,
                              interceptTrend = -0.05,
                              sigmaFylkeTrend = 0,
                              sigmaKommuneTrend = 0,
                              sortGrid = F, 
                              sortFylke = F, 
                              sortKommune = F)

class(mapOcc5years)

```r"} mapOcc5years %>% filter(year == 1) %>% select(prob) %>% plot()

The maps also contains information areal type according to the `AR5`-layer. This used for simulating varying occurrences by habitat type later on.
```r
mapOcc5years %>% 
  filter(year == 5) %>%
  select(ARTYPE) %>% 
  plot()

NINAnor/SurveyPower documentation built on May 7, 2019, 10:43 p.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

Tweet to @rdrrHQ

GitHub issue tracker

ian@mutexlabs.com