Contents

  1. Testing for Preferential Sampling in Geostatistical Datasets click here
  2. Testing for Preferential Sampling in Continuous Spatio-temporal Datasets click here
  3. Testing for Preferential Sampling in Discrete Spatial Datasets click here
  4. Testing for Preferential Sampling in Discrete Spatio-temporal Datasets click here

Testing for Preferential Sampling in Geostatistical Datasets {#section1}

Data Preparation

Setting up the data is made simple with the helper function PSTestInit. For geostatistical data, this function takes as arguments:

In addition to the above, there is the additional following additional (optional) argument:

#* `covariate_grids` - this is a list of covariate rasters, grids or pixels. They must be defined over the study region of interest and be in the same coordinate reference system or units (e.g. km or m) as the coordinates in the positions objects. These can be of class `SpatialPixelsDataFrame`, `SpatialGridDataFrame` (`sp` package), `sf` (`sf` package), or `im` (`spatstat` package). Covariates can be added later in the `PSTestRun` function.
library(sp)
library(maptools)
library(rgeos)
library(rgdal)
library(spatstat)
library(doParallel)
library(foreach)
library(ggplot2)
library(mgcv)
library(PStestR)

# set seed
set.seed(12345)

## generate point process data

# Define the intensity function
int_function <- function(x,y){100*((x+y)^2)}
int_function_neg <- function(x,y){100*(((x+y-2)^2))}

spat_dat <- rpoispp(lambda = int_function, win=owin(c(0,1),c(0,1)))

spat_dat$marks <- (rnorm(n=spat_dat$n ,mean=int_function(x=spat_dat$x, y=spat_dat$y), sd= 100) / 100 )

spat_dat$marks <- spat_dat$marks - min(spat_dat$marks)

st_dat <- rpoispp(int_function, win=owin(c(0,1),c(0,1)), nsim = 2)

# define SpatialPolygon for the owin object (tests maptools loaded)
window_poly <- as(owin(c(0,1),c(0,1)), 'SpatialPolygons')

# Define SpatialPixels across window
window_pixels <- makegrid(window_poly, n=10000)
window_pixels <- SpatialPixels(points = SpatialPoints(as.matrix(window_pixels) ))

#source('PSTestInit')

Let's show an example of how to initialise data, suitable for use with PSTestInit. Suppose that we have an object called window_poly of class SpatialPolygons defining our study region. Suppose we also have a marked spatial point pattern of class ppp called spat_dat. The marks here correspond to noisy observations of some latent process which we aim to study. Unfortunately, however, it is feared that samples were preferentially sampled in regions of the study area where the process was expected to be highest. Thus we will show how one can test for this.

First, we plot the point pattern and the marks as a bubble plot:

class(spat_dat)
plot(spat_dat)
class(window_poly)

Here, we see an increased sampling intensity in the top right hand corner of the study region (as defined by the square polygon). The size of the marks, as indicated by the radius of the plotted bubbles, are also larger in this region. This is clear evidence of positive preferential sampling.

Now we move onto creating an object suitable for the PStestR package by running PSTestInit. We specify that we want 10,000 points to be uniformly placed within the study region defined by window_poly. We will use these locations later for predicting our latent field/process and covariates over.

test_spat_dat <- PSTestInit(type='spatial', discrete = F,     
                            positions=spat_dat, poly=window_poly,   
                            n_prediction = 10000)
names(test_spat_dat)

The object test_spat_dat contains everything that is needed to fit the test. Of importance are the following two named objects within the test_spat_dat:

Once the desired covariates and latent effect(s) have been estimated or projected onto the 10,000 points, they can be appended to the prediction_grid as a SpatialPixelsDataFrame. This is a suitable file format for the test.

Returning to our example, for the time being we ignore the potential preferential sampling and fit a spatial regression model to the marks. We find a significant effect of a covariate called 'cov', and capture the residual spatial correlation with a Gaussian process. This Gaussian process is our estimate of the latent effect. Next, we obtain the covariate values and estimate the latent effect across the 10,000 points. We store these estimates in vectors cov and latent_effect. To convert cov and latent_effect to SpatialPixelsDataFrame objects, simply do the following:

n_pred <- length(test_spat_dat$prediction_df$x)
cov <- runif(n_pred)
latent_effect <- int_function(x=test_spat_dat$prediction_df$x,
                     y=test_spat_dat$prediction_df$y)
cov2 <- latent_effect + rnorm(n=n_pred, mean = 0, sd=100)
cov2 <- SpatialPixelsDataFrame(test_spat_dat$prediction_grid,
                               data=data.frame(cov2=cov2))
layout(mat=matrix(c(1,2),nrow=1,ncol = 2))
# cov1 and cov2 are numeric vectors of length 10,000
cov <- SpatialPixelsDataFrame(test_spat_dat$prediction_grid,
                               data=data.frame(cov=cov))
latent_effect <- SpatialPixelsDataFrame(test_spat_dat$prediction_grid,
                         data=data.frame(latent_effect=latent_effect))
plot(cov)
plot(latent_effect)
layout(mat=matrix(c(1),nrow=1,ncol = 1))

Conducting the test

Now we can run the test using the PSTestRun function. For geostatistical data, this takes arguments:

In addition to the above arguments, for geostatistical data there are additional (optional) arguments that can be provided:

Now we show how to fit the test. First, we fit the pointwise tests without covariate cov. We only fit the nearest neighbour based test first. Here the formula is specified as ~ 1, and the interaction is set to NULL to indicate that a homogeneous poisson process is desired for capturing our sampling process under the null.

spat_test_1 <- PSTestRun(test_spat_dat, formula = ~ 1, 
                         interaction = NULL, 
                         latent_effect = latent_effect,
                         residual_tests=F, M=19, no_nn = 10,
                         parallel = T, ncores=1,
                         return_plots = F)

spat_test_1

Here it is apparent that the pointwise tests reject the null hypothesis of no assocation between the latent effect and the intensity of the point pattern in favour of the alternative, that the intensity depends monotonically on the value of the latent effect. The test is able to reject the null at the 5% significance level with only 19 Monte Carlo simulations. As is the case with any analysis, problems of multiple testing should be considered. Note that the row of printed text log(lambda) provides quantities from the fitted model including point estimates, standard errors etc., for the value of the intercept in the log intensity.

Next, we repeat the test, but this time compute a simultaneous confidence band by adding the argument simultaneous = T. This test will no longer suffer from the problem of multiple testing.

spat_test_2 <- PSTestRun(test_spat_dat, formula = ~ 1, 
                         interaction = NULL, 
                         latent_effect = latent_effect,
                         residual_tests=F, M=19, no_nn = 10,
                         parallel = T, ncores=1,
                         return_plots = T, simultaneous = T)

spat_test_2$critical_deviance
spat_test_2$global_test_NN
#spat_test_2$pointwise_tests

Here it is shown in the plot that the observed rank correlation lies outside the Monte Carlo rank envelope (i.e. simultaneous confidence band) for all values of K nearest neighbours tested. Thus we reject the null hypothesis at the 5% level once again. The critical value of the test is returned as the named entry critical_deviance. If the absolute value of the observed rank correlation exceeds this value at any value of K, then the test is rejected. For convenience, the result of the test for each value of K is returned as the vector of logicals global_test_NN. A value of TRUE at any of the K values denotes the test is rejected.

Now we repeat the same test, but this time include the covariate cov. Remember, cov is the SpatialPixelsDataFrame object created earlier. To include cov in our null model, we create a named list containing all the covariates we desire to include in our model. Then we add this list as an argument to the function PSTestRun. Finally, we add the desired covariates, in their desired functional form to the right hand side of the formula argument. Note that higher order terms and interactions can be added as per usual.

covariates_list <- list(cov = cov)

spat_test_3 <- PSTestRun(test_spat_dat, formula = ~ cov, 
                         interaction = NULL, 
                         latent_effect = latent_effect,
                         covariates = covariates_list,
                         residual_tests=F, M=19, no_nn = 10,
                         parallel = T, ncores=1,
                         return_plots = T, simultaneous = T)

spat_test_3$critical_deviance
spat_test_3$global_test_NN
#spat_test_3$pointwise_tests

Thus, once again the test is rejected at the 5% significance level. Thus, while a statistically significant effect of covariate cov was found (point estimate 0.738 (SE 0.31)), it fails to explain away the preferential sampling.

Finally, suppose after further research, a second covariate 'cov2' is found to be strongly associated with the marks. Furthermore there is a plausible reason for 'cov2' affecting the choice of sampling locations. The latent effect is once again estimated, this time after controlling for the effect of cov2. We now also include this as a covariate cov2 in the test.

latent_effect2 <- latent_effect
covariates_list2 <- list(cov = cov,
                        cov2=cov2)

spat_test_4 <- PSTestRun(test_spat_dat, formula = ~ cov + cov2, 
                         interaction = NULL, 
                         latent_effect = latent_effect2,
                         covariates = covariates_list2,
                         residual_tests=F, M=100, no_nn = 5,
                         parallel = T, ncores=1,
                         return_plots = T, simultaneous = T,
                         global_alpha = 0.05)

spat_test_4$critical_deviance
spat_test_4$global_test_NN
#spat_test_4$pointwise_tests

As before with cov, a statistically significant linear effect of cov2 is found on the log intensity of the point pattern. The simultaneous test at the 5% significance level rejects the null hypothesis for K > 3. Thus we find insufficient evidence that the covariate cov2 has sufficiently explained away the preferential sampling.

If we were to look at the results from the pointwise tests, the test would have been rejected at the 5% significance level for each value of K<5 tested. However, once again, the issue of multiple testing must be considered and thus the significance level of the pointwise tests are only valid if the value of K is chosen beforehand. A single pointwise test can have a higher power to detect PS than the simultaneous test if a 'good' value of K is chosen beforehand.

Testing for Preferential Sampling in Continuous-space Spatio-temporal Datasets {#section2}

Data Preparation

Next we give an example of how to test for preferential sampling in continuous spatio-temporal data. Following on from the previous example, suppose two follow-up studies were performed in the following two years. These had n_2 and n_3 locations chosen independently according to two different 'sampling protocols'. Suppose the datasets appear as objects of class ppp called spat_dat2 and spat_dat3.

spat_dat2 <- rpoispp(lambda = 100, win=owin(c(0,1),c(0,1)))

spat_dat2$marks <- (rnorm(n=spat_dat2$n ,mean=int_function(x=spat_dat2$x, y=spat_dat2$y), sd= 100) / 100 )

spat_dat2$marks <- spat_dat2$marks - min(spat_dat2$marks)

plot(spat_dat2)

spat_dat3 <- rmh.default(model = list(
                                      cif="hardcore",
                                      par=list(beta=1,
                                               hc=0.025),
                                      w=owin(c(0,1),c(0,1)),
                                      trend=int_function_neg))
  #rpoispp(lambda = int_function_neg, win=owin(c(0,1),c(0,1)))

spat_dat3$marks <- (rnorm(n=spat_dat3$n ,mean=int_function(x=spat_dat3$x, y=spat_dat3$y), sd= 100) / 100 )

spat_dat3$marks <- spat_dat3$marks - min(spat_dat3$marks)

plot(spat_dat3)

This time, the n_2 sampling locations of the first follow-up survey at which to record (noisy) values of the latent process under study appear to be more uniformly placed throughout the study region than before. Furthermore, no clear associations between the intensity of the observation locations and the recorded values (the marks) exist.

On the other hand, the n_3 sampling locations of the second follow-up survey at which to record (noisy) values of the latent process under study appear to be placed at a higher intensity in the bottom left corner of the study region. A, clear negative association is seen between the intensity of the observation locations and the recorded values (the marks). Thus it appears that negative preferential sampling may have occured.

The goal of this analysis is to see if the follow-up studies were preferentially sampled.

First, we must format the data into the correct form required for the PStestR package. This is achieved once again by the function PSTestInit. Along with the positions argument, a times argument is now also required for specifying which discrete time period each observation belongs to. Furthermore, the covariates and formula arguments may be unique for each time period, or fixed across all time periods.

For spatio-temporal data, the PSTestInit function is used similarly as before with the geostatistical data. The major differences here are with the following arguments:

#In addition to the above, the following optional argument also changes:

#* `covariate_grids` - this is now either a list of covariate rasters, grids or pixels as before, or a named list of lists. In the latter, each list must contain a set of covariate rasters, grids or pixels for a specific time period. The names of the list entries must correspond to the unique values in `times`.

Thus returning to our example, for the positions argument we create a data.frame with named elements x and y for both spat_dat2 and spat_dat3. Next, we merge the two data frames. Finally, to create times, we create a vector of length n_2 + n_3 with the first n_2 entries set to 2 and the remaining n_3 entries set equal to 3. Then, we plug these into the PSTestInit function.

positions_merged <- data.frame(x = c(spat_dat2$x, spat_dat3$x),
                               y = c(spat_dat2$y, spat_dat3$y))
times_vec <- c( rep(2, times = spat_dat2$n),
                rep(3, times = spat_dat3$n) )
test_st_dat <- PSTestInit(type='spacetime',
                              positions = positions_merged,
                              times = times_vec,
                              discrete = F,     
                              poly=window_poly,   
                              n_prediction = 10000) 
names(test_st_dat)

Now we are ready to conduct the test.

Conducting the test

Now we can run the test using the PSTestRun function. The major differences between spatio-temporal data and geostatistical data are the arguments:

As before, there are additional (optional) arguments that can be provided.

Now we show how to fit the tests. First, we fit the pointwise tests without any covariates. We only fit the nearest neighbour based test first. Here the formula is specified as ~ 1, and the interaction is set to NULL to indicate that a homogeneous poisson process for each time/study is desired for capturing our sampling processes under the null. The latent effect is assumed constant across time and is estimated separately using the data from both surveys.

latent_effect3 <- latent_effect
st_test <- PSTestRun(test_st_dat, formula = ~ 1, 
                     interaction = NULL, 
                     latent_effect = latent_effect,
                     residual_tests=F, M=19, no_nn = 10,
                     parallel = T, ncores=1,
                     return_plots = F, simultaneous = F)

st_test

Thus the pointwise tests reject the null hypothesis of no preferential sampling at the 5% signficance level for the second follow-up dataset for K = 3 and K = 10 and fail to reject the null hypothesis for any K for the first follow-up dataset. Next, we perform the simultaneous test, which computes the simultaneous confidence band over all values of K and over all times.

st_test2 <- PSTestRun(test_st_dat, formula = ~ 1, 
                     interaction = NULL, 
                     latent_effect = latent_effect,
                     residual_tests=F, M=19, no_nn = 10,
                     parallel = T, ncores=1,
                     return_plots = T, simultaneous = T)

names(st_test2)

st_test2$critical_deviance

The plot above is presented as follows. On the y-axis are the number of nearest neighbours K, over which the distances have been averaged over. On the x-axis is time, indicating the different follow-up studies in our example. The numbers in the white boxes are the observed rank correlations in the data for each value of K and time. Finally, the colour behind the white boxes denotes the value of the observed rank correlation, with no colour indicating that the observed rank correlation falls within the Monte Carlo approximated simultaneous confidence band.

Thus, for our example it is apparent that the null is rejected only for the second follow-up dataset, for K = 10. These are the only cells with colour behind them. Thus we fail to find evidence of preferential sampling in the first follow-up study and find evidence at the 5% significance level that PS did occur in the second follow-up study.

Things to consider when using the simultaneous tests

Each of the test statistics should be of similar variability, otherwise the studentized MAD envelope test or the directional quantile MAD envelope test should be used (see Mrkvicka et.al 2017). This is because the power to detect a departure from the null could be severely diminished if a specific test statistic is very highly variable relative to the others.

Although the PStestR package doesn't currently implement these modified envelope tests, by adding an argument return_rho_vals=T to PSTestRun, the Monte Carlo rho values will be returned, allowing such tests to be computed manually. The Monte Carlo rho values are stored in the entry Monte_Carlo_rho_values and the observed rho values in the data are stored in test_rho

manual_test <- PSTestRun(test_st_dat, formula = ~ 1, 
                     interaction = NULL, 
                     latent_effect = latent_effect,
                     residual_tests=F, M=19, no_nn = 10,
                     parallel = T, ncores=1,
                     return_plots = T, simultaneous = T,
                     return_rho_vals = T)

names(manual_test)

Testing for Preferential Sampling in Discrete Spatial Datasets {#section3}

Data Preparation

Setting up the data is made simple with the helper function PSTestInit. For discrete spatial data, this function takes as arguments:

In addition to the above, the following are additional (optional) arguments to call:

library(sp)
library(maptools)
library(rgeos)
library(rgdal)
library(spatstat)
library(doParallel)
library(foreach)
library(ggplot2)
library(mgcv)
library(PStestR)

# set seed
set.seed(12345)

## load areal data
discrete_data <- PStestR:::sample_discrete_data#readRDS('../data/sample_discrete_data.rds')
list2env(discrete_data,.GlobalEnv)
#source('PSTestInit')

Let's show an example of how to initialise data, suitable for use with PSTestInit. Suppose that we have an object called areal_polygons of class SpatialPolygons defining our population of areal units. Suppose we also have a vector of integers called PS_indices giving the indices of the chosen areal polygons. Finally, suppose we have an object called PS_data of class SpatialPolygonsDataFrame. This contains the selected polygons with values of the response attached as a dataframe. The responses are noisy observations of some latent process which we aim to study. Once again, it is feared that areal units were preferentially sampled where the process was expected to be highest. We will show how one can test for this.

First, we plot selected areal units, and fill the colour based on the value of the response:

class(areal_polygons)
class(PS_data)
plot(areal_polygons)
sp::spplot(PS_data)

Here, we see a potential clustering of areal units in the South Western corner. The values of the latent field appear especially high here. Thus positive PS may be present.

Now we move onto creating an object suitable for the PStestR package by running PSTestInit.

test_spat_dat <- PSTestInit(type='spatial', discrete = T,
                            areal_polygons = areal_polygons,
                            areal_poly_observations = PS_indices)
names(test_spat_dat)

The object test_spat_dat contains everything that is needed to fit the test. Of importance is the following named object within the test_spat_dat:

Once the desired covariates and latent effect(s) have been estimated or projected onto the centroids, they can be appended to the prediction_df object. This is a suitable file format for the test.

Returning to our example, for the time being we ignore the potential preferential sampling and fit a discrete spatial regression model. No covariate effect is found here. We append the estimates of the latent effect, stored in the object latent_effect, to the prediction_df object.

latent_effect <- complete_data$Y
test_spat_dat$prediction_df$latent_effect <- latent_effect

Conducting the test

Now we can run the test using the PSTestRun function. For spatial data, this takes arguments:

In addition to the above arguments, for discrete spatial data there are additional (optional) arguments that can be provided:

Now we show how to fit the test. First, we fit the pointwise tests. Here the formula is specified as R ~ 1 for the null model (assuming no PS). Note that R must always be set as the response variable.

spat_test_1 <- PSTestRun(test_spat_dat, formula = R ~ 1,
                         covariates = test_spat_dat$prediction_df,
                         latent_effect = latent_effect,
                         interaction = NULL, M=19, no_nn = 10,
                         parallel = F, ncores=1,
                         return_plots = F)

spat_test_1

Here we see evidence that the data were indeed preferentially sampled.

spat_test_12 <- PSTestRun(test_spat_dat, formula = R ~ 1,
                         covariates = test_spat_dat$prediction_df,
                         latent_effect = latent_effect,
                         interaction = NULL, M=19, no_nn = 10,
                         parallel = F, ncores=1,
                         return_plots = T, simultaneous = T)

spat_test_12$global_test_NN
spat_test_12$pointwise_empirical_pvalues

Here, the simultaneous test also finds evidence of PS for all K > 1. This simultaneous test does not suffer the problems of multiple testing.

Now suppose that we find a covariate cov that we believe to be associated with both the sampling process, and with the latent effect. We can include this in the sampling model to see if this explains away the PS. First we need to append cov, of class vector, to the prediction_df object, and then update the formula.

cov <- latent_effect + rnorm(length(latent_effect), mean=0, sd=2)
test_spat_dat$prediction_df$cov <- cov

spat_test_2 <- PSTestRun(test_spat_dat, formula = R ~ cov,
                         covariates = test_spat_dat$prediction_df,
                         latent_effect = latent_effect,
                         interaction = NULL, M=19, no_nn = 10,
                         parallel = F, ncores=1,
                         return_plots = F)

spat_test_2

Whilst the covariate cov was indeed found to be significantly associated with the binary selection process, the evidence of preferential sampling remains. Thus, we have failed to find a sufficient set of covariates to adequately explain away the PS we see in the data.

Testing for Preferential Sampling in Discrete Spatio-temporal Datasets {#section4}

For discrete spatio-temporal data, the process is very similar.

Data Preparation

Setting up the data is made simple with the helper function PSTestInit. For discrete spatiotemporal data, this function takes as arguments:

In addition to the above, the following are additional (optional) arguments to call:

library(sp)
library(maptools)
library(rgeos)
library(rgdal)
library(spatstat)
library(doParallel)
library(foreach)
library(ggplot2)
library(mgcv)
library(PStestR)

# set seed
set.seed(12345)

## load areal data
discrete_st_data <- PStestR:::sample_st_discrete_data#readRDS('../data/sample_st_discrete_data.rds')
list2env(discrete_st_data,.GlobalEnv)
#source('PSTestInit')

Let's show an example of how to initialise data, suitable for use with PSTestInit. Suppose that we have an object called areal_polygons of class SpatialPolygons defining our population of areal units. Suppose we also have a vector called PS_indices giving the indices of the chosen areal polygons. Suppose we also have a vector called times taking integer values. Each entry indicates the observed time step corresponding to each entry of PS_indices. Finally, suppose we have an object called PS_data of class list. Each element of the list is of class SpatialPolygonsDataFrame and contains the selected polygons with values of the response attached as a dataframe. The responses are noisy observations of some latent process which we aim to study. The length of the list PS_data equals the number of unique entries in times.

Once again, it is feared that areal units were preferentially sampled where the process was expected to be highest. We will show how one can test for this.

First, we plot selected areal units at each time step and fill the colour based on the value of the response. Note that 5 time steps were observed.

par(mfrow=c(3,2))
class(areal_polygons)
class(PS_data)
class(PS_data[[1]])
plot(areal_polygons)
sp::spplot(PS_data[[1]])
sp::spplot(PS_data[[2]])
sp::spplot(PS_data[[3]])
sp::spplot(PS_data[[4]])
sp::spplot(PS_data[[5]])
par(mfrow=c(1,1))

For the first four time steps, we see a potential clustering of areal units where the latent field takes large values. Thus positive PS may be present. In the final time step, no obvious clustering can be seen.

Now we move onto creating an object suitable for the PStestR package by running PSTestInit.

test_st_dat <- PSTestInit(type='spacetime', discrete = T,
                            areal_polygons = areal_polygons,
                            areal_poly_observations = PS_indices,
                            times = times)
names(test_st_dat)

The object test_st_dat contains everything that is needed to fit the test. Of importance is the following named object within test_st_dat:

Once the desired covariates and latent effect(s) have been estimated or projected onto the centroids, they can be appended to the prediction_df object. This is a suitable file format for the test.

Returning to our example, for the time being we ignore the potential preferential sampling and fit a discrete spatial regression model. No covariate effect is found here. We append the estimates of the latent effect, stored in the object latent_effect, of class numeric, to the prediction_df object.

latent_effect <- rep(NA, times=length(test_st_dat$prediction_df$t))
latent_effect[test_st_dat$prediction_df$t==1] <- complete_data[[1]]$Y
latent_effect[test_st_dat$prediction_df$t==2] <- complete_data[[2]]$Y
latent_effect[test_st_dat$prediction_df$t==3] <- complete_data[[3]]$Y
latent_effect[test_st_dat$prediction_df$t==4] <- complete_data[[4]]$Y
latent_effect[test_st_dat$prediction_df$t==5] <- complete_data[[5]]$Y
test_st_dat$prediction_df$latent_effect <- latent_effect

Conducting the test

Now we can run the test using the PSTestRun function. For spatial data, this takes arguments:

In addition to the above arguments, for discrete spatial data there are additional (optional) arguments that can be provided:

Now we show how to fit the test. First, we fit the pointwise tests. Here the formula is specified as R ~ 1 for the null model (assuming no PS). Note that R must always be set as the response variable.

st_test_1 <- PSTestRun(test_st_dat, formula = R ~ 1,
                         covariates = test_st_dat$prediction_df,
                         latent_effect = latent_effect,
                         interaction = NULL, M=19, no_nn = 10,
                         parallel = F, ncores=1,
                         return_plots = F)

st_test_1

Here we see evidence that the data were indeed preferentially sampled for the first 4 times. No evidence of PS is found for the final year.

Now suppose that we find a covariate cov that we believe to be associated with both the sampling process, and with the latent effect for the first year. Suppose we are unable to find covariates for any of the other years. We can include this in the sampling model to see if this explains away the PS. First we need to append cov to the prediction_df object, and then update the formula. The formula object should now be a named list of length 5. Each entry of the list specifies the null sampling model. The names of the entries corresponds to the value of the time step. The first formula will contain the variable cov, while the remaining 4 formulae will be intercept-only.

cov <- complete_data[[1]]$Y + rnorm(length(complete_data[[1]]$Y), mean=0, sd=0)
test_st_dat$prediction_df$cov <- NA
test_st_dat$prediction_df$cov[test_st_dat$prediction_df$t==1] <- cov

formula_st2 <- list('1' = R ~ cov,
                   '2' = R ~ 1, '3' = R ~ 1, '4' = R ~ 1, '5' = R ~ 1)

st_test_2 <- PSTestRun(test_st_dat, formula = formula_st2,
                         covariates = test_st_dat$prediction_df,
                         latent_effect = latent_effect,
                         interaction = NULL, M=19, no_nn = 10,
                         parallel = F, ncores=1,
                         return_plots = F)

st_test_2

Here the covariate cov was indeed found to be significantly associated with the binary selection process for the first year. Furthermore, the evidence of preferential sampling has dissapeared with the inclusion of cov in the null sampling model. Thus, we have found a sufficient set of covariates to adequately explain away the PS we see in the data for the first time step. We could now repeat the process for the remaining 4 years.



joenomiddlename/PStestR documentation built on March 26, 2022, 8:17 a.m.