In joenomiddlename/PStestR: Monte Carlo Test for Preferential Sampling in Spatio-Temporal Data

Testing for Preferential Sampling in Geostatistical Datasets click here
Testing for Preferential Sampling in Continuous Spatio-temporal Datasets click here
Testing for Preferential Sampling in Discrete Spatial Datasets click here
Testing for Preferential Sampling in Discrete Spatio-temporal Datasets click here

Testing for Preferential Sampling in Geostatistical Datasets {#section1}

Data Preparation

Setting up the data is made simple with the helper function PSTestInit. For geostatistical data, this function takes as arguments:

type - this takes a character string. For geostatistical data, simply set type = 'spatial'
discrete - this takes a logical value. Here, as we are in the geostatistical (i.e. point-referenced) setting we set discrete = F
positions - this takes either an object of class ppp or else it takes a data.frame with named values x and y. These give the coordinates of the observed points
poly - this takes a polygon-style object defining the boundaries of the study region. This can be of class owin (from the spatstat package), SpatialPolygons (from the sp package) or sf (from the sf package).

In addition to the above, there is the additional following additional (optional) argument:

n_prediction - this specifies the number of points to be generated within the study region. The coordinates of these points are returned for predicting the latent effect over, and for evaluating the covariates for later use with fitting the point processes.

#* `covariate_grids` - this is a list of covariate rasters, grids or pixels. They must be defined over the study region of interest and be in the same coordinate reference system or units (e.g. km or m) as the coordinates in the positions objects. These can be of class `SpatialPixelsDataFrame`, `SpatialGridDataFrame` (`sp` package), `sf` (`sf` package), or `im` (`spatstat` package). Covariates can be added later in the `PSTestRun` function.

library(sp)
library(maptools)
library(rgeos)
library(rgdal)
library(spatstat)
library(doParallel)
library(foreach)
library(ggplot2)
library(mgcv)
library(PStestR)

# set seed
set.seed(12345)

## generate point process data

# Define the intensity function
int_function <- function(x,y){100*((x+y)^2)}
int_function_neg <- function(x,y){100*(((x+y-2)^2))}

spat_dat <- rpoispp(lambda = int_function, win=owin(c(0,1),c(0,1)))

spat_dat$marks <- (rnorm(n=spat_dat$n ,mean=int_function(x=spat_dat$x, y=spat_dat$y), sd= 100) / 100 )

spat_dat$marks <- spat_dat$marks - min(spat_dat$marks)

st_dat <- rpoispp(int_function, win=owin(c(0,1),c(0,1)), nsim = 2)

# define SpatialPolygon for the owin object (tests maptools loaded)
window_poly <- as(owin(c(0,1),c(0,1)), 'SpatialPolygons')

# Define SpatialPixels across window
window_pixels <- makegrid(window_poly, n=10000)
window_pixels <- SpatialPixels(points = SpatialPoints(as.matrix(window_pixels) ))

#source('PSTestInit')

Let's show an example of how to initialise data, suitable for use with PSTestInit. Suppose that we have an object called window_poly of class SpatialPolygons defining our study region. Suppose we also have a marked spatial point pattern of class ppp called spat_dat. The marks here correspond to noisy observations of some latent process which we aim to study. Unfortunately, however, it is feared that samples were preferentially sampled in regions of the study area where the process was expected to be highest. Thus we will show how one can test for this.

First, we plot the point pattern and the marks as a bubble plot:

class(spat_dat)
plot(spat_dat)
class(window_poly)

Here, we see an increased sampling intensity in the top right hand corner of the study region (as defined by the square polygon). The size of the marks, as indicated by the radius of the plotted bubbles, are also larger in this region. This is clear evidence of positive preferential sampling.

Now we move onto creating an object suitable for the PStestR package by running PSTestInit. We specify that we want 10,000 points to be uniformly placed within the study region defined by window_poly. We will use these locations later for predicting our latent field/process and covariates over.

test_spat_dat <- PSTestInit(type='spatial', discrete = F,     
                            positions=spat_dat, poly=window_poly,   
                            n_prediction = 10000)
names(test_spat_dat)

The object test_spat_dat contains everything that is needed to fit the test. Of importance are the following two named objects within the test_spat_dat:

prediction_df is a data.frame object containing the x, y coordinates of each of the 10,000 points. This can be especially useful as a prediction data frame for estimating the latent effect and covariates over
prediction_grid is a SpatialPixels object comprised of 10,000 pixels, with centroids equal to the coordinates found in prediction_df

Once the desired covariates and latent effect(s) have been estimated or projected onto the 10,000 points, they can be appended to the prediction_grid as a SpatialPixelsDataFrame. This is a suitable file format for the test.

Returning to our example, for the time being we ignore the potential preferential sampling and fit a spatial regression model to the marks. We find a significant effect of a covariate called 'cov', and capture the residual spatial correlation with a Gaussian process. This Gaussian process is our estimate of the latent effect. Next, we obtain the covariate values and estimate the latent effect across the 10,000 points. We store these estimates in vectors cov and latent_effect. To convert cov and latent_effect to SpatialPixelsDataFrame objects, simply do the following:

n_pred <- length(test_spat_dat$prediction_df$x)
cov <- runif(n_pred)
latent_effect <- int_function(x=test_spat_dat$prediction_df$x,
                     y=test_spat_dat$prediction_df$y)
cov2 <- latent_effect + rnorm(n=n_pred, mean = 0, sd=100)
cov2 <- SpatialPixelsDataFrame(test_spat_dat$prediction_grid,
                               data=data.frame(cov2=cov2))

layout(mat=matrix(c(1,2),nrow=1,ncol = 2))

# cov1 and cov2 are numeric vectors of length 10,000
cov <- SpatialPixelsDataFrame(test_spat_dat$prediction_grid,
                               data=data.frame(cov=cov))
latent_effect <- SpatialPixelsDataFrame(test_spat_dat$prediction_grid,
                         data=data.frame(latent_effect=latent_effect))
plot(cov)
plot(latent_effect)

layout(mat=matrix(c(1),nrow=1,ncol = 1))

Conducting the test

Now we can run the test using the PSTestRun function. For geostatistical data, this takes arguments:

proc_data - this is the processed data, outputted from the PSTestInit function
covariates - A named list whose entries are images, functions, windows, tessellations or single numbers. Names of the list elements correspond to the name of the covariates
formula - this is the formula specifying the trend for the point process model. For the inhomogeneous Poisson process, this specifies the intensity function
interaction - this is an object of spatstat class interact describing the point process interaction structure. It can also be a function that makes such an object, or NULL indicating that a Poisson process (stationary or nonstationary) should be fitted. The help files in the spatstat package provide more information.
M - specifies the number of Monte Carlo samples to take for the test
no_nn - specifies the maximum number of nearest neighbours to take distances over
latent_effect - specifies the latent_effect. This can be any of the classes allowed for the covariates (including SpatialPixelsDataFrame)
simultaneous - a logical stating whether the test should be computed with a simultaneous confidence band estimated. If set equal to FALSE, the pointwise tests are computed for each nearest neighbour number. If set equal to TRUE, the test does not suffer from the multiple testing problem, as the test is no longer pointwise and is instead a rank envelope test.
global_alpha - when simultaneous=TRUE, this specifies the significance level for computing the simultaneous confidence band for the rank envelope test. This is equal to the significance level of the simultaneous test.

In addition to the above arguments, for geostatistical data there are additional (optional) arguments that can be provided:

residual_tests - takes a logical value indicating if a Monte Carlo test based on kernel smoothed estimates of the raw residuals should be computed. Note that this feature is experimental and should be used with care. When set to TRUE, the function will suffer performance drawbacks.
parallel - takes a logical value indicating whether or not the script should be run in parallel using the foreach and DoParallel packages.
ncores - if parallel=TRUE, this sets the number of cores to parallelize across.
return_plots - a logical argument specifying whether or not a plot is desired for the rank envelope test when simultaneous=TRUE.
PS - a character string specifying 'either' (default), 'positive' or 'negative'. This specifies the direction of preferential sampling, with 'either' specifying a two-sided alternative hypothesis.
fix_n - a logical argument specifying if the Monte Carlo sampled point patterns should contain the same number of points as the observed point pattern.

Now we show how to fit the test. First, we fit the pointwise tests without covariate cov. We only fit the nearest neighbour based test first. Here the formula is specified as ~ 1, and the interaction is set to NULL to indicate that a homogeneous poisson process is desired for capturing our sampling process under the null.

spat_test_1 <- PSTestRun(test_spat_dat, formula = ~ 1, 
                         interaction = NULL, 
                         latent_effect = latent_effect,
                         residual_tests=F, M=19, no_nn = 10,
                         parallel = T, ncores=1,
                         return_plots = F)

spat_test_1

Here it is apparent that the pointwise tests reject the null hypothesis of no assocation between the latent effect and the intensity of the point pattern in favour of the alternative, that the intensity depends monotonically on the value of the latent effect. The test is able to reject the null at the 5% significance level with only 19 Monte Carlo simulations. As is the case with any analysis, problems of multiple testing should be considered. Note that the row of printed text log(lambda) provides quantities from the fitted model including point estimates, standard errors etc., for the value of the intercept in the log intensity.

Next, we repeat the test, but this time compute a simultaneous confidence band by adding the argument simultaneous = T. This test will no longer suffer from the problem of multiple testing.

spat_test_2 <- PSTestRun(test_spat_dat, formula = ~ 1, 
                         interaction = NULL, 
                         latent_effect = latent_effect,
                         residual_tests=F, M=19, no_nn = 10,
                         parallel = T, ncores=1,
                         return_plots = T, simultaneous = T)

spat_test_2$critical_deviance
spat_test_2$global_test_NN
#spat_test_2$pointwise_tests

Here it is shown in the plot that the observed rank correlation lies outside the Monte Carlo rank envelope (i.e. simultaneous confidence band) for all values of K nearest neighbours tested. Thus we reject the null hypothesis at the 5% level once again. The critical value of the test is returned as the named entry critical_deviance. If the absolute value of the observed rank correlation exceeds this value at any value of K, then the test is rejected. For convenience, the result of the test for each value of K is returned as the vector of logicals global_test_NN. A value of TRUE at any of the K values denotes the test is rejected.

Now we repeat the same test, but this time include the covariate cov. Remember, cov is the SpatialPixelsDataFrame object created earlier. To include cov in our null model, we create a named list containing all the covariates we desire to include in our model. Then we add this list as an argument to the function PSTestRun. Finally, we add the desired covariates, in their desired functional form to the right hand side of the formula argument. Note that higher order terms and interactions can be added as per usual.

covariates_list <- list(cov = cov)

spat_test_3 <- PSTestRun(test_spat_dat, formula = ~ cov, 
                         interaction = NULL, 
                         latent_effect = latent_effect,
                         covariates = covariates_list,
                         residual_tests=F, M=19, no_nn = 10,
                         parallel = T, ncores=1,
                         return_plots = T, simultaneous = T)

spat_test_3$critical_deviance
spat_test_3$global_test_NN
#spat_test_3$pointwise_tests

Thus, once again the test is rejected at the 5% significance level. Thus, while a statistically significant effect of covariate cov was found (point estimate 0.738 (SE 0.31)), it fails to explain away the preferential sampling.

Finally, suppose after further research, a second covariate 'cov2' is found to be strongly associated with the marks. Furthermore there is a plausible reason for 'cov2' affecting the choice of sampling locations. The latent effect is once again estimated, this time after controlling for the effect of cov2. We now also include this as a covariate cov2 in the test.

latent_effect2 <- latent_effect

covariates_list2 <- list(cov = cov,
                        cov2=cov2)

spat_test_4 <- PSTestRun(test_spat_dat, formula = ~ cov + cov2, 
                         interaction = NULL, 
                         latent_effect = latent_effect2,
                         covariates = covariates_list2,
                         residual_tests=F, M=100, no_nn = 5,
                         parallel = T, ncores=1,
                         return_plots = T, simultaneous = T,
                         global_alpha = 0.05)

spat_test_4$critical_deviance
spat_test_4$global_test_NN
#spat_test_4$pointwise_tests

As before with cov, a statistically significant linear effect of cov2 is found on the log intensity of the point pattern. The simultaneous test at the 5% significance level rejects the null hypothesis for K > 3. Thus we find insufficient evidence that the covariate cov2 has sufficiently explained away the preferential sampling.

If we were to look at the results from the pointwise tests, the test would have been rejected at the 5% significance level for each value of K<5 tested. However, once again, the issue of multiple testing must be considered and thus the significance level of the pointwise tests are only valid if the value of K is chosen beforehand. A single pointwise test can have a higher power to detect PS than the simultaneous test if a 'good' value of K is chosen beforehand.

Testing for Preferential Sampling in Continuous-space Spatio-temporal Datasets {#section2}

Data Preparation

Next we give an example of how to test for preferential sampling in continuous spatio-temporal data. Following on from the previous example, suppose two follow-up studies were performed in the following two years. These had n_2 and n_3 locations chosen independently according to two different 'sampling protocols'. Suppose the datasets appear as objects of class ppp called spat_dat2 and spat_dat3.

spat_dat2 <- rpoispp(lambda = 100, win=owin(c(0,1),c(0,1)))

spat_dat2$marks <- (rnorm(n=spat_dat2$n ,mean=int_function(x=spat_dat2$x, y=spat_dat2$y), sd= 100) / 100 )

spat_dat2$marks <- spat_dat2$marks - min(spat_dat2$marks)

plot(spat_dat2)

spat_dat3 <- rmh.default(model = list(
                                      cif="hardcore",
                                      par=list(beta=1,
                                               hc=0.025),
                                      w=owin(c(0,1),c(0,1)),
                                      trend=int_function_neg))
  #rpoispp(lambda = int_function_neg, win=owin(c(0,1),c(0,1)))

spat_dat3$marks <- (rnorm(n=spat_dat3$n ,mean=int_function(x=spat_dat3$x, y=spat_dat3$y), sd= 100) / 100 )

spat_dat3$marks <- spat_dat3$marks - min(spat_dat3$marks)

plot(spat_dat3)

This time, the n_2 sampling locations of the first follow-up survey at which to record (noisy) values of the latent process under study appear to be more uniformly placed throughout the study region than before. Furthermore, no clear associations between the intensity of the observation locations and the recorded values (the marks) exist.

On the other hand, the n_3 sampling locations of the second follow-up survey at which to record (noisy) values of the latent process under study appear to be placed at a higher intensity in the bottom left corner of the study region. A, clear negative association is seen between the intensity of the observation locations and the recorded values (the marks). Thus it appears that negative preferential sampling may have occured.

The goal of this analysis is to see if the follow-up studies were preferentially sampled.

First, we must format the data into the correct form required for the PStestR package. This is achieved once again by the function PSTestInit. Along with the positions argument, a times argument is now also required for specifying which discrete time period each observation belongs to. Furthermore, the covariates and formula arguments may be unique for each time period, or fixed across all time periods.

For spatio-temporal data, the PSTestInit function is used similarly as before with the geostatistical data. The major differences here are with the following arguments:

type - For spatiotemporal data, simply set type = 'spacetime'
times - This is a vector of unique numerical entries, specifying the observation times.
positions - this takes either an object of class ppp containing all the points across all the times. Otherwise, it takes a data.frame with named values x and y, giving the coordinates of all the observed points, across all times. Thus the number of rows of positions must equal the length of times.
poly - this takes a polygon-style object defining the boundaries of the study region. This can be of class owin (from the spatstat package), SpatialPolygons (from the sp package) or sf (from the sf package).

#In addition to the above, the following optional argument also changes:

#* `covariate_grids` - this is now either a list of covariate rasters, grids or pixels as before, or a named list of lists. In the latter, each list must contain a set of covariate rasters, grids or pixels for a specific time period. The names of the list entries must correspond to the unique values in `times`.

Thus returning to our example, for the positions argument we create a data.frame with named elements x and y for both spat_dat2 and spat_dat3. Next, we merge the two data frames. Finally, to create times, we create a vector of length n_2 + n_3 with the first n_2 entries set to 2 and the remaining n_3 entries set equal to 3. Then, we plug these into the PSTestInit function.

positions_merged <- data.frame(x = c(spat_dat2$x, spat_dat3$x),
                               y = c(spat_dat2$y, spat_dat3$y))
times_vec <- c( rep(2, times = spat_dat2$n),
                rep(3, times = spat_dat3$n) )
test_st_dat <- PSTestInit(type='spacetime',
                              positions = positions_merged,
                              times = times_vec,
                              discrete = F,     
                              poly=window_poly,   
                              n_prediction = 10000) 
names(test_st_dat)

Now we are ready to conduct the test.

Conducting the test

Now we can run the test using the PSTestRun function. The major differences between spatio-temporal data and geostatistical data are the arguments:

formula - this is now either a formula specifying the same trend for each of the point process models across the times, or a named list, with entries equal to formulas. If a named list is chosen, names should match the unique entries in times. Each formula then specifies the trend in the fitted point process corresponding to that time.
covariates - this is now either a list of covariate rasters, grids, pixels, or functions of the coordinates as before, or a named list of lists. In the latter, each list must contain a set of covariate rasters, grids etc., for a specific time period. The names of the list entries must correspond to the unique values in times.
latent_effect - specifies the latent_effect estimated for each time period. As with covariates, either this can be a single allowed object (e.g. a SpatialPixelsDataFrame), or a named list of such entries. In the latter case, each list must contain a suitable data type containing the estimated latent effect for that specific time period
simultaneous - a logical stating whether the test should be computed with a simultaneous confidence band estimated. This simultaneous test is now computed over all K values and all times.

As before, there are additional (optional) arguments that can be provided.

Now we show how to fit the tests. First, we fit the pointwise tests without any covariates. We only fit the nearest neighbour based test first. Here the formula is specified as ~ 1, and the interaction is set to NULL to indicate that a homogeneous poisson process for each time/study is desired for capturing our sampling processes under the null. The latent effect is assumed constant across time and is estimated separately using the data from both surveys.

latent_effect3 <- latent_effect

st_test <- PSTestRun(test_st_dat, formula = ~ 1, 
                     interaction = NULL, 
                     latent_effect = latent_effect,
                     residual_tests=F, M=19, no_nn = 10,
                     parallel = T, ncores=1,
                     return_plots = F, simultaneous = F)

st_test

Thus the pointwise tests reject the null hypothesis of no preferential sampling at the 5% signficance level for the second follow-up dataset for K = 3 and K = 10 and fail to reject the null hypothesis for any K for the first follow-up dataset. Next, we perform the simultaneous test, which computes the simultaneous confidence band over all values of K and over all times.

st_test2 <- PSTestRun(test_st_dat, formula = ~ 1, 
                     interaction = NULL, 
                     latent_effect = latent_effect,
                     residual_tests=F, M=19, no_nn = 10,
                     parallel = T, ncores=1,
                     return_plots = T, simultaneous = T)

names(st_test2)

st_test2$critical_deviance

The plot above is presented as follows. On the y-axis are the number of nearest neighbours K, over which the distances have been averaged over. On the x-axis is time, indicating the different follow-up studies in our example. The numbers in the white boxes are the observed rank correlations in the data for each value of K and time. Finally, the colour behind the white boxes denotes the value of the observed rank correlation, with no colour indicating that the observed rank correlation falls within the Monte Carlo approximated simultaneous confidence band.

Thus, for our example it is apparent that the null is rejected only for the second follow-up dataset, for K = 10. These are the only cells with colour behind them. Thus we fail to find evidence of preferential sampling in the first follow-up study and find evidence at the 5% significance level that PS did occur in the second follow-up study.

Things to consider when using the simultaneous tests

Each of the test statistics should be of similar variability, otherwise the studentized MAD envelope test or the directional quantile MAD envelope test should be used (see Mrkvicka et.al 2017). This is because the power to detect a departure from the null could be severely diminished if a specific test statistic is very highly variable relative to the others.

Although the PStestR package doesn't currently implement these modified envelope tests, by adding an argument return_rho_vals=T to PSTestRun, the Monte Carlo rho values will be returned, allowing such tests to be computed manually. The Monte Carlo rho values are stored in the entry Monte_Carlo_rho_values and the observed rho values in the data are stored in test_rho

manual_test <- PSTestRun(test_st_dat, formula = ~ 1, 
                     interaction = NULL, 
                     latent_effect = latent_effect,
                     residual_tests=F, M=19, no_nn = 10,
                     parallel = T, ncores=1,
                     return_plots = T, simultaneous = T,
                     return_rho_vals = T)

names(manual_test)

Testing for Preferential Sampling in Discrete Spatial Datasets {#section3}

Data Preparation

Setting up the data is made simple with the helper function PSTestInit. For discrete spatial data, this function takes as arguments:

type - this takes a character string. For discrete spatial data, simply set type = 'spatial'
discrete - this takes a logical value. Here, as we are in the discrete (i.e. areal) setting we set discrete = T
areal_polygons - this takes an object of class owin (from the spatstat package), SpatialPolygons (from the sp package) or sf (from the sf package), giving the Polygons of the population of areal units
areal_poly_observations - this is a vector of indices denoting the polygons with complete data.

In addition to the above, the following are additional (optional) arguments to call:

discrete_locations provides the locations (e.g. centroids) of the population of areal units. This is an alternative specification which avoids the need to specify the polygons.

library(sp)
library(maptools)
library(rgeos)
library(rgdal)
library(spatstat)
library(doParallel)
library(foreach)
library(ggplot2)
library(mgcv)
library(PStestR)

# set seed
set.seed(12345)

## load areal data
discrete_data <- PStestR:::sample_discrete_data#readRDS('../data/sample_discrete_data.rds')
list2env(discrete_data,.GlobalEnv)
#source('PSTestInit')

Let's show an example of how to initialise data, suitable for use with PSTestInit. Suppose that we have an object called areal_polygons of class SpatialPolygons defining our population of areal units. Suppose we also have a vector of integers called PS_indices giving the indices of the chosen areal polygons. Finally, suppose we have an object called PS_data of class SpatialPolygonsDataFrame. This contains the selected polygons with values of the response attached as a dataframe. The responses are noisy observations of some latent process which we aim to study. Once again, it is feared that areal units were preferentially sampled where the process was expected to be highest. We will show how one can test for this.

First, we plot selected areal units, and fill the colour based on the value of the response:

class(areal_polygons)
class(PS_data)
plot(areal_polygons)
sp::spplot(PS_data)

Here, we see a potential clustering of areal units in the South Western corner. The values of the latent field appear especially high here. Thus positive PS may be present.

Now we move onto creating an object suitable for the PStestR package by running PSTestInit.

test_spat_dat <- PSTestInit(type='spatial', discrete = T,
                            areal_polygons = areal_polygons,
                            areal_poly_observations = PS_indices)
names(test_spat_dat)

The object test_spat_dat contains everything that is needed to fit the test. Of importance is the following named object within the test_spat_dat:

prediction_df is a data.frame object containing the x, y coordinates of each of the areal polygon centroids. If a different measure of areal unit center is desired then this can be specified with the argument discrete_locations instead (see above). We will need to append our estimates of the latent effect to this data frame.

Once the desired covariates and latent effect(s) have been estimated or projected onto the centroids, they can be appended to the prediction_df object. This is a suitable file format for the test.

Returning to our example, for the time being we ignore the potential preferential sampling and fit a discrete spatial regression model. No covariate effect is found here. We append the estimates of the latent effect, stored in the object latent_effect, to the prediction_df object.

latent_effect <- complete_data$Y

test_spat_dat$prediction_df$latent_effect <- latent_effect

Conducting the test

Now we can run the test using the PSTestRun function. For spatial data, this takes arguments:

proc_data - this is the processed data, outputted from the PSTestInit function
covariates - A data.frame object with number of rows equal to the number of areal units.
latent_effect contains the vector of predicted latent effect values of length equal to the number of areal polygons in the population
formula - this is the formula specifying the trend for the binary sampling model. R must be specified as the response variable (i.e. on the LHS of the ~).
M - specifies the number of Monte Carlo samples to take for the test
no_nn - specifies the maximum number of nearest neighbours to take distances over
simultaneous - a logical stating whether the test should be computed with a simultaneous confidence band estimated. If set equal to FALSE, the pointwise tests are computed for each nearest neighbour number. If set equal to TRUE, the test does not suffer from the multiple testing problem, as the test is no longer pointwise and is instead a rank envelope test.
global_alpha - when simultaneous=TRUE, this specifies the significance level for computing the simultaneous confidence band for the rank envelope test. This is equal to the significance level of the simultaneous test.

In addition to the above arguments, for discrete spatial data there are additional (optional) arguments that can be provided:

parallel - takes a logical value indicating whether or not the script should be run in parallel using the foreach and DoParallel packages.
ncores - if parallel=TRUE, this sets the number of cores to parallelize across.
return_plots - a logical argument specifying whether or not a plot is desired for the rank envelope test when simultaneous=TRUE.
PS - a character string specifying 'either' (default), 'positive' or 'negative'. This specifies the direction of preferential sampling, with 'either' specifying a two-sided alternative hypothesis.
fix_n - a logical argument specifying if the Monte Carlo sampled binary indicators should sum to match the observed numbers of areal units with complete data.

Now we show how to fit the test. First, we fit the pointwise tests. Here the formula is specified as R ~ 1 for the null model (assuming no PS). Note that R must always be set as the response variable.

spat_test_1 <- PSTestRun(test_spat_dat, formula = R ~ 1,
                         covariates = test_spat_dat$prediction_df,
                         latent_effect = latent_effect,
                         interaction = NULL, M=19, no_nn = 10,
                         parallel = F, ncores=1,
                         return_plots = F)

spat_test_1

Here we see evidence that the data were indeed preferentially sampled.

spat_test_12 <- PSTestRun(test_spat_dat, formula = R ~ 1,
                         covariates = test_spat_dat$prediction_df,
                         latent_effect = latent_effect,
                         interaction = NULL, M=19, no_nn = 10,
                         parallel = F, ncores=1,
                         return_plots = T, simultaneous = T)

spat_test_12$global_test_NN
spat_test_12$pointwise_empirical_pvalues

Here, the simultaneous test also finds evidence of PS for all K > 1. This simultaneous test does not suffer the problems of multiple testing.

Now suppose that we find a covariate cov that we believe to be associated with both the sampling process, and with the latent effect. We can include this in the sampling model to see if this explains away the PS. First we need to append cov, of class vector, to the prediction_df object, and then update the formula.

cov <- latent_effect + rnorm(length(latent_effect), mean=0, sd=2)

test_spat_dat$prediction_df$cov <- cov

spat_test_2 <- PSTestRun(test_spat_dat, formula = R ~ cov,
                         covariates = test_spat_dat$prediction_df,
                         latent_effect = latent_effect,
                         interaction = NULL, M=19, no_nn = 10,
                         parallel = F, ncores=1,
                         return_plots = F)

spat_test_2

Whilst the covariate cov was indeed found to be significantly associated with the binary selection process, the evidence of preferential sampling remains. Thus, we have failed to find a sufficient set of covariates to adequately explain away the PS we see in the data.

Testing for Preferential Sampling in Discrete Spatio-temporal Datasets {#section4}

For discrete spatio-temporal data, the process is very similar.

Data Preparation

Setting up the data is made simple with the helper function PSTestInit. For discrete spatiotemporal data, this function takes as arguments:

type - this takes a character string. For discrete spatial data, simply set type = 'spacetime'
discrete - this takes a logical value. Here, as we are in the discrete (i.e. areal) setting we set discrete = T
areal_polygons - this takes an object of class owin (from the spatstat package), SpatialPolygons (from the sp package) or sf (from the sf package), giving the Polygons of the population of areal units
areal_poly_observations - this is a vector of indices denoting the polygons with complete data.
times - this is a vector of observation times

In addition to the above, the following are additional (optional) arguments to call:

discrete_locations provides the locations (e.g. centroids) of the population of areal units. This is an alternative specification which avoids the need to specify the polygons.

library(sp)
library(maptools)
library(rgeos)
library(rgdal)
library(spatstat)
library(doParallel)
library(foreach)
library(ggplot2)
library(mgcv)
library(PStestR)

# set seed
set.seed(12345)

## load areal data
discrete_st_data <- PStestR:::sample_st_discrete_data#readRDS('../data/sample_st_discrete_data.rds')
list2env(discrete_st_data,.GlobalEnv)
#source('PSTestInit')

Let's show an example of how to initialise data, suitable for use with PSTestInit. Suppose that we have an object called areal_polygons of class SpatialPolygons defining our population of areal units. Suppose we also have a vector called PS_indices giving the indices of the chosen areal polygons. Suppose we also have a vector called times taking integer values. Each entry indicates the observed time step corresponding to each entry of PS_indices. Finally, suppose we have an object called PS_data of class list. Each element of the list is of class SpatialPolygonsDataFrame and contains the selected polygons with values of the response attached as a dataframe. The responses are noisy observations of some latent process which we aim to study. The length of the list PS_data equals the number of unique entries in times.

Once again, it is feared that areal units were preferentially sampled where the process was expected to be highest. We will show how one can test for this.

First, we plot selected areal units at each time step and fill the colour based on the value of the response. Note that 5 time steps were observed.

par(mfrow=c(3,2))

class(areal_polygons)
class(PS_data)
class(PS_data[[1]])
plot(areal_polygons)
sp::spplot(PS_data[[1]])
sp::spplot(PS_data[[2]])
sp::spplot(PS_data[[3]])
sp::spplot(PS_data[[4]])
sp::spplot(PS_data[[5]])

par(mfrow=c(1,1))

For the first four time steps, we see a potential clustering of areal units where the latent field takes large values. Thus positive PS may be present. In the final time step, no obvious clustering can be seen.

Now we move onto creating an object suitable for the PStestR package by running PSTestInit.

test_st_dat <- PSTestInit(type='spacetime', discrete = T,
                            areal_polygons = areal_polygons,
                            areal_poly_observations = PS_indices,
                            times = times)
names(test_st_dat)

The object test_st_dat contains everything that is needed to fit the test. Of importance is the following named object within test_st_dat:

prediction_df is a data.frame object containing the x, y coordinates of each of the areal polygon centroids. If a different measure of areal unit center is desired then this can be specified with the argument discrete_locations instead (see above). We will need to append our estimates of the latent effect to this data frame.

Once the desired covariates and latent effect(s) have been estimated or projected onto the centroids, they can be appended to the prediction_df object. This is a suitable file format for the test.

latent_effect <- rep(NA, times=length(test_st_dat$prediction_df$t))
latent_effect[test_st_dat$prediction_df$t==1] <- complete_data[[1]]$Y
latent_effect[test_st_dat$prediction_df$t==2] <- complete_data[[2]]$Y
latent_effect[test_st_dat$prediction_df$t==3] <- complete_data[[3]]$Y
latent_effect[test_st_dat$prediction_df$t==4] <- complete_data[[4]]$Y
latent_effect[test_st_dat$prediction_df$t==5] <- complete_data[[5]]$Y

test_st_dat$prediction_df$latent_effect <- latent_effect

Conducting the test

Now we can run the test using the PSTestRun function. For spatial data, this takes arguments:

proc_data - this is the processed data, outputted from the PSTestInit function
covariates - A data.frame object with number of rows equal to the number of areal units multiplied by the number of unique entries in times.
latent_effect contains the vector of predicted latent effect values of length equal to the number of areal polygons in the population multiplied by the number of unique entries in times.
formula - this is the formula specifying the trend for the binary sampling model.
M - specifies the number of Monte Carlo samples to take for the test
no_nn - specifies the maximum number of nearest neighbours to take distances over
simultaneous - a logical stating whether the test should be computed with a simultaneous confidence band estimated. If set equal to FALSE, the pointwise tests are computed for each nearest neighbour number. If set equal to TRUE, the test does not suffer from the multiple testing problem, as the test is no longer pointwise and is instead a rank envelope test.
global_alpha - when simultaneous=TRUE, this specifies the significance level for computing the simultaneous confidence band for the rank envelope test. This is equal to the significance level of the simultaneous test.

In addition to the above arguments, for discrete spatial data there are additional (optional) arguments that can be provided:

parallel - takes a logical value indicating whether or not the script should be run in parallel using the foreach and DoParallel packages.
ncores - if parallel=TRUE, this sets the number of cores to parallelize across.
return_plots - a logical argument specifying whether or not a plot is desired for the rank envelope test when simultaneous=TRUE.
PS - a character string specifying 'either' (default), 'positive' or 'negative'. This specifies the direction of preferential sampling, with 'either' specifying a two-sided alternative hypothesis.
fix_n - a logical argument specifying if the Monte Carlo sampled binary indicators should sum to match the observed numbers of areal units with complete data.

st_test_1 <- PSTestRun(test_st_dat, formula = R ~ 1,
                         covariates = test_st_dat$prediction_df,
                         latent_effect = latent_effect,
                         interaction = NULL, M=19, no_nn = 10,
                         parallel = F, ncores=1,
                         return_plots = F)

st_test_1

Here we see evidence that the data were indeed preferentially sampled for the first 4 times. No evidence of PS is found for the final year.

Now suppose that we find a covariate cov that we believe to be associated with both the sampling process, and with the latent effect for the first year. Suppose we are unable to find covariates for any of the other years. We can include this in the sampling model to see if this explains away the PS. First we need to append cov to the prediction_df object, and then update the formula. The formula object should now be a named list of length 5. Each entry of the list specifies the null sampling model. The names of the entries corresponds to the value of the time step. The first formula will contain the variable cov, while the remaining 4 formulae will be intercept-only.

cov <- complete_data[[1]]$Y + rnorm(length(complete_data[[1]]$Y), mean=0, sd=0)

test_st_dat$prediction_df$cov <- NA
test_st_dat$prediction_df$cov[test_st_dat$prediction_df$t==1] <- cov

formula_st2 <- list('1' = R ~ cov,
                   '2' = R ~ 1, '3' = R ~ 1, '4' = R ~ 1, '5' = R ~ 1)

st_test_2 <- PSTestRun(test_st_dat, formula = formula_st2,
                         covariates = test_st_dat$prediction_df,
                         latent_effect = latent_effect,
                         interaction = NULL, M=19, no_nn = 10,
                         parallel = F, ncores=1,
                         return_plots = F)

st_test_2

Here the covariate cov was indeed found to be significantly associated with the binary selection process for the first year. Furthermore, the evidence of preferential sampling has dissapeared with the inclusion of cov in the null sampling model. Thus, we have found a sufficient set of covariates to adequately explain away the PS we see in the data for the first time step. We could now repeat the process for the remaining 4 years.

joenomiddlename/PStestR documentation built on March 26, 2022, 8:17 a.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

joenomiddlename/PStestR
Monte Carlo Test for Preferential Sampling in Spatio-Temporal Data

In joenomiddlename/PStestR: Monte Carlo Test for Preferential Sampling in Spatio-Temporal Data

Contents

Testing for Preferential Sampling in Geostatistical Datasets {#section1}

Data Preparation

Conducting the test

Testing for Preferential Sampling in Continuous-space Spatio-temporal Datasets {#section2}

Data Preparation

Conducting the test

Things to consider when using the simultaneous tests

Testing for Preferential Sampling in Discrete Spatial Datasets {#section3}

Data Preparation

Conducting the test

Testing for Preferential Sampling in Discrete Spatio-temporal Datasets {#section4}

Data Preparation

Conducting the test

R Package Documentation

Browse R Packages

We want your feedback!

joenomiddlename/PStestR Monte Carlo Test for Preferential Sampling in Spatio-Temporal Data

In joenomiddlename/PStestR: Monte Carlo Test for Preferential Sampling in Spatio-Temporal Data

Contents

Testing for Preferential Sampling in Geostatistical Datasets {#section1}

Data Preparation

Conducting the test

Testing for Preferential Sampling in Continuous-space Spatio-temporal Datasets {#section2}

Data Preparation

Conducting the test

Things to consider when using the simultaneous tests

Testing for Preferential Sampling in Discrete Spatial Datasets {#section3}

Data Preparation

Conducting the test

Testing for Preferential Sampling in Discrete Spatio-temporal Datasets {#section4}

Data Preparation

Conducting the test

R Package Documentation

Browse R Packages

We want your feedback!

joenomiddlename/PStestR
Monte Carlo Test for Preferential Sampling in Spatio-Temporal Data