Setting up the data is made simple with the helper function PSTestInit
. For geostatistical data, this function takes as arguments:
type
- this takes a character string. For geostatistical data, simply set type = 'spatial' discrete
- this takes a logical value. Here, as we are in the geostatistical (i.e. point-referenced) setting we set discrete = Fpositions
- this takes either an object of class ppp or else it takes a data.frame with named values x and y. These give the coordinates of the observed pointspoly
- this takes a polygon-style object defining the boundaries of the study region. This can be of class owin (from the spatstat package), SpatialPolygons (from the sp package) or sf (from the sf package).In addition to the above, there is the additional following additional (optional) argument:
n_prediction
- this specifies the number of points to be generated within the study region. The coordinates of these points are returned for predicting the latent effect over, and for evaluating the covariates for later use with fitting the point processes.#* `covariate_grids` - this is a list of covariate rasters, grids or pixels. They must be defined over the study region of interest and be in the same coordinate reference system or units (e.g. km or m) as the coordinates in the positions objects. These can be of class `SpatialPixelsDataFrame`, `SpatialGridDataFrame` (`sp` package), `sf` (`sf` package), or `im` (`spatstat` package). Covariates can be added later in the `PSTestRun` function.
library(sp) library(maptools) library(rgeos) library(rgdal) library(spatstat) library(doParallel) library(foreach) library(ggplot2) library(mgcv) library(PStestR) # set seed set.seed(12345) ## generate point process data # Define the intensity function int_function <- function(x,y){100*((x+y)^2)} int_function_neg <- function(x,y){100*(((x+y-2)^2))} spat_dat <- rpoispp(lambda = int_function, win=owin(c(0,1),c(0,1))) spat_dat$marks <- (rnorm(n=spat_dat$n ,mean=int_function(x=spat_dat$x, y=spat_dat$y), sd= 100) / 100 ) spat_dat$marks <- spat_dat$marks - min(spat_dat$marks) st_dat <- rpoispp(int_function, win=owin(c(0,1),c(0,1)), nsim = 2) # define SpatialPolygon for the owin object (tests maptools loaded) window_poly <- as(owin(c(0,1),c(0,1)), 'SpatialPolygons') # Define SpatialPixels across window window_pixels <- makegrid(window_poly, n=10000) window_pixels <- SpatialPixels(points = SpatialPoints(as.matrix(window_pixels) )) #source('PSTestInit')
Let's show an example of how to initialise data, suitable for use with PSTestInit
. Suppose that we have an object called window_poly
of class SpatialPolygons
defining our study region. Suppose we also have a marked spatial point pattern of class ppp
called spat_dat
. The marks here correspond to noisy observations of some latent process which we aim to study. Unfortunately, however, it is feared that samples were preferentially sampled in regions of the study area where the process was expected to be highest. Thus we will show how one can test for this.
First, we plot the point pattern and the marks as a bubble plot:
class(spat_dat) plot(spat_dat) class(window_poly)
Here, we see an increased sampling intensity in the top right hand corner of the study region (as defined by the square polygon). The size of the marks, as indicated by the radius of the plotted bubbles, are also larger in this region. This is clear evidence of positive preferential sampling.
Now we move onto creating an object suitable for the PStestR
package by running PSTestInit
. We specify that we want 10,000 points to be uniformly placed within the study region defined by window_poly. We will use these locations later for predicting our latent field/process and covariates over.
test_spat_dat <- PSTestInit(type='spatial', discrete = F, positions=spat_dat, poly=window_poly, n_prediction = 10000) names(test_spat_dat)
The object test_spat_dat
contains everything that is needed to fit the test. Of importance are the following two named objects within the test_spat_dat
:
prediction_df
is a data.frame object containing the x, y coordinates of each of the 10,000 points. This can be especially useful as a prediction data frame for estimating the latent effect and covariates overprediction_grid
is a SpatialPixels
object comprised of 10,000 pixels, with centroids equal to the coordinates found in prediction_df
Once the desired covariates and latent effect(s) have been estimated or projected onto the 10,000 points, they can be appended to the prediction_grid
as a SpatialPixelsDataFrame
. This is a suitable file format for the test.
Returning to our example, for the time being we ignore the potential preferential sampling and fit a spatial regression model to the marks. We find a significant effect of a covariate called 'cov', and capture the residual spatial correlation with a Gaussian process. This Gaussian process is our estimate of the latent effect. Next, we obtain the covariate values and estimate the latent effect across the 10,000 points. We store these estimates in vectors cov
and latent_effect
. To convert cov
and latent_effect
to SpatialPixelsDataFrame
objects, simply do the following:
n_pred <- length(test_spat_dat$prediction_df$x) cov <- runif(n_pred) latent_effect <- int_function(x=test_spat_dat$prediction_df$x, y=test_spat_dat$prediction_df$y) cov2 <- latent_effect + rnorm(n=n_pred, mean = 0, sd=100) cov2 <- SpatialPixelsDataFrame(test_spat_dat$prediction_grid, data=data.frame(cov2=cov2))
layout(mat=matrix(c(1,2),nrow=1,ncol = 2))
# cov1 and cov2 are numeric vectors of length 10,000 cov <- SpatialPixelsDataFrame(test_spat_dat$prediction_grid, data=data.frame(cov=cov)) latent_effect <- SpatialPixelsDataFrame(test_spat_dat$prediction_grid, data=data.frame(latent_effect=latent_effect)) plot(cov) plot(latent_effect)
layout(mat=matrix(c(1),nrow=1,ncol = 1))
Now we can run the test using the PSTestRun
function. For geostatistical data, this takes arguments:
proc_data
- this is the processed data, outputted from the PSTestInit
functioncovariates
- A named list whose entries are images, functions, windows, tessellations or single numbers. Names of the list elements correspond to the name of the covariatesformula
- this is the formula specifying the trend for the point process model. For the inhomogeneous Poisson process, this specifies the intensity functioninteraction
- this is an object of spatstat
class interact
describing the point process interaction structure. It can also be a function that makes such an object, or NULL
indicating that a Poisson process (stationary or nonstationary) should be fitted. The help files in the spatstat
package provide more information.M
- specifies the number of Monte Carlo samples to take for the testno_nn
- specifies the maximum number of nearest neighbours to take distances overlatent_effect
- specifies the latent_effect. This can be any of the classes allowed for the covariates (including SpatialPixelsDataFrame
)simultaneous
- a logical stating whether the test should be computed with a simultaneous confidence band estimated. If set equal to FALSE
, the pointwise tests are computed for each nearest neighbour number. If set equal to TRUE
, the test does not suffer from the multiple testing problem, as the test is no longer pointwise and is instead a rank envelope test.global_alpha
- when simultaneous=TRUE
, this specifies the significance level for computing the simultaneous confidence band for the rank envelope test. This is equal to the significance level of the simultaneous test.In addition to the above arguments, for geostatistical data there are additional (optional) arguments that can be provided:
residual_tests
- takes a logical value indicating if a Monte Carlo test based on kernel smoothed estimates of the raw residuals should be computed. Note that this feature is experimental and should be used with care. When set to TRUE
, the function will suffer performance drawbacks.parallel
- takes a logical value indicating whether or not the script should be run in parallel using the foreach
and DoParallel
packages. ncores
- if parallel=TRUE
, this sets the number of cores to parallelize across.return_plots
- a logical argument specifying whether or not a plot is desired for the rank envelope test when simultaneous=TRUE
. PS
- a character string specifying 'either' (default), 'positive' or 'negative'. This specifies the direction of preferential sampling, with 'either' specifying a two-sided alternative hypothesis.fix_n
- a logical argument specifying if the Monte Carlo sampled point patterns should contain the same number of points as the observed point pattern.Now we show how to fit the test. First, we fit the pointwise tests without covariate cov
. We only fit the nearest neighbour based test first. Here the formula is specified as ~ 1, and the interaction is set to NULL
to indicate that a homogeneous poisson process is desired for capturing our sampling process under the null.
spat_test_1 <- PSTestRun(test_spat_dat, formula = ~ 1, interaction = NULL, latent_effect = latent_effect, residual_tests=F, M=19, no_nn = 10, parallel = T, ncores=1, return_plots = F) spat_test_1
Here it is apparent that the pointwise tests reject the null hypothesis of no assocation between the latent effect and the intensity of the point pattern in favour of the alternative, that the intensity depends monotonically on the value of the latent effect. The test is able to reject the null at the 5% significance level with only 19 Monte Carlo simulations. As is the case with any analysis, problems of multiple testing should be considered. Note that the row of printed text log(lambda)
provides quantities from the fitted model including point estimates, standard errors etc., for the value of the intercept in the log intensity.
Next, we repeat the test, but this time compute a simultaneous confidence band by adding the argument simultaneous = T
. This test will no longer suffer from the problem of multiple testing.
spat_test_2 <- PSTestRun(test_spat_dat, formula = ~ 1, interaction = NULL, latent_effect = latent_effect, residual_tests=F, M=19, no_nn = 10, parallel = T, ncores=1, return_plots = T, simultaneous = T) spat_test_2$critical_deviance spat_test_2$global_test_NN #spat_test_2$pointwise_tests
Here it is shown in the plot that the observed rank correlation lies outside the Monte Carlo rank envelope (i.e. simultaneous confidence band) for all values of K nearest neighbours tested. Thus we reject the null hypothesis at the 5% level once again. The critical value of the test is returned as the named entry critical_deviance
. If the absolute value of the observed rank correlation exceeds this value at any value of K, then the test is rejected. For convenience, the result of the test for each value of K is returned as the vector of logicals global_test_NN
. A value of TRUE
at any of the K values denotes the test is rejected.
Now we repeat the same test, but this time include the covariate cov
. Remember, cov
is the SpatialPixelsDataFrame
object created earlier. To include cov
in our null model, we create a named list containing all the covariates we desire to include in our model. Then we add this list as an argument to the function PSTestRun
. Finally, we add the desired covariates, in their desired functional form to the right hand side of the formula
argument. Note that higher order terms and interactions can be added as per usual.
covariates_list <- list(cov = cov) spat_test_3 <- PSTestRun(test_spat_dat, formula = ~ cov, interaction = NULL, latent_effect = latent_effect, covariates = covariates_list, residual_tests=F, M=19, no_nn = 10, parallel = T, ncores=1, return_plots = T, simultaneous = T) spat_test_3$critical_deviance spat_test_3$global_test_NN #spat_test_3$pointwise_tests
Thus, once again the test is rejected at the 5% significance level. Thus, while a statistically significant effect of covariate cov
was found (point estimate 0.738 (SE 0.31)), it fails to explain away the preferential sampling.
Finally, suppose after further research, a second covariate 'cov2' is found to be strongly associated with the marks. Furthermore there is a plausible reason for 'cov2' affecting the choice of sampling locations. The latent effect is once again estimated, this time after controlling for the effect of cov2
. We now also include this as a covariate cov2
in the test.
latent_effect2 <- latent_effect
covariates_list2 <- list(cov = cov, cov2=cov2) spat_test_4 <- PSTestRun(test_spat_dat, formula = ~ cov + cov2, interaction = NULL, latent_effect = latent_effect2, covariates = covariates_list2, residual_tests=F, M=100, no_nn = 5, parallel = T, ncores=1, return_plots = T, simultaneous = T, global_alpha = 0.05) spat_test_4$critical_deviance spat_test_4$global_test_NN #spat_test_4$pointwise_tests
As before with cov
, a statistically significant linear effect of cov2
is found on the log intensity of the point pattern. The simultaneous test at the 5% significance level rejects the null hypothesis for K > 3. Thus we find insufficient evidence that the covariate cov2
has sufficiently explained away the preferential sampling.
If we were to look at the results from the pointwise tests, the test would have been rejected at the 5% significance level for each value of K<5 tested. However, once again, the issue of multiple testing must be considered and thus the significance level of the pointwise tests are only valid if the value of K is chosen beforehand. A single pointwise test can have a higher power to detect PS than the simultaneous test if a 'good' value of K is chosen beforehand.
Next we give an example of how to test for preferential sampling in continuous spatio-temporal data. Following on from the previous example, suppose two follow-up studies were performed in the following two years. These had n_2 and n_3 locations chosen independently according to two different 'sampling protocols'. Suppose the datasets appear as objects of class ppp
called spat_dat2
and spat_dat3
.
spat_dat2 <- rpoispp(lambda = 100, win=owin(c(0,1),c(0,1))) spat_dat2$marks <- (rnorm(n=spat_dat2$n ,mean=int_function(x=spat_dat2$x, y=spat_dat2$y), sd= 100) / 100 ) spat_dat2$marks <- spat_dat2$marks - min(spat_dat2$marks) plot(spat_dat2) spat_dat3 <- rmh.default(model = list( cif="hardcore", par=list(beta=1, hc=0.025), w=owin(c(0,1),c(0,1)), trend=int_function_neg)) #rpoispp(lambda = int_function_neg, win=owin(c(0,1),c(0,1))) spat_dat3$marks <- (rnorm(n=spat_dat3$n ,mean=int_function(x=spat_dat3$x, y=spat_dat3$y), sd= 100) / 100 ) spat_dat3$marks <- spat_dat3$marks - min(spat_dat3$marks) plot(spat_dat3)
This time, the n_2 sampling locations of the first follow-up survey at which to record (noisy) values of the latent process under study appear to be more uniformly placed throughout the study region than before. Furthermore, no clear associations between the intensity of the observation locations and the recorded values (the marks) exist.
On the other hand, the n_3 sampling locations of the second follow-up survey at which to record (noisy) values of the latent process under study appear to be placed at a higher intensity in the bottom left corner of the study region. A, clear negative association is seen between the intensity of the observation locations and the recorded values (the marks). Thus it appears that negative preferential sampling may have occured.
The goal of this analysis is to see if the follow-up studies were preferentially sampled.
First, we must format the data into the correct form required for the PStestR
package. This is achieved once again by the function PSTestInit
. Along with the positions
argument, a times
argument is now also required for specifying which discrete time period each observation belongs to. Furthermore, the covariates
and formula
arguments may be unique for each time period, or fixed across all time periods.
For spatio-temporal data, the PSTestInit
function is used similarly as before with the geostatistical data. The major differences here are with the following arguments:
type
- For spatiotemporal data, simply set type = 'spacetime' times
- This is a vector of unique numerical entries, specifying the observation times.positions
- this takes either an object of class ppp containing all the points across all the times. Otherwise, it takes a data.frame with named values x and y, giving the coordinates of all the observed points, across all times. Thus the number of rows of positions
must equal the length of times
.poly
- this takes a polygon-style object defining the boundaries of the study region. This can be of class owin
(from the spatstat
package), SpatialPolygons
(from the sp
package) or sf
(from the sf
package).#In addition to the above, the following optional argument also changes: #* `covariate_grids` - this is now either a list of covariate rasters, grids or pixels as before, or a named list of lists. In the latter, each list must contain a set of covariate rasters, grids or pixels for a specific time period. The names of the list entries must correspond to the unique values in `times`.
Thus returning to our example, for the positions
argument we create a data.frame with named elements x
and y
for both spat_dat2
and spat_dat3
. Next, we merge the two data frames. Finally, to create times
, we create a vector of length n_2 + n_3 with the first n_2 entries set to 2 and the remaining n_3 entries set equal to 3. Then, we plug these into the PSTestInit
function.
positions_merged <- data.frame(x = c(spat_dat2$x, spat_dat3$x), y = c(spat_dat2$y, spat_dat3$y)) times_vec <- c( rep(2, times = spat_dat2$n), rep(3, times = spat_dat3$n) ) test_st_dat <- PSTestInit(type='spacetime', positions = positions_merged, times = times_vec, discrete = F, poly=window_poly, n_prediction = 10000) names(test_st_dat)
Now we are ready to conduct the test.
Now we can run the test using the PSTestRun
function. The major differences between spatio-temporal data and geostatistical data are the arguments:
formula
- this is now either a formula specifying the same trend for each of the point process models across the times, or a named list, with entries equal to formulas. If a named list is chosen, names should match the unique entries in times
. Each formula then specifies the trend in the fitted point process corresponding to that time.covariates
- this is now either a list of covariate rasters, grids, pixels, or functions of the coordinates as before, or a named list of lists. In the latter, each list must contain a set of covariate rasters, grids etc., for a specific time period. The names of the list entries must correspond to the unique values in times
.latent_effect
- specifies the latent_effect estimated for each time period. As with covariates
, either this can be a single allowed object (e.g. a SpatialPixelsDataFrame
), or a named list of such entries. In the latter case, each list must contain a suitable data type containing the estimated latent effect for that specific time periodsimultaneous
- a logical stating whether the test should be computed with a simultaneous confidence band estimated. This simultaneous test is now computed over all K values and all times.As before, there are additional (optional) arguments that can be provided.
Now we show how to fit the tests. First, we fit the pointwise tests without any covariates. We only fit the nearest neighbour based test first. Here the formula is specified as ~ 1, and the interaction is set to NULL
to indicate that a homogeneous poisson process for each time/study is desired for capturing our sampling processes under the null. The latent effect is assumed constant across time and is estimated separately using the data from both surveys.
latent_effect3 <- latent_effect
st_test <- PSTestRun(test_st_dat, formula = ~ 1, interaction = NULL, latent_effect = latent_effect, residual_tests=F, M=19, no_nn = 10, parallel = T, ncores=1, return_plots = F, simultaneous = F) st_test
Thus the pointwise tests reject the null hypothesis of no preferential sampling at the 5% signficance level for the second follow-up dataset for K = 3 and K = 10 and fail to reject the null hypothesis for any K for the first follow-up dataset. Next, we perform the simultaneous test, which computes the simultaneous confidence band over all values of K and over all times.
st_test2 <- PSTestRun(test_st_dat, formula = ~ 1, interaction = NULL, latent_effect = latent_effect, residual_tests=F, M=19, no_nn = 10, parallel = T, ncores=1, return_plots = T, simultaneous = T) names(st_test2) st_test2$critical_deviance
The plot above is presented as follows. On the y-axis are the number of nearest neighbours K, over which the distances have been averaged over. On the x-axis is time, indicating the different follow-up studies in our example. The numbers in the white boxes are the observed rank correlations in the data for each value of K and time. Finally, the colour behind the white boxes denotes the value of the observed rank correlation, with no colour indicating that the observed rank correlation falls within the Monte Carlo approximated simultaneous confidence band.
Thus, for our example it is apparent that the null is rejected only for the second follow-up dataset, for K = 10. These are the only cells with colour behind them. Thus we fail to find evidence of preferential sampling in the first follow-up study and find evidence at the 5% significance level that PS did occur in the second follow-up study.
Each of the test statistics should be of similar variability, otherwise the studentized MAD envelope test or the directional quantile MAD envelope test should be used (see Mrkvicka et.al 2017). This is because the power to detect a departure from the null could be severely diminished if a specific test statistic is very highly variable relative to the others.
Although the PStestR package doesn't currently implement these modified envelope tests, by adding an argument return_rho_vals=T
to PSTestRun
, the Monte Carlo rho values will be returned, allowing such tests to be computed manually. The Monte Carlo rho values are stored in the entry Monte_Carlo_rho_values
and the observed rho values in the data are stored in test_rho
manual_test <- PSTestRun(test_st_dat, formula = ~ 1, interaction = NULL, latent_effect = latent_effect, residual_tests=F, M=19, no_nn = 10, parallel = T, ncores=1, return_plots = T, simultaneous = T, return_rho_vals = T) names(manual_test)
Setting up the data is made simple with the helper function PSTestInit
. For discrete spatial data, this function takes as arguments:
type
- this takes a character string. For discrete spatial data, simply set type = 'spatial' discrete
- this takes a logical value. Here, as we are in the discrete (i.e. areal) setting we set discrete = Tareal_polygons
- this takes an object of class owin (from the spatstat package), SpatialPolygons (from the sp package) or sf (from the sf package), giving the Polygons of the population of areal unitsareal_poly_observations
- this is a vector of indices denoting the polygons with complete data.In addition to the above, the following are additional (optional) arguments to call:
discrete_locations
provides the locations (e.g. centroids) of the population of areal units. This is an alternative specification which avoids the need to specify the polygons.library(sp) library(maptools) library(rgeos) library(rgdal) library(spatstat) library(doParallel) library(foreach) library(ggplot2) library(mgcv) library(PStestR) # set seed set.seed(12345) ## load areal data discrete_data <- PStestR:::sample_discrete_data#readRDS('../data/sample_discrete_data.rds') list2env(discrete_data,.GlobalEnv) #source('PSTestInit')
Let's show an example of how to initialise data, suitable for use with PSTestInit
. Suppose that we have an object called areal_polygons
of class SpatialPolygons
defining our population of areal units. Suppose we also have a vector of integers called PS_indices
giving the indices of the chosen areal polygons. Finally, suppose we have an object called PS_data
of class SpatialPolygonsDataFrame
. This contains the selected polygons with values of the response attached as a dataframe. The responses are noisy observations of some latent process which we aim to study. Once again, it is feared that areal units were preferentially sampled where the process was expected to be highest. We will show how one can test for this.
First, we plot selected areal units, and fill the colour based on the value of the response:
class(areal_polygons) class(PS_data) plot(areal_polygons) sp::spplot(PS_data)
Here, we see a potential clustering of areal units in the South Western corner. The values of the latent field appear especially high here. Thus positive PS may be present.
Now we move onto creating an object suitable for the PStestR
package by running PSTestInit
.
test_spat_dat <- PSTestInit(type='spatial', discrete = T, areal_polygons = areal_polygons, areal_poly_observations = PS_indices) names(test_spat_dat)
The object test_spat_dat
contains everything that is needed to fit the test. Of importance is the following named object within the test_spat_dat
:
prediction_df
is a data.frame
object containing the x, y coordinates of each of the areal polygon centroids. If a different measure of areal unit center is desired then this can be specified with the argument discrete_locations
instead (see above). We will need to append our estimates of the latent effect to this data frame.Once the desired covariates and latent effect(s) have been estimated or projected onto the centroids, they can be appended to the prediction_df
object. This is a suitable file format for the test.
Returning to our example, for the time being we ignore the potential preferential sampling and fit a discrete spatial regression model. No covariate effect is found here. We append the estimates of the latent effect, stored in the object latent_effect
, to the prediction_df
object.
latent_effect <- complete_data$Y
test_spat_dat$prediction_df$latent_effect <- latent_effect
Now we can run the test using the PSTestRun
function. For spatial data, this takes arguments:
proc_data
- this is the processed data, outputted from the PSTestInit
functioncovariates
- A data.frame
object with number of rows equal to the number of areal units.latent_effect
contains the vector of predicted latent effect values of length equal to the number of areal polygons in the populationformula
- this is the formula specifying the trend for the binary sampling model. R must be specified as the response variable (i.e. on the LHS of the ~).M
- specifies the number of Monte Carlo samples to take for the testno_nn
- specifies the maximum number of nearest neighbours to take distances oversimultaneous
- a logical stating whether the test should be computed with a simultaneous confidence band estimated. If set equal to FALSE
, the pointwise tests are computed for each nearest neighbour number. If set equal to TRUE
, the test does not suffer from the multiple testing problem, as the test is no longer pointwise and is instead a rank envelope test.global_alpha
- when simultaneous=TRUE
, this specifies the significance level for computing the simultaneous confidence band for the rank envelope test. This is equal to the significance level of the simultaneous test.In addition to the above arguments, for discrete spatial data there are additional (optional) arguments that can be provided:
parallel
- takes a logical value indicating whether or not the script should be run in parallel using the foreach
and DoParallel
packages. ncores
- if parallel=TRUE
, this sets the number of cores to parallelize across.return_plots
- a logical argument specifying whether or not a plot is desired for the rank envelope test when simultaneous=TRUE
. PS
- a character string specifying 'either' (default), 'positive' or 'negative'. This specifies the direction of preferential sampling, with 'either' specifying a two-sided alternative hypothesis.fix_n
- a logical argument specifying if the Monte Carlo sampled binary indicators should sum to match the observed numbers of areal units with complete data.Now we show how to fit the test. First, we fit the pointwise tests. Here the formula is specified as R ~ 1
for the null model (assuming no PS). Note that R
must always be set as the response variable.
spat_test_1 <- PSTestRun(test_spat_dat, formula = R ~ 1, covariates = test_spat_dat$prediction_df, latent_effect = latent_effect, interaction = NULL, M=19, no_nn = 10, parallel = F, ncores=1, return_plots = F) spat_test_1
Here we see evidence that the data were indeed preferentially sampled.
spat_test_12 <- PSTestRun(test_spat_dat, formula = R ~ 1, covariates = test_spat_dat$prediction_df, latent_effect = latent_effect, interaction = NULL, M=19, no_nn = 10, parallel = F, ncores=1, return_plots = T, simultaneous = T) spat_test_12$global_test_NN spat_test_12$pointwise_empirical_pvalues
Here, the simultaneous test also finds evidence of PS for all K > 1. This simultaneous test does not suffer the problems of multiple testing.
Now suppose that we find a covariate cov
that we believe to be associated with both the sampling process, and with the latent effect. We can include this in the sampling model to see if this explains away the PS. First we need to append cov
, of class vector
, to the prediction_df
object, and then update the formula.
cov <- latent_effect + rnorm(length(latent_effect), mean=0, sd=2)
test_spat_dat$prediction_df$cov <- cov spat_test_2 <- PSTestRun(test_spat_dat, formula = R ~ cov, covariates = test_spat_dat$prediction_df, latent_effect = latent_effect, interaction = NULL, M=19, no_nn = 10, parallel = F, ncores=1, return_plots = F) spat_test_2
Whilst the covariate cov
was indeed found to be significantly associated with the binary selection process, the evidence of preferential sampling remains. Thus, we have failed to find a sufficient set of covariates to adequately explain away the PS we see in the data.
For discrete spatio-temporal data, the process is very similar.
Setting up the data is made simple with the helper function PSTestInit
. For discrete spatiotemporal data, this function takes as arguments:
type
- this takes a character string. For discrete spatial data, simply set type = 'spacetime' discrete
- this takes a logical value. Here, as we are in the discrete (i.e. areal) setting we set discrete = Tareal_polygons
- this takes an object of class owin (from the spatstat package), SpatialPolygons (from the sp package) or sf (from the sf package), giving the Polygons of the population of areal unitsareal_poly_observations
- this is a vector of indices denoting the polygons with complete data.times
- this is a vector of observation timesIn addition to the above, the following are additional (optional) arguments to call:
discrete_locations
provides the locations (e.g. centroids) of the population of areal units. This is an alternative specification which avoids the need to specify the polygons.library(sp) library(maptools) library(rgeos) library(rgdal) library(spatstat) library(doParallel) library(foreach) library(ggplot2) library(mgcv) library(PStestR) # set seed set.seed(12345) ## load areal data discrete_st_data <- PStestR:::sample_st_discrete_data#readRDS('../data/sample_st_discrete_data.rds') list2env(discrete_st_data,.GlobalEnv) #source('PSTestInit')
Let's show an example of how to initialise data, suitable for use with PSTestInit
. Suppose that we have an object called areal_polygons
of class SpatialPolygons
defining our population of areal units. Suppose we also have a vector called PS_indices
giving the indices of the chosen areal polygons. Suppose we also have a vector called times
taking integer values. Each entry indicates the observed time step corresponding to each entry of PS_indices
. Finally, suppose we have an object called PS_data
of class list
. Each element of the list is of class SpatialPolygonsDataFrame
and contains the selected polygons with values of the response attached as a dataframe. The responses are noisy observations of some latent process which we aim to study. The length of the list PS_data
equals the number of unique entries in times
.
Once again, it is feared that areal units were preferentially sampled where the process was expected to be highest. We will show how one can test for this.
First, we plot selected areal units at each time step and fill the colour based on the value of the response. Note that 5 time steps were observed.
par(mfrow=c(3,2))
class(areal_polygons) class(PS_data) class(PS_data[[1]]) plot(areal_polygons) sp::spplot(PS_data[[1]]) sp::spplot(PS_data[[2]]) sp::spplot(PS_data[[3]]) sp::spplot(PS_data[[4]]) sp::spplot(PS_data[[5]])
par(mfrow=c(1,1))
For the first four time steps, we see a potential clustering of areal units where the latent field takes large values. Thus positive PS may be present. In the final time step, no obvious clustering can be seen.
Now we move onto creating an object suitable for the PStestR
package by running PSTestInit
.
test_st_dat <- PSTestInit(type='spacetime', discrete = T, areal_polygons = areal_polygons, areal_poly_observations = PS_indices, times = times) names(test_st_dat)
The object test_st_dat
contains everything that is needed to fit the test. Of importance is the following named object within test_st_dat
:
prediction_df
is a data.frame
object containing the x, y coordinates of each of the areal polygon centroids. If a different measure of areal unit center is desired then this can be specified with the argument discrete_locations
instead (see above). We will need to append our estimates of the latent effect to this data frame.Once the desired covariates and latent effect(s) have been estimated or projected onto the centroids, they can be appended to the prediction_df
object. This is a suitable file format for the test.
Returning to our example, for the time being we ignore the potential preferential sampling and fit a discrete spatial regression model. No covariate effect is found here. We append the estimates of the latent effect, stored in the object latent_effect
, of class numeric
, to the prediction_df
object.
latent_effect <- rep(NA, times=length(test_st_dat$prediction_df$t)) latent_effect[test_st_dat$prediction_df$t==1] <- complete_data[[1]]$Y latent_effect[test_st_dat$prediction_df$t==2] <- complete_data[[2]]$Y latent_effect[test_st_dat$prediction_df$t==3] <- complete_data[[3]]$Y latent_effect[test_st_dat$prediction_df$t==4] <- complete_data[[4]]$Y latent_effect[test_st_dat$prediction_df$t==5] <- complete_data[[5]]$Y
test_st_dat$prediction_df$latent_effect <- latent_effect
Now we can run the test using the PSTestRun
function. For spatial data, this takes arguments:
proc_data
- this is the processed data, outputted from the PSTestInit
functioncovariates
- A data.frame
object with number of rows equal to the number of areal units multiplied by the number of unique entries in times
.latent_effect
contains the vector of predicted latent effect values of length equal to the number of areal polygons in the population multiplied by the number of unique entries in times
.formula
- this is the formula specifying the trend for the binary sampling model.M
- specifies the number of Monte Carlo samples to take for the testno_nn
- specifies the maximum number of nearest neighbours to take distances oversimultaneous
- a logical stating whether the test should be computed with a simultaneous confidence band estimated. If set equal to FALSE
, the pointwise tests are computed for each nearest neighbour number. If set equal to TRUE
, the test does not suffer from the multiple testing problem, as the test is no longer pointwise and is instead a rank envelope test.global_alpha
- when simultaneous=TRUE
, this specifies the significance level for computing the simultaneous confidence band for the rank envelope test. This is equal to the significance level of the simultaneous test.In addition to the above arguments, for discrete spatial data there are additional (optional) arguments that can be provided:
parallel
- takes a logical value indicating whether or not the script should be run in parallel using the foreach
and DoParallel
packages. ncores
- if parallel=TRUE
, this sets the number of cores to parallelize across.return_plots
- a logical argument specifying whether or not a plot is desired for the rank envelope test when simultaneous=TRUE
. PS
- a character string specifying 'either' (default), 'positive' or 'negative'. This specifies the direction of preferential sampling, with 'either' specifying a two-sided alternative hypothesis.fix_n
- a logical argument specifying if the Monte Carlo sampled binary indicators should sum to match the observed numbers of areal units with complete data.Now we show how to fit the test. First, we fit the pointwise tests. Here the formula is specified as R ~ 1
for the null model (assuming no PS). Note that R
must always be set as the response variable.
st_test_1 <- PSTestRun(test_st_dat, formula = R ~ 1, covariates = test_st_dat$prediction_df, latent_effect = latent_effect, interaction = NULL, M=19, no_nn = 10, parallel = F, ncores=1, return_plots = F) st_test_1
Here we see evidence that the data were indeed preferentially sampled for the first 4 times. No evidence of PS is found for the final year.
Now suppose that we find a covariate cov
that we believe to be associated with both the sampling process, and with the latent effect for the first year. Suppose we are unable to find covariates for any of the other years. We can include this in the sampling model to see if this explains away the PS. First we need to append cov
to the prediction_df
object, and then update the formula. The formula
object should now be a named list of length 5. Each entry of the list specifies the null sampling model. The names of the entries corresponds to the value of the time step. The first formula will contain the variable cov
, while the remaining 4 formulae will be intercept-only.
cov <- complete_data[[1]]$Y + rnorm(length(complete_data[[1]]$Y), mean=0, sd=0)
test_st_dat$prediction_df$cov <- NA test_st_dat$prediction_df$cov[test_st_dat$prediction_df$t==1] <- cov formula_st2 <- list('1' = R ~ cov, '2' = R ~ 1, '3' = R ~ 1, '4' = R ~ 1, '5' = R ~ 1) st_test_2 <- PSTestRun(test_st_dat, formula = formula_st2, covariates = test_st_dat$prediction_df, latent_effect = latent_effect, interaction = NULL, M=19, no_nn = 10, parallel = F, ncores=1, return_plots = F) st_test_2
Here the covariate cov
was indeed found to be significantly associated with the binary selection process for the first year. Furthermore, the evidence of preferential sampling has dissapeared with the inclusion of cov
in the null sampling model. Thus, we have found a sufficient set of covariates to adequately explain away the PS we see in the data for the first time step. We could now repeat the process for the remaining 4 years.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.