In wsteenbeek/sppt: Spatial Point Pattern Test

A description of the steps in the Spatial Point Pattern Test are described here, and pseudocode can be found in several publications (linked to on the same website).

Nick Malleson has programmed a Java application for the SPPT, including a graphical user interface. The sppt R function was programmed by Wouter Steenbeek independently. This vignette compares the output of the R function and Java application, showing their differences.

Toy example data

library("sppt")

Spatial objects areas.sp, points1.sp, and points2.sp can be used for a quick comparison of the R and Java implementation.

plot(areas.sp)
points(points1.sp, col="blue", pch = 19)
points(points2.sp, col="red", pch = 15)
text(coordinates(areas.sp), label = areas.sp$ID, font = 2)

The areas' ID numbers are printed in black, placed at the area centroid. The base points are blue circles, and the test points are red squares. The question is: does the spatial pattern of the blue points differ from the spatial pattern of the red points, for these areas?

A first comparison

Using the Java application, a shapefile is produced with the following data.frame:

load("toy_java.rda")
toy_java@data

The original ordering of the area ID vector was:

areas.sp$ID

Apparently the Java application reorders the ID numbers. The Java application has also converted the ID number from an integer to a factor.

Putting the Java output back in the original order:

toy_java_df <- toy_java@data[order(as.numeric(as.character(toy_java$ID))),]
toy_java_df

Let's now run the R sppt() function:

set.seed(39346) # set seed for reproducibility
toy_r <- sppt(points1.sp, points2.sp, areas.sp)
toy_r@data

We see differences already in the calculation of PctBsePts (PercentBasePoints) and PctTstPts (PercentTestPoints). The reason is a difference in the way the methods handle points that lie outside the polygons.

As you can see in the figure above, there are 9 base points in total, and 10 test points in total. But in each points dataset, there is one point outside the polygon areas.

In the R code, a point that lies outside the areas is ignored completely. So it is not included when counting the number of points per areal unit, and it is also not included for the calculation of the percentage of points per unit (i.e. in the denominator). In contrast, the Java app includes a point that is located outside the areal units in the denominator. The R code is correct, because we are interested in the increase/decrease of event occurrence in a particular set of areal units. By including events outside the areal units under study, the percentages within the areal units don't make much sense anymore.

In this toy example, the consequences aren't noteworthy. But let's simulate extra data points to produce a crime hot spot a bit further away, i.e. away from the areal units the researcher is interested in.

if(!require(scales)) install.packages("scales", repos = "https://cloud.r-project.org/")

plot(areas.sp)
points(points1.sp, col="blue", pch = 19)
points(points2.sp, col="red", pch = 15)
set.seed(345)
points(x = jitter(rep(bbox(points1.sp)[3]+150, times=50), factor=0.008), 
       y = jitter(rep(bbox(points1.sp)[2]-50, times=50), factor=0.008), 
       col=scales::alpha("blue", 0.6), pch=16)
text(coordinates(areas.sp), label = areas.sp$ID, font = 2)

In this case, it doesn't make much sense to include these 50 points if we are interested in the change in crime occurrence within these 6 areal units. In the new basepoints data it would then seem that 3.4% of crime is happening in areal unit 1, and in the test points dataset this will have increased to 22.2%! Instead, it makes most sense to only analyze the points that lie within the areal units.

In most cases, we expect researchers will have a contiguous set of areal units (e.g. neighborhoods that are all connected to each other), and the locations of (crime) events within those areal units. In that case, the Java app and R code will produce (almost) the same result (see below).

A comparison using the exact same data

With the previous caveat of the Java application in mind, let's select only points located inside the areas, and compare the R and Java output. Here are the toy example data again, but with two data points removed, one from the base data and one from the test data.

# remove points outside any areal unit
points1.sp.new <- points1.sp[areas.sp,]
points2.sp.new <- points2.sp[areas.sp,]

plot(areas.sp)
points(points1.sp.new, col="blue", pch = 19)
points(points2.sp.new, col="red", pch = 15)
text(coordinates(areas.sp), label = areas.sp$ID, font = 2)

The Java application, using 200 iterations, 85% sample size, and 95% confidence interval, present the following message:

Calculating percentage test points in each area for each run

Ranking percentages in ascending order and removing 5 outliers from top and bottom

Calculating S-index for each area

Found global S value: 0.8333333333333334

Outputting shapefile of areas: basic_test_new_output.shp

ALGORITHM HAS FINISHED

Have read in 8 base points, 9 test points (8 test point used) and 6 areas.

And a shapefile is produced with the following data.frame (with the ID number put back in the original order):

load("toy_java_new.rda")
toy_java_df <- toy_java_new@data[order(as.numeric(as.character(toy_java_new$ID))),]
toy_java_df

The S-Index 'global S value' can be reproduced by taking the mean of the absolute values of the S-index for each area:

1 - mean(abs(toy_java_df$SIndex))

The R sppt() function:

set.seed(39346) # set seed for reproducibility
toy_r_new <- sppt(points1.sp.new, points2.sp.new, areas.sp, nsamples=200, percpoints=85, conf_level=95)

The standard and robust global S-values are included in the resulting spatial object's dataframe:

toy_r_new@data

These S-values can also be accessed through the convenience function summary_sppt():

summary_sppt(toy_r_new)

We see that the sppt() R function can reproduce the Java output exactly.

Vancouver example data

Spatial objects areas_vancouver.sp, points1_vancouver.sp, and points2_vancouver.sp are used for a more in-depth comparison of the R and Java implementation.

The question is whether, and to what extent, the Vancouver base points pattern:

plot(vancouver_areas.sp)
points(vancouver_points1.sp, col="blue", pch = 19, cex=.2)

is different from the test points pattern:

plot(vancouver_areas.sp)
points(vancouver_points2.sp, col="red", pch = 15, cex=.2)

To make an honest comparison (see above), both SpatialPointsDataFrames were clipped to areas_vancouver.sp and these new shapefiles were used in the Java application: it now should produce the same results as the R function.

# remove points outside any areal unit
vancouver_points1.sp <- vancouver_points1.sp[vancouver_areas.sp, ]
vancouver_points2.sp <- vancouver_points2.sp[vancouver_areas.sp, ]

Speed difference

The Java application (200 iterations, 85% sample size, 95% confidence interval) takes about 50 seconds on an Intel Core i7-3770 CPU @ 3.40 Ghz, with 32 Gb RAM, and Windows 7 Professional.

The microbenchmark package is used to provide an estimate of the R function's processing time:

if(!require(microbenchmark)) install.packages("microbenchmark", repos = "https://cloud.r-project.org/")

microbenchmark::microbenchmark(sppt(vancouver_points1.sp, vancouver_points2.sp, vancouver_areas.sp, nsamples=200, percpoints=85, conf_level=95), times = 50L, unit = "s")

The R function takes about 1.5 seconds, a massive speed increase compared to the Java application.

Preliminary checks

The sppt() function call is:

set.seed(39346) # set seed for reproducibility
vancouver_r <- sppt(vancouver_points1.sp, vancouver_points2.sp, vancouver_areas.sp, nsamples=200, percpoints=85, conf_level=95)

vancouver_r_df <- vancouver_r@data

load("vancouver_java_new.rda")
vancouver_java_df <- vancouver_java_new@data
vancouver_java_df <- vancouver_java_df[order(as.numeric(as.character(vancouver_java_df$CTUID))),]

After running the Java application with the same parameters and reordering the generated shapefile, let's check to see that the R output and Java output are similar on basic variables (data.frames are called vancouver_r_df and vancouver_java_df, respectively):

# test if the ID numbers are in the same order:
table(vancouver_r_df$CTUID == vancouver_java_df$CTUID)
# test that vectors PctBsePts are the same:
table(vancouver_r_df$PctBsePts == vancouver_java_df$PctBsePts)
# test that vectors PctTstPts are the same:
table(vancouver_r_df$PctTstPts == vancouver_java_df$PctTstPts)

Comparing results

The Java application produced the following output on screen:

Calculating percentage test points in each area for each run

Ranking percentages in ascending order and removing 5 outliers from top and bottom

Calculating S-index for each area

Found global S value: 0.17592592592592593

Outputting shapefile of areas: vancouver_all_new_output.shp

ALGORITHM HAS FINISHED

Have read in 14584 base points, 14782 test points (12565 test point used) and 108 areas.

The S-values created by the sppt() function are:

summary_sppt(vancouver_r)

The standard global S-value is slightly different from the value reported by the Java application. As the it refers to the proportion of areas that had stable proportions of crime, the difference in standard global S-value is caused by a difference in the underlying S-Index (which is a value that can differ for each area, indicating stability or change). Are these vectors equal?

table(vancouver_r_df$SIndex == vancouver_java_df$SIndex)

Let's look at the output of the R function and Java application. First, the R output:

differences <- which(vancouver_r_df$SIndex != vancouver_java_df$SIndex)
vancouver_r_df[differences, c("CTUID", "PctBsePts","PctTstPts","ConfLowP","ConfUppP","SIndex")]

And the Java output:

vancouver_java_df[differences, c("CTUID", "PctBsePts","PctTstPts","ConfLowP","ConfUppP","SIndex")]

The confidence intervals for the R output are just slightly different, leading to a conclusion of change, as the PctBsePts values fall within the confidence interval between ConfLowP and ConfUppP in the R output, whereas they are just outside the interval of ConfLowP and ConfUppP of the Java output.

Because SPPT is a simulation-based test, the results can be slightly different every time. The R function is reproducible by using set.seed(), but the Java application is not. If this is a reasonable explanation for the difference, then we can find the exact ConfLowP and ConfUppP as reported by the Java application in the R output, as long as we replicate the sppt() function often enough.

Let's try 10 replications of the sppt() function (note that the defaults of the sppt() function are 200 iterations, 85% sample size, 95% confidence interval, so we don't have to include these in the function call):

set.seed(85335)
reps.output <- replicate(10, sppt(vancouver_points1.sp, vancouver_points2.sp, vancouver_areas.sp))
reps.output <- lapply(c("ConfLowP", "ConfUppP"), function(y) do.call(cbind, sapply(reps.output, function(x) x@data[y])))

reps.output is a list with the first element referring to ConfLowP, consisting of r nrow(reps.output[[1]]) rows (= one for each area) and r ncol(reps.output[[1]]) columns (= the replications). The second list element is of equal size, but refers to ConfUppP. As an example: suppose we run the sppt() function 20 times, then the following r ncol(reps.output[[1]]) possible ConfLowP were calculated for the 52nd area:

as.numeric(reps.output[[1]][52,])

So the question is, can the first area's ConfLowP be found in one of the twenty potential outcomes (for the first area) of the SPPT R function?

vancouver_java_df$ConfLowP[1] %in% reps.output[[1]][1,]

Considering all areas, how many times can the 'ConfLowP' and 'ConfUppP' calculated by the Java application be found in at least one of the replicated outcomes?

# ConfLowP
table(sapply(1:nrow(vancouver_java_df), function(x) {vancouver_java_df$ConfLowP[x] %in% reps.output[[1]][x,]} ))
# ConfUppP
table(sapply(1:nrow(vancouver_java_df), function(x) {vancouver_java_df$ConfUppP[x] %in% reps.output[[2]][x,]} ))

10 replications is enough to reproduce (exactly) many of the ConfLowP and ConfUppP of the Java application. Let's see the same output if we try 80 replications:

set.seed(85335)
reps.output <- replicate(80, sppt(vancouver_points1.sp, vancouver_points2.sp, vancouver_areas.sp)) # this will take some time
reps.output <- lapply(c("ConfLowP", "ConfUppP"), function(y) do.call(cbind, sapply(reps.output, function(x) x@data[y])))

# ConfLowP
table(sapply(1:nrow(vancouver_java_df), function(x) {vancouver_java_df$ConfLowP[x] %in% reps.output[[1]][x,]} ))
# ConfUppP
table(sapply(1:nrow(vancouver_java_df), function(x) {vancouver_java_df$ConfUppP[x] %in% reps.output[[2]][x,]} ))

Now all of the Java application's output is reproduced exactly by the R sppt() function. We conclude that just by chance, the R output of one function call can give different results than the Java output, but this is to be expected as the SPPT is a simulation-based test.

wsteenbeek/sppt documentation built on Oct. 16, 2020, 6:11 p.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

Tweet to @rdrrHQ

GitHub issue tracker

ian@mutexlabs.com