A description of the steps in the Spatial Point Pattern Test are described here, and pseudocode can be found in several publications (linked to on the same website).
Nick Malleson has programmed a Java application for the SPPT, including a graphical user interface. The sppt
R function was programmed by Wouter Steenbeek independently. This vignette compares the output of the R function and Java application, showing their differences.
library("sppt")
Spatial objects areas.sp, points1.sp, and points2.sp can be used for a quick comparison of the R and Java implementation.
plot(areas.sp) points(points1.sp, col="blue", pch = 19) points(points2.sp, col="red", pch = 15) text(coordinates(areas.sp), label = areas.sp$ID, font = 2)
The areas' ID numbers are printed in black, placed at the area centroid. The base points are blue circles, and the test points are red squares. The question is: does the spatial pattern of the blue points differ from the spatial pattern of the red points, for these areas?
Using the Java application, a shapefile is produced with the following data.frame:
load("toy_java.rda") toy_java@data
The original ordering of the area ID vector was:
areas.sp$ID
Apparently the Java application reorders the ID numbers. The Java application has also converted the ID number from an integer to a factor.
Putting the Java output back in the original order:
toy_java_df <- toy_java@data[order(as.numeric(as.character(toy_java$ID))),] toy_java_df
Let's now run the R sppt()
function:
set.seed(39346) # set seed for reproducibility toy_r <- sppt(points1.sp, points2.sp, areas.sp) toy_r@data
We see differences already in the calculation of PctBsePts (PercentBasePoints) and PctTstPts (PercentTestPoints). The reason is a difference in the way the methods handle points that lie outside the polygons.
As you can see in the figure above, there are 9 base points in total, and 10 test points in total. But in each points dataset, there is one point outside the polygon areas.
In the R code, a point that lies outside the areas is ignored completely. So it is not included when counting the number of points per areal unit, and it is also not included for the calculation of the percentage of points per unit (i.e. in the denominator). In contrast, the Java app includes a point that is located outside the areal units in the denominator. The R code is correct, because we are interested in the increase/decrease of event occurrence in a particular set of areal units. By including events outside the areal units under study, the percentages within the areal units don't make much sense anymore.
In this toy example, the consequences aren't noteworthy. But let's simulate extra data points to produce a crime hot spot a bit further away, i.e. away from the areal units the researcher is interested in.
if(!require(scales)) install.packages("scales", repos = "https://cloud.r-project.org/") plot(areas.sp) points(points1.sp, col="blue", pch = 19) points(points2.sp, col="red", pch = 15) set.seed(345) points(x = jitter(rep(bbox(points1.sp)[3]+150, times=50), factor=0.008), y = jitter(rep(bbox(points1.sp)[2]-50, times=50), factor=0.008), col=scales::alpha("blue", 0.6), pch=16) text(coordinates(areas.sp), label = areas.sp$ID, font = 2)
In this case, it doesn't make much sense to include these 50 points if we are interested in the change in crime occurrence within these 6 areal units. In the new basepoints data it would then seem that 3.4% of crime is happening in areal unit 1, and in the test points dataset this will have increased to 22.2%! Instead, it makes most sense to only analyze the points that lie within the areal units.
In most cases, we expect researchers will have a contiguous set of areal units (e.g. neighborhoods that are all connected to each other), and the locations of (crime) events within those areal units. In that case, the Java app and R code will produce (almost) the same result (see below).
With the previous caveat of the Java application in mind, let's select only points located inside the areas, and compare the R and Java output. Here are the toy example data again, but with two data points removed, one from the base data and one from the test data.
# remove points outside any areal unit points1.sp.new <- points1.sp[areas.sp,] points2.sp.new <- points2.sp[areas.sp,]
plot(areas.sp) points(points1.sp.new, col="blue", pch = 19) points(points2.sp.new, col="red", pch = 15) text(coordinates(areas.sp), label = areas.sp$ID, font = 2)
The Java application, using 200 iterations, 85% sample size, and 95% confidence interval, present the following message:
Calculating percentage test points in each area for each run
Ranking percentages in ascending order and removing 5 outliers from top and bottom
Calculating S-index for each area
Found global S value: 0.8333333333333334
Outputting shapefile of areas: basic_test_new_output.shp
ALGORITHM HAS FINISHED
Have read in 8 base points, 9 test points (8 test point used) and 6 areas.
And a shapefile is produced with the following data.frame (with the ID number put back in the original order):
load("toy_java_new.rda") toy_java_df <- toy_java_new@data[order(as.numeric(as.character(toy_java_new$ID))),] toy_java_df
The S-Index 'global S value' can be reproduced by taking the mean of the absolute values of the S-index for each area:
1 - mean(abs(toy_java_df$SIndex))
The R sppt()
function:
set.seed(39346) # set seed for reproducibility toy_r_new <- sppt(points1.sp.new, points2.sp.new, areas.sp, nsamples=200, percpoints=85, conf_level=95)
The standard and robust global S-values are included in the resulting spatial object's dataframe:
toy_r_new@data
These S-values can also be accessed through the convenience function summary_sppt()
:
summary_sppt(toy_r_new)
We see that the sppt()
R function can reproduce the Java output exactly.
Spatial objects areas_vancouver.sp, points1_vancouver.sp, and points2_vancouver.sp are used for a more in-depth comparison of the R and Java implementation.
The question is whether, and to what extent, the Vancouver base points pattern:
plot(vancouver_areas.sp) points(vancouver_points1.sp, col="blue", pch = 19, cex=.2)
is different from the test points pattern:
plot(vancouver_areas.sp) points(vancouver_points2.sp, col="red", pch = 15, cex=.2)
To make an honest comparison (see above), both SpatialPointsDataFrames were clipped to areas_vancouver.sp and these new shapefiles were used in the Java application: it now should produce the same results as the R function.
# remove points outside any areal unit vancouver_points1.sp <- vancouver_points1.sp[vancouver_areas.sp, ] vancouver_points2.sp <- vancouver_points2.sp[vancouver_areas.sp, ]
The Java application (200 iterations, 85% sample size, 95% confidence interval) takes about 50 seconds on an Intel Core i7-3770 CPU @ 3.40 Ghz, with 32 Gb RAM, and Windows 7 Professional.
The microbenchmark
package is used to provide an estimate of the R function's processing time:
if(!require(microbenchmark)) install.packages("microbenchmark", repos = "https://cloud.r-project.org/") microbenchmark::microbenchmark(sppt(vancouver_points1.sp, vancouver_points2.sp, vancouver_areas.sp, nsamples=200, percpoints=85, conf_level=95), times = 50L, unit = "s")
The R function takes about 1.5 seconds, a massive speed increase compared to the Java application.
The sppt()
function call is:
set.seed(39346) # set seed for reproducibility vancouver_r <- sppt(vancouver_points1.sp, vancouver_points2.sp, vancouver_areas.sp, nsamples=200, percpoints=85, conf_level=95)
vancouver_r_df <- vancouver_r@data load("vancouver_java_new.rda") vancouver_java_df <- vancouver_java_new@data vancouver_java_df <- vancouver_java_df[order(as.numeric(as.character(vancouver_java_df$CTUID))),]
After running the Java application with the same parameters and reordering the generated shapefile, let's check to see that the R output and Java output are similar on basic variables (data.frames are called vancouver_r_df
and vancouver_java_df
, respectively):
# test if the ID numbers are in the same order: table(vancouver_r_df$CTUID == vancouver_java_df$CTUID) # test that vectors PctBsePts are the same: table(vancouver_r_df$PctBsePts == vancouver_java_df$PctBsePts) # test that vectors PctTstPts are the same: table(vancouver_r_df$PctTstPts == vancouver_java_df$PctTstPts)
The Java application produced the following output on screen:
Calculating percentage test points in each area for each run
Ranking percentages in ascending order and removing 5 outliers from top and bottom
Calculating S-index for each area
Found global S value: 0.17592592592592593
Outputting shapefile of areas: vancouver_all_new_output.shp
ALGORITHM HAS FINISHED
Have read in 14584 base points, 14782 test points (12565 test point used) and 108 areas.
The S-values created by the sppt()
function are:
summary_sppt(vancouver_r)
The standard global S-value is slightly different from the value reported by the Java application. As the it refers to the proportion of areas that had stable proportions of crime, the difference in standard global S-value is caused by a difference in the underlying S-Index (which is a value that can differ for each area, indicating stability or change). Are these vectors equal?
table(vancouver_r_df$SIndex == vancouver_java_df$SIndex)
Let's look at the output of the R function and Java application. First, the R output:
differences <- which(vancouver_r_df$SIndex != vancouver_java_df$SIndex) vancouver_r_df[differences, c("CTUID", "PctBsePts","PctTstPts","ConfLowP","ConfUppP","SIndex")]
And the Java output:
vancouver_java_df[differences, c("CTUID", "PctBsePts","PctTstPts","ConfLowP","ConfUppP","SIndex")]
The confidence intervals for the R output are just slightly different, leading to a conclusion of change, as the PctBsePts values fall within the confidence interval between ConfLowP and ConfUppP in the R output, whereas they are just outside the interval of ConfLowP and ConfUppP of the Java output.
Because SPPT is a simulation-based test, the results can be slightly different every time. The R function is reproducible by using set.seed()
, but the Java application is not. If this is a reasonable explanation for the difference, then we can find the exact ConfLowP and ConfUppP as reported by the Java application in the R output, as long as we replicate the sppt()
function often enough.
Let's try 10 replications of the sppt()
function (note that the defaults of the sppt()
function are 200 iterations, 85% sample size, 95% confidence interval, so we don't have to include these in the function call):
set.seed(85335) reps.output <- replicate(10, sppt(vancouver_points1.sp, vancouver_points2.sp, vancouver_areas.sp)) reps.output <- lapply(c("ConfLowP", "ConfUppP"), function(y) do.call(cbind, sapply(reps.output, function(x) x@data[y])))
reps.output
is a list with the first element referring to ConfLowP, consisting of r nrow(reps.output[[1]])
rows (= one for each area) and r ncol(reps.output[[1]])
columns (= the replications). The second list element is of equal size, but refers to ConfUppP. As an example: suppose we run the sppt()
function 20 times, then the following r ncol(reps.output[[1]])
possible ConfLowP were calculated for the 52nd area:
as.numeric(reps.output[[1]][52,])
So the question is, can the first area's ConfLowP be found in one of the twenty potential outcomes (for the first area) of the SPPT R function?
vancouver_java_df$ConfLowP[1] %in% reps.output[[1]][1,]
Considering all areas, how many times can the 'ConfLowP' and 'ConfUppP' calculated by the Java application be found in at least one of the replicated outcomes?
# ConfLowP table(sapply(1:nrow(vancouver_java_df), function(x) {vancouver_java_df$ConfLowP[x] %in% reps.output[[1]][x,]} )) # ConfUppP table(sapply(1:nrow(vancouver_java_df), function(x) {vancouver_java_df$ConfUppP[x] %in% reps.output[[2]][x,]} ))
10 replications is enough to reproduce (exactly) many of the ConfLowP and ConfUppP of the Java application. Let's see the same output if we try 80 replications:
set.seed(85335) reps.output <- replicate(80, sppt(vancouver_points1.sp, vancouver_points2.sp, vancouver_areas.sp)) # this will take some time reps.output <- lapply(c("ConfLowP", "ConfUppP"), function(y) do.call(cbind, sapply(reps.output, function(x) x@data[y])))
# ConfLowP table(sapply(1:nrow(vancouver_java_df), function(x) {vancouver_java_df$ConfLowP[x] %in% reps.output[[1]][x,]} )) # ConfUppP table(sapply(1:nrow(vancouver_java_df), function(x) {vancouver_java_df$ConfUppP[x] %in% reps.output[[2]][x,]} ))
Now all of the Java application's output is reproduced exactly by the R sppt()
function. We conclude that just by chance, the R output of one function call can give different results than the Java output, but this is to be expected as the SPPT is a simulation-based test.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.