In Michael-Stevens-27/silverblaze: Geographic Profiling in R

set.seed(27)
library(silverblaze)
plasma <- c("#F0F921FF", "#FDC926FF", "#FA9E3BFF", "#ED7953FF", "#D8576BFF",
            "#BD3786FF", "#9C179EFF", "#7301A8FF", "#47039FFF", "#0D0887FF")

Introduction

The following tutorial will introduce the user to the general structure required to run silverblaze when analysing prevalence data. The protocol of silverblaze is to create one fundamental object, referred to in these tutorials as p, that describes the entire project. The structure of p consists of:

The data used to run the model
The parameter settings for the model
The output of running the model

Following this, the user will learn how to plot and diagnose the output of the model.

Simulating data

Before any data simulation happens, we must first define the sentinel site locations for the data.

sentinal_lon <- seq(-0.2, 0.0, l = 12)
sentinal_lat <- seq(51.45, 51.55, l = 12)
sentinal_grid <- expand.grid(sentinal_lon, sentinal_lat)
names(sentinal_grid) <- c("longitude", "latitude")

Here we've chosen a lattice configuration but the user may define any list of longitude/latitude locations they would like. With these sentinel sites we can generate our data from a given model using sim_data().

mysim <- sim_data(sentinal_grid$longitude,
                  sentinal_grid$latitude,
                  K = 3,
                  sigma_model = "single",
                  sigma_mean = 1,
                  sigma_var = 0.5,
                  expected_popsize = 300, 
                  data_type = "prevalence",
                  test_rate = 20)

true_source <- mysim$record$true_source

These parameters describe the underlying behaviour of the data in question:

K is the number of sources with which individuals are spatially based.
sigma_model defines homogeneity or independence of dispersal across sources.
sigma_mean and sigma_var govern the dispersal associated with each source. Set sigma_var = 0 to set an explicit dispersal via sigma_mean.
expected_popsize tells us how many individuals we expect to see over our search area. The true value is drawn from a Poisson distribution with this expectation.
data_type
test_rate

Let's view our prevalence data.

head(mysim$data)

Here we see the general structure of the data required to run the binomial model. That is, a collection of trial sites, the number of individuals tested at that location and the number testing positive. This mysim$data object is the only piece of information required for the user to run the model. All other output from mysim$data is merely a consequence of simulating the data in the first place.

Creating a project

We initialise a blank project p using the rgeoprofile_project() function.

p <- rgeoprofile_project()

We then add in our simulated data using bind_data().

p <- bind_data(p,
               df = mysim$data,
               data_type = "prevalence")

Simple spatial prior

We wish to impose a spatial prior on our model. Here we implement a very simple one by instructing the model to restrict the area of interest to to that which contains our sentinel sites plus some "guardRail." We can produce a uniform spatial prior with raster_grid().

uniform_prior <- raster_grid(range_lon = c(-0.2, 0), range_lat = c(51.45, 51.55),
                             cells_lon = 100, cells_lat = 100, guard_rail = 0.1)

spatialProject <- p
spatialProject <- new_set(project = spatialProject,
                          spatial_prior = uniform_prior)

plot_spatial <- plot_map()
plot_spatial <- overlay_spatial_prior(plot_spatial, spatialProject, col = "red", opacity = 0.2)
plot_spatial

Parameter sets

Next we define the various priors on the variables we wish to estimate when running the model. We define these parameters via new_set(). Notice the project p is one of the function's arguments. This is such that the parameter set we define is added into our project.

p <- new_set(project = p,
             spatial_prior = uniform_prior,
             sigma_model = "single",
             sigma_prior_mean = 1,
             sigma_prior_sd = 1,
             expected_popsize_prior_mean = 300,
             expected_popsize_prior_sd = 30,
             name = "Tutorial Parameters")
p

The project p is now taking shape as we have defined the data and parameters for the model. Each time the new_set() function is run it will add a new set of parameters to p$parameters_sets and will regard this new set as the active_set. To delete an old parameter set, we use delete_set(). The "output" object within p is empty right now, but will be populated once the model has run.

Running the model

We are now ready to run the MCMC algorithm on our data. The arguments in run_mcmc() are similar to many other MCMC implementations. The algorithm will start in the burn-in phase, checking for convergence at each convergence_test iterations and move into the sample phase once the conditions for convergence have been met or the number of burn-in iterations has been reached. The argument K governs how many sources the model should search for. Although we simulated the data, and know the true underlying number of sources, this will rarely be the case with a real-world data set. Hence K can take a single value or a sequence of values. We choose 1:5 as our data was generated with three sources. The pb_markdown = TRUE argument ensures a neater version of the console output is printed below, this should be removed when run by you.

p <- run_mcmc(project = p,
              K = 1:5,
              burnin = 1e4,
              samples = 5e4,
              converge_test = 5e3,
              auto_converge = TRUE,
              pb_markdown = TRUE)

p <- rgeoprofile_file("tutorial2_project.rds")

Before plotting any geoprofiles let's makes sure the MCMC is indeed mixing well.

Diagnosing the model

Firstly we question the MCMC's ability by producing diagnostic plots of the log-likelihood. This can be done for each value of K to ensure the algorithm is behaving properly.

plot_loglike_diagnostic(p, K = 3)

Another useful metric for assessing the MCMC is its effective sampling size (ESS). If we have 10,000 sampling iterations but an ESS of 80 then we really only have 80 samples from the posterior, and so we should run it out for longer.

get_output(p, name = "ESS", K = 3, type = "summary")

Results

Mapping

Now the model has run, we can start to visualise different aspects of the project. Plotting objects in silverblaze follow a similar structure to Leaflet and ggplot2. We start with a base layer

plot1 <- plot_map()

and overlay different parts of our project using the set of overlay functions. Lets start with the underlying sources locations and our spatial prior.

plot2 <- overlay_spatial_prior(plot1, p, col = "red", opacity = 0.2)
plot2 <- overlay_sources(plot2, true_source$longitude, true_source$latitude)
plot2

This isn't too interesting yet, so let's add to this the trial site data.

plot3 <- overlay_trial_sites(plot2, 
                             project = p, 
                             fill_opacity = 0.5,
                             fill = TRUE, 
                             fill_colour = c(grey(0.7), "red"),
                             border = c(FALSE, TRUE), 
                             border_colour = "black", 
                             border_weight = 0.5)
plot3

We can see the proportions of positive and negative tests, to see the exact values, try clicking on one of the pie charts. Finally, we add in the geoprofile produced by running the MCMC.

plot4 <- overlay_geoprofile(plot3, 
                            project = p, 
                            threshold = 0.1, 
                            opacity = 0.8, 
                            K = 3, 
                            col = plasma, 
                            legend = TRUE)
plot4

Other Parameter Estimation

We can also plot the posterior 95% credible intervals of each sigma. The "single" model will produce intervals for each source, this requires updating (TODO).

plot_sigma(p, K = 3)

Similarly the same can be done for the expected population size.

plot_expected_popsize(p, K = 3)

The model-testing metric used consistently through the geographic profiling literature is the hitscore. This is calculated by computing the area searched before finding a source divided by the total search area (where a search is defined by starting at the top of the geoprofile and working your way down). Hence a lower hitscore indicates a better performing model. We use get_hitscores to produce these.

hs <- get_hitscores(p, true_source$longitude, true_source$latitude)
hs

Another common metric to use for this analysis is a gini coefficient from a Lorenz plot.

plot_lorenz(hs)

This plot_lorenz() function returns a graph that describes the number of sources found as a function of area searched. The Gini co-efficient is then calculated as the area under these piecewise curves.

gini(hs)

Both the gini() and get_hitscores() functions refer to a "ringsearch" strategy. This is used to compare the model to a naive search strategy that requires you to search radially outwards from each positive sentinel site until sources are found. This gives us a bottom line non-trivial search strategy to compare hitscores to.

# secretly save/load data from file so output is consistent between tutorials
# saveRDS(mysim, file = "/home/mstevens/Desktop/MAIN WORK/Presence Absence/Package Versions/silverblaze/inst/extdata/tutorial2_mysim.rds")
# saveRDS(p, file = "/home/mstevens/Desktop/MAIN WORK/Presence Absence/Package Versions/silverblaze/inst/extdata/tutorial2_project.rds")
# saveRDS(hs, file = "/home/mstevens/Desktop/MAIN WORK/Presence Absence/Package Versions/silverblaze/inst/extdata/tutorial2_hitscore.rds")

Michael-Stevens-27/silverblaze documentation built on May 28, 2021, 5:47 p.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

Tweet to @rdrrHQ

GitHub issue tracker

ian@mutexlabs.com