knitr::opts_chunk$set(fig.align = "center") knitr::opts_chunk$set(fig.height = 7, fig.width = 8, dpi = 90, out.width = '100%') knitr::opts_chunk$set(comment = "#>")
In this vignette we present additional features of the sfclust
package.
We begin by loading the required packages.
library(sfclust) library(stars) library(ggplot2) library(dplyr)
The simulated dataset used in this vignette, stgaus
, is included in our package. It is
a stars
object with one variable, y
, and two dimensions: geometry
and time
. The
dataset represents the number of cases in 100 regions, observed daily over 91 days,
starting in January 2024.
data("stgaus") stgaus
To gain some initial insights, we can aggregate the data weekly:
stweekly <- aggregate(stgaus, by = "week", FUN = mean) ggplot() + geom_stars(aes(fill = y), stweekly) + facet_wrap(~ time) + scale_fill_distiller(palette = "RdBu") + theme_bw()
We can also examine the trends for each region:
stgaus |> st_set_dimensions("geometry", values = 1:nrow(stgaus)) |> as_tibble() |> ggplot() + geom_line(aes(time, y, group = geometry, color = factor(geometry)), linewidth = 0.3) + theme_bw() + theme(legend.position = "none")
Some regions exhibit similar trends over time, but the overall patterns are more complex than polinomial functions. Our goal is to cluster these regions while accounting for spatial contiguity.
We assume that the response variable ( Y_{it} ) for region ( i ) at time ( t ) follows a normal distribution given the partition ( M ): $$ Y_{it} \mid \mu_{it}, M \stackrel{ind}{\sim} \text{Normal}(\mu_{it}, \sigma^2), $$ where the mean ( \mu_{it} ) is modeled based on the cluster assignment ( c_i ): $$ \mu_{it} = \alpha_{c_i} + f_{c_i,t}, $$ where ( \alpha_{c_i} ) is the intercept for cluster ( c_i ), and ( f_{c_i,t} ) is a latent random effect modeled as a random walk process. Specifically, we impose the condition: $$ f_{c_i,t} - f_{c_i,t-1} \sim N(0, \nu^{-1}), \quad \text{for } t = 2, \dots, n. $$
The prior for the hyperparameter ( \nu_c ) is defined as: $$ \log(\nu_c) \sim \text{Normal}(-2, 1). $$
sfclust
uses an undirected graph to represent connections between regions and proposes
spatial clusters using this graph through minimum spanning trees (MST). By default,
sfclust
accepts the argument graphdata
, which should include:
igraph
object representing spatial connections,membership
vector indicating the initial cluster assignments for each region.For simplicity, you can use the genclust
function to generate an initial random
partitioning with a specified number of clusters. In this example, we create a partition
with 20 clusters:
set.seed(123) initial_cluster <- genclust(st_geometry(stgaus), nclust = 20) names(initial_cluster)
Now, let's visualize how the regions were randomly clustered:
st_sf(st_geometry(stgaus), cluster = factor(initial_cluster$membership)) |> ggplot() + geom_sf(aes(fill = cluster)) + theme_bw()
sfclust
We will initiate the Bayesian clustering algorithm using our generated partition
(initial_cluster
). Additionally, we use the capabilites of INLA::inla
to define a
model that includes a random walk process in the linear predictor, and custom priors for
the scale hyperparameter. To begin, we will run the algorithm for only 50 iterations.
# result <- sfclust(stgaus, graphdata = initial_cluster, niter = 50, # formula = y ~ f(idt, model = "rw1", hyper = list(prec = list(prior = "normal", param = c(-2, 1)))), # path_save = file.path("inst", "vigdata", "full-gaussian-mcmc1.rds")) # system.time( # update(result, niter = 2000, path_save = file.path("inst", "vigdata", "full-gaussian-mcmc2.rds")) # ) # # user system elapsed # # 16010.960 11900.949 2794.096
# # Reduce size of object # result1 <- readRDS(file.path("inst", "vigdata", "full-gaussian-mcmc1.rds")) # result2 <- readRDS(file.path("inst", "vigdata", "full-gaussian-mcmc2.rds")) # pseudo_inla <- function(x) { # list( # summary.random = list(id = x$summary.random$id["mean"]), # summary.linear.predictor = x$summary.linear.predictor["mean"] # ) # } # result1$clust$models <- NULL # result2$clust$models <- lapply(result2$clust$models, pseudo_inla) # saveRDS(result1, file.path("inst", "vigdata", "gaussian-mcmc1.rds")) # saveRDS(result2, file.path("inst", "vigdata", "gaussian-mcmc2.rds"))
result <- sfclust(stgaus, graphdata = initial_cluster, niter = 50, formula = y ~ f(idt, model = "rw1", hyper = list(prec = list(prior = "normal", param = c(-2, 1))))) result
result <- readRDS(system.file("vigdata", "gaussian-mcmc1.rds", package = "sfclust")) result
The output indicates that after starting with 20 clusters, the algorithm created 11 new clusters (births), removed 6 clusters (deaths), changed the membership of 3 clusters, and modified the minimum spanning tree (MST) 3 times.
The plot
method allows us to select which graph to produce. For example, we can
visualize only the log marginal likelihood to diagnose convergence. With just 50
iterations, we can see that the log marginal likelihood has not yet achieved convergence.
plot(result, which = 2)
To achieve convergence, we can continue the sampling using the update
function and
specify the number of new iterations with the niter
argument:
result2 <- update(result, niter = 2000) result2
result2 <- readRDS(system.file("vigdata", "gaussian-mcmc2.rds", package = "sfclust")) result2
With 2000 additional iterations, there have been many clustering movements. Furthermore, when visualizing the results, we can observe that the log marginal likelihood has achieved convergence.
plot(result2, which = 2)
The final iteration indicates that the algorithm identified 10 clusters, with a log marginal likelihood of 12,708.35. The largest cluster consists of 27 regions, while the smallest contains only one region.
summary(result2, sort = TRUE)
Let's visualize the regions grouped by cluster.
stgaus$cluster <- fitted(result2, sort = TRUE)$cluster st_sf(st_geometry(stgaus), cluster = factor(stgaus$cluster)) |> ggplot() + geom_sf(aes(fill = cluster)) + theme_bw()
Let's visualize the original data grouped by cluster.
stgaus$cluster <- fitted(result2, sort = TRUE)$cluster stgaus |> st_set_dimensions("geometry", values = 1:nrow(stgaus)) |> as_tibble() |> ggplot() + geom_line(aes(time, y, group = geometry), linewidth = 0.3) + facet_wrap(~ cluster, ncol = 2) + theme_bw() + labs(title = "Risk per cluster", y = "Risk", x = "Time")
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.