knitr::opts_chunk$set( collapse = TRUE, comment = "#>", eval = FALSE )
The treeSS package implements the tree-spatial scan statistic of Cançado, Oliveira, Quadros and Duczmal (2025), which detects clusters that are simultaneously anomalous in geographic space and on a hierarchical classification tree. It generalises Kulldorff's (1997) circular spatial scan by searching jointly over circular spatial zones $z$ and tree branches $g$, returning the (zone, branch) pair that maximises the likelihood ratio.
The package targets epidemiological and surveillance applications where the events to be clustered (deaths, hospitalisations, crimes, collisions) carry both a location and a categorical hierarchy (ICD-10 chapter $\to$ subchapter $\to$ leaf, NAICS sector, road type, etc.). Detecting clusters jointly in both dimensions can identify specific cause-area combinations that pure spatial or pure tree scans would miss.
The package provides:
| Function | Purpose |
|:---------|:--------|
| treespatial_scan() | Tree-spatial scan (main entry point) |
| circular_scan() | Kulldorff's circular spatial scan (tree-free) |
| tree_scan() | Tree-based scan (space-free) |
| iterative_scan() | Sequential conditional scan with Holm–Bonferroni correction |
| filter_clusters() | Distinct secondary clusters per Cançado et al. (2025), Sec. 5.1.1 |
| get_cluster_regions() | Tidy region-by-cluster table for plotting |
| aggregate_tree() | Roll counts up the tree |
All scans accept the Poisson and Binomial models of the original paper
and run the Monte Carlo step under OpenMP via the n_cores argument.
The package ships four example datasets:
| Dataset | Country | Domain | Regions | Tree |
|:--------|:--------|:-------|:--------|:-----|
| rj_mortality + rj_tree | Brazil | Public health | 89 municipalities | ICD-10 (622 nodes) |
| chicago_crimes + chicago_tree | USA | Crime | 77 community areas | Type × Description × Location (2841 nodes) |
| fl_deaths | USA | Public health | 65 counties | (built by user from raw data) |
| london_collisions + london_tree | UK | Road safety | 33 boroughs | Light × Road × Junction (81 nodes) |
This vignette walks through rj_mortality (reproducing the published
analysis) and chicago_crimes (a different-domain illustration), then
shows how to use iterative_scan() for sequential cluster detection.
The Florida and London examples are shipped in inst/examples/ and
discussed in the companion paper.
This reproduces Section 5.2 of Cançado et al. (2025): all live births and infant deaths in the 89 municipalities of Rio de Janeiro state in 2016, classified by the ICD-10 hierarchy. The pre-built dataset combines deaths, live births, ICD-10 leaf codes and centroid coordinates in long format.
library(treeSS) data(rj_mortality) data(rj_tree) str(rj_mortality, max = 1) #> 'data.frame': 1440 obs. of 9 variables: region_id, ibge_code, #> name, live_births, x, y, node_id, cases, ...
result_rj <- treespatial_scan( cases = rj_mortality$cases, population = rj_mortality$live_births, region_id = rj_mortality$region_id, x = rj_mortality$x, y = rj_mortality$y, node_id = rj_mortality$node_id, tree = rj_tree, max_pop_pct = 0.50, nsim = 999, seed = 2016, n_cores = 4L ) print(result_rj)
Running this produces the most likely cluster P209 with 18
municipalities, log-likelihood ratio $LR \approx 38.83$, $p < 0.001$,
27 observed deaths against 3.34 expected — closely matching the
published Table 7 ($LR = 38.48$, 18 municipalities, 27 deaths,
3.37 expected). The minor LR difference comes from a SIM database
update between the paper's analysis and the dataset shipped here;
the conclusions are identical.
To extract more than one cluster from a single scan, the paper's
Section 5.1.1 specifies a filtering rule: a candidate (zone, branch)
pair is retained only if its branch is disjoint from every previously
retained branch (no ancestor/descendant relation), or its zone is
disjoint from every previously retained zone (no region in common).
This is implemented in filter_clusters():
filter_clusters(result_rj)[1:5, c("node_id", "n_regions", "cases", "expected", "llr")]
For visualisation, get_cluster_regions() returns a tidy data.frame
ready to merge with a polygon layer:
# Most likely cluster only (single map) cr1 <- get_cluster_regions(result_rj, n_clusters = 1, overlap = FALSE) # Top-2 distinct clusters (faceted map) cr2 <- get_cluster_regions(result_rj, n_clusters = 2, overlap = TRUE)
A complete plotting workflow with geobr and ggplot2:
library(ggplot2) library(geobr) library(sf) mun <- read_municipality(code_muni = "RJ", year = 2016) mun$code6 <- as.integer(substr(mun$code_muni, 1, 6)) cr2 <- merge(get_cluster_regions(result_rj, n_clusters = 2, overlap = TRUE), unique(rj_mortality[, c("region_id", "ibge_code")]), by = "region_id") mun_facet <- merge(mun, cr2, by.x = "code6", by.y = "ibge_code") ggplot(mun_facet) + geom_sf(aes(fill = factor(cluster)), color = "gray40", alpha = 0.75) + scale_fill_manual(values = c("1" = "#C44E52", "2" = "#4C72B0"), na.value = "gray95", na.translate = FALSE) + facet_wrap(~ panel, nrow = 1) + theme_minimal() + theme(legend.position = "none")
A complete worked example with detailed annotation is shipped at
system.file("examples", "example_brazil_rj.R", package = "treeSS").
The Chicago dataset illustrates the package on a non-medical domain with a much larger tree (2841 nodes covering crime type, description and location). It also demonstrates a recurring choice users face in surveillance applications: which population should be the denominator?
chicago_crimes ships with two denominator columns:
population — the total incidents in each community area. This
is a compositional denominator: it expresses each cell as a
share of the citywide incident count and answers the question
"which crime types over-occur in which areas, relative to the
citywide profile?"pop_residential — the residential population of each community
area (ACS 2020 5-year, aggregated to community areas by CMAP
Community Data Snapshots). This is the appropriate denominator
for an incidence-rate analysis (incidents per resident),
analogous to deaths per live birth in Example 1.The choice changes both the model and the interpretation. We use the incidence-rate denominator below.
data(chicago_crimes) data(chicago_tree) result_chi <- treespatial_scan( cases = chicago_crimes$cases, population = chicago_crimes$pop_residential, region_id = chicago_crimes$region_id, x = chicago_crimes$x, y = chicago_crimes$y, node_id = chicago_crimes$node_id, tree = chicago_tree, max_pop_pct = 0.25, nsim = 999, seed = 2023, n_cores = 4L ) print(result_chi)
The most likely cluster covers a contiguous group of community areas on Chicago's South Side (Englewood, West Englewood, Auburn Gresham, Roseland and neighbouring CCAs), driven by an over-occurrence of violent and property crimes against residents. The relative risk for this cluster is approximately 1.58 — a 58 % excess over the citywide rate.
chicago_crimes joins to the bundled chicago_map polygon layer via
the area_number column:
data(chicago_map) cr1 <- merge(get_cluster_regions(result_chi, n_clusters = 1, overlap = FALSE), unique(chicago_crimes[, c("region_id", "area_number", "name")]), by = "region_id") chi <- merge(chicago_map, cr1, by.x = "AREA_NUM", by.y = "area_number", all.x = TRUE) ggplot(chi) + geom_sf(aes(fill = factor(cluster)), color = "gray40", alpha = 0.75) + scale_fill_manual(values = c("1" = "#C44E52"), na.value = "gray95", name = "Cluster") + theme_minimal()
iterative_scan()filter_clusters() (used in Example 1) returns distinct clusters
extracted from a single run of the scan. An alternative — popular
in the spatial-epidemiology literature since Zhang, Assunção and
Kulldorff (2010) — is to run the scan, remove the cases assigned to
the detected cluster, and re-run on the residual data. Repeating this
yields a sequence of clusters whose subsequent components are
conditionally most likely, given that the prior clusters have been
explained away.
The function iterative_scan() implements this procedure. Note that
it is not part of Cançado et al. (2025); users seeking strict
fidelity to that paper should use filter_clusters(). The iterative
procedure is offered as a useful extension and is documented as such.
Because each iteration is a separate hypothesis test on data modified
by the previous iteration, the raw p-values overstate significance.
iterative_scan() collects the raw p-values from all iterations and
applies the Holm–Bonferroni correction at the end:
iter_rj <- iterative_scan( cases = rj_mortality$cases, population = rj_mortality$live_births, region_id = rj_mortality$region_id, x = rj_mortality$x, y = rj_mortality$y, node_id = rj_mortality$node_id, tree = rj_tree, max_iter = 5, alpha = 0.05, nsim = 999, seed = 2016, max_pop_pct = 0.50, n_cores = 4L ) print(iter_rj)
The returned clusters data.frame has one row per iteration with the
raw p-value, the Holm-adjusted p-value, and a significant flag:
iter_rj$clusters[, c("iteration", "node_id", "n_regions", "llr", "pvalue", "pvalue_adjusted", "significant")]
Mapping the iterative result uses the same S3 dispatch:
cr_it <- merge(get_cluster_regions(iter_rj, overlap = TRUE), unique(rj_mortality[, c("region_id", "ibge_code")]), by = "region_id") mun_it <- merge(mun, cr_it, by.x = "code6", by.y = "ibge_code", all.x = TRUE) ggplot(mun_it) + geom_sf(aes(fill = factor(cluster)), color = "gray40", alpha = 0.75) + scale_fill_manual(values = c("1" = "#C44E52", "2" = "#4C72B0", "3" = "#55A868", "4" = "#8172B2", "5" = "#CCB974"), na.value = "gray95", na.translate = FALSE) + facet_wrap(~ panel) + theme_minimal() + theme(legend.position = "none")
When to use which:
filter_clusters() / get_cluster_regions(n_clusters = N)
when you want to follow Cançado et al. (2025) exactly.iterative_scan() when residual clustering is plausible and
you want a sequence of conditionally distinct clusters with
proper multiple-testing control.The Monte Carlo step (replicates of the null Poisson/Binomial counts
under proportional intensities) dominates the runtime. All scan
functions accept n_cores, which parallelises the Monte Carlo step
via OpenMP. On a 4-core laptop the Rio de Janeiro analysis runs in
under 20 seconds with nsim = 999; the Chicago analysis (much larger
tree) takes around two minutes.
For tasks where the case counts are concentrated near the leaves and
the tree is deep, aggregate_tree() can be used to compute the
internal-node sums once and reuse them across calls.
The inst/examples/ directory of the installed package contains
fully annotated worked examples for all four datasets:
list.files(system.file("examples", package = "treeSS")) #> [1] "example_brazil_rj.R" "example_chicago.R" #> [2] "example_florida.R" "example_london.R"
The Florida script demonstrates the full workflow from raw long
data — building the ICD-10 tree from scratch, downloading county
polygons via tigris, assembling parallel input vectors and
plotting with ggplot2. The London script illustrates the use of
a non-medical tree (light × road × junction) on traffic collisions,
with an interactive leaflet map.
These two examples are also the basis of the detailed use cases in the companion R Journal paper.
Cançado, A. L. F., Oliveira, G. S., Quadros, A. V. C., & Duczmal, L. H. (2025). A tree-spatial scan statistic. Environmental and Ecological Statistics, 32, 953–978. doi:10.1007/s10651-025-00670-w
Kulldorff, M. (1997). A spatial scan statistic. Communications in Statistics — Theory and Methods, 26(6), 1481–1496.
Kulldorff, M., Fang, Z., & Walsh, S. J. (2003). A tree-based scan statistic for database disease surveillance. Biometrics, 59(2), 323–331.
Zhang, Z., Assunção, R., & Kulldorff, M. (2010). Spatial scan statistics adjusted for multiple clusters. Journal of Probability and Statistics, 2010, 642379.
Holm, S. (1979). A simple sequentially rejective multiple test procedure. Scandinavian Journal of Statistics, 6(2), 65–70.
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.