Introduction to treeSS
In treeSS: Tree-Spatial Scan Statistic for Cluster Detection

knitr::opts_chunk$set(
  collapse = TRUE, comment = "#>",
  eval = FALSE
)

Overview

The treeSS package implements the tree-spatial scan statistic of Cançado, Oliveira, Quadros and Duczmal (2025), which detects clusters that are simultaneously anomalous in geographic space and on a hierarchical classification tree. It generalises Kulldorff's (1997) circular spatial scan by searching jointly over circular spatial zones $z$ and tree branches $g$, returning the (zone, branch) pair that maximises the likelihood ratio.

The package targets epidemiological and surveillance applications where the events to be clustered (deaths, hospitalisations, crimes, collisions) carry both a location and a categorical hierarchy (ICD-10 chapter $\to$ subchapter $\to$ leaf, NAICS sector, road type, etc.). Detecting clusters jointly in both dimensions can identify specific cause-area combinations that pure spatial or pure tree scans would miss.

The package provides:

| Function | Purpose | |:---------|:--------| | treespatial_scan() | Tree-spatial scan (main entry point) | | circular_scan() | Kulldorff's circular spatial scan (tree-free) | | tree_scan() | Tree-based scan (space-free) | | iterative_scan() | Sequential conditional scan with Holm–Bonferroni correction | | filter_clusters() | Distinct secondary clusters per Cançado et al. (2025), Sec. 5.1.1 | | get_cluster_regions() | Tidy region-by-cluster table for plotting | | aggregate_tree() | Roll counts up the tree |

All scans accept the Poisson and Binomial models of the original paper and run the Monte Carlo step under OpenMP via the n_cores argument.

The package ships four example datasets:

| Dataset | Country | Domain | Regions | Tree | |:--------|:--------|:-------|:--------|:-----| | rj_mortality + rj_tree | Brazil | Public health | 89 municipalities | ICD-10 (622 nodes) | | chicago_crimes + chicago_tree | USA | Crime | 77 community areas | Type × Description × Location (2841 nodes) | | fl_deaths | USA | Public health | 65 counties | (built by user from raw data) | | london_collisions + london_tree | UK | Road safety | 33 boroughs | Light × Road × Junction (81 nodes) |

This vignette walks through rj_mortality (reproducing the published analysis) and chicago_crimes (a different-domain illustration), then shows how to use iterative_scan() for sequential cluster detection. The Florida and London examples are shipped in inst/examples/ and discussed in the companion paper.

Example 1 — Infant mortality in Rio de Janeiro

This reproduces Section 5.2 of Cançado et al. (2025): all live births and infant deaths in the 89 municipalities of Rio de Janeiro state in 2016, classified by the ICD-10 hierarchy. The pre-built dataset combines deaths, live births, ICD-10 leaf codes and centroid coordinates in long format.

Run the scan

library(treeSS)
data(rj_mortality)
data(rj_tree)

str(rj_mortality, max = 1)
#> 'data.frame': 1440 obs. of 9 variables: region_id, ibge_code,
#>   name, live_births, x, y, node_id, cases, ...

result_rj <- treespatial_scan(
  cases       = rj_mortality$cases,
  population  = rj_mortality$live_births,
  region_id   = rj_mortality$region_id,
  x           = rj_mortality$x,
  y           = rj_mortality$y,
  node_id     = rj_mortality$node_id,
  tree        = rj_tree,
  max_pop_pct = 0.50,
  nsim        = 999, seed = 2016,
  n_cores     = 4L
)
print(result_rj)

Running this produces the most likely cluster P209 with 18 municipalities, log-likelihood ratio $LR \approx 38.83$, $p < 0.001$, 27 observed deaths against 3.34 expected — closely matching the published Table 7 ($LR = 38.48$, 18 municipalities, 27 deaths, 3.37 expected). The minor LR difference comes from a SIM database update between the paper's analysis and the dataset shipped here; the conclusions are identical.

Multiple distinct clusters (paper-faithful)

To extract more than one cluster from a single scan, the paper's Section 5.1.1 specifies a filtering rule: a candidate (zone, branch) pair is retained only if its branch is disjoint from every previously retained branch (no ancestor/descendant relation), or its zone is disjoint from every previously retained zone (no region in common). This is implemented in filter_clusters():

filter_clusters(result_rj)[1:5, c("node_id", "n_regions",
                                   "cases", "expected", "llr")]

For visualisation, get_cluster_regions() returns a tidy data.frame ready to merge with a polygon layer:

# Most likely cluster only (single map)
cr1 <- get_cluster_regions(result_rj, n_clusters = 1, overlap = FALSE)

# Top-2 distinct clusters (faceted map)
cr2 <- get_cluster_regions(result_rj, n_clusters = 2, overlap = TRUE)

A complete plotting workflow with geobr and ggplot2:

library(ggplot2)
library(geobr)
library(sf)

mun <- read_municipality(code_muni = "RJ", year = 2016)
mun$code6 <- as.integer(substr(mun$code_muni, 1, 6))

cr2 <- merge(get_cluster_regions(result_rj, n_clusters = 2, overlap = TRUE),
             unique(rj_mortality[, c("region_id", "ibge_code")]),
             by = "region_id")
mun_facet <- merge(mun, cr2, by.x = "code6", by.y = "ibge_code")

ggplot(mun_facet) +
  geom_sf(aes(fill = factor(cluster)), color = "gray40", alpha = 0.75) +
  scale_fill_manual(values = c("1" = "#C44E52", "2" = "#4C72B0"),
                    na.value = "gray95", na.translate = FALSE) +
  facet_wrap(~ panel, nrow = 1) +
  theme_minimal() +
  theme(legend.position = "none")

A complete worked example with detailed annotation is shipped at system.file("examples", "example_brazil_rj.R", package = "treeSS").

Example 2 — Crime in Chicago

The Chicago dataset illustrates the package on a non-medical domain with a much larger tree (2841 nodes covering crime type, description and location). It also demonstrates a recurring choice users face in surveillance applications: which population should be the denominator?

chicago_crimes ships with two denominator columns:

population — the total incidents in each community area. This is a compositional denominator: it expresses each cell as a share of the citywide incident count and answers the question "which crime types over-occur in which areas, relative to the citywide profile?"
pop_residential — the residential population of each community area (ACS 2020 5-year, aggregated to community areas by CMAP Community Data Snapshots). This is the appropriate denominator for an incidence-rate analysis (incidents per resident), analogous to deaths per live birth in Example 1.

The choice changes both the model and the interpretation. We use the incidence-rate denominator below.

Run the scan

data(chicago_crimes)
data(chicago_tree)

result_chi <- treespatial_scan(
  cases       = chicago_crimes$cases,
  population  = chicago_crimes$pop_residential,
  region_id   = chicago_crimes$region_id,
  x           = chicago_crimes$x,
  y           = chicago_crimes$y,
  node_id     = chicago_crimes$node_id,
  tree        = chicago_tree,
  max_pop_pct = 0.25,
  nsim        = 999, seed = 2023,
  n_cores     = 4L
)
print(result_chi)

The most likely cluster covers a contiguous group of community areas on Chicago's South Side (Englewood, West Englewood, Auburn Gresham, Roseland and neighbouring CCAs), driven by an over-occurrence of violent and property crimes against residents. The relative risk for this cluster is approximately 1.58 — a 58 % excess over the citywide rate.

Visualise the most likely cluster

chicago_crimes joins to the bundled chicago_map polygon layer via the area_number column:

data(chicago_map)

cr1 <- merge(get_cluster_regions(result_chi, n_clusters = 1, overlap = FALSE),
             unique(chicago_crimes[, c("region_id", "area_number", "name")]),
             by = "region_id")
chi <- merge(chicago_map, cr1, by.x = "AREA_NUM", by.y = "area_number",
              all.x = TRUE)

ggplot(chi) +
  geom_sf(aes(fill = factor(cluster)), color = "gray40", alpha = 0.75) +
  scale_fill_manual(values = c("1" = "#C44E52"), na.value = "gray95",
                    name = "Cluster") +
  theme_minimal()

Multi-cluster detection: `iterative_scan()`

filter_clusters() (used in Example 1) returns distinct clusters extracted from a single run of the scan. An alternative — popular in the spatial-epidemiology literature since Zhang, Assunção and Kulldorff (2010) — is to run the scan, remove the cases assigned to the detected cluster, and re-run on the residual data. Repeating this yields a sequence of clusters whose subsequent components are conditionally most likely, given that the prior clusters have been explained away.

The function iterative_scan() implements this procedure. Note that it is not part of Cançado et al. (2025); users seeking strict fidelity to that paper should use filter_clusters(). The iterative procedure is offered as a useful extension and is documented as such.

Because each iteration is a separate hypothesis test on data modified by the previous iteration, the raw p-values overstate significance. iterative_scan() collects the raw p-values from all iterations and applies the Holm–Bonferroni correction at the end:

iter_rj <- iterative_scan(
  cases       = rj_mortality$cases,
  population  = rj_mortality$live_births,
  region_id   = rj_mortality$region_id,
  x           = rj_mortality$x,
  y           = rj_mortality$y,
  node_id     = rj_mortality$node_id,
  tree        = rj_tree,
  max_iter    = 5, alpha = 0.05,
  nsim        = 999, seed = 2016,
  max_pop_pct = 0.50, n_cores = 4L
)
print(iter_rj)

The returned clusters data.frame has one row per iteration with the raw p-value, the Holm-adjusted p-value, and a significant flag:

iter_rj$clusters[, c("iteration", "node_id", "n_regions",
                      "llr", "pvalue", "pvalue_adjusted",
                      "significant")]

Mapping the iterative result uses the same S3 dispatch:

cr_it <- merge(get_cluster_regions(iter_rj, overlap = TRUE),
               unique(rj_mortality[, c("region_id", "ibge_code")]),
               by = "region_id")
mun_it <- merge(mun, cr_it, by.x = "code6", by.y = "ibge_code",
                all.x = TRUE)

ggplot(mun_it) +
  geom_sf(aes(fill = factor(cluster)), color = "gray40", alpha = 0.75) +
  scale_fill_manual(values = c("1" = "#C44E52", "2" = "#4C72B0",
                                "3" = "#55A868", "4" = "#8172B2",
                                "5" = "#CCB974"),
                    na.value = "gray95", na.translate = FALSE) +
  facet_wrap(~ panel) +
  theme_minimal() + theme(legend.position = "none")

When to use which:

Use filter_clusters() / get_cluster_regions(n_clusters = N) when you want to follow Cançado et al. (2025) exactly.
Use iterative_scan() when residual clustering is plausible and you want a sequence of conditionally distinct clusters with proper multiple-testing control.

Computational notes

The Monte Carlo step (replicates of the null Poisson/Binomial counts under proportional intensities) dominates the runtime. All scan functions accept n_cores, which parallelises the Monte Carlo step via OpenMP. On a 4-core laptop the Rio de Janeiro analysis runs in under 20 seconds with nsim = 999; the Chicago analysis (much larger tree) takes around two minutes.

For tasks where the case counts are concentrated near the leaves and the tree is deep, aggregate_tree() can be used to compute the internal-node sums once and reuse them across calls.

Other examples

The inst/examples/ directory of the installed package contains fully annotated worked examples for all four datasets:

list.files(system.file("examples", package = "treeSS"))
#> [1] "example_brazil_rj.R"  "example_chicago.R"
#> [2] "example_florida.R"    "example_london.R"

The Florida script demonstrates the full workflow from raw long data — building the ICD-10 tree from scratch, downloading county polygons via tigris, assembling parallel input vectors and plotting with ggplot2. The London script illustrates the use of a non-medical tree (light × road × junction) on traffic collisions, with an interactive leaflet map.

These two examples are also the basis of the detailed use cases in the companion R Journal paper.

References

Cançado, A. L. F., Oliveira, G. S., Quadros, A. V. C., & Duczmal, L. H. (2025). A tree-spatial scan statistic. Environmental and Ecological Statistics, 32, 953–978. doi:10.1007/s10651-025-00670-w

Kulldorff, M. (1997). A spatial scan statistic. Communications in Statistics — Theory and Methods, 26(6), 1481–1496.

Kulldorff, M., Fang, Z., & Walsh, S. J. (2003). A tree-based scan statistic for database disease surveillance. Biometrics, 59(2), 323–331.

Zhang, Z., Assunção, R., & Kulldorff, M. (2010). Spatial scan statistics adjusted for multiple clusters. Journal of Probability and Statistics, 2010, 642379.

Holm, S. (1979). A simple sequentially rejective multiple test procedure. Scandinavian Journal of Statistics, 6(2), 65–70.

Any scripts or data that you put into this service are public.

treeSS documentation built on May 16, 2026, 1:08 a.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

treeSS
Tree-Spatial Scan Statistic for Cluster Detection

Introduction to treeSS
In treeSS: Tree-Spatial Scan Statistic for Cluster Detection

Overview

Example 1 — Infant mortality in Rio de Janeiro

Run the scan

Multiple distinct clusters (paper-faithful)

Example 2 — Crime in Chicago

Run the scan

Visualise the most likely cluster

Multi-cluster detection: `iterative_scan()`

Computational notes

Other examples

References

Try the treeSS package in your browser

R Package Documentation

Browse R Packages

We want your feedback!

treeSS Tree-Spatial Scan Statistic for Cluster Detection

Introduction to treeSS In treeSS: Tree-Spatial Scan Statistic for Cluster Detection

Overview

Example 1 — Infant mortality in Rio de Janeiro

Run the scan

Multiple distinct clusters (paper-faithful)

Example 2 — Crime in Chicago

Run the scan

Visualise the most likely cluster

Multi-cluster detection: iterative_scan()

Computational notes

Other examples

References

Try the treeSS package in your browser

R Package Documentation

Browse R Packages

We want your feedback!

treeSS
Tree-Spatial Scan Statistic for Cluster Detection

Introduction to treeSS
In treeSS: Tree-Spatial Scan Statistic for Cluster Detection

Multi-cluster detection: `iterative_scan()`