scan.test: Compute the scan test

View source: R/scan.test.R

scan.testR Documentation

Compute the scan test

Description

This function compute the spatial scan test for Bernoulli and Multinomial categorical spatial process, and detect spatial clusters

Usage

scan.test(formula = NULL, data = NULL, fx = NULL, coor = NULL, case = NULL,
nv = NULL, nsim = NULL, distr = NULL, windows = "circular", listw = NULL,
alternative = "High", minsize = 1, control = list())

Arguments

formula

a symbolic description of the factor (optional).

data

an (optional) data frame or a sf object containing the variable to testing for.

fx

a factor (optional).

coor

(optional) coordinates of observations.

case

Only for Bernoulli distribution. A element of factor, there are cases and non-cases for testing for cases versus non-cases

nv

Maximum windows size, default nv = N/2. The algorithm scan for clusters of geographic size between 1 and the upper limit (nv) defined by the user.

nsim

Number of permutations.

distr

distribution of the spatial process: "bernoulli" for two levels or "multinomial" for three or more levels.

windows

a string to select the type of cluster "circular" (default) of "elliptic".

listw

only for flexible windows. A neighbours list (an object of the class listw, nb or knn frop spdep) or an adjacency matrix.

alternative

Only for Bernoulli spatial process. A character string specifying the type of cluster, must be one of "High" (default), "Both" or "Low".

minsize

Minimum number of observations inside of Most Likely Cluster and secondary clusters.

control

List of additional control arguments.

Details

Two alternative sets of arguments can be included in this function to compute the scan test:

  • Option 1: A factor (fx) and coordinates (coor).

  • Option 2: A sf object (data) and the formula to specify the factor. The function consider the coordinates of the centroids of the elements of th sf object.

The spatial scan statistics are widely used in epidemiology, criminology or ecology. Their purpose is to analyse the spatial distribution of points or geographical regions by testing the hypothesis of spatial randomness distribution on the basis of different distributions (e.g. Bernoulli, Poisson or Normal distributions). The scan.test function obtain the scan statistic for two relevant distributions related with categorical variables: the Bernoulli and Multinomial distribution.
The spatial scan statistic is based on the likelihood ratio test statistic and is formulated as follows:

Δ = { \max_{z \in Z,H_A} L(θ|z) \over \max_{z \in Z,H_0} L(θ|z)}

where Z represents the collection of scanning windows constructed on the study region, H_A is an alternative hypothesis, H_0 is a null hypothesis, and L(θ|z) is the likelihood function with parameter θ given window Z
. The null hypothesis says that there is no spatial clustering on the study region, and the alternative hypothesis is that there is a certain area with high (or low) rates of outcome variables. The null and alternative hypotheses and the likelihood function may be expressed in different ways depending on the probability model under consideration.
To test independence in a spatial process, under the null, the type of windows is irrelevant but under the alternative the elliptic windows can to identify with more precision the cluster.

For big data sets (N >>) the windows = "elliptic" can be so slowly

Bernoulli version

When we have dichotomous outcome variables, such as cases and noncases of certain diseases, the Bernoulli model is used. The null hypothesis is written as

H_0 : p = q \ \ for \ \ all \ \ Z

and the alternative hypothesis is

H_A : p \neq q \ \ for \ \ some \ \ Z

where p and q are the outcome probabilities (e.g., the probability of being a case) inside and outside scanning window Z, respectively. Given window Z, the test statistic is:

where cz and nz are the numbers of cases and observations (cases and noncases) within z, respectively, and C and N are the total numbers of cases and observations in the whole study region, respectively.

Δ =

Multinomial version of the scan test

The multinomial version of the spatial scan statistic is useful to investigate clustering when a discrete spatial variable can take one and only one of k possible outcomes that lack intrinsic order information. If the region defined by the moving window is denoted by Z, the null hypothesis for the statistic can be stated as follows:

H_0: p_1 = q_1;p_2 = q_2;...;p_k = q_k

where p_j is the probability of being of event type j inside the window Z, and q_j is the probability of being of event type j outside the window. The alternative hypothesis is that for at least one type event the probability of being of that type is different inside and outside of the window.

The statistic is built as a likelihood ratio, and takes the following form after transformation using the natural logarithm:

Δ = \max_Z \{∑_j \{ S_j^Z log({ S_j^Z \over S^Z }) + (S_j-S_j^Z) log({ {S_j-S_j^Z} \over {S-S^Z} })\}\}-∑_j S_j log({ S_j \over S })

where S is the total number of events in the study area and S_j is the total number of events of type j. The superscript Z denotes the same but for the sub-region defined by the moving window.
The theoretical distribution of the statistic under the null hypothesis is not known, and therefore significance is evaluated numerically by simulating neutral landscapes (obtained using a random spatial process) and contrasting the empirically calculated statistic against the frequency of values obtained from the neutral landscapes. The results of the likelihood ratio serve to identify the most likely cluster, which is followed by secondary clusters by the expedient of sorting them according to the magnitude of the test. As usual, significance is assigned by the analyst, and the cutoff value for significance reflects the confidence of the analyst, or tolerance for error.
When implementing the statistic, the analyst must decide the shape of the window and the maximum number of cases that any given window can cover. Currently, analysis can be done using circular or elliptical windows.
Elliptical windows are more time consuming to evaluate but provide greater flexibility to contrast the distribution of events inside and outside the window, and are our selected shape in the analyses to follow. Furthermore, it is recommended that the maximum number of cases entering any given window does not exceed 50% of all available cases.

Value

A object of the htest and scantest class

method The type of test applied ().
fx Factor included as input to get the scan test.
MLC Observations included into the Most Likelihood Cluster (MLC).
statistic Value of the scan test (maximum Log-likelihood ratio).
N Total number of observations.
nn Windows used to get the cluster.
nv Maximum number of observations into the cluster.
data.name A character string giving the name of the factor.
coor coordinates.
alternative Only for bernoulli spatial process. A character string describing the alternative hypothesis select by the user.
p.value p-value of the scan test.
cases.expect Expected cases into the MLC.
cases.observ Observed cases into the MLC.
nsim Number of permutations.
scan.mc a (nsim x 1) vector with the loglik values under bootstrap permutation.
secondary.clusters a list with the observations included into the secondary clusters.
loglik.second a vector with the value of the secondary scan tests (maximum Log-likelihood ratio).
p.value.secondary a vector with the p-value of the secondary scan tests.
Alternative.MLC A vector with the observations included in another cluster with the same loglik than MLC.

Control arguments

seedinit Numerical value for the seed (only for boot version). Default value seedinit=123

Author(s)

Fernando López fernando.lopez@upct.es
Román Mínguez roman.minguez@uclm.es
Antonio Páez paezha@gmail.com
Manuel Ruiz manuel.ruiz@upct.es

References

  • Kulldorff M, Nagarwalla N. (1995). Spatial disease clusters: Detection and Inference. Statistics in Medicine. 14:799-810

  • Jung I, Kulldorff M, Richard OJ (2010). A spatial scan statistic for multinomial data. Statistics in Medicine. 29(18), 1910-1918

  • Páez, A., López-Hernández, F.A., Ortega-García, J.A., Ruiz, M. (2016). Clustering and co-occurrence of cancer types: A comparison of techniques with an application to pediatric cancer in Murcia, Spain. Spatial Analysis in Health Geography, 69-90.

  • Tango T., Takahashi K. (2005). A flexibly shaped spatial scan statistic for detecting clusters, International Journal of Health Geographics 4:11.

See Also

local.sp.runs.test, dgp.spq, Q.test,

Examples


# Case 1: Scan test bernoulli
data(provinces_spain)
sf::sf_use_s2(FALSE)
provinces_spain$Male2Female <- factor(provinces_spain$Male2Female > 100)
levels(provinces_spain$Male2Female) = c("men","woman")
formula <- ~ Male2Female
scan <- scan.test(formula = formula, data = provinces_spain, case="men",
nsim = 99, distr = "bernoulli")
print(scan)
summary(scan)
plot(scan, sf = provinces_spain)

## With maximum number of neighborhood
scan <- scan.test(formula = formula, data = provinces_spain, case = "woman",
nsim = 99, distr = "bernoulli")
print(scan)
plot(scan, sf = provinces_spain)


## With elliptic windows
scan <- scan.test(formula = formula, data = provinces_spain, case = "men", nv = 25,
nsim = 99, distr = "bernoulli", windows ="elliptic")
print(scan)
scan <- scan.test(formula = formula, data = provinces_spain, case = "men", nv = 15,
nsim = 99, distr = "bernoulli", windows ="elliptic", alternative = "Low")
print(scan)
plot(scan, sf = provinces_spain)

# Case 2: scan test multinomial
data(provinces_spain)
provinces_spain$Older <- cut(provinces_spain$Older, breaks = c(-Inf,19,22.5,Inf))
levels(provinces_spain$Older) = c("low","middle","high")
formula <- ~ Older
scan <- scan.test(formula = formula, data = provinces_spain, nsim = 99, distr = "multinomial")
print(scan)
plot(scan, sf = provinces_spain)

# Case 3: scan test multinomial
data(FastFood.sf)
sf::sf_use_s2(FALSE)
formula <- ~ Type
scan <- scan.test(formula = formula, data = FastFood.sf, nsim = 99,
distr = "multinomial", windows="elliptic", nv = 50)
print(scan)
summary(scan)
plot(scan, sf = FastFood.sf)

# Case 4: DGP two categories
N <- 150
cx <- runif(N)
cy <- runif(N)
listw <- spdep::knearneigh(cbind(cx,cy), k = 10)
p <- c(1/2,1/2)
rho <- 0.5
fx <- dgp.spq(p = p, listw = listw, rho = rho)
scan <- scan.test(fx = fx, nsim = 99, case = "A", nv = 50, coor = cbind(cx,cy),
distr = "bernoulli",windows="circular")
print(scan)
plot(scan)

# Case 5: DGP three categories
N <- 200
cx <- runif(N)
cy <- runif(N)
listw <- spdep::knearneigh(cbind(cx,cy), k = 10)
p <- c(1/3,1/3,1/3)
rho <- 0.5
fx <- dgp.spq(p = p, listw = listw, rho = rho)
scan <- scan.test(fx = fx, nsim = 19, coor = cbind(cx,cy), nv = 30,
distr = "multinomial", windows = "elliptic")
print(scan)
plot(scan)

# Case 6: Flexible windows
data(provinces_spain)
sf::sf_use_s2(FALSE)
provinces_spain$Male2Female <- factor(provinces_spain$Male2Female > 100)
levels(provinces_spain$Male2Female) = c("men","woman")
formula <- ~ Male2Female
listw <- spdep::poly2nb(provinces_spain, queen = FALSE)
scan <- scan.test(formula = formula, data = provinces_spain, case="men", listw = listw, nv = 6,
                  nsim = 99, distr = "bernoulli", windows = "flexible")
print(scan)
summary(scan)
plot(scan, sf = provinces_spain)



spqdep documentation built on March 28, 2022, 5:06 p.m.