Q.test: A function to compute Q test for spatial qualitative data

View source: R/Q.test.R

Q.testR Documentation

A function to compute Q test for spatial qualitative data

Description

A function to compute Q test for spatial qualitative data.

Usage

Q.test(formula = NULL, data = NULL, na.action,
fx = NULL, coor = NULL, m = 3, r = 1, distr = "asymptotic",
control = list())

Arguments

formula

a symbolic description of the factor(s).

data

an (optional) data frame or a sf object with points/multipolygons geometry containing the variable(s) to be tested.

na.action

action with NA values

fx

a factor or a matrix of factors in columns

coor

(optional) a 2xN vector with spatial coordinates. Used when *data* is not a spatial object

m

length of m-surrounding (default = 3).

r

only for asimtotic distribution. Maximum overlapping between any two m-surroundings (default = 1).

distr

character. Distribution type "asymptotic" (default) or "mc".

control

Optional argument. See Control Argument section.

Details

The Q(m) statistic was introduced by Ruiz et al. (2010) as a tool to explore geographical co-location/co-occurrence of qualitative data. Consider a spatial variable X which is the result of a qualitative process with a set number of categorical outcomes a_j (j=1,...,k). The spatial variable is observed at a set of fixed locations indexed by their coordinates s_i (i=1,..., N), so that at each location si where an event is observed, X_i takes one of the possible values a_j.

Since the observations are georeferenced, a spatial embedding protocol can be devised to assess the spatial property of co-location. Let us define, for an observation at a specified location, say s_0, a surrounding of size m, called an m-surrounding.
The m-surrounding is the set of m-1 nearest neighbours from the perspective of location s_0. In the case of distance ties, a secondary criterion can be invoked based on direction.
Once that an embedding protocol is adopted and the elements of the m-surrounding for location s_0 have been determined, a string can be obtained that collects the elements of the local neighborhood (the m-1 nearest neighbors) of the observation at s_0. The m-surrounding can then be represented in the following way:

X_m(s_0)=(X_{s_0},X_{s_1},...X_{s_{m-1}})

Since each observation Xs takes one of k possible values, and there are m observations in the m-surrounding, there are exactly k possible unique ways in which those values can co-locate. This is the number of permutations with replacement.
For instance, if k=2 (e.g. the possible outcomes are a1=0 and a2=1) and m=3, the following eight unique patterns of co-location are possible (the number of symbols is n_{σ}=8): 0,0,0, 1,0,0, 0,1,0, 0,0,1, 1,1,0, 1,0,1, 0,1,1, and 1,1,1. Each unique co-locationtype can be denoted in a convenient way by means of a symbol σ_i (i=1, 2,...,k^m). It follows that each site can be uniquely associated with a specific symbol, in a process termed symbolization. In this way, we say that a location s is of type σ_i if and only if X_m(s)=σ_i.
Equivalent symbols (see Páez, et al. 2012) can be obtained by counting the number of occurrences of each category within an m-surrounding. This surrenders some topological information (ordering within the m-surrounding is lost) in favor of a more compact set of symbols, since the number of combinations with replacement.

Definition of Q(m) statistic

Let \{X_s\}_{s \in R} be a discrete spatial process and m be a fixed embedding dimension. The statistic Q testing the null hypothesis:

H_0:\{X_s\}_{s \in R} is spatially independent, against any other alternative.

For a fixed m ≥q 2, the relative frequency of symbols can be used to define the symbolic entropy of the spatial process as the Shanon entropy of the distinct symbols:

h(m) = - ∑_j p_{σ_j}ln(p_{σ_j})

where

p_{σ_j}={ n_{σ_j} \over R}

with n_{σ_j} is simply the number of times that the symbol σ_j is observed and R the number of symbolized locations. The entropy function is bounded between 0 < h (m) ≤q η.

The Q statistic is essentially a likelihood ratio test between the symbolic entropy of the observed pattern and the entropy of the system under the null hypothesis of a random spatial sequence:

Q(m)=2R(η-h(m))

with η = ln(k^m). The statistic is asymptotically χ^2 distributed with degrees of freedom equal to the number of symbols minus one.

Value

An list of two object of the class htest. Each element of the list return the:

data.name a character string giving the names of the data.
statistic Value of the Q test
N total number of observations.
R total number of symbolized observations.
m length m-surrounding.
r degree of overlapping.
df degree of freedom.
distr type of distribution used to get the significance of the Q test.
type type of symbols.

Control arguments

distance

character to select the type of distance. Default = "Euclidean" for Cartesian coordinates only: one of Euclidean, Hausdorff or Frechet; for geodetic coordinates, great circle distances are computed (see sf::st_distance())

dtmaxabs

Delete degenerate surrounding based on the absolute distance between observations.

dtmaxpc

A value between 0 and 1. Delete degenerate surrounding based on the distance. Delete m-surrounding when the maximum distance between observation is upper than k percentage of maximum distance between anywhere observation.

dtmaxknn

A integer value 'k'. Delete degenerate surrounding based on the near neighbourhood criteria. Delete m-surrounding is a element of the m-surrounding is not include in the set of k near neighbourhood of the first element

nsim

number of simulations for get the Monte Carlo distribution. Default = 999

seedinit

seed to select the initial element to star the algorithm to get compute the m-surroundings or to start the simulation

Standard-Permutation vs Equivalent-Combination Symbols

The symbolization protocol proposed by Ruiz et al. (2010) - call these Standard-Permutation Symbols — contains a large amount of topological information regarding the units of analysis, including proximity and direction. In this sense, the protocol is fairly general. On the other hand, it is easy to see that the combinatorial possibilities can very quickly become unmanageable. For a process with k = 3 outcomes and m = 5, the number of symbols becomes 3^5 = 243; for k = 6 and m = 4 it is 6^4 = 1,296. Depending on the number of observations N, the explosion in the number of symbols can very rapidly consume degrees of freedom for hypothesis testing, because as a rule of thumb it is recommended that the number of symbolized locations be at least five times the number of symbols used (e.g., R ≥q 5k^m), and R will usually be a fraction of N.

As an alternative, we propose a symbolization protocol that sacrifices some amount of topological detail for conciseness. The alternative is based on the standard scheme; however, instead of retaining proximity and direction relationships, it maintains only the total number of occurrences of each outcome in an m-surrounding. We call these Equivalent-Combination Symbols. Because order in the sequence is not considered in this protocol, instead of a permutation with repetition, the number of symbols reflects a combination with repetition.

Selection of m-surrounding with Controlled Degree of Overlapping (r)

To select S locations for the analysis, coordinates are selected such that for any two coordinates s_i , s_j the number of overlapping nearest neighbours of s_i and s_j are at most r. The set S, which is a subset of all the observations N, is defined recursively as follows. First choose a location s_0 at random and fix an integer r with 0 ≤q r < m. The integer r is the degree of overlap, the maximum number of observations that contiguous m-surroundings are allowed to have in common.
Let \{s_1^0, s_2^0,...,s_{m-1}^0 \} be the set of nearest neighbours to s_0, where the s_i^0 are ordered by distance to s_0, or angle in the case of ties. Let us call s_1 = s_{m-r-1}^0 and define A_0 = \{s_0,s_0^1,...,s^0_{m-r-2} \} . Take the set of nearest neighbours to s_1, namely, \{ s_1^1, s_2^1,...,s^1_{m-1} \} in the set of locations S \setminus A_0 and define s_2=s^1_{m-r-1} . Nor for i>1 we define s_i = s^{i-1}_{m-r-1} where s^{i-1}_{m-r-1} is in the ser of nearest neighbors to s_{i-1}, \{ s_1^{i-1},s_2^{i-1},...,s_{m-1}^{i-1} \} ,of the set S \setminus \{ \cup_{j=0}^{i-1} A_j \} . Continue this process while there are locations to symbolize.

Selection of m-surroundings for bootstrap distribution

The bootstrapped-based testing can provide an advantage since overlapping between m-surroundings is not a consideration, and the full sample can be used.

Author(s)

Fernando López fernando.lopez@upct.es
Román Mínguez roman.minguez@uclm.es
Antonio Páez paezha@gmail.com
Manuel Ruiz manuel.ruiz@upct.es

References

  • Ruiz M, López FA, A Páez. (2010). Testing for spatial association of qualitative data using symbolic dynamics. Journal of Geographical Systems. 12 (3) 281-309

  • López, FA, and A Páez. (2012). Distribution-free inference for Q(m) based on permutational bootstrapping: an application to the spatial co-location pattern of firms in Madrid Estadística Española, 177, 135-156.

Examples


# Case 1: With coordinates
N <- 200
cx <- runif(N)
cy <- runif(N)
coor <- cbind(cx,cy)
p <- c(1/6,3/6,2/6)
rho = 0.3
listw <- spdep::nb2listw(spdep::knn2nb(spdep::knearneigh(cbind(cx,cy), k = 4)))
fx <- dgp.spq(list = listw, p = p, rho = rho)
q.test <- Q.test(fx = fx, coor = coor, m = 3, r = 1)
summary(q.test)
plot(q.test)
print(q.test)

q.test.mc <- Q.test(fx = fx, coor = coor, m = 3, distr = "mc", control = list(nsim = 999))
summary(q.test.mc)
plot(q.test.mc)
print(q.test.mc)


# Case 2: With a sf object
data("FastFood.sf")
f1 <- ~ Type
q.test <- Q.test(formula = f1, data = FastFood.sf, m = c(3, 4),
r = c(1, 2, 3), control = list(distance ="Euclidean"))
summary(q.test)
plot(q.test)
print(q.test)

# Case 3: With a sf object with isolated areas
data("provinces_spain")
sf::sf_use_s2(FALSE)
provinces_spain$Male2Female <- factor(provinces_spain$Male2Female > 100)
levels(provinces_spain$Male2Female) = c("men","woman")
provinces_spain$Older <- cut(provinces_spain$Older, breaks = c(-Inf,19,22.5,Inf))
levels(provinces_spain$Older) = c("low","middle","high")
f1 <- ~ Older + Male2Female
q.test <- Q.test(formula = f1,
data = provinces_spain, m = 3, r = 1, control = list(seedinit = 1111))
summary(q.test)
print(q.test)
plot(q.test)
q.test.mc <- Q.test(formula = f1, data = provinces_spain, m = 4, r = 3, distr = "mc",
control = list(seedinit = 1111))
summary(q.test.mc)
print(q.test.mc)
plot(q.test.mc)

# Case 4: Examples with multipolygons
library(sf)
fname <- system.file("shape/nc.shp", package="sf")
nc <- sf::st_read(fname)
qb79 <- quantile(nc$BIR79)
nc$QBIR79 <- (nc$BIR79 > qb79[2]) + (nc$BIR79 > qb79[3]) +
(nc$BIR79 >= qb79[4]) + 1
nc$QBIR79 <- as.factor(nc$QBIR79)
plot(nc["QBIR79"], pal = c("#FFFEDE","#FFDFA2", "#FFA93F", "#D5610D"),
     main = "BIR79 (Quartiles)")
sid79 <- quantile(nc$SID79)
nc$QSID79 <- (nc$SID79 > sid79[2]) + (nc$SID79 > sid79[3]) +
(nc$SID79 >= sid79[4]) + 1
nc$QSID79 <- as.factor(nc$QSID79)
plot(nc["QSID79"], pal = c("#FFFEDE","#FFDFA2", "#FFA93F", "#D5610D"),
     main = "SID79 (Quartiles)")
f1 <- ~ QSID79 + QBIR79
lq1nc <- Q.test(formula = f1, data = nc, m = 5, r = 2,
control = list(seedinit = 1111, dtmaxpc = 0.5, distance = "Euclidean") )
print(lq1nc)

lq2nc <- Q.test(formula = f1, data = nc, m = 5, r = 2,
control = list(dtmaxpc = 0.2) )
print(lq2nc)

lq3nc <- Q.test(formula = f1, data = nc, m = 5, r = 2,
control = list(dtmaxknn = 5) )
print(lq3nc)

# Case 5: Examples with points and matrix of variables
fx <- matrix(c(nc$QBIR79, nc$QSID79), ncol = 2, byrow = TRUE)
mctr <- suppressWarnings(sf::st_centroid(st_geometry(nc)))
mcoor <- st_coordinates(mctr)[,c("X","Y")]
q.test <- Q.test(fx = fx, coor = mcoor, m = 5, r = 2,
                 control = list(seedinit = 1111, dtmaxpc = 0.5))
print(q.test)
plot(q.test)



spqdep documentation built on March 28, 2022, 5:06 p.m.