Q.test | R Documentation |
A function to compute Q test for spatial qualitative data.
Q.test(formula = NULL, data = NULL, na.action, fx = NULL, coor = NULL, m = 3, r = 1, distr = "asymptotic", control = list())
formula |
a symbolic description of the factor(s). |
data |
an (optional) data frame or a sf object with points/multipolygons geometry containing the variable(s) to be tested. |
na.action |
action with NA values |
fx |
a factor or a matrix of factors in columns |
coor |
(optional) a 2xN vector with spatial coordinates. Used when *data* is not a spatial object |
m |
length of m-surrounding (default = 3). |
r |
only for asimtotic distribution. Maximum overlapping between any two m-surroundings (default = 1). |
distr |
character. Distribution type "asymptotic" (default) or "mc". |
control |
Optional argument. See Control Argument section. |
The Q(m) statistic was introduced by Ruiz et al. (2010) as a tool to explore geographical
co-location/co-occurrence of qualitative data. Consider a spatial variable X which is the
result of a qualitative process with a set number of categorical outcomes a_j (j=1,...,k).
The spatial variable is observed at a set of fixed locations indexed by their coordinates
s_i (i=1,..., N), so that at each location si where an event is observed,
X_i takes one of the possible values a_j.
Since the observations are georeferenced, a spatial embedding protocol can be devised
to assess the spatial property of co-location. Let us define, for an observation at
a specified location, say s_0, a surrounding of size m, called an m-surrounding.
The m-surrounding is the set of m-1 nearest neighbours from the perspective
of location s_0. In the case of distance ties, a secondary criterion can be
invoked based on direction.
Once that an embedding protocol is adopted and the elements of the m-surrounding
for location s_0 have been determined, a string can be obtained that collects
the elements of the local neighborhood (the m-1 nearest neighbors) of the observation
at s_0. The m-surrounding can then be represented in the following way:
X_m(s_0)=(X_{s_0},X_{s_1},...X_{s_{m-1}})
Since each observation Xs takes one of k possible values, and there are m observations in
the m-surrounding, there are exactly k possible unique ways in which those values can
co-locate. This is the number of permutations with replacement.
For instance, if k=2
(e.g. the possible outcomes are a1=0 and a2=1) and m=3, the following eight unique
patterns of co-location are possible (the number of symbols is n_{σ}=8): 0,0,0, 1,0,0,
0,1,0, 0,0,1, 1,1,0, 1,0,1, 0,1,1, and 1,1,1. Each unique co-locationtype
can be denoted in a convenient way by means of a symbol σ_i (i=1, 2,...,k^m). It follows
that each site can be uniquely associated with a specific symbol, in a process termed
symbolization. In this way, we say that a location s is of type σ_i if and only if X_m(s)=σ_i.
Equivalent symbols (see Páez, et al. 2012) can be obtained by counting the number of
occurrences of each category within an m-surrounding. This surrenders some
topological information (ordering within the m-surrounding is lost) in favor of a more
compact set of symbols, since the number of combinations with replacement.
Definition of Q(m) statistic
Let \{X_s\}_{s \in R} be a discrete spatial process and m be a fixed embedding
dimension. The statistic Q testing the null hypothesis:
H_0:\{X_s\}_{s \in R} is spatially independent, against any other alternative.
For a fixed m ≥q 2, the relative frequency of symbols can be used to define the symbolic
entropy of the spatial process as the Shanon entropy of the distinct symbols:
h(m) = - ∑_j p_{σ_j}ln(p_{σ_j})
where
p_{σ_j}={ n_{σ_j} \over R}
with n_{σ_j} is simply the
number of times that the symbol σ_j is observed and R the number of
symbolized locations.
The entropy function is bounded between 0 < h (m) ≤q η.
The Q statistic is essentially a likelihood ratio test between the symbolic entropy of the observed pattern and the entropy of the system under the null hypothesis of a random spatial sequence:
Q(m)=2R(η-h(m))
with η = ln(k^m). The statistic is asymptotically χ^2 distributed
with degrees of freedom equal to the number of symbols minus one.
An list of two object of the class htest
. Each element of the list return the:
data.name | a character string giving the names of the data. |
statistic | Value of the Q test |
N | total number of observations. |
R | total number of symbolized observations. |
m | length m-surrounding. |
r | degree of overlapping. |
df | degree of freedom. |
distr | type of distribution used to get the significance of the Q test. |
type | type of symbols. |
character to select the type of distance. Default = "Euclidean" for Cartesian coordinates only: one of Euclidean, Hausdorff or Frechet; for geodetic coordinates, great circle distances are computed (see sf::st_distance())
Delete degenerate surrounding based on the absolute distance between observations.
A value between 0 and 1. Delete degenerate surrounding based on the distance. Delete m-surrounding when the maximum distance between observation is upper than k percentage of maximum distance between anywhere observation.
A integer value 'k'. Delete degenerate surrounding based on the near neighbourhood criteria. Delete m-surrounding is a element of the m-surrounding is not include in the set of k near neighbourhood of the first element
number of simulations for get the Monte Carlo distribution. Default = 999
seed to select the initial element to star the algorithm to get compute the m-surroundings or to start the simulation
The symbolization protocol proposed by Ruiz et al. (2010) - call these
Standard-Permutation Symbols — contains a large amount of topological information
regarding the units of analysis, including proximity and direction.
In this sense, the protocol is fairly general. On the other hand, it is easy to see
that the combinatorial possibilities can very quickly become unmanageable.
For a process with k = 3 outcomes and m = 5, the number of symbols becomes
3^5 = 243; for k = 6 and m = 4 it is 6^4 = 1,296. Depending on the number
of observations N, the explosion in the number of symbols can very rapidly consume
degrees of freedom for hypothesis testing, because as a rule of thumb
it is recommended that the number of symbolized locations be at least five times
the number of symbols used (e.g., R ≥q 5k^m), and R will usually be a fraction of N.
As an alternative, we propose a symbolization protocol that sacrifices
some amount of topological detail for conciseness. The alternative is based
on the standard scheme; however, instead of retaining proximity and
direction relationships, it maintains only the total number of occurrences
of each outcome in an m-surrounding. We call these Equivalent-Combination Symbols.
Because order in the sequence is not considered in this protocol, instead of a
permutation with repetition, the number of symbols reflects a combination with
repetition.
To select S locations for the analysis, coordinates are selected such that
for any two coordinates s_i , s_j the number of overlapping nearest
neighbours of s_i and s_j are at most r. The set S, which is a subset of all the
observations N, is defined recursively as follows. First choose a location s_0 at random and fix an integer r
with 0 ≤q r < m. The integer r is the degree of overlap, the maximum number of observations that contiguous
m-surroundings are allowed to have in common.
Let \{s_1^0, s_2^0,...,s_{m-1}^0 \} be the set of nearest neighbours
to s_0, where the s_i^0 are ordered by distance to s_0, or angle in the case of ties.
Let us call s_1 = s_{m-r-1}^0 and define A_0 = \{s_0,s_0^1,...,s^0_{m-r-2} \} . Take the set of
nearest neighbours to s_1, namely, \{ s_1^1, s_2^1,...,s^1_{m-1} \} in the
set of locations S \setminus A_0 and define s_2=s^1_{m-r-1} . Nor for i>1 we define
s_i = s^{i-1}_{m-r-1} where s^{i-1}_{m-r-1} is in the ser of nearest neighbors to s_{i-1},
\{ s_1^{i-1},s_2^{i-1},...,s_{m-1}^{i-1} \} ,of the set S \setminus \{ \cup_{j=0}^{i-1} A_j \} . Continue this process while there are locations to symbolize.
The bootstrapped-based testing can provide an advantage since overlapping between
m-surroundings is not a consideration, and the full sample can be used.
Fernando López | fernando.lopez@upct.es |
Román Mínguez | roman.minguez@uclm.es |
Antonio Páez | paezha@gmail.com |
Manuel Ruiz | manuel.ruiz@upct.es |
Ruiz M, López FA, A Páez. (2010). Testing for spatial association of qualitative data using symbolic dynamics. Journal of Geographical Systems. 12 (3) 281-309
López, FA, and A Páez. (2012). Distribution-free inference for Q(m) based on permutational bootstrapping: an application to the spatial co-location pattern of firms in Madrid Estadística Española, 177, 135-156.
# Case 1: With coordinates N <- 200 cx <- runif(N) cy <- runif(N) coor <- cbind(cx,cy) p <- c(1/6,3/6,2/6) rho = 0.3 listw <- spdep::nb2listw(spdep::knn2nb(spdep::knearneigh(cbind(cx,cy), k = 4))) fx <- dgp.spq(list = listw, p = p, rho = rho) q.test <- Q.test(fx = fx, coor = coor, m = 3, r = 1) summary(q.test) plot(q.test) print(q.test) q.test.mc <- Q.test(fx = fx, coor = coor, m = 3, distr = "mc", control = list(nsim = 999)) summary(q.test.mc) plot(q.test.mc) print(q.test.mc) # Case 2: With a sf object data("FastFood.sf") f1 <- ~ Type q.test <- Q.test(formula = f1, data = FastFood.sf, m = c(3, 4), r = c(1, 2, 3), control = list(distance ="Euclidean")) summary(q.test) plot(q.test) print(q.test) # Case 3: With a sf object with isolated areas data("provinces_spain") sf::sf_use_s2(FALSE) provinces_spain$Male2Female <- factor(provinces_spain$Male2Female > 100) levels(provinces_spain$Male2Female) = c("men","woman") provinces_spain$Older <- cut(provinces_spain$Older, breaks = c(-Inf,19,22.5,Inf)) levels(provinces_spain$Older) = c("low","middle","high") f1 <- ~ Older + Male2Female q.test <- Q.test(formula = f1, data = provinces_spain, m = 3, r = 1, control = list(seedinit = 1111)) summary(q.test) print(q.test) plot(q.test) q.test.mc <- Q.test(formula = f1, data = provinces_spain, m = 4, r = 3, distr = "mc", control = list(seedinit = 1111)) summary(q.test.mc) print(q.test.mc) plot(q.test.mc) # Case 4: Examples with multipolygons library(sf) fname <- system.file("shape/nc.shp", package="sf") nc <- sf::st_read(fname) qb79 <- quantile(nc$BIR79) nc$QBIR79 <- (nc$BIR79 > qb79[2]) + (nc$BIR79 > qb79[3]) + (nc$BIR79 >= qb79[4]) + 1 nc$QBIR79 <- as.factor(nc$QBIR79) plot(nc["QBIR79"], pal = c("#FFFEDE","#FFDFA2", "#FFA93F", "#D5610D"), main = "BIR79 (Quartiles)") sid79 <- quantile(nc$SID79) nc$QSID79 <- (nc$SID79 > sid79[2]) + (nc$SID79 > sid79[3]) + (nc$SID79 >= sid79[4]) + 1 nc$QSID79 <- as.factor(nc$QSID79) plot(nc["QSID79"], pal = c("#FFFEDE","#FFDFA2", "#FFA93F", "#D5610D"), main = "SID79 (Quartiles)") f1 <- ~ QSID79 + QBIR79 lq1nc <- Q.test(formula = f1, data = nc, m = 5, r = 2, control = list(seedinit = 1111, dtmaxpc = 0.5, distance = "Euclidean") ) print(lq1nc) lq2nc <- Q.test(formula = f1, data = nc, m = 5, r = 2, control = list(dtmaxpc = 0.2) ) print(lq2nc) lq3nc <- Q.test(formula = f1, data = nc, m = 5, r = 2, control = list(dtmaxknn = 5) ) print(lq3nc) # Case 5: Examples with points and matrix of variables fx <- matrix(c(nc$QBIR79, nc$QSID79), ncol = 2, byrow = TRUE) mctr <- suppressWarnings(sf::st_centroid(st_geometry(nc))) mcoor <- st_coordinates(mctr)[,c("X","Y")] q.test <- Q.test(fx = fx, coor = mcoor, m = 5, r = 2, control = list(seedinit = 1111, dtmaxpc = 0.5)) print(q.test) plot(q.test)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.