In antiphon/PenGE: Penalised Gibbs Estimation

Introduction

This document demonstrates the use of the R-package PenGE for estimating significant interactions of a multivariate, piece-wise constant saturation Gibbs model using Group Lasso automatic variable selection.

Start by making sure required packages are installed:

library(PenGE)
library(spatstat)

library(PenGE)
library(spatstat)

We will use some rudimentary parallelisation from the package parallel:

library(parallel)
MCORES <- pmax(1, detectCores() - 1) # use all but one core
# Windows doesn't support forking
if(.Platform$OS.type == "windows") MCORES <- 1

Data

Let's see if we can discover interactions in 6-variate lansing data set.

data(lansing)
plot(split(lansing), cex=.5)

Fitting the model

The data is $p=6$ variate point pattern.

Bare minimum

The bare minimum is to define the interaction ranges:

p <- 6 # we have six species
rv <- c(0.025, 0.05) # use these for everything
r1 <- rep( list(rv) , p) # intra-type interaction ranges
r2 <- rep( list(rv), p * (p-1)/2) # inter-type

Then we need to compute the logistic regression pseudo-likelihood model matrix.

Q <- make_Q_stepper_multi(x=lansing, ranges1 = r1, ranges2 = r2, border_r = 0.05)

Then we can estimate the parameters with group lasso (no Cross validation):

fit <- fitGlbin_CV(Q)

The algorithm estimates the interactions for a range of penalty levels. Lets take a look at the estimated interaction matrix on the default penalty level (when param k is not set). This is an ad-hoc rule-of-thumb: half-way between max. penalty and AIC-wise optimum. The AIC should not be used since the true likelihood is not available.

M <- imatrix(fit, signed = TRUE)
image.imatrix(M, col = c("red","black","green","blue"), cex.axis = 0.7, zlim = c(-1,2))

This matrix shows estimated interactions between the types. Note however, that the matrix can be different for different levels of penalisation. The rule-of-thumb above for selecting the penalty level is not very well justified. For illustration, compute the number of interactions per penalty level:

# remove the intercepts which are not penalised, and divide by number of ranges per interaction
nnzero <- (colSums(fit$fullfit$beta != 0) - p )/ 2
plot(fit$lambda, nnzero, xlim = rev(range(fit$lambda)), main ="number of non-zero interactions")

As the penalty decreases (left to right in the figure), more and more coefficients are estimated to be non-zero (note the maximum count: 6 intra + 15 inter = 21 interaction). Choosing the right level is therefore of great importance. We can use cross validation to more objectively choose the penalty level.

Cross validation

For cross validation, we split the pattern spatially. In order to have truly spatially independent datasets, we need to use eroded windows to remove overlaps of neighbourhoods of points near the sub-window edges. This in turn leads to a loss of data: Points in the erosion zone, while contributing to other points' neighbourhoods, will not be directly included in the likelihood.

So, some loss of data is inevitable. Let's accept 50% loss. How many quadrats can we have, and how to split? For rectangular windows we can do the following:

r_border <- 0.05 # define a border zone radius, should be the max. of the model's interaction radii.
win_side_len <- 1
nmax <- cv_border_solve_split(R_frac = r_border/win_side_len, acceptable_loss = 0.5)
nx <- round( sqrt(nmax) ) # nx times nx regular quadrature
quads <- split_window(lansing$window, nx = nx, ny = nx)

Check split:

plot(unmark(lansing), cex=.3, main = "CV quadrats")
for(q in quads) plot(erosion(w = q, r = r_border), add=T)

Not that many. We can increase nx but at the cost of introducing correlation in the CV estimates. It remains an open question how badly such correlation affects the estimation.

Now we need to fit the model for each different quadrat subset. Current implementation simply uses mclapply from parallel to run locally parallelised. This will take a moment:

cvfit <- fitGlbin_CV(Q, quads = quads, mc.cores = MCORES)

Next we need to compute the cross validation risk, i.e. a prediction error per penalty level. This is given by the average of the quadrats' squared-residual sums. Unlike the data-minus-prediction -residual for linear models, several different residuals are available for Gibbs models.

Rest <- residuals_cv(cvfit, x = lansing, mc.cores = MCORES)

By default something sensible should emerge. Several numerical approximations are needed in the current implementation. One can for example set dum_grid_dimyx = c(30,30) to use a quadrature-integration instead of Monte Carlo -integration to approximate the residual integrals.

Illustration of the individual residual curves per CV-fit with the inverse-residuals:

inv <- Rest$by_resid$inverse
pen <- cvfit$lambda 

par(mfrow=c(1,2))

# The residual per penalty for each CV quadrat
plot(pen, inv[1,], "l", xlim = rev(range(pen)), ylim=range(inv), main="Inverse residual SS by quadrat")
for(i in 2:nx^2) lines(pen, inv[i,], col=i)
# Estimated Risk:
risk_inv <- Rest$ave$inverse
plot(pen, log(risk_inv), "l", xlim = rev(range(pen)), main = "Mean inverse residual SS")

The optimal interaction matrix in the sense of minimial inverse-residuals:

kopt <- which.min(risk_inv)
Mcv <- imatrix(cvfit, k = kopt, signed = TRUE)
image.imatrix(Mcv, cols = c(2,1,3,4),  zlim = c(-1,2), cex.axis=.7)

print(Mcv)

The coding for the matrix is as follows when signed=TRUE:

-1 : At least one range scale coefficient is negative, and none are positive
0 : Interaction at all ranges are 0
+1 : At least one range scale coefficient is positive, and none are negative
+2 : At least one range scale coefficient is negative, and at least one range scale coefficient is positive

If signed=FALSE is passed to the functionimatrix, the resulting matrix is binary with 1 if at least one rance scale coefficient is non-zero.

From the matrix we can see that most types are clustered (positive intra-interaction, diagonal elements), and that maple is negatively associated with blackoak and hickory, for example.

Further options

Setting the saturation levels

All interaction functions between point pairs are built from piecewise-constant Geyer's saturation potentials. We can either give the saturations as a single integers (parameters sat1=3, which would set all intra-saturations to 3), or as lists of matrices. The latter allow for asymmetric saturations, which might be desireable if the intensities are highly unbalanced. Note that Strauss-potential is the special case with infinite saturation.

By default, the model adapts the saturation levels automagically depending on the intensities. We can override this. Let's set all saturations levels to 1:

Q1 <- make_Q_stepper_multi(x=lansing, ranges1 = r1, ranges2 = r2, border_r = 0.05, sat1 = 1, sat2 = 1, auto_sat = FALSE)

sat1 referes to intra-components and sat2 for inter-components. The saturation levels for inter-interaction between types 1 and 2:

print(Q$parameters$sat2[[1]])
print( Q1$parameters$sat2[[1]] )

This matrix is $k\times 2$, where $k$ is the length of the corresponding range-vector. The two column vectors of the automatically adjusted model are not equal, as the point count of type 2 ("hickory") is much larger than type 1 ("blackoak"). We have two saturation levels, one level per range interval. Recall

Q$parameters$ranges2[[1]]

Note that the first range interval start from 0.

Additional details

Details per type pair after fitting the model:

potential(cvfit, i = 1, j = 3, k = kopt)

Covariates

We can include covariates into the model as spatstat im-objects. For the lansing data no covariate data is available, but for illustration lets generate two covarite fields:

cov1 <- as.im(function(x,y) (x-.5)^2, W = lansing$window)
cov2 <- as.im(function(x,y) (y-.5)^2, W = lansing$window)
covs <- list(covx = cov1, covy = cov2)

The covariates are inputted like this:

Qc <- make_Q_stepper_multi(x=lansing, ranges1 = r1, ranges2 = r2, border_r = 0.05, covariates = covs, penalise_covariates = FALSE)

Note that we choose not to penalise the covariates. Group lasso shrinks the estimates, and if we penalise the covariates this might lead to problems in the residual estimation (=prediction). This needs to be studied further.

Then we proceed as before, with the addition that the covariates need to be given explicitly to residuals_cv:

cvfit_c <- fitGlbin_CV(Qc, quads = quads, mc.cores = MCORES)
Rc <- residuals_cv(cvfit_c, x = lansing, mc.cores = MCORES, covariates = covs)

We can extract the covariate coefficients thusly:

kopt_c <- which.min(Rc$ave$inverse)
cov_est <- covariate_matrix(cvfit_c, k = kopt_c, pref = "cov")
print(cov_est, digits=2)

Confidence intervals of these estimates are not available. pref="cov" gives the prefix for string matching the covariate coefficients (avoid names with _, intra and ìnter).

Note

The estimation is based on an approximation of the model likelihood. The approximation is based on an auxiliary random mechanism (see make_Q_stepper_multi parameters regarding dummy points). Therefore, there will be a level of variability between different fits of the model even if the data and the model are kept same.

This means that the estimates are correct penalised maximum likelihood estimates on average.

It is highly recommended to fit the model more than once and compare/average the outputs.

antiphon/PenGE documentation built on July 31, 2019, 10:01 p.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

Tweet to @rdrrHQ

GitHub issue tracker

ian@mutexlabs.com