bootstrap_MRF: Bootstrap observations to estimate MRF parameter coefficients
In nicholasjclark/MRFcov: Markov Random Fields with Additional Covariates

View source: R/bootstrap_MRF.R

bootstrap_MRF

R Documentation

Bootstrap observations to estimate MRF parameter coefficients

Description

This function runs MRFcov models multiple times to capture uncertainty in parameter esimates. The dataset is shuffled and missing values (if found) are imputed in each bootstrap iteration.

Usage

bootstrap_MRF(
  data,
  n_bootstraps,
  sample_seed,
  symmetrise,
  n_nodes,
  n_cores,
  n_covariates,
  family,
  sample_prop,
  spatial = FALSE,
  coords = NULL
)

Arguments

`data`	Dataframe. The input data where the `n_nodes` left-most variables are variables that are to be represented by nodes in the graph. Note that `NA`'s are allowed for covariates. If present, these missing values will be imputed from the distribution `rnorm(mean = 0, sd = 1)`, which assumes that all covariates are scaled and centred (i.e. by using the function `scale` or similar)
`n_bootstraps`	Positive integer. Represents the total number of bootstrap samples to test. Default is `100`.
`sample_seed`	Numeric. Used as the seed value for generating bootstrap replicates, allowing users to generate replicated datasets on different systems. Default is a random seed
`symmetrise`	The method to use for symmetrising corresponding parameter estimates (which are taken from separate regressions). Options are `min` (take the coefficient with the smallest absolute value), `max` (take the coefficient with the largest absolute value) or `mean` (take the mean of the two coefficients). Default is `mean`
`n_nodes`	Positive integer. The index of the last column in `data` which is represented by a node in the final graph. Columns with index greater than `n_nodes` are taken as covariates. Default is the number of columns in `data`, corresponding to no additional covariates
`n_cores`	Integer. The number of cores to spread the job across using `makePSOCKcluster`. Default is 1 (no parallelisation)
`n_covariates`	Positive integer. The number of covariates in `data`, before cross-multiplication. Default is `NCOL(data) - n_nodes`
`family`	The response type. Responses can be quantitative continuous (`family = "gaussian"`), non-negative counts (`family = "poisson"`) or binomial 1s and 0s (`family = "binomial"`)
`sample_prop`	Positive probability value indicating the proportion of rows to sample from `data` in each bootstrap iteration. Default is no subsampling (`sample_prop == 1`)
`spatial`	Logical. If `TRUE`, spatial MRF / CRF models are bootstrapped using `MRFcov_spatial`. Note, GPS coordinates must be supplied as `coords` for spatial models to be run. Smoothed spatial splines will be included in each node-wise regression as covariates. This ensures resulting node interaction parameters are estimated after accounting for possible spatial autocorrelation. Note that interpretation of spatial autocorrelation is difficult, and so it is recommended to compare predictive capacities spatial and non-spatial CRFs through the `predict_MRF` function
`coords`	A two-column `dataframe` (with `nrow(coords) == nrow(data)`) representing the spatial coordinates of each observation in `data`. Ideally, these coordinates will represent Latitude and Longitude GPS points for each observation.

Details

MRFcov models are fit via cross-validation using cv.glmnet. For each model, the data is bootstrapped by shuffling row observations and fitting models to a subset of observations to account for uncertainty in parameter estimates. Parameter estimates from the set of bootstrapped models are summarised to present means and confidence intervals (as 95 percent quantiles).

Value

A list containing:

direct_coef_means: dataframe containing mean coefficient values taken from all bootstrapped models across the iterations
direct_coef_upper90 and direct_coef_lower90: dataframes containing coefficient 95 percent and 5 percent quantiles taken from all bootstrapped models across the iterations
indirect_coef_mean: list of symmetric matrices (one matrix for each covariate) containing mean effects of covariates on pairwise interactions
mean_key_coefs: list of matrices of length n_nodes containing mean covariate coefficient values and their relative importances (using the formula x^2 / sum (x^2) taken from all bootstrapped models across iterations. Only coefficients with mean relative importances >0.01 are returned. Note, relative importance are only useful if all covariates are on a similar scale.
mod_type: A character stating the type of model that was fit (used in other functions)
mod_family: A character stating the family of model that was fit (used in other functions)
poiss_sc_factors: A vector of the square-root mean scaling factors used to standardise poisson variables (only returned if family = "poisson")

Examples


data("Bird.parasites")

# Perform 2 quick bootstrap replicates using 70% of observations
bootedCRF <- bootstrap_MRF(data = Bird.parasites,
                          n_nodes = 4,
                          family = 'binomial',
                          sample_prop = 0.7,
                          n_bootstraps = 2)


# Small example of using spatial coordinates for a spatial CRF
Latitude <- sample(seq(120, 140, length.out = 100), nrow(Bird.parasites), TRUE)
Longitude <- sample(seq(-19, -22, length.out = 100), nrow(Bird.parasites), TRUE)
coords <- data.frame(Latitude = Latitude, Longitude = Longitude)
bootedSpatial <- bootstrap_MRF(data = Bird.parasites, n_nodes = 4,
                             family = 'binomial',
                             spatial = TRUE,
                             coords = coords,
                             sample_prop = 0.5,
                             n_bootstraps = 2)

nicholasjclark/MRFcov documentation built on March 30, 2024, 10:31 p.m.