bootstrap_MRF: Bootstrap observations to estimate MRF parameter coefficients

View source: R/bootstrap_MRF.R

bootstrap_MRFR Documentation

Bootstrap observations to estimate MRF parameter coefficients

Description

This function runs MRFcov models multiple times to capture uncertainty in parameter esimates. The dataset is shuffled and missing values (if found) are imputed in each bootstrap iteration.

Usage

bootstrap_MRF(
  data,
  n_bootstraps,
  sample_seed,
  symmetrise,
  n_nodes,
  n_cores,
  n_covariates,
  family,
  sample_prop,
  spatial = FALSE,
  coords = NULL
)

Arguments

data

Dataframe. The input data where the n_nodes left-most variables are variables that are to be represented by nodes in the graph. Note that NA's are allowed for covariates. If present, these missing values will be imputed from the distribution rnorm(mean = 0, sd = 1), which assumes that all covariates are scaled and centred (i.e. by using the function scale or similar)

n_bootstraps

Positive integer. Represents the total number of bootstrap samples to test. Default is 100.

sample_seed

Numeric. Used as the seed value for generating bootstrap replicates, allowing users to generate replicated datasets on different systems. Default is a random seed

symmetrise

The method to use for symmetrising corresponding parameter estimates (which are taken from separate regressions). Options are min (take the coefficient with the smallest absolute value), max (take the coefficient with the largest absolute value) or mean (take the mean of the two coefficients). Default is mean

n_nodes

Positive integer. The index of the last column in data which is represented by a node in the final graph. Columns with index greater than n_nodes are taken as covariates. Default is the number of columns in data, corresponding to no additional covariates

n_cores

Integer. The number of cores to spread the job across using makePSOCKcluster. Default is 1 (no parallelisation)

n_covariates

Positive integer. The number of covariates in data, before cross-multiplication. Default is NCOL(data) - n_nodes

family

The response type. Responses can be quantitative continuous (family = "gaussian"), non-negative counts (family = "poisson") or binomial 1s and 0s (family = "binomial")

sample_prop

Positive probability value indicating the proportion of rows to sample from data in each bootstrap iteration. Default is no subsampling (sample_prop == 1)

spatial

Logical. If TRUE, spatial MRF / CRF models are bootstrapped using MRFcov_spatial. Note, GPS coordinates must be supplied as coords for spatial models to be run. Smoothed spatial splines will be included in each node-wise regression as covariates. This ensures resulting node interaction parameters are estimated after accounting for possible spatial autocorrelation. Note that interpretation of spatial autocorrelation is difficult, and so it is recommended to compare predictive capacities spatial and non-spatial CRFs through the predict_MRF function

coords

A two-column dataframe (with nrow(coords) == nrow(data)) representing the spatial coordinates of each observation in data. Ideally, these coordinates will represent Latitude and Longitude GPS points for each observation.

Details

MRFcov models are fit via cross-validation using cv.glmnet. For each model, the data is bootstrapped by shuffling row observations and fitting models to a subset of observations to account for uncertainty in parameter estimates. Parameter estimates from the set of bootstrapped models are summarised to present means and confidence intervals (as 95 percent quantiles).

Value

A list containing:

  • direct_coef_means: dataframe containing mean coefficient values taken from all bootstrapped models across the iterations

  • direct_coef_upper90 and direct_coef_lower90: dataframes containing coefficient 95 percent and 5 percent quantiles taken from all bootstrapped models across the iterations

  • indirect_coef_mean: list of symmetric matrices (one matrix for each covariate) containing mean effects of covariates on pairwise interactions

  • mean_key_coefs: list of matrices of length n_nodes containing mean covariate coefficient values and their relative importances (using the formula x^2 / sum (x^2) taken from all bootstrapped models across iterations. Only coefficients with mean relative importances >0.01 are returned. Note, relative importance are only useful if all covariates are on a similar scale.

  • mod_type: A character stating the type of model that was fit (used in other functions)

  • mod_family: A character stating the family of model that was fit (used in other functions)

  • poiss_sc_factors: A vector of the square-root mean scaling factors used to standardise poisson variables (only returned if family = "poisson")

See Also

MRFcov, MRFcov_spatial, cv.glmnet

Examples


data("Bird.parasites")

# Perform 2 quick bootstrap replicates using 70% of observations
bootedCRF <- bootstrap_MRF(data = Bird.parasites,
                          n_nodes = 4,
                          family = 'binomial',
                          sample_prop = 0.7,
                          n_bootstraps = 2)


# Small example of using spatial coordinates for a spatial CRF
Latitude <- sample(seq(120, 140, length.out = 100), nrow(Bird.parasites), TRUE)
Longitude <- sample(seq(-19, -22, length.out = 100), nrow(Bird.parasites), TRUE)
coords <- data.frame(Latitude = Latitude, Longitude = Longitude)
bootedSpatial <- bootstrap_MRF(data = Bird.parasites, n_nodes = 4,
                             family = 'binomial',
                             spatial = TRUE,
                             coords = coords,
                             sample_prop = 0.5,
                             n_bootstraps = 2)

nicholasjclark/MRFcov documentation built on March 30, 2024, 10:31 p.m.