gbm.auto: Automated Boosted Regression Tree modelling and mapping suite

View source: R/gbm.auto.R

gbm.autoR Documentation

Automated Boosted Regression Tree modelling and mapping suite

Description

Automates delta log normal boosted regression trees abundance prediction. Loops through all permutations of parameters provided (learning rate, tree complexity, bag fraction), chooses the best, then simplifies it. Generates line, dot and bar plots, and outputs these and the predictions and a report of all variables used, statistics for tests, variable interactions, predictors used and dropped, etc. If selected, generates predicted abundance maps, and Unrepresentativeness surfaces. See www.GitHub.com/SimonDedman/gbm.auto for issues, feedback, and development suggestions. See SimonDedman.com for links to walkthrough paper, and papers and thesis published using this package.

Usage

gbm.auto(
  grids = NULL,
  samples,
  expvar,
  resvar,
  randomvar = FALSE,
  tc = c(2),
  lr = c(0.01, 0.005),
  bf = 0.5,
  offset = NULL,
  n.trees = 50,
  ZI = "CHECK",
  fam1 = c("bernoulli", "binomial", "poisson", "laplace", "gaussian"),
  fam2 = c("gaussian", "bernoulli", "binomial", "poisson", "laplace"),
  simp = TRUE,
  gridslat = 2,
  gridslon = 1,
  samplesGridsAreaScaleFactor = 1,
  multiplot = TRUE,
  cols = grey.colors(1, 1, 1),
  linesfiles = TRUE,
  smooth = FALSE,
  savedir = tempdir(),
  savegbm = TRUE,
  loadgbm = NULL,
  varint = TRUE,
  map = TRUE,
  shape = NULL,
  RSB = TRUE,
  BnW = TRUE,
  alerts = TRUE,
  pngtype = c("cairo-png", "quartz", "Xlib"),
  gaus = TRUE,
  MLEvaluate = TRUE,
  brv = NULL,
  grv = NULL,
  Bin_Preds = NULL,
  Gaus_Preds = NULL,
  ...
)

Arguments

grids

Explanatory data to predict to. Import with (e.g.) read.csv and specify object name. Defaults to NULL (won't predict to grids).

samples

Explanatory and response variables to predict from. Keep col names short (~17 characters max), no odd characters, spaces, starting numerals or terminal periods. Spaces may be converted to periods in directory names, underscores won't. Can be a subset of a large dataset.

expvar

Vector of names or column numbers of explanatory variables in 'samples': c(1,3,6) or c("Temp","Sal"). No default.

resvar

Name or column number(s) of response variable in samples: 12, c(1,4), "Rockfish". No default. Column name is ideally species name.

randomvar

Add a random variable (uniform distribution, 0-1) to the expvars, to see whether other expvars perform better or worse than random.

tc

Permutations of tree complexity allowed, can be vector with the largest sized number no larger than the number of explanatory variables e.g. c(2,7), or a list of 2 single numbers or vectors, the first to be passed to the binary BRT, the second to the Gaussian, e.g. tc = list(c(2,6), 2) or list(6, c(2,6)).

lr

Permutations of learning rate allowed. Can be a vector or a list of 2 single numbers or vectors, the first to be passed to the binary BRT, the second to the Gaussian, e.g. lr = list(c(0.01,0.02),0.0001) or list(0.01,c(0.001, 0.0005)).

bf

Permutations of bag fraction allowed, can be single number, vector or list, per tc and lr. Defaults to 0.5.

offset

Column number or quoted name in samples, containing offset values relating to the samples. A numeric vector of length equal to the number of cases. Similar to weighting, see https://towardsdatascience.com/offsetting-the-model-logic-to-implementation-7e333bc25798 .

n.trees

From gbm.step, number of initial trees to fit. Can be single or list but not vector i.e. list(fam1,fam2).

ZI

Are data zero-inflated? TRUE FALSE "CHECK". Choose one. TRUE: delta BRT, log-normalised Gaus, reverse log-norm and bias corrected. FALSE: do Gaussian only, no log-normalisation. "CHECK": Tests data for you. Default is "CHECK". TRUE and FALSE aren't in quotes, "CHECK" is.

fam1

Probability distribution family for 1st part of delta process, defaults to "bernoulli". Choose one.

fam2

Probability distribution family for 2nd part of delta process, defaults to "gaussian". Choose one.

simp

Try simplifying best BRTs?

gridslat

Column number for latitude in 'grids'.

gridslon

Column number for longitude in 'grids'.

samplesGridsAreaScaleFactor

Scale up or down factor so values in the predict-to pixels of 'grids' match the spatial scale sampled by rows in 'samples'. Default 1 means no change.

multiplot

Create matrix plot of all line files? Default true. turn off if big n of exp vars causes an error due to margin size problems.

cols

Barplot colour vector. Assignment in order of explanatory variables. Default 1white: white bars black borders. '1' repeats.

linesfiles

Save individual line plots' data as csv's? Default TRUE.

smooth

Apply a smoother to the line plots? Default FALSE.

savedir

Save outputs to a temporary directory (default) else change to current directory e.g. "/home/me/folder". Do not use getwd() here.

savegbm

Save gbm objects and make available in environment after running? Open with load("Bin_Best_Model") Default TRUE.

loadgbm

Relative or (very much preferably) absolute location of folder containing Bin_Best_Model and Gaus_Best_Model. If set will skip BRT calculations and do predicted maps and csvs. Avoids re-running BRT models again (the slow bit), can run normally once with savegbm=T then multiple times with new grids & loadgbm to predict to multiple grids e.g. different seasons, areas, etc. Default NULL, character vector, "./" for working directory.

varint

Calculate variable interactions? Default:TRUE, FALSE for error: "contrasts can be applied only to factors with 2 or more levels".

map

Save abundance map png files?

shape

Enter the full path to downloaded map e.g. coastline shapefile, possibly from gbm.basemap, typically Crop_Map.shp, including the .shp. Can also name an existing object in the environment, read in with sf::st_read. Default NULL, in which case bounds calculated by gbm.mapsf which then calls gbm.basemap to download and auto-generate the base map.

RSB

Run Unrepresentativeness surface builder? Default TRUE.

BnW

Repeat maps in black and white e.g. for print journals. Default TRUE.

alerts

Play sounds to mark progress steps. Default TRUE but running multiple small BRTs in a row (e.g. gbm.loop) can cause RStudio to crash.

pngtype

Filetype for png files, alternatively try "quartz" on Mac. Choose one.

gaus

Do family2 (typically Gaussian) runs as well as family1 (typically Bin)? Default TRUE.

MLEvaluate

do machine learning evaluation metrics & plots? Default TRUE.

brv

Dummy param for package testing for CRAN, ignore.

grv

Dummy param for package testing for CRAN, ignore.

Bin_Preds

Dummy param for package testing for CRAN, ignore.

Gaus_Preds

Dummy param for package testing for CRAN, ignore.

...

Optional arguments for gbm.step (dismo package) arguments n.trees and max.trees, both of which can be added in list(1,2) format to pass to fam1 and 2; for gbm.mapsf colourscale, heatcolours, colournumber, and others.

Details

Errors and their origins:

  1. install ERROR: dependencies ‘rgdal’, ‘rgeos’ are not available for package ‘gbm.auto’. For Linux/*buntu systems, in terminal, type: 'sudo apt install libgeos-dev', 'sudo apt install libproj-dev', 'sudo apt install libgdal-dev'.

  2. Error in FUN(X[[i]], ...) : only defined on a data frame with all numeric variables. Check your variable types are correct, e.g. numerics haven't been imported as factors because there's an errant first row of text information before the data. Remove NA rows from the response variable if present: convert blank cells to NA on import with read.csv(x, na.strings = "") then samples2 <- samples[-which(is.na(samples[,resvar_column_number])),]

  3. At BF=0.5, if nrows <= 42, gbm.step will crash. Use gbm.bfcheck to determine optimal viable BF size.

  4. Maps/plots don't work/output. If on a Mac, try changing pngtype to "quartz".

  5. Error in while (delta.deviance > tolerance.test & n.fitted < max.trees): missing value where TRUE/FALSE needed. If running a zero-inflated delta model (bernoulli/bin & gaussian/gaus), Data are expected to contain zeroes (lots of them in zero- inflated cases), have you already filtered them out, i.e. are only testing the positive cases? Or do you only have positive cases? If so only run (e.g.) Gaussian: set ZI to FALSE.

  6. Error in round(gbm.object$cv.statistics$deviance.mean, 4) : non-numeric argument to mathematical function. LR or BF probably too low in earlier BRT (normally Gaus run with highest TC).

  7. Error in if (n.trees > x$n.trees) argument is of length zero. LR or BF probably too low in earlier BRT (normally Gaus run with highest TC).

  8. Error in gbm.fit(x, y, offset = offset, distribution = distribution, w = w): The dataset size is too small or subsampling rate is too large: nTrain*bag.fraction <= n.minobsinnode. LR or BF probably too low in earlier BRT (normally Gaus run with highest TC). It may be that you don't have enough positive samples to run BRT modelling. Run gbm.bfcheck to check recommended minimum BF size.

  9. Warning message: In cor(y_i, u_i) : the standard deviation is zero. LR or BF probably too low in earlier BRT (normally Gaus run with highest TC). It may be that you don't have enough positive samples to run BRT modelling. Run gbm.bfcheck to check recommended minimum BF size. Similarly: glm.fit: fitted probabilities numerically 0 or 1 occurred, and glm.fit: algorithm did not converge. Similarly: Error in if (get(paste0("Gaus_BRT", ".tc", j, ".lr", k, ".bf", l))$self.statistics$correlation[[1]]: argument is of length zero. See also: Error 15.

  10. Anomalous values can obfuscate clarity in line plots e.g. salinity range 32:35ppm but dataset has errant 0 value: plot axis will be 0:35, and 99.99% of the data will be in the tiny bit at the right. Clean your data beforehand.

  11. Error in plot.new() : figure margins too large: In RStudio, adjust plot pane (usually bottom right) to increase its size. Still fails? Set multiplot=FALSE.

  12. Error in dev.print(file = paste0("./", names(samples[i]), "/pred_dev_bin.jpeg"): can only print from a screen device. An earlier failed run (e.g. LR/BF too low) left a plotting device open. Close it with: 'dev.off()'.

  13. RStudio crashed: set alerts=F and pause cloud sync programs if outputting to a synced folder.

  14. Error in grDevices::dev.copy(device = function (filename = "Rplot%03d.jpeg", could not open file './resvar/pred_dev_bin.jpeg' (or similar). Your resvar column name contains an illegal character e.g. /&'_. Fix with colnames(samples)[n] <- "BetterName".

  15. Error in gbm.fit: Poisson requires the response to be a positive integer. If running Poisson distributions, ensure the response variables are positive integers, but if they are, try a smaller LR.

  16. If lineplots of factorial variables include empty columns be sure to remove unused levels with samples %<>% droplevels() before the gbm.auto run.

  17. Error in seq.default(from = min(x$var.levels[[i.var[i]]]), to = max(x$var.levels[[i.var[i]]]):'from' must be a finite number. If you logged any expvars with log() and they has zeroes in them, those zeroes became imaginary numbers. Use log1p() instead.

  18. Error in loadNamespace...'dismo' 1.3-9 is being loaded, but >= 1.3.10 is required: first do remotes::install_github("rspatial/dismo") then library(dismo).

  19. Error in if (scope >= 160) res <- "c" : missing value where TRUE/FALSE needed. Check gridslat and gridslon are indexing the correct columns in grids.

ALSO: check this section in the other functions run by gbm.auto e.g. gbm.mapsf, gbm.basemap. Use traceback() to find the source of errors.

I strongly recommend that you download papers 1 to 5 (or just the doctoral thesis) on http://www.simondedman.com, with emphasis on P4 (the guide) and P1 (statistical background). Elith et al 2008 (https://besjournals.onlinelibrary.wiley.com/doi/10.1111/j.1365-2656.2008.01390.x) is also strongly recommended. Just because you CAN try every conceivable combination of tc, lr, bf, all, at once doesn't mean you should. Try a range of lr in shrinking orders of magnitude from 0.1 to 0.000001, find the best, THEN try tc c(2, n.expvars), find the best THEN bf c(0.5, 0.75, 0.9) and then in between if either outperform 0.5.

Value

Line, dot and bar plots, a report of all variables used, statistics for tests, variable interactions, predictors used and dropped, etc. If selected, generates predicted abundance maps, and Unrepresentativeness surface. Biggest Interactions in the report csv: see ?dismo::gbm.interactions .

Author(s)

Simon Dedman, simondedman@gmail.com

Examples


# Not run. Note: grids file was heavily cropped for CRAN upload so output map
# predictions only cover patchy chunks of the Irish Sea, not the whole area.
# Full versions of these files:
# https://drive.google.com/file/d/1WHYpftP3roozVKwi_R_IpW7tlZIhZA7r
# /view?usp=sharing
library(gbm.auto)
data(grids)
data(samples)
# Set your working directory
gbm.auto(grids = grids, samples = samples, expvar = c(4:8, 10), resvar = 11,
tc = c(2,7), lr = c(0.005, 0.001), ZI = TRUE, savegbm = FALSE)


SimonDedman/gbm.auto documentation built on Oct. 9, 2024, 8:57 p.m.