#' Run a randomization analysis for one or more biodiversity metrics
#'
#' The observed value of the biodiversity metric(s) will be calculated for the
#' input community data, then compared against a set of random communities.
#' Various statistics are calculated from the comparison (see **Value** below).
#'
#' The biodiversity metrics (`metrics`) available for analysis include:
#'
#' - `pd`: Phylogenetic diversity (Faith 1992)
#' - `rpd`: Relative phylogenetic diversity (Mishler et al 2014)
#' - `pe`: Phylogenetic endemism (Rosauer et al 2009)
#' - `rpe`: Relative phylogenetic endemism (Mishler et al 2014)
#'
#' (`pe` and `rpe` are needed for CANAPE with
#' \code{\link{cpr_classify_endem}()})
#'
#' The choice of a randomization algorithm (`null_model`) is not trivial, and
#' may strongly affect results. `cpr_rand_test()` uses null models provided by
#' `vegan`; for a complete list, see the help file of [vegan::commsim()] or run
#' [vegan::make.commsim()]. One frequently used null model is `swap` (Gotelli &
#' Entsminger 2003), which randomizes the community matrix while preserving
#' column and row sums (marginal sums). For a review of various null models, see
#' Strona et al. (2018); `swap` is an "FF" model in the sense of Strona et al.
#' (2018).
#'
#' Instead of using one of the predefined null models in [vegan::commsim()], it
#' is also possible to define a custom null model; see **Examples** in
#' \code{\link{cpr_rand_comm}()}
#'
#' Note that the pre-defined models in `vegan` include binary models (designed
#' for presence-absence data) and quantitative models (designed for abundance
#' data). Although the binary models will accept abundance data, they treat it
#' as binary and always return a binary (presence-absence) matrix. The PD and PE
#' calculations in `canaper` are not abundance-weighted, so they return the same
#' result regardless of whether the input is presence-absence or abundance. In
#' that sense, binary null models are appropriate for `cpr_rand_test()`. The
#' quantitative models could also be used for abundance data, but the output
#' will be treated as binary anyways when calculating PD and PE. The effects of
#' using binary vs. quantitative null models for `cpr_rand_test()` have not been
#' investigated.
#'
#' A minimum of 5 species and sites are required as input; fewer than that is
#' likely cause the some randomization algorithms (e.g., `swap`) to enter an
#' infinite loop. Besides, inferences on very small numbers of species and/or
#' sites is not recommended generally.
#'
#' The following rules apply to `comm` input:
#' - If dataframe or matrix, must include row names (site names) and column
#' names (species names).
#' - If tibble, a single column (default, `site`) must be included with site
#' names, and other columns must correspond to species names.
#' - Column names cannot start with a number and must be unique.
#' - Row names (site names) must be unique.
#' - Values (other than site names) should only include integers >= 0;
#' non-integer input will be converted to integer.
#'
#' The results are identical regardless of whether the input for `comm` is
#' abundance or presence-absence data (i.e., abundance weighting is not used).
#'
#' @srrstats {G2.0a, G2.1a, G2.3b} Documents expectations on lengths,
#' types of vector inputs, case-sensitivity
#' @srrstats {G2.7, UL1.0} accept dataframe, tibble, or matrix
#' @srrstats {UL4.3a} If `tbl_out` is `TRUE`, restricts number of rows/columns
#' printed to screen
#' @param comm Dataframe, tibble, or matrix; input community data with
#' sites (communities) as rows and species as columns. Either presence-absence
#' data (values only 0s or 1s) or abundance data (values >= 0) accepted, but
#' calculations do not use abundance-weighting, so results from abundance data
#' will be the same as if converted to presence-absence before analysis.
#' @param phy List of class `phylo`; input phylogeny.
#' @param null_model Character vector of length 1 or object of class `commsim`;
#' either the name of the model to use for generating random communities (null
#' model), or a custom null model. For full list of available predefined null
#' models, see the help file of [vegan::commsim()], or run
#' [vegan::make.commsim()]. An object of class `commsim` can be generated with
#' [vegan::commsim()] (see **Examples** in \code{\link{cpr_rand_comm}()}).
#' @param n_reps Numeric vector of length 1; number of random communities to
#' replicate.
#' @param n_iterations Numeric vector of length 1; number of iterations to use
#' for sequential null models; ignored for non-sequential models.
#' @param thin Numeric vector of length 1; thinning parameter used by some
#' null models in `vegan` (e.g., `quasiswap`); ignored for other models.
#' @param metrics Character vector; names of biodiversity metrics to calculate.
#' May include one or more of: `pd`, `rpd`, `pe`, `rpe` (case-sensitive).
#' @param site_col Character vector of length 1; name of column in `comm` that
#' contains the site names; only used if `comm` is a tibble (object of class
#' `tbl_df`).
#' @param tbl_out Logical vector of length 1; should the output be returned as
#' a tibble? If `FALSE`, will return a dataframe. Defaults to `TRUE` if `comm`
#' is a tibble.
#' @param quiet Logical vector of length 1; if `TRUE`, suppress all warnings and
#' messages that would be emitted by this function.
#'
#' @srrstats {G1.3} defines terminology (also in 'details')
#' @srrstats {UL3.4} output includes variances in random values
#' @return Dataframe. For each of the biodiversity metrics, the following 9
#' columns will be produced:
#' - `*_obs`: Observed value
#' - `*_obs_c_lower`: Count of times observed value was lower than random values
#' - `*_obs_c_upper`: Count of times observed value was higher than random
#' values
#' - `*_obs_p_lower`: Percentage of times observed value was lower than random
#' values
#' - `*_obs_p_upper`: Percentage of times observed value was higher than random
#' values
#' - `*_obs_q`: Count of the non-NA random values used for comparison
#' - `*_obs_z`: Standard effect size (z-score)
#' - `*_rand_mean`: Mean of the random values
#' - `*_rand_sd`: Standard deviation of the random values
#'
#' So if you included `pd` in `metrics`, the output columns would include
#' `pd_obs`, `pd_obs_c_lower`, etc...
#'
#' @srrstats {G1.0} Cites original refs:
#' @references Faith DP (1992) Conservation evaluation and phylogenetic
#' diversity. Biological Conservation, 61:1–10.
#' \doi{10.1016/0006-3207(92)91201-3}
#' @references Gotelli, N.J. and Entsminger, N.J. (2003). Swap algorithms in
#' null model analysis. Ecology 84, 532–535.
#' @references Mishler, B., Knerr, N., González-Orozco, C. et al. (2014)
#' Phylogenetic measures of biodiversity and neo- and paleo-endemism in
#' Australian Acacia. Nat Commun, 5: 4473. \doi{10.1038/ncomms5473}
#' @references Rosauer, D., Laffan, S.W., Crisp, M.D., Donnellan, S.C. and Cook,
#' L.G. (2009) Phylogenetic endemism: a new approach for identifying
#' geographical concentrations of evolutionary history. Molecular Ecology, 18:
#' 4061-4072. \doi{10.1111/j.1365-294X.2009.04311.x}
#' @references Strona, G., Ulrich, W. and Gotelli, N.J. (2018), Bi-dimensional
#' null model analysis of presence-absence binary matrices.
#' Ecology, 99: 103-115. \doi{10.1002/ecy.2043}
#'
#' @examples
#' \donttest{
#' set.seed(12345)
#' data(phylocom)
#' # Returns a dataframe by defualt
#' cpr_rand_test(
#' phylocom$comm, phylocom$phy,
#' null_model = "curveball", metrics = "pd", n_reps = 10
#' )
#'
#' # Tibbles may be preferable because of the large number of columns
#' cpr_rand_test(
#' phylocom$comm, phylocom$phy,
#' null_model = "curveball", tbl_out = TRUE, n_reps = 10
#' )
#' }
#' @srrstats {G1.4} uses roxygen
#'
#' @export
cpr_rand_test <- function(comm, phy, null_model,
n_reps = 100, n_iterations = 10000, thin = 1,
metrics = c("pd", "rpd", "pe", "rpe"),
site_col = "site", tbl_out = tibble::is_tibble(comm),
quiet = FALSE) {
# First check `quiet` flag
assertthat::assert_that(assertthat::is.flag(quiet))
# Capture state of n_reps and n_iterations for testing if different from
# default later
n_reps_input <- n_reps
n_iterations_input <- n_iterations
# Check input ----
# `null_model`, `n_reps`, `n_iterations`, `thin`, `metrics`, `tbl_out`
#' @srrstats {G2.0, G2.2, G2.1, G2.3, G2.3a, G2.4a, G2.6, G2.13, G2.14, G2.14a, G2.15, G2.16} # nolint
#' check input types and lengths, missingness, undefined values, values of
#' univariate char input, convert to integer before numeric comparisons
# - null_model
assertthat::assert_that(
assertthat::is.string(null_model) | inherits(null_model, "commsim"),
msg = "'null_model' must be a string (character vector of length 1) or an object of class 'commsim'" # nolint
)
assertthat::assert_that(assertthat::noNA(null_model))
# - n_reps
assertthat::assert_that(assertthat::is.number(n_reps))
assertthat::assert_that(assertthat::noNA(n_reps))
assertthat::assert_that(is.finite(n_reps))
n_reps <- as.integer(n_reps)
assertthat::assert_that(is.integer(n_reps))
assertthat::assert_that(n_reps > 0, msg = "'n_reps' must be > 0")
# - n_iterations
assertthat::assert_that(assertthat::noNA(n_iterations))
assertthat::assert_that(is.finite(n_iterations))
assertthat::assert_that(assertthat::is.number(n_iterations))
n_iterations <- as.integer(n_iterations)
assertthat::assert_that(is.integer(n_iterations))
assertthat::assert_that(
n_iterations > 0,
msg = "'n_iterations' must be > 0"
)
# - thin
assertthat::assert_that(assertthat::is.number(thin))
assertthat::assert_that(assertthat::noNA(thin))
assertthat::assert_that(is.finite(thin))
thin <- as.integer(thin)
assertthat::assert_that(is.integer(thin))
assertthat::assert_that(thin > 0, msg = "'thin' must be > 0")
# - metrics
assertthat::assert_that(is.character(metrics))
assertthat::assert_that(assertthat::noNA(metrics))
assertthat::assert_that(
isTRUE(all(metrics %in% c("pd", "rpd", "pe", "rpe"))),
msg = "'metrics' may only include 'pd', 'rpd', 'pe', or 'rpe'"
)
# - tbl_out
assertthat::assert_that(assertthat::is.flag(tbl_out))
# Check input: `comm` ----
#' @srrstats {UL1.1} assert that all input data is of the expected form
assertthat::assert_that(
inherits(comm, "data.frame") | inherits(comm, "matrix"),
msg = "'comm' must be of class 'data.frame' or 'matrix'"
)
#' @srrstats {G2.8} Convert tibble to dataframe
if (tibble::is_tibble(comm)) {
assertthat::assert_that(assertthat::is.string(site_col))
assertthat::assert_that(assertthat::noNA(site_col))
assertthat::assert_that(
isTRUE(site_col %in% colnames(comm)),
msg = "'site_col' must be one of the column names of 'comm'"
)
assertthat::assert_that(
isTRUE(all(assertr::is_uniq(comm[[site_col]]))),
msg = "Site names must all be unique"
)
assertthat::assert_that(
assertthat::noNA(comm[[site_col]]),
msg = "Site names must not include any missing data"
)
assertthat::assert_that(
!isTRUE(grepl(site_col, "_obs$|_rand_|_obs_")),
msg = "Name of 'site_col' cannot resemble names of column names in 'cpr_rand_test()' output" # nolint
)
comm <- tibble::column_to_rownames(comm, site_col)
}
#' @srrstats {G2.8} Convert matrix to dataframe
#' @srrstats {UL1.2} Check that column names are unchanged after conversion
if (inherits(comm, "matrix")) {
comm_df <- data.frame(comm)
assertthat::assert_that(
isTRUE(
all.equal(
colnames(comm),
colnames(comm_df)
)
),
msg = "Column names of 'comm' changed after conversion from matrix to dataframe. Do any column names start with a number or include duplicates? This is not allowed." # nolint
)
comm <- comm_df
}
#' @srrstats {UL1.2} Check for default-looking rownames
# Default rownames not allowed because phyloregion::dense2sparse() will
# convert them to NULL
assertthat::assert_that(
!identical(rownames(comm), as.character(seq(nrow(comm)))),
msg = "'comm' cannot have default row names (consecutive integers from 1 to the number of rows)" # nolint
)
assertthat::assert_that(
isTRUE(all(assertr::is_uniq(colnames(comm), allow.na = FALSE))),
msg = "'comm' must have unique column names"
)
assertthat::assert_that(
isTRUE(all(assertr::is_uniq(rownames(comm), allow.na = FALSE))),
msg = "'comm' must have unique rownames"
)
assertthat::assert_that(assertthat::noNA(colnames(comm)))
assertthat::assert_that(assertthat::noNA(rownames(comm)))
#' @srrstats {G2.15, UL1.1} don't assume non-missingness
assertthat::assert_that(
assertr::assert(
comm, assertr::not_na, dplyr::everything(),
success_fun = assertr::success_logical,
error_fun = assertr::error_logical
),
msg = "No missing values allowed in 'comm'"
)
#' @srrstats {G2.11, UL1.1} check for class attributes in dataframe
numeric_check <- NULL
for (i in seq_along(comm)) {
numeric_check[i] <- is.vector(comm[, i], mode = "numeric")
}
assertthat::assert_that(
isTRUE(all(numeric_check)),
msg = "All columns of 'comm' must be numeric"
)
#' @srrstats {G2.16, UL1.1} don't allow infinite values
assertthat::assert_that(
isTRUE(all(is.finite(unlist(comm)))),
msg = "No infinite values allowed in 'comm'"
)
#' @srrstats {G2.4a} Convert all values in comm to integer
comm <- dplyr::mutate(comm, dplyr::across(dplyr::everything(), as.integer))
#' @srrstats {UL1.4} Check that all values in comm are >= 0
assertthat::assert_that(
isTRUE(all(unlist(comm) >= 0)), # values have been converted to integer, so OK to use equalities # nolint
msg = "No negative values allowed in 'comm'"
)
#' @srrstats {UL1.4} Check for all 0s for a given site or species
assertthat::assert_that(
isTRUE(all(colSums(comm) > 0)),
msg = "Every species in 'comm' must occur in at least one site"
)
assertthat::assert_that(
isTRUE(all(rowSums(comm) > 0)),
msg = "Every site in 'comm' must have at least once species"
)
# Add warning for abundance data
if (isTRUE(any(unlist(comm) > 1)) && !quiet) {
warning(
"Abundance data detected. Results will be the same as if using presence/absence data (no abundance weighting is used).", # nolint
call. = FALSE,
immediate. = TRUE
)
}
#' @srrstats {UL1.4, UL2.0}
#' Warn if community matrix is >95% present or absent and defaults are
#' unchanged
percent_present <- (sum(unlist(comm) > 0)) / (length(unlist(comm)))
if (
percent_present > 0.95 &&
n_reps_input == 100 &&
n_iterations_input == 10000 &&
!quiet
) {
# nolint
warning(
"'comm' is > 95% presences (values > 1). Be sure that 'n_reps' and 'n_iterations' are sufficiently large to ensure adequate mixing of random communities", # nolint
immediate. = TRUE,
call. = FALSE
)
}
if (
percent_present < 0.05 &&
n_reps_input == 100 &&
n_iterations_input == 10000 && !quiet
) {
# nolint
warning(
"'comm' is > 95% absences (zeros). Be sure that 'n_reps' and 'n_iterations' are sufficiently large to ensure adequate mixing of random communities", # nolint
immediate. = TRUE,
call. = FALSE
)
}
# Check input: `phylo` ----
assertthat::assert_that(
is.list(phy) && inherits(phy, "phylo"),
msg = "'phy' must be a list of class 'phylo'"
)
# Check for phylo unique tip labels
assertthat::assert_that(
isTRUE(all(assertr::is_uniq(phy$tip.label, allow.na = FALSE))),
msg = "All tip labels in 'phy' must be unique"
)
#' @srrstats {UL1.4} Check for non-negative branch lengths
assertthat::assert_that(
isTRUE(all(phy$edge.length %>=% 0)), # nolint
msg = "'phy' may not have negative branchlengths"
)
assertthat::assert_that(is.numeric(phy$edge.length)) # nolint
assertthat::assert_that(assertthat::noNA(phy$edge.length)) # nolint
assertthat::assert_that(is.character(phy$tip.label)) # nolint
assertthat::assert_that(assertthat::noNA(phy$tip.label)) # nolint
assertthat::assert_that(
!any(purrr::map_lgl(phy$edge.length, is.infinite)), # nolint
msg = "'phy' may not have infinite branchlengths"
)
# Match input between `comm` and `phylo` ----
# Match tips of tree and column names of community data frame:
# Use only taxa that are in common between phylogeny and community
subsetted_data <- match_phylo_comm(phy = phy, comm = comm, quiet = quiet)
phy <- subsetted_data[["phy"]]
comm <- subsetted_data[["comm"]]
# Make sure at least one taxon matches between comm and phy
assertthat::assert_that(
isTRUE(
!is.null(phy) && !is.null(comm)
),
msg = "Tip names don't match between community and phylogeny"
)
assertthat::assert_that(
isTRUE(
all.equal(
sort(colnames(comm)),
sort(phy$tip.label)
)
),
msg = "Tip names don't match between community and phylogeny"
)
# Verify that at least 5 species and sites are present
assertthat::assert_that(
ape::Ntip(phy) > 4,
msg = "'phy' and 'comm' must share at least 5 species in common"
)
assertthat::assert_that(
nrow(comm) > 4,
msg = "'comm' must include at least 5 sites"
)
assertthat::assert_that(
ncol(comm) > 4,
msg = "'phy' and 'comm' must share at least 5 species in common"
)
assertthat::assert_that(assertthat::not_empty(comm))
# Prepare for calculations ----
# Make alternative tree with equal branch lengths
phy_alt <- phy
# convert **non-zero** branch lengths to same value (1)
non_zero_branches <- purrr::map_lgl(phy_alt$edge.length, ~ !isTRUE(all.equal(., 0))) # nolint
phy_alt$edge.length[non_zero_branches] <- rep(x = 1, times = length(phy_alt$edge.length[non_zero_branches])) # nolint
# rescale so total phy length is 1
phy_alt$edge.length <- phy_alt$edge.length / sum(phy_alt$edge.length) # nolint
# rescale original phy so total length is 1
phy$edge.length <- phy$edge.length / sum(phy$edge.length) # nolint
# Make sparse community df
comm_sparse <- phyloregion::dense2sparse(comm)
# Check phylogenies after modifications
#' @srrstats {G2.1} Check input types
assertthat::assert_that(
is.list(phy_alt) && inherits(phy_alt, "phylo"),
msg = "'phy_alt' must be a list of class 'phylo'"
)
assertthat::assert_that(
!isTRUE((all.equal(phy, phy_alt))),
msg = "'phy' and 'phy_alt' should not be identical"
)
assertthat::assert_that(
isTRUE(all.equal(sum(phy$edge.length), 1)), # nolint
msg = "phylogeny not rescaled to total length 1"
)
assertthat::assert_that(
isTRUE(all.equal(sum(phy_alt$edge.length), 1)), # nolint
msg = "alternative phylogeny not rescaled to total length 1"
)
# Calculate biodiversity metrics ----
## Loop over random communities
pb <- progressr::progressor(steps = n_reps) # Set up a progress bar
random_vals <-
future.apply::future_lapply(
1:n_reps,
function(x) {
pb()
calc_biodiv_random(
comm = comm, phy = phy, phy_alt = phy_alt,
null_model = null_model,
n_iterations = n_iterations,
thin = thin,
metrics = metrics
)
},
future.seed = TRUE
)
# Calculate biodiversity metrics for observed community
# - set up null vectors first
ses_pd <- NULL
ses_pd_alt <- NULL
ses_rpd <- NULL
ses_pe <- NULL
ses_pe_alt <- NULL
ses_rpe <- NULL
# - calculate selected metrics
if ("pd" %in% metrics) {
pd_obs <- phyloregion::PD(comm_sparse, phy)
ses_pd <- get_ses(random_vals, pd_obs, "pd")
}
if ("rpd" %in% metrics) {
if (!exists("pd_obs")) pd_obs <- phyloregion::PD(comm_sparse, phy)
pd_alt_obs <- phyloregion::PD(comm_sparse, phy_alt)
ses_pd_alt <- get_ses(random_vals, pd_alt_obs, "pd_alt")
rpd_obs <- pd_obs / pd_alt_obs
ses_rpd <- get_ses(random_vals, rpd_obs, "rpd")
}
if ("pe" %in% metrics) {
pe_obs <- phyloregion::phylo_endemism(comm_sparse, phy, weighted = TRUE)
ses_pe <- get_ses(random_vals, pe_obs, "pe")
}
if ("rpe" %in% metrics) {
if (!exists("pe_obs")) {
pe_obs <-
phyloregion::phylo_endemism(comm_sparse, phy, weighted = TRUE)
}
pe_alt_obs <- phyloregion::phylo_endemism(
comm_sparse, phy_alt,
weighted = TRUE
)
ses_pe_alt <- get_ses(random_vals, pe_alt_obs, "pe_alt")
rpe_obs <- pe_obs / pe_alt_obs
ses_rpe <- get_ses(random_vals, rpe_obs, "rpe")
}
## Combine results
# (don't use pipe, to maintain backwards compatibility with R < 4.1)
results_df <- dplyr::bind_cols(
ses_pd,
ses_pd_alt,
ses_rpd,
ses_pe,
ses_pe_alt,
ses_rpe
)
results_df <- dplyr::mutate(results_df, site = rownames(comm))
results_df <- tibble::column_to_rownames(results_df, "site")
# If tibble was input, return tibble starting with the "site" column
if (isTRUE(tbl_out)) {
results_tbl <- tibble::rownames_to_column(results_df, site_col)
results_tbl <- tibble::as_tibble(results_tbl)
results_tbl <- dplyr::select(
results_tbl, dplyr::all_of(site_col), dplyr::everything()
)
return(results_tbl)
} else {
return(results_df)
}
}
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.