phy_or_env_spec: phy_or_env_spec

View source: R/phy_or_env_spec.r

phy_or_env_specR Documentation

phy_or_env_spec

Description

Calculates species' specificities to either a 1-dimensional variable (vector), 2-dimensional variable (matrix), or to a phylogeny. Transforms all variable input types into a matrix D, and calculates specificity by comparing empirical Rao's Quadratic Entropy to simulated RQE (same but with permuted abundances). By default (denom_type = "index"), an index is calculated from emp and sim values such that Spec=0 indicates random assortment (null hypothesis), and more negative values indicate stronger specificity.

Usage

phy_or_env_spec(
  abunds_mat,
  env = NULL,
  hosts = NULL,
  hosts_phylo = NULL,
  n_sim = 1000,
  p_adj = "fdr",
  seed = 1234567,
  tails = 1,
  n_cores = 2,
  verbose = TRUE,
  p_method = "raw",
  center = "mean",
  denom_type = "index_full",
  diagnostic = F,
  chunksize = 1000,
  ga_params = get_ga_defaults()
)

Arguments

abunds_mat

matrix or data frame of numeric values. Columns represent species, rows are samples. For columns where the value is nonzero for two or fewer data points, specificity cannot be calculated, and NAs will be returned. Negative values in abunds_mat are not allowed (REQUIRED).

env

numeric vector, dist, or square matrix. Environmental variable corresponding to abunds. For example, temperature, or geographic distance. Not required for computing phylogenetic specificity. If square matrix provided, note that only the lower triangle will be used (default: NULL).

hosts

character vector. Host identities corresponding to abunds. Only required if calculating phylogenetic specificity (default: NULL).

hosts_phylo

phylo object. Tree containing all unique hosts as tips. Only required if calculating phylogenetic specificity (default: NULL).

n_sim

integer. Number of simulations of abunds_mat to do under the null hypothesis that host or environmental association is random. P-values will not be calculated if n_sim < 100 (default: 500).

p_adj

string. Type of multiple hypothesis testing correction performed on P-values. Can take any valid method argument to p.adjust, including "none", "bonferroni", "holm", "fdr", and others (default: "fdr").

seed

integer. Seed to use so that this is repeatable. Same seed will be used for each species in abunds_mat, so all species will experience the same permutations. This can be disabled by setting seed=0, which will make permutation is both non deterministic (not repeatable) AND each species will experience different permutations (default: 1234557).

tails

integer. 1 = 1-tailed, test for specificity only. 2 = 2-tailed. 3 = 1-tailed, test for cosmopolitanism only. 0 = no test, P=1.0 (default: 1).

n_cores

integer. Number of CPU cores to use for parallel operations. If set to 1, lapply will be used instead of mclapply. A warning will be shown if n_cores > 1 on Windows, which does not support forked parallelism (default: 2).

verbose

logical. Should status messages be displayed? (default: TRUE).

p_method

string. "raw" for quantile method, or "gamma_fit" for calculating P by fitting a gamma distribution (default: "raw").

center

string. Type of central tendency to use for simulated RQE values. Options are "mean", "median", and "mode". If mode is chosen, a reversible gamma distribution is fit and mode is calculated using that distribution (default: mean).

denom_type

string. Type of denominator (d) to use (default: "index"). Note that denominator type does NOT affect P-values.

"ses":

d for species s is calculated as the standard deviation of RQE values calculated from permuted species weights. This makes the output specificity a standardized effect size (SES). Unfortunately, this makes SES counterintuitively sensitive to occupancy, where species with high occupancy have more extreme SES than rare species, due to their more deterministic sim specificities. Included for comparative purposes, not suggested.

"raw":

d is 1 for all species, so output specificity has units of distance, i.e. the raw difference between empirical and simulated RQE. This means that results from different variables are not comparable, since it is not scale-invariant to env or hosts_phylo. It not scale-invariant to the species weights in aunds_mat, either. Not sensitive to number of samples. Not suggested because units are strange, and isn't comparable between variables.

"index":

d is the center of simulated (permuted) RQE values for species that have stronger specificity than expected by chance, resulting in specificity values with range [-1, 0), with 0 as the null hypothesis. In this case, -1 indicates perfect specificity, where a species is associated with zero environmental variability. In the euclidean sense, this could be a species that is always found at the exact same elevation or the exact same pH. For species that have weaker specificity than expected by chance, d is x minus the center (see above) of simulated RQE values, where x is the maximum possible dissimilarity observable given species weights. x is estimated using a genetic algorithm. This d has other useful properties: scale invariance to env/hosts_phylo, insensitivity to the number of samples, insensitivity to occupancy, and strong sensitivity to specificity (default).

"sim_center":

d is always the center of simulated (permuted) RQE values. For species that have stronger specificity than expected by chance, this will return the same Spec values as "index". For species with weaker specificity than expected by chance, instead of values that range between 0 and 1, they will range between 0 and Inf. This is much faster than "index" because the genetic algorithm is not used. So if species with weaker specificity than expected by chance are not interesting to you, this may be a good option.

diagnostic

logical. If true, changes output to include different parts of Spec. This includes Pval, Spec, raw, denom, emp, and all sim values with column labels as simN where N is the number of sims (default: FALSE)

chunksize

integer. If greater than zero, computation of sim RAO values will be done using chunked evaluation, which lowers memory use considerably for larger data sets. Can be disabled by setting to 0. Default value is 1000 species per chunk (default: 1000).

ga_params

list. Parameters for genetic algorithm that maximizes RQE. Only used with denom_type="index". Default is the output of get_ga_defaults(). If different parameters are desired, start with output of get_ga_defaults and modify accordingly.

Value

data.frame where each row is an input species. First column is P-value ($Pval), second column is specificity ($Spec).

Author(s)

John L. Darcy

References

  • Poulin et al. (2011) Host specificity in phylogenetic and geographic space. Trends Parasitol 8:355-361. doi: 10.1016/j.pt.2011.05.003

  • Rao CR (2010) Quadratic entropy and analysis of diversity. Sankhya 72:70-80. doi: 10.1007/s13171-010-0016-3

  • Rao CR (1982) Diversity and dissimilarity measurements: A unified approach. Theor Popul Biol 21:24-43.

Examples

# library(specificity)
# attach(endophyte)
# # only analyze species with occupancy >= 20
# m <- occ_threshold(prop_abund(otutable), 20)
# # create list to hold phy_or_env_spec outputs
# specs_list <- list()
#
# # phylogenetic specificity using endophyte data set
# specs_list$host <- phy_or_env_spec(
#     abunds_mat=m,
#     hosts=metadata$PlantGenus,
#     hosts_phylo=supertree,
#     n_sim=100, p_method="gamma_fit",
#     n_cores=4
# )
#
# # environmental specificity using elevation from endophyte data set:
# specs_list$elev <- phy_or_env_spec(
#     abunds_mat=m,
#     env=metadata$Elevation,
#     n_sim=100, p_method="gamma_fit",
#     n_cores=4
# )
#
# # geographic specificity using spatial data from endophyte data set:
# specs_list$geo <- phy_or_env_spec(
#     abunds_mat=m,
#     env=distcalc(metadata$Lat, metadata$Lon),
#     n_sim=100, p_method="gamma_fit",
#     n_cores=4
# )
#
# plot_specs_violin(specs_list, cols=c("forestgreen", "red", "black"))


darcyj/specificity documentation built on Aug. 1, 2023, 8 a.m.