In ajrominger/RarePlusComMinus: A package in support of the manuscript ""

knitr::opts_chunk$set(echo = FALSE, cache = FALSE, message = FALSE, warning = FALSE)

# function for cross-cache lazy loading
findChunk <- function(key, otherDoc) {
    f <- file.path(paste(otherDoc, 'cache', sep = '_'), 'latex')

    allChunk <- list.files(f, pattern = '.RData')

    thisChunk <- gsub('\\.RData', '', grep(paste0(key, '_'), allChunk, value = TRUE))
    thisChunk <- file.path(f, thisChunk)

    return(thisChunk)
}

# function for cross-doc figure referencing
refOtherFig <- function(x, otherDoc, prefix = '') {

}

# load packages
pkgs <- readLines('RarePlusComMinus_supp_cache/latex/__packages')
for(p in pkgs) library(p, character.only = TRUE)

# load plotting defaults
lazyLoad(findChunk('plot_defaults', 'RarePlusComMinus_supp'))
figW <- figH <- 3.5
knitr::opts_chunk$set(fig.width = figW, fig.height = figH, fig.align = 'center')

# load threading defaults
lazyLoad(findChunk('threading', 'RarePlusComMinus_supp'))

\linenumbers

Abstract

The prevalence of rare species in ecosystems begs the question of how they persist. In a recent paper, Calatayuda et al. (CEA) provided a new hypothesis that rare species, in contrast to common species, share unique microhabitats and/or preferentially engage in mutualistic interactions. CEA support this hypotheses by reconstructing association networks from spatially replicated abundance data finding that rare species are over-representing in positive association networks while common species are over-representing in negative association networks. However, the use of abundance and co-occurrence data to infer true species associations is difficult and often inaccurate. Here, I show that the finding of rare species being more represented in positive association networks can be explained by statistical artifacts in the inference of species associations from abundance data. I caution against the inference of ecological association networks from abundance data alone. \newline \newline

Why do rare species persist in ecosystems? Rare species seem to be at a disadvantage by pure probabilistic odds [@mcgill2005] and perhaps also from poorly adapted species-environment and species-species interactions [@hutchinson1961], though negative density-dependence may help buoy rare species [@leigh2004; @yenni2012]. The question of rarity and persistence thus remains unresolved. In a recent paper, @calatayud2019 (CEA) have contributed toward helping resolve this question. They compiled an impressive collection of datasets, across many taxa and environments, capturing spatially replicated species abundance measures. With these data they inferred species-species association networks. Such association networks are hypothesized to reflect both potential species-species interactions and/or shared environmental preferences, though there is debate about their accuracy and interpretation [@sander2017; @barner2018; @freilich2018; @carr2019; @rajala2019; @blanchet2020]. CEA found that rare species were statistically over-represented in positive-positive species association networks, while common species were statistically over-represented in negative-negative species association networks [@calatayud2019]. CEA interpreted this finding as possible evidence that the persistence of rare species may be aided by positive species interactions, such as mutualism or facilitation, or by shared use of similar microhabitats. However, this result could be compromised by the unreliability of inferring species associations from abundance data alone. Here, I show that the correlation between abundance and association type (positive or negative) as reported by CEA can be explained by statistical artifacts. These artifacts arise because of spatial clustering in intra-specific abundances. It would therefore not be supported to assign biological interpretations to correlations between association types and abundances until more data can be brought to bear on the subject.

When association networks are inferred from spatially replicated abundance data, species-species co-occurrences are quantified by a metric (e.g., CEA use Schoener similarity [@schoener1968]) and then a null model is used to assess whether these co-occurrence metrics deviate substantially enough from null expectations to suggest a non-random association, either in the positive or negative direction. However, seemingly non-random patterns in abundance can arise from many processes, including neutrality, that are not driven by species interactions or associations. As such, deviations of abundance patterns from null models might not, by itself, indicate true associations or interactions. One critical, and widely observed, property of species abundances is that they are not evenly distributed across species nor across space within a given species (often referred to as spatial clustering) [@mcgill2003; @engen2008; @zillio2010; @harte2011; @connolly2017]. Both ubiquitous patterns can be accounted for by purely probabilistic processes from neutral birth-death-immigration [@kendall1949; @hubbell2001] to mechanistically agnostic statistical-mechanical properties of large assemblages [@harte2011]. Importantly, the negative binomial probability distribution both accurately reflects many empirical measurements of spatial variation in intra-specific abundances [@harte2011; @connolly2017] and is independently derived by disparate null or neutral ecological theories [@kendall1949; @engen2008; @harte2011].

Thus, the simple observation of uneven or clustered intra-specific abundances does by itself indicate the influence of deterministic species associations. The data compiled by CEA [@calatayud2019] indeed confirm the ubiquity of uneven species abundances both at an intra-specific level across space (Supplementary Fig. \ref{fig:ssad}) at an inter-specific level (Supplementary Fig. \ref{fig:sadShape}). For consistency, I will refer to the spatial distribution of intra-specific abundances as the spatial species abundance distribution (SSAD) following Harte [@harte2011] and the inter-specific distribution of abundances as the species abundance distribution SAD. To reiterate for clarity, the SSAD is a measure of spatial variability in intra-specific abundances across space and is measured once for each species; the SAD is a measure of variability in abundance across species (i.e. inter-specific abundances).

Using simulation, I show that intra-specific spatial clustering of the SSAD alone is sufficient to reproduce the apparent correlation between abundance and association type reported by CEA. Spatial clustering is not, by itself, evidence that rare species positively associate with each other while common species negatively associate. Therefore, the observations reported by CEA do not tell us about species associations, but rather that the null models used do not preserve important aspects of the SSAD.

In Figure \ref{fig:plusMinus} I first reproduce key results from CEA's Figure 2(B-C). Then to evaluate whether these results can be produced simply from spatial clustering alone I simulate purely random data (with absolutely no association or interaction between species) that match the unevenness of abundances found in the observed data. These random data are simulated as follows:

1) The number of species $S$, number of sites $M$, and shape of the best fitting SAD are sampled (with replacement) from the observed data 2) $S$ species abundances $x_i \ldots x_S$ are sampled from the SAD 3) For each species $i$ with abundance $x_i$, within-species counts are distributed across the $M$ sites according to an SSAD that is either negative binomial (in the case of spatial clustering) or Poisson (in the case of spatial randomness) 4) The resulting simulated site by species matrix is fed through the same analytically pipeline (described in CEA) as the observed data to infer positive and negative associations.

All analyses are carried out in R [@rcore] and can be fully reproduced by installing the R package (https://github.com/ajrominger/RarePlusComMinus) accompanying this paper, as detailed in the supplement.

# load plotting funs
lazyLoad(findChunk('plusMinus_plotFuns', 'RarePlusComMinus_supp'))

# load all data to plot
lazyLoad(findChunk('obs_plusMinus_comp', 'RarePlusComMinus_supp'))
lazyLoad(findChunk('sim_plusMinus_comp', 'RarePlusComMinus_supp'))
lazyLoad(findChunk('sim_pois_plusMinus_comp', 'RarePlusComMinus_supp'))
# lazyLoad(findChunk('sim_poisUnif_plusMinus_comp', 'RarePlusComMinus_supp'))

```r'} wmmax <- ceiling(max(simPMData$pos.wm, simPMData$neg.wm, na.rm = TRUE) * 9) / 3 xless <- 0.48 yextra <- 0.55

foo <- split.screen(c(3, 1))

screen(1) fig2bc(commStats, breaksRho = seq(-2, 2, by = 1/5), breaksWM = seq(-wmmax, wmmax, by = 0.1/3), addxlab = FALSE, figLabs = c('A', 'B'), insetxprop = xless, insetyprop = yextra) mtext('Observed', side = 4, outer = TRUE)

screen(2) fig2bc(simPMData, breaksRho = seq(-2, 2, by = 1/5), breaksWM = seq(-wmmax, wmmax, by = 0.1/3), addxlab = FALSE, figLabs = c('C', 'D'), insetxprop = xless, insetyprop = yextra)

screen(3) fig2bc(simPMDataPois, breaksRho = seq(-2, 2, by = 1/5), breaksWM = seq(-wmmax, wmmax, by = 0.1/3), addxlab = TRUE, figLabs = c('E', 'F'), insetxprop = xless, insetyprop = yextra)

foo <- close.screen(all.screens = TRUE)

In the case of a Poisson SSAD the one parameter (the mean) is fully specified by the average site-level abundance of a given species. In the case of a negative binomial SSAD, the mean parameter is again specified by the site-level average, but the size or clustering parameter $k$ is not fully specified.  To capture the rough features of the data, I sample $k$ from a linear relationship (with noise) between the maximum likelihood estimates of $k$ and the relative abundance of each species (Supplementary Fig. \ref{fig:ssad}).

Figure \ref{fig:plusMinus} A--D shows that with a negative binomial SSAD, simulated data closely match observed findings: the correlation between abundance and species' network degree skews more negative in positive association networks (i.e. more rare species are more highly connected in positive association networks), and positive associations networks tend to contain more rare species than negative networks. This correspondence between real and simulated patterns largely disappears when we instead use a Poisson SSAD, highlighting the importance of spatial aggregation in driving the spurious results. 


My findings do not depend on simulating SAD and SSAD shapes from the data: in Supplementary Figure \ref{fig:simpleSim} I show that the spurious relationship between abundance and association type occurs even when simulating data from just one arbitrary SAD function with the one arbitrary spatially clustered SSAD for all species.  In this simulation, again, replacing the spatially clustered SSAD with a Poisson SSAD breaks the spurious connection between abundance and association type as in Figure \ref{fig:plusMinus} (E-F).

Why do negative binomial SSADs reproduce the results while Poisson SSADs fail to? The null model algorithm used here and in CEA fixes row and column marginals, thus the empirical shape of the SAD is preserved, and the *total* abundances (across all species) at each site are also preserved. However, the way a species' total abundance is allocated across sites by the null model has a potentially large combinatorial space to explore. In Figure \ref{fig:ssadPerm} I compare summary statistics of known SSADs to their permuted counterparts and find that the null model transforms negative binomial SSADs to a more Poisson shape, while leaving Poisson SSADs probabilistically unchanged. Specifically, when starting with a negative binomial SSAD, the null model inflates the number of sites individuals are allocated to (more similarly to a Poisson SSAD) and increases the inferred $k$ parameter, indicating less spatial clustering in the permuted matrices compared to their non-permuted, negative binomial starting points.

```r
lazyLoad(findChunk('ssad_perm_comp', 'RarePlusComMinus_supp'))
lazyLoad(findChunk('ssad_perm_setup', 'RarePlusComMinus_supp'))
lazyLoad(findChunk('sim_simple_setup', 'RarePlusComMinus_supp'))

```r"} nbCol <- hsv(0.53, 1, 0.8) poCol <- hsv(0.1, 1, 0.95)

layout(matrix(c(1, 1, 2, 3), nrow = 2, byrow = TRUE), heights = c(1, 5))

lmar <- parArgs$mar lmar[c(1, 3)] <- 0 par(mar = lmar)

plot(1, type = 'n', axes = FALSE, xlab = '', ylab = '') legend(1, 1, legend = c('Negative binomial', 'Poisson'), pch = 21, pt.cex = 2, col = c(nbCol, poCol), pt.bg = colAlpha(c(nbCol, poCol), 0.4), bty = 'n', horiz = TRUE, xjust = 0.5)

par(parArgs) plot(simPoPerm$nocc, simPoPerm$null_nocc, pch = 21, col = poCol, bg = colAlpha(poCol, 0.1), xlim = c(1, nsite), ylim = c(1, nsite), xlab = 'True num. occupied sites', ylab = 'Permuted num. occupied sites')

points(simNBPerm$nocc, simNBPerm$null_nocc, pch = 21, col = nbCol, bg = colAlpha(nbCol, 0.1))

abline(0, 1, col = 'black', lwd = 2)

foo <- rbind(simNBPerm, simPoPerm) foo <- foo[is.finite(foo$size) & is.finite(foo$null_size), ]

par(parArgs) plot(simPoPerm$size, simPoPerm$null_size, log = 'xy', xaxt = 'n', yaxt = 'n', pch = 21, col = poCol, bg = colAlpha(poCol, 0.1), xlim = range(foo[, c('size', 'null_size')]), ylim = range(foo[, c('size', 'null_size')]), xlab = expression('True'~k), ylab = expression('Permuted'~k)) logAxis(1:2, expLab = TRUE)

points(simNBPerm$size, simNBPerm$null_size, pch = 21, col = nbCol, bg = colAlpha(nbCol, 0.1))

abline(0, 1, lwd = 1.5)

```r
lazyLoad(findChunk('mathmat_fun', 'RarePlusComMinus_supp'))
lazyLoad(findChunk('mathEG_setup', 'RarePlusComMinus_supp'))
lazyLoad(findChunk('mathEG_rare', 'RarePlusComMinus_supp'))
lazyLoad(findChunk('mathEG_comm', 'RarePlusComMinus_supp'))
lazyLoad(findChunk('mathEG_commMax', 'RarePlusComMinus_supp'))

formatNice <- function(x) {
    o <- formatC(x, format = 'e', digits = 2)
    o <- gsub('e', ' \\\\times 10^{', o)
    o <- gsub('10\\^\\{-0', '10^\\{-', o)
    o <- gsub('10\\^\\{0', '10^\\{', o)

    return(paste0(o, '}'))
}

The negative binomial SSAD appears to be the key to producing presumably spurious relationships between abundance and positive or negative association networks. One might then expect that a null model which preserves the shape of the SSAD for each species would account for statistical artifacts deriving from the SSAD. CEA indeed explore such a null model algorithm (the "independent swap algorithm" [@ulrich2010; @picante]; null model III in the CEA supplement) and find it still supports their results. I similarly apply the independent swap algorithm to data simulated with a negative binomial SSAD and no real species associations or interactions. I find that the same spurious relationships between abundance and association type, even when using the independent swap algorithm (Supp. Fig. \ref{fig:indSwap}). This again confirms that such association networks and further biological interpretations of them cannot be drawn from abundance data alone.

At a mathematical level, clustered SSADs as compared to spatially even SSADs, increase the probability that rare species will appear positively associated with each other and common species will appear negatively associated. Consider, for example, two rare species: one with a single individual and the other with abundance r Nrare, distributed across r nsite sites. Their Schoener similarity is maximized when all individuals occur at the same site, such as this site by species matrix $$ X_{rare} = r mathmat(rarePair) $$ If we define $Q(x_i; \mu = r Nrare / nsite)$ as the probability of observing $x_i$ individuals in site $i$ given an SSAD with mean parameter $\mu$, then the probability of the above configuration is $P(X_{rare}) = Q(r Nrare; \mu = r Nrare / nsite) \left(Q(0; \mu = r Nrare / nsite)^{r nsite - 1}\right)$. Under a negative binomial SSAD with $k = r k$, $P(X_{rare}) = r formatNice(nbRareP)$ whereas under a Poisson SSAD $P(X_{rare}) = r formatNice(poRareP)$.

Conversely, for two common species, say each with abundance r Ncomm, an example configuration that minimizes their Schoener similarity would be $$ Y_{min} = r mathmat(commPair) $$ We calculate the probability of any such scenario where no abundances overlap as $P(Y_{min}) = r nsite - 1 \left(\left(Q(r Ncomm; \mu = r Ncomm / nsite) Q(0; \mu = r Ncomm / nsite)^{r nsite - 1}\right)^2\right)$. With a negative binomial SSAD with $k = r k$, $P(Y_{min}) = r formatNice(nbCommP)$ whereas with a Poisson SSAD $P(Y_{min}) = r formatNice(poCommP)$.

We contrast this with a configuration that would maximize the Schoener similarity between these two common species: $$ Y_{max} = r mathmat(commPairMax) $$ The probability of this configuration is $P(Y_{max}) = Q(r commPairMax[1, 1]; \mu = r Ncomm / nsite)^{r 2 * nsite}$. For the same negative binomial $P(Y_{max}) = r formatNice(nbCommPMax)$, and for the Poisson $P(Y_{max}) = r formatNice(poCommPMax)$.

Thus a spatially clustered SSAD, compared to a spatially even SSAD, gives more probability to configurations where rare species appear aggregated and common species appear over-dispersed. Because the null model algorithm permutes site by species matrices to resemble more Poisson-like SSADs this probabilistic difference between spatially clustered versus even SSADs accounts for the prevalence of rare species in positive association networks and common species in negative association networks.

# function to calculate percent of species in networks
percNet <- function(x, type) {
    round(mean(100 * x[, paste(type, 'v', sep = '.')] / x$all.v, na.rm = TRUE))
}

Caution should be used when inferring species association from abundance data. More fundamentally than the spurious correlation of abundance with association type, my analysis shows that statistically significant species associations are inferred from data simulated without any real species associations. In data simulated with a negative binomial SSAD, on average r percNet(simPMData, 'pos')% of species were placed in positive association networks and r percNet(simPMData, 'neg')% in negative association networks with a significance cutoff of $\alpha = 0.05$. With the Poisson SSAD these simulated numbers were r percNet(simPMDataPois, 'pos')% for positive networks and r percNet(simPMDataPois, 'neg')% for negative networks. For the observed data, on average r percNet(commStats, 'pos')% of species were placed in positive association networks and r percNet(commStats, 'neg')% in negative association networks.

It is becoming increasingly appreciated that abundance data alone are not sufficient to distinguish between different ecological processes [@mcgill2007; @morlon2009]. The question of why rare species persist is fascinating, and CEA should be commended for making a concerted effort to illuminate possible mechanisms underlying the phenomenon; however, to reach robust conclusions, other types of data, such as actual experimental measurement of shared environmental associations or species-species interaction strengths, are needed in addition to abundance data.

Data and Code Availability

All data and code needed to reproduce the results of this manuscript are available at https://github.com/ajrominger/RarePlusComMinus and a detailed description of the analytical approach is available in the supplement.

\clearpage