\chead{\ifthenelse{\value{page}=1}{}{\textit{Inferring species interactions}}} \renewcommand{\headrulewidth}{0pt}

\setlength{\parskip}{1pt}

Title: Inferring species interactions from co-occurrence data with Markov networks

Author: David J. Harris: Population Biology; 1 Shields Avenue, Davis CA, 95616

Abstract: Inferring species interactions from co-occurrence data is one of the most controversial tasks in community ecology. One difficulty is that a single pairwise interaction can ripple through an ecological network and produce surprising indirect consequences. For example, the negative correlation between two competing species can be reversed in the presence of a third species that outcompetes both of them. Here, I apply models from statistical physics, called Markov networks or Markov random fields, that can predict the direct and indirect consequences of any possible species interaction matrix. Interactions in these models can be estimated from observed co-occurrence rates via maximum likelihood, controlling for indirect effects. Using simulated landscapes with known interactions, I evaluated Markov networks and six existing approaches. Markov networks consistently outperformed the other methods, correctly isolating direct interactions between species pairs even when indirect interactions or abiotic factors largely overpowered them. Two computationally efficient approximations, which controlled for indirect effects with partial correlations or generalized linear models, also performed well. Null models showed no evidence of being able to control for indirect effects, and reliably yielded incorrect inferences when such effects were present.

Key words: Ecological interactions; Occurrence data; Species associations; Markov network; Markov random field; Ising model; Biogeography; Presence–absence matrix; Null model

\noindent\textbf{Introduction}

\setlength{\parindent}{2em}

\noindent To the extent that nontrophic species interactions (such as competition) affect community assembly, ecologists might expect to find signatures of these interactions in species composition data [@macarthur_population_1958; @diamond_island_1975]. Despite decades of work and several major controversies, however [@lewin_santa_1983; @strong_ecological_1984; @connor_checkered_2013], existing methods for detecting competition's effects on community structure are unreliable [@gotelli_empirical_2009]. In particular, species' effects on one another can become lost in a web of indirect effects. For example, the competitive interaction between the two shrub species in Figure 1A is obscured by their shared tendency to occur in unshaded areas (Figure 1B). While ecologists have long known that indirect effects can overwhelm direct ones [@levine_competitive_1976], most methods for drawing inferences from co-occurrence data do not control for these effects [e.g. @diamond_island_1975; @strong_ecological_1984; @gotelli_empirical_2009; @veech_probabilistic_2013; @pollock_understanding_2014]. As a result, ecologists do not have tools that allow them to isolate direct interactions from indirect effects.

While competition doesn't reliably reduce co-occurrence rates at the whole-landscape level (as most methods assume), it does still leave a signal in the data (Figure 1C). After controlling for the presence of the tree species (e.g. by splitting the data set into shaded and unshaded sites or by using a model that estimates conditional relationships amongs species), the two shrubs do have a negative association, with fewer co-occurrences in unshaded areas than would be expected if they occurred independently of one another.

Following @azaele_inferring_2010, this paper shows that Markov networks [undirected graphical models also known as Markov random fields; @murphy_machine_2012] can provide a framework for understanding the landscape-level consequences of pairwise species interactions, and for estimating them from observed presence-absence matrices. Markov networks have been used in diverse scientific fields for decades, from physics [where nearby particles interact magnetically; @cipra_introduction_1987] to spatial statistics [where adjacent grid cells have correlated values; @harris_contact_1974; @gelfand_modelling_2005]. While community ecologists explored some related approaches in the 1980's [@whittam_species_1981], they used severe approximations that led to unintelligible results [e.g. "probabilities" greater than one; @gilpin_factors_1982].

Below, I demonstrate Markov networks' ability to produce exact predictions about the direct and indirect consequences of an interaction matrix, and also to make inferences about the interaction matrix based on co-occurrence rates. Using simulated data sets where the "true" interactions are known, I compare this approach with several existing methods. Finally, I discuss opportunities for extending the approach presented here to other problems in community ecology, e.g. quantifying the overall effect of species interactions on occurrence rates [@roughgarden_competition_1983] and disentangling the effects of biotic versus abiotic interactions on species composition [@harris_estimating_2016].

\noindent\textbf{Methods}

\noindent \textbf{Markov networks.} Markov networks provide a framework for translating back and forth between the conditional (all-else-equal) relationships among species (Figure 1C) and the kinds of species assemblages that these relationships produce. Here, I show how a set of conditional relationships can determine species composition. Methods for estimating conditional relationships from data are discussed in the next section.

A Markov network defines the relative probability of observing a given vector of species-level presences (1s) and absences (0s), $\vec{y}$ at a site, as

\centering

$\displaystyle{p(\vec{y}; \alpha, \beta) \propto exp(\sum_{i}\alpha_i y_i + \sum_{}\beta_{ij}y_i y_j),}$

\raggedright \setlength{\parindent}{2em}

\noindent where the second sum is over all $\frac{1}{2}n(n-1)$ pairs of $n$ species. In this model, $\alpha_{i}$ is an intercept term determining the amount that the presence of species $i$ contributes to the log-probability of $\vec{y}$; it directly controls the prevalence of species $i$. Similarly, $\beta_{ij}$ is the amount that the co-occurrence of species $i$ and species $j$ contributes to the log-probability; it determines the conditional relationship between two species, i.e. the probability that they will be found together, after controlling for the other species in the network (Figure 2A, Figure 2B). For example, if $\beta_{ij} = +2$, then each species' odds of occurrence would be $e^2$ times higher when the other one is present (as compared with otherwise equivalent sites). The relative probability of a presence-absence vector increases when positively-associated species co-occur and decreases when negatively-associated species co-occur. As a result, the model tends---all else equal---to produce assemblages where many positively-associated species pairs co-occur and few negatively-associated pairs do (just as an ecologist might expect).

\noindent \textbf{Estimating $\alpha$ and $\beta$ coefficients from presence-absence data.} In the previous section, the values of $\alpha$ and $\beta$ were known and the goal was to make predictions about possible species assemblages. In practice, however, ecologists will often need to estimate the parameters from an observed co-occurrence matrix (i.e. from a set of independent $\vec{y}$ vectors indicating which species are present at each site on the landscape). When Equation 1 can be normalized (Figure 2B), one can find exact maximum likelihood estimates for $\alpha$ and $\beta$ by numerically optimizing $p(\vec{y}|\alpha, \beta)$. Fully-observed networks like the ones considered here have unimodal likelihood surfaces [@murphy_machine_2012], so optimizers will always find the global optimum.

When the number of species is larger than about 30, noralizing Equation 1 can become intractable, and researchers will either need to approximate it [@lee_learning_2012] or approximate its gradient [@harris_estimating_2016] if they want to fit a Markov network model. For the analyses presented here, where the number of species did not exceed 20, approximations were not necessary. Instead, I used the rosalia package [@harris_rosalia_2015] for the R programming language [@r_core_team_r_2015] to calculate $p(\vec{y}; \alpha, \beta)$ and its gradients exactly [@murphy_machine_2012]; the package passes these functions to the "BFGS" method in R's general-purpose optimizer, which finds values for $\alpha$ and $\beta$.

\noindent \textbf{Simulated landscapes.} I simulated several sets of landscapes using known parameters so that model estimates could be compared with "true" values. The first set of landscapes included the three competing species shown in Figure 1. For each of 1000 replicates, I generated a landscape with 100 sites by sampling from a probability distribution defined by the figure's interaction coefficients (Appendix 1). Each of the methods described below was then evaluated on its ability to correctly infer that the two shrub species competed with one another, despite their frequent co-occurrence in unshaded areas.

I then simulated landscapes with up to 20 interacting species at 25, 200, or 1600 sites using three increasingly complex models (50 replicates for each combination of size and model; Appendix 2). The simplest set of simulated landscapes were generated with Gibbs samples from Equation 1. For each replicate, I randomly drew the “true” $\beta$ coefficient magnitudes from an exponential distribution with rate 1 so that most species pairs interacted negligibly but a few interactions were strong enough to propagate through the newtwork. I randomly assigned 25% of the interactions to be positive; the remainder were negative.

The next set of landscapes provided a way to assess each method's ability to identify direct interactions in the presence of environmental heterogeneity. Here, the $\beta$ coefficients were calculated as above, but each species' $\alpha$ value depended linearly on two environmental factors, which were drawn from independent Gaussians for each site. Once the local $\alpha$ values were calculated, independent Gibbs samplers determined the species composition for each site based on Equation 1.

In the final set of landscapes, I simulated each species' abundance (instead of just presence/absence); furthermore, interactions between species occurred on a per-capita basis in these simulations (i.e. each species' effect on the others is proportional to its abundance). To prevent runaway mutualisms leading to infinite abundance, all interaction coefficients were negative in these simulations. Good performance on these landscapes would indicate some robustness to the mechanistic details of species interactions.

\noindent \textbf{Recovering species interactions from simulated data.} I compared seven techniques for determining the sign and strength of the associations between pairs of species from simulated data (Appendix 3). First, I used the rosalia package [@harris_rosalia_2015] to fit Markov network models, as described above. For the analyses with 20 species, a weakly-informative regularizer (equivalent to a logistic prior with location 0 and scale 2) ensured that the estimates were always finite (Appendix 3).

I also evaluated six alternative methods: five from the existing literature, plus a novel combination of two of these methods. The first alternative interaction metric was the sample correlation between species' presence-absence vectors, which summarizes their marginal association. Next, I used partial correlations, which summarize species' conditional relationships. This approach, which is closely related to linear regression, is common in molecular biology [@friedman_sparse_2008], but is rare in ecology (see @albrecht_spatial_2001 and @faisal_inferring_2010 for two exceptions). In the context of non-Gaussian data, the partial correlation (or partial covariance) can be thought of as a computationally efficient approximation to the full Markov network model [@loh_structure_2013]. Partial correlations are undefined for landscapes with perfectly-correlated species pairs, so I used the regularized estimate provided by the corpcor package’s pcor.shrink function with the default settings [@schafer_corpcor_2014].

The third alternative, generalized linear models (GLMs), also provide a computationally efficient approximation to the Markov network [@lee_learning_2012]. Following @faisal_inferring_2010, I fit regularized logistic regression models [@gelman_weakly_2008] for each species, using the other species as predictors. This produced two interaction estimates for each species pair (one for the effect of species $i$ on species $j$ and one for the reverse). These two estimates were very tightly correlated (mean Pearson correlation of 0.95, Appendix 3); their arithmetic mean provided a consensus estimate of the overall interaction.

The next method used the Pairs software described in @gotelli_empirical_2009. This program simulates new landscapes from a null model that retains the row and column sums of the original matrix [@strong_ecological_1984] and calculates $Z$-scores to summarize a species pair's deviation from this null.

The last two estimators used the latent correlation matrix estimated by the BayesComm package [@golding_bayescomm_2015] in order to evaluate the recent claim that the correlation coefficients estimated by "joint species distribution models" provide an accurate assessment of species’ pairwise interactions [@pollock_understanding_2014; see also @harris_generating_2015]. In addition to using the posterior mean correlation [@pollock_understanding_2014], I also used the posterior mean partial correlation, which should control better for indirect effects.

\noindent \textbf{Evaluating model performance.} For the simulated landscapes based on Figure 1, I assessed whether each method's test statistic indicated a positive or negative relationship between the two shrubs (Appendix 1). For the null model (Pairs), I calculated statistical significance using its $Z$-score. For the Markov network, I used the Hessian matrix to generate approximate confidence intervals.

For the larger landscapes, I evaluated the relationship between each method's estimates and the "true" interaction strengths. To ensure that the different test statistics (e.g. correlations versus $Z$ scores) were on a common scale, I rescaled them using linear regression through the origin. I then calculated the proportion of variance explained for different combinations of model type and landscape size (compared with a baseline model that assumed all interaction strengths to be zero).

For the null model and the Markov network, the probability of rejecting the null hypothesis of zero interaction was estimated across a range of "true" interaction strengths using a kernel smoother (Appendix 4). The probability of rejection when the "true" value of $\beta$ was zero was defined as the Type I error rate. Because the coefficients' interpretation is different for the abundance simulations than for the other two types, its error rates were not analyzed in this way.

\noindent \textbf{Results}

\noindent \textbf{Three species.} As shown in Figure 1, the marginal relationship between the two shrub species was positive---despite their competition for space at a mechanistic level---due to indirect effects of the dominant tree species. As a result, the correlation between these species was positive in 94% of replicates, and the randomization-based null model falsely reported positive associations 100% of the time. Worse, more than 98% of these false conclusions were statistically significant. The partial correlation and Markov network estimates, on the other hand, each correctly isolated the direct negative interaction between the shrubs from their positive indirect interaction 94% of the time (although the confidence intervals overlapped zero in most replicates).

\noindent \textbf{Twenty species.} In general, each model's performance was highest for large landscapes with simple assembly rules and no environmental heterogeneity (Figure 3). Despite some variability across contexts, the rank ordering across methods was very consistent. In particular, the four methods that controlled for indirect effects (the Markov network, the generalized linear models, and the two partial correlation-based methods) always matched or outperformed those that did not. The Markov network consistently performed best of all. As anticipated by @lee_learning_2012, generalized linear models closely approximated the Markov network estimates (Figure 4A), especially when the data sets were very large (Figure 3). As reviewed in @gotelli_empirical_2009, however, most analyses in this field of ecology involve fewer than 50 sites; in this context, the gap between the methods was larger. As shown in Appendix 4, the standard errors associated with the estimates in Figure 3 are small (less than 0.01), so the differences among methods should not be attributed to sampling error.

Of the methods that did not control for indirect effects, Figure 3 shows that simple correlation coefficients provided a more reliable indicator of species' true interaction strengths than either the joint species distribution model (BayesComm) or the null model (Pairs). 95% of the variance the Pairs test statistic was explained by correlation coefficients (controlling for landscape size; Figure 4B); much of the remaining variance is due to sampling error.

Finally, we can evaluate the models' inferential statistics (focusing on the first two simulation types, where the interaction coefficients are easiest to interpret). The Markov network's Type I error rate was 0.02 for simulations that matched the model's assumptions, and 0.14 for simulations that included environmental heterogeneity (see Appendix 4 for confidence interval coverage across a range of $\beta_{ij}$ values). In contrast, the null model's Type I error rates were 0.30 and 0.51, respectively---far higher than the nominal 0.05 rate. Figure 4C shows, across a range of true interaction strengths, the probability that the null model or the Markov network will predict the wrong sign of the interaction with 95% confidence. The null model makes such errors more than 8 times as often as the Markov network, even though it only rejects the null hypothesis twice as often overall (Appendix 4). The Markov network's errors were also more concentrated around 0, as it never misclassified strong interactions like the null model did (Figure 4C).

\noindent \textbf{Discussion}

\noindent The results presented above show that Markov networks can reliably recover species' pairwise interactions from species composition data, even for cases where environmental heterogeneity and indirect interactions cause ecologists' typical null modeling approaches to reliably fail. Partial correlations and generalized linear models can both provide computationally efficient approximations, but with somewhat lower accuracy [especially for typically-sized data sets with small numbers of sites; @gotelli_empirical_2009]. The difference in accuracy may be even larger for real data sets than for the simulated landscapes in Figure 3; linear approximations to the Markov network make larger errors when the interaction matrix is structured [e.g. due to guilds or trophic levels; @loh_structure_2013]. Similarly, the separate generalized linear models for each species can severely overfit in some cases [@lee_learning_2012]. The full Markov network should thus be preferred to the approximations when it is computationally tractable.

Compositional data only contains enough degrees of freedom to estimate one interaction per species pair [@schmidt_modeling_2012], so none of these methods can identify the exact nature of the pairwise interactions (e.g. which species in a positively-associated pair is facilitating the other). To estimate asymmetric interactions, such as commensalism or predation, ecologists could use time series, behavioral observations, manipulative experiments, or natural history. These other sources of information could also be used to augment the likelihood function with a more informative prior distribution, reducing ecologists' error and uncertainty relative to Figure 3's results.

Markov networks have enormous potential to improve our understanding of species interactions. In particular, they make many fewer errors than existing approaches, and can make precise statements about the conditions where indirect interactions will overwhelm direct ones. They also provide a simple answer to the question of how competition should affect a species' overall prevalence, which has important implications for community-level modeling [@strong_ecological_1984]. Specifically, Equation 1 can be used to calculate the expected prevalence of a species in the absence of biotic influences as $e^{\alpha_i}/(e^{0} + e^{\alpha_i})$. Competition's effect on prevalence can then be estimated by comparing this value with the observed prevalence (e.g. comparing Figure 2D with Figure 2C). This novel quantitative result undermines most of our null models, which unreasonably assume that prevalence would be the exactly same in the absence of competition as it is in the observed data [@roughgarden_competition_1983].

Markov networks---particularly the Ising model for binary networks---are very well understood, having been studied for nearly a century [@cipra_introduction_1987]. Tapping into this framework would thus allow ecologists to take advantage of into a vast set of existing discoveries and techniques for dealing with indirect effects, stability, and alternative stable states. Numerous extensions to the basic network are possible as well. For example, the states of the interaction network can be modeled as a function of the local abiotic environment [@lee_learning_2012; @harris_estimating_2016], which would lead to a better understanding of the interplay between biotic and abiotic effects on community structure. Alternatively, models could allow one species to alter the relationship between two other species [@tjelmeland_markov_1998; @whittam_species_1981; cf @bruno_inclusion_2003].

Finally, the results presented here have important implications for ecologists' continued use of null models for studying species interactions. When the non-null backdrop is not controlled for, Type I error rates can skyrocket, the apparent sign of the interaction can change, and null models can routinely produce misleading inferences (Figure 1, Figure 4C, @gotelli_empirical_2009). Null and neutral models can be useful for clarifying our thinking [@harris_occupancy_2011; @xiao_strong_2015], but deviations from a given null model must be interpreted with care [@roughgarden_competition_1983]. Even in small networks with three species, it may simply not be possible to implicate specific ecological processes like competition by rejecting a general-purpose null [@gotelli_empirical_2009], especially when the test statistic is effectively just a correlation coefficient (Figure 4B).

Controlling for indirect effects via simultaneous estimation of multiple ecological parameters seems like a much more promising approach: to the extent that the models' relative performance on real data sets is similar to the range of results shown in Figure 3, scientists in this field could often triple their explanatory power by switching from null models to Markov networks (or increase it nearly as much with linear or generalized linear approximations). Regardless of the methods ecologists ultimately choose, controlling for indirect effects could clearly improve our understanding of species' direct effects on one another and on community structure.

\noindent \textbf{Acknowledgements:} This work benefited greatly from discussions with A. Sih, M. L. Baskett, R. McElreath, R. J. Hijmans, A. C. Perry, and C. S. Tysor. Additionally, A. K. Barner, E. Baldridge, E. P. White, D. Li, D. L. Miller, N. Golding, N. J. Gotelli, C. F. Dormann, and two anonymous reviewers provided very helpful feedback on the text. This research was partially supported by a Graduate Research Fellowship from the US National Science Foundation and by the Gordon and Betty Moore Foundation's Data-Driven Discovery Initiative through Grant GBMF4563 to E. P. White.

\setlength{\parindent}{0cm}

\noindent \textbf{References:}

\setlength{\parindent}{-1em} \setlength{\leftskip}{1em} \setlength{\parskip}{0pt}

Figure 1: A. A small network of three competing species. The tree (top) tends not to co-occur with either of the two shrub species, as indicated by the strongly negative coefficient linking them. The two shrub species also compete with one another, but more weakly (circled coefficient). B. In spite of the competitive interactions between the two shrub species, their shared tendency to occur in locations without trees makes their occurrence vectors positively correlated (circled). C. Controlling for trees with a conditional (all-else-equal) approach such as a partial correlation or a Markov network leads to correct identification of the negative shrub-shrub interaction (circled). See Appendix 1 and the results for "three species" for more details.

Figure 2: A. A small Markov network, defined by its $\alpha$ and $\beta$ values. The abiotic environment favors the occurrence of each species ($\alpha >0$), particularly species 2 ($\alpha_2 > \alpha_1$). The negative $\beta_{12}$ coefficient is consistent with competition between the two species. B. The coefficients determine the probabilities of all four possible presence-absence combinations for Species 1 and Species 2. $\alpha_1$ is added to the exponent whenever Species 1 is present ($y_1 = 1$), but not when it is absent ($y_1 = 0$). Similarly, the exponent includes $\alpha_2$ only when species $2$ is present ($y_2 = 1$), and includes $\beta_{12}$ only when both are present ($y_1y_2 = 1$). The normalizing constant $Z$, ensures that the four probabilities sum to 1. In this case, $Z$ is about 18.5. C. The expected frequencies of all possible co-occurrence patterns between the two species of interest, as calculated in the previous panel. D. Without competition (i.e. with $\beta_{12}=0$, each species would occur more often.

Figure 3: Proportion of variance in interaction coefficients explained by each method versus number of sampled locations across the three simulation types. For the null model (Pairs), two outliers with $|Z|>1000$ were manually adjusted to $|Z|=50$ to mitigate their detrimental influence on $R^2$ (Appendix 5).

Figure 4: A. The Markov network's estimated interaction coefficients were generally very similar to the GLM estimates. B. The null model's estimates typically matched the (negative) correlation coefficient, after controlling for landscape size. C. For any given interaction strength, the null model was much more likely to misclassify its sign with 95% confidence than the Markov network was. As with the other analyses based on inferential statistics, this panel only shows data from the first two simulation types.



davharris/rosalia documentation built on May 14, 2019, 9:29 p.m.