knitr::opts_chunk$set(echo = FALSE)
library(drake)
library(dplyr)
library(ggplot2)
library(scadsanalysis)

knitr::opts_chunk$set(fig.width=3, fig.height=2.5, warning = FALSE, message = FALSE) 

We are interested in how the self-similarity of the elements of the feasible set varies over gradients in S and N. The more self-similar the feasible set is, the less likely it is for us to obtain a SAD that differs from the majority of the feasible set by chance. The intuition from "common-sense" probability theory, and a common theme in statistical mechanics (Jaynes 1957) is that as the number of possible arrangements becomes large, the majority of possible arrangements should become very similar to each other in broad-scale characteristics. However, we don't know that this is the case for SADs, nor do we know the values for S and N where this increase in self-similarity begins to occur.

In the manuscript, we use a breadth index defined as ratio of the 95% interval of summary statistic values : the full range of summary statistic values. This can be calculated quickly and is easily interpretable. It also directly reflects the distributions to which we are comparing our observations. However, it is not a widely-used metric. Also, if the summary statistic values are idiosyncratic (skewness in particular can behave counterintuitively), it can reflect those idiosyncracies.

With more computing, we can explore how similar the elements of the feasible set are to each other by comparing them to each other directly. Then we can ask whether, as the feasible set gets large, the elements become more similar and converge on a dominant form.

Overview

We can describe how self-similar the elements of a feasible set are via numerous pairwise comparisons. Given a body of samples drawn from a feasible set, we draw two samples and compute some metric that describes how similar these two samples are to each other. We do this many times, making many pairwise comparisons, to generate a distribution of the self-similarity metric for that feasible set. We then compare how self-similar different feasible sets are by comparing the distributions of self-similarity metrics for the different feasible sets.

Here we demonstrate this process for an example feasible set, and then present results for feasible sets spanning the range of S and N present in our data.

Example

## Set up the cache and config
db <- DBI::dbConnect(RSQLite::SQLite(), here::here("analysis", "drake", "drake-cache-net.sqlite"))
cache <- storr::storr_dbi("datatable", "keystable", db)
cache$del(key = "lock", namespace = "session")
knitr::opts_chunk$set(fig.width=3, fig.height=2.5, warning = FALSE, message = FALSE) 


net_summary <- readd(all_di_summary, cache = cache)

net_summary <- net_summary %>%
  mutate(log_nparts = log(gmp:::as.double.bigz(nparts)))

example_fs <- readd(fs_s_7_n_71, cache = cache)

example_di <- readd(di_fs_s_7_n_71, cache =cache)

example_fs <- example_fs %>%
  left_join(example_di) %>%
  left_join(net_summary)

Here we have a bank of r length(unique(example_fs$sim)) samples from the feasible set for SADs with 7 species and 71 individuals. We draw two of these samples to compare.

ggplot(example_fs, aes(x = rank, y = abund, group = sim)) +
  geom_line(alpha = .1) +
  theme_bw()


set.seed(1977)



comparison <- fs_diff_sampler(example_fs)

sims <- c(comparison$sim1[1], comparison$sim2[1])


comparison_fs <- filter(example_fs, sim %in% sims) %>%
  mutate(sim = as.factor(sim))

ggplot(comparison_fs, aes(x = rank, y = abund, color = sim)) +
  geom_line() +
  geom_point() +
  theme_bw() +
  scale_color_viridis_d(end = .7)

We have implemented 5 metrics of similarity for comparing samples:

comparison

We draw numerous pairs of SADs and compare them numerous times to get distributions for the self similarity metrics.

comparisons <- rep_diff_sampler(example_fs, ndraws = 100) 


ggplot(comparisons, aes(r2)) +
  geom_density() +
  theme_bw()


ggplot(comparisons, aes(r2_log)) +
  geom_density() +
  theme_bw()


ggplot(comparisons, aes(cd)) +
  geom_density() +
  theme_bw()


ggplot(comparisons, aes(prop_off)) +
  geom_density() +
  theme_bw()


ggplot(comparisons, aes(div)) +
  geom_density() +
  theme_bw()

We repeat this process for feasible sets for different values of S and N, and compare the distributions for the self-similarity metrics for the different feasible sets. Here we compare the distributions for the self-similarity metrics for our original feasible set (S = 7, N = 71, "small" in the figure) and for a much larger community (S = 44, N = 13360, "large").

example_fs2 <- readd(fs_s_44_n_13360, cache = cache)

example_di2 <- readd(di_fs_s_44_n_13360, cache =cache)

example_fs2 <- example_fs2 %>%
  left_join(example_di2) %>%
  left_join(net_summary)

comparisons2 <- rep_diff_sampler(example_fs2, ndraws = 100) 

both_comparisons <- bind_rows(comparisons, comparisons2) %>%
  mutate(community_size = ifelse(n0 > 13000, "large", "small")) 

gridExtra::grid.arrange(grobs = list(
  ggplot(both_comparisons, aes(1, community_size, color = community_size)) +
    geom_point(size =8) +
    geom_label(aes(label = community_size), nudge_x = .005, size = 8) +
    theme_void() +
  theme(legend.position = "none") +
  scale_color_viridis_d(end = .8, option = "plasma") +
    xlim(.99999, 1.006) 

    ,

ggplot(both_comparisons, aes(r2, color = community_size)) +
  geom_density() +
  theme_bw() +
  theme(legend.position = "none") +
  scale_color_viridis_d(end = .8, option = "plasma"),

ggplot(both_comparisons, aes(r2_log, color = community_size)) +
  geom_density() +
  theme_bw() +
  theme(legend.position = "none") +
  scale_color_viridis_d(end = .8, option = "plasma"),

ggplot(both_comparisons, aes(cd, color = community_size)) +
  geom_density() +
  theme_bw() +
  theme(legend.position = "none") +
  scale_color_viridis_d(end = .8, option = "plasma"),


ggplot(both_comparisons, aes(prop_off, color = community_size)) +
  geom_density() +
  theme_bw() +
  theme(legend.position = "none") +
  scale_color_viridis_d(end = .8, option = "plasma"),


ggplot(both_comparisons, aes(div, color = community_size)) +
  geom_density() +
  theme_bw() +
  theme(legend.position = "none") +
  scale_color_viridis_d(end = .8, option = "plasma")

), ncol = 3)

For each metric, the large community is more self-similar than the smaller. That is, it has consistently higher R2, r2_log, and coefficient of determination values, and lower prop_off and K-L divergence.

Across a range of S and N

Here we have drawn samples from a set of points in S and N space that spans the range present in our datasets. For each feasible set we make 200 comparisons of elements (although for small feasible sets, 200 is not necessarily possible). Here is how that set is distributed in S and N space, colored by the log() number of elements in the feasible set.

loadd(all_diffs, cache= cache)

all_diffs <- all_diffs %>%
  mutate(log_nparts = log(gmp:::as.double.bigz(nparts))) %>%
  filter(!is.na(sim1), !is.na(sim2))


ggplot(all_diffs, aes(log(s0), log(n0), color = log_nparts)) +
  geom_point() +
  theme_bw() +
  scale_color_viridis_c() +
  theme(legend.position = "top")

Heat maps

We can make heat maps of the density distribution of self-similarity metrics and look at how the densities shift over gradients in the size of the feasible set.

There are a couple of nuances to doing this:

R2

Higher R2 values indicate more similarity. R2 can be very low but the most meaningful variation is between 0 and 1.

loadd(all_diffs, cache= cache)

# For plotting remove r2 that are EXTREMELY LOW. 

all_diffs_r2 <- all_diffs %>%
    mutate(log_nparts = log(gmp:::as.double.bigz(nparts))) %>%
  mutate(s_and_n = paste0("s_", s0, "_n_", n0),
         low_r2 = r2 <= -1) %>%
  group_by(s_and_n) %>%
  mutate(prop_small_r2 = mean(low_r2)) %>%
  filter(r2 > -1) %>%
  group_by(s_and_n) %>%
  mutate(ncomparisons = dplyr::n(),
         comparison_index = dplyr::row_number()) %>%
  ungroup() %>%
  filter(ncomparisons > 50, comparison_index <= 50) %>%
  group_by(s0, n0) %>%
  mutate(comparisons_included = dplyr::n()) %>%
  ungroup()


ggplot(all_diffs_r2, aes(log(s0), log(n0), color = log_nparts)) +
  geom_point() +
  theme_bw() +
  scale_color_viridis_c(direction = -1, end = .8, option = "plasma") +
  theme(legend.position = "top")

# This plot demonstrates that removing comparisons for which r2 < -1 most impacts small FS. So if it distorts the results at all, it makes the small ones look *higher* than they really are.
# ggplot(all_diffs_r2, aes(log(s0), log(n0), color = prop_small_r2)) +
#   geom_point() +
#   theme_bw() +
#   scale_color_viridis_c() +
#   theme(legend.position = "top")

ggplot(all_diffs_r2, aes(r2, group = s_and_n, color = log_nparts)) +
  geom_density(alpha = .4) +
  theme_bw() +
  theme(legend.position = "top") +
  scale_color_viridis_c(direction = -1, end = .8, option = "plasma")

Here is R2 for only relatively small feasible sets (those with fewer than $e^20$ elements in the feasible set). This lets us zoom in on the distributions where they start to broaden out.

ggplot(filter(all_diffs_r2, log_nparts < 20), aes(log(s0), log(n0), color = log_nparts)) +
  geom_point() +
  theme_bw() +
  scale_color_viridis_c(direction = -1, end = .8, option = "plasma") +
  theme(legend.position = "top")

ggplot(filter(all_diffs_r2, log_nparts < 20), aes(r2, group = s_and_n, color = log_nparts)) +
  geom_density(alpha = .4) +
  theme_bw() +
  theme(legend.position = "top") +
  scale_color_viridis_c(direction = -1, end = .8, option = "plasma")

R2 on logged vectors

Higher R2 values indicate more similarity. R2 can be very low but the most meaningful variation is between 0 and 1.

loadd(all_diffs, cache= cache)

# For plotting remove r2 that are EXTREMELY LOW. 

all_diffs_r2_log <- all_diffs %>%
    mutate(log_nparts = log(gmp:::as.double.bigz(nparts))) %>%
  mutate(s_and_n = paste0("s_", s0, "_n_", n0),
         low_r2_log = r2_log <= -1) %>%
  group_by(s_and_n) %>%
  mutate(prop_small_r2_log = mean(low_r2_log)) %>%
  filter(r2_log > -1) %>%
  group_by(s_and_n) %>%
  mutate(ncomparisons = dplyr::n(),
         comparison_index = dplyr::row_number()) %>%
  ungroup() %>%
  filter(ncomparisons > 50, comparison_index <= 50) %>%
  group_by(s0, n0) %>%
  mutate(comparisons_included = dplyr::n()) %>%
  ungroup()


ggplot(all_diffs_r2_log, aes(log(s0), log(n0), color = log_nparts)) +
  geom_point() +
  theme_bw() +
  scale_color_viridis_c(direction = -1, end = .8, option = "plasma") +
  theme(legend.position = "top")

# This plot demonstrates that removing comparisons for which r2 < -1 most impacts small FS. So if it distorts the results at all, it makes the small ones look *higher* than they really are.
# ggplot(all_diffs_r2, aes(log(s0), log(n0), color = prop_small_r2)) +
#   geom_point() +
#   theme_bw() +
#   scale_color_viridis_c() +
#   theme(legend.position = "top")

ggplot(all_diffs_r2_log, aes(r2_log, group = s_and_n, color = log_nparts)) +
  geom_density(alpha = .4) +
  theme_bw() +
  theme(legend.position = "top") +
  scale_color_viridis_c(direction = -1, end = .8, option = "plasma")

Here is R2 on logged vectors for only "small" feasible sets. This lets us zoom in on the distributions where they start to broaden out.

ggplot(filter(all_diffs_r2_log, log_nparts < 20), aes(log(s0), log(n0), color = log_nparts)) +
  geom_point() +
  theme_bw() +
  scale_color_viridis_c(direction = -1, end = .8, option = "plasma") +
  theme(legend.position = "top")

ggplot(filter(all_diffs_r2_log, log_nparts < 20), aes(r2_log, group = s_and_n, color = log_nparts)) +
  geom_density(alpha = .4) +
  theme_bw() +
  theme(legend.position = "top") +
  scale_color_viridis_c(direction = -1, end = .8, option = "plasma")

Coefficient of determination

Higher CD values indicate more similarity. It is bounded 0 to 1.

loadd(all_diffs, cache= cache)

# For plotting remove r2 that are EXTREMELY LOW. 

all_diffs_cd <- all_diffs %>%
    mutate(log_nparts = log(gmp:::as.double.bigz(nparts))) %>%
  mutate(s_and_n = paste0("s_", s0, "_n_", n0),
         low_cd = cd <= -1) %>%
  group_by(s_and_n) %>%
  mutate(prop_small_cd = mean(low_cd)) %>%
#  filter(cd > -1) %>%
  group_by(s_and_n) %>%
  mutate(ncomparisons = dplyr::n(),
         comparison_index = dplyr::row_number()) %>%
  ungroup() %>%
  filter(ncomparisons > 50, comparison_index <= 50) %>%
  group_by(s0, n0) %>%
  mutate(comparisons_included = dplyr::n()) %>%
  ungroup()


ggplot(all_diffs_cd, aes(log(s0), log(n0), color = log_nparts)) +
  geom_point() +
  theme_bw() +
  scale_color_viridis_c(direction = -1, end = .8, option = "plasma") +
  theme(legend.position = "top")

# This plot demonstrates that removing comparisons for which cd < -1 most impacts small FS. So if it distorts the results at all, it makes the small ones look *higher* than they really are.
# ggplot(all_diffs_cd, aes(log(s0), log(n0), color = prop_small_cd)) +
#   geom_point() +
#   theme_bw() +
#   scale_color_viridis_c() +
#   theme(legend.position = "top")

ggplot(all_diffs_cd, aes(cd, group = s_and_n, color = log_nparts)) +
  geom_density(alpha = .4) +
  theme_bw() +
  theme(legend.position = "top") +
  scale_color_viridis_c(direction = -1, end = .8, option = "plasma")

Here is cd for only "small" feasible sets.

ggplot(filter(all_diffs_cd, log_nparts < 20), aes(log(s0), log(n0), color = log_nparts)) +
  geom_point() +
  theme_bw() +
  scale_color_viridis_c(direction = -1, end = .8, option = "plasma") +
  theme(legend.position = "top")

ggplot(filter(all_diffs_cd, log_nparts < 20), aes(cd, group = s_and_n, color = log_nparts)) +
  geom_density(alpha = .4) +
  theme_bw() +
  theme(legend.position = "top") +
  scale_color_viridis_c(direction = -1, end = .8, option = "plasma")

Proportion of individuals allocated to different species

Lower "prop off" values indicate more similarity. It is bounded 0 to 1.

loadd(all_diffs, cache= cache)

# For plotting remove r2 that are EXTREMELY LOW. 

all_diffs_prop_off <- all_diffs %>%
    mutate(log_nparts = log(gmp:::as.double.bigz(nparts))) %>%
  mutate(s_and_n = paste0("s_", s0, "_n_", n0),
         low_prop_off = prop_off <= -1) %>%
  group_by(s_and_n) %>%
  mutate(prop_small_prop_off = mean(low_prop_off)) %>%
#  filter(prop_off > -1) %>%
  group_by(s_and_n) %>%
  mutate(ncomparisons = dplyr::n(),
         comparison_index = dplyr::row_number()) %>%
  ungroup() %>%
  filter(ncomparisons > 50, comparison_index <= 50) %>%
  group_by(s0, n0) %>%
  mutate(comparisons_included = dplyr::n()) %>%
  ungroup()


ggplot(all_diffs_prop_off, aes(log(s0), log(n0), color = log_nparts)) +
  geom_point() +
  theme_bw() +
  scale_color_viridis_c(direction = -1, end = .8, option = "plasma") +
  theme(legend.position = "top")

# This plot demonstrates that removing comparisons for which prop_off < -1 most impacts small FS. So if it distorts the results at all, it makes the small ones look *higher* than they really are.
# ggplot(all_diffs_prop_off, aes(log(s0), log(n0), color = prop_small_prop_off)) +
#   geom_point() +
#   theme_bw() +
#   scale_color_viridis_c() +
#   theme(legend.position = "top")

ggplot(all_diffs_prop_off, aes(prop_off, group = s_and_n, color = log_nparts)) +
  geom_density(alpha = .4) +
  theme_bw() +
  theme(legend.position = "top") +
  scale_color_viridis_c(direction = -1, end = .8, option = "plasma")

Here is prop_off for only "small" feasible sets.

ggplot(filter(all_diffs_prop_off, log_nparts < 20), aes(log(s0), log(n0), color = log_nparts)) +
  geom_point() +
  theme_bw() +
  scale_color_viridis_c(direction = -1, end = .8, option = "plasma") +
  theme(legend.position = "top")

ggplot(filter(all_diffs_prop_off, log_nparts < 20), aes(prop_off, group = s_and_n, color = log_nparts)) +
  geom_density(alpha = .4) +
  theme_bw() +
  theme(legend.position = "top") +
  scale_color_viridis_c(direction = -1, end = .8, option = "plasma")

K-L divergence

Lower divergence values indicate more similarity. It is bounded 0 to 1.

loadd(all_diffs, cache= cache)

# For plotting remove r2 that are EXTREMELY LOW. 

all_diffs_div <- all_diffs %>%
    mutate(log_nparts = log(gmp:::as.double.bigz(nparts))) %>%
  mutate(s_and_n = paste0("s_", s0, "_n_", n0),
         low_div = div <= -1) %>%
  group_by(s_and_n) %>%
  mutate(prop_small_div = mean(low_div)) %>%
#  filter(div > -1) %>%
  group_by(s_and_n) %>%
  mutate(ncomparisons = dplyr::n(),
         comparison_index = dplyr::row_number()) %>%
  ungroup() %>%
  filter(ncomparisons > 50, comparison_index <= 50) %>%
  group_by(s0, n0) %>%
  mutate(comparisons_included = dplyr::n()) %>%
  ungroup()


ggplot(all_diffs_div, aes(log(s0), log(n0), color = log_nparts)) +
  geom_point() +
  theme_bw() +
  scale_color_viridis_c(direction = -1, end = .8, option = "plasma") +
  theme(legend.position = "top")

# This plot demonstrates that removing comparisons for which div < -1 most impacts small FS. So if it distorts the results at all, it makes the small ones look *higher* than they really are.
# ggplot(all_diffs_div, aes(log(s0), log(n0), color = prop_small_div)) +
#   geom_point() +
#   theme_bw() +
#   scale_color_viridis_c() +
#   theme(legend.position = "top")

ggplot(all_diffs_div, aes(div, group = s_and_n, color = log_nparts)) +
  geom_density(alpha = .4) +
  theme_bw() +
  theme(legend.position = "top") +
  scale_color_viridis_c(direction = -1, end = .8, option = "plasma")

Here is K-L divergence for only "small" feasible sets.

ggplot(filter(all_diffs_div, log_nparts < 20), aes(log(s0), log(n0), color = log_nparts)) +
  geom_point() +
  theme_bw() +
  scale_color_viridis_c(direction = -1, end = .8, option = "plasma") +
  theme(legend.position = "top")

ggplot(filter(all_diffs_div, log_nparts < 20), aes(div, group = s_and_n, color = log_nparts)) +
  geom_density(alpha = .4) +
  theme_bw() +
  theme(legend.position = "top") +
  scale_color_viridis_c(direction = -1, end = .8, option = "plasma")

In summary

Large feasible sets are consistently more self-similar than small ones, regardless of the metric of self-similarity used.

References

Jaynes, E.T. (1957). Information Theory and Statistical Mechanics. Phys. Rev., 106, 620–630.



diazrenata/scadsanalysis documentation built on May 14, 2021, 6:59 p.m.