In rscherrer/speciomer: Import Speciome Data into R

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

Use this package to import simulation data saved by the speciome program into R-friendly objects. Here we introduce the most important functions to give a gist of how to handle this type of data. The purpose of the package is mostly to import the binary data saved during the simulation into tibbles, which can then be handled using the tidyverse (or other tools) in R.

library(speciomer)
library(dplyr)
library(tidyr)
library(ggplot2)
library(forcats)
library(stringr)

The inst/extdata folder contains example simulation data.

list.files("../inst/extdata")

Read a simulation

Let us focus on one particlar simulation:

root <- "../inst/extdata/sim-example"

Use the read_population function to read the data of a simulation into one tibble. We can use it, for example, to know whether speciation happened in a simulation by reading the degrees of ecological, spatial and reproductive isolation:

read_population(root, variables = c("EI", "SI", "RI"))

Now that the data is in an R-friendly tibble, we can proceed onto all kinds of analyses and plotting, for example:

read_population(root, variables = c("EI", "RI", "SI")) %>%
  pivot_longer(cols = c("EI", "RI", "SI")) %>%
  mutate(
    name = fct_recode(
      name,
      !!!c(Ecological = "EI", Reproductive = "RI", Spatial = "SI")
    )
  ) %>%
  ggplot(aes(x = time, y = value, color = name)) +
  geom_line() +
  theme_classic() +
  scale_color_manual(values = c("lightgreen", "lightblue", "coral")) +
  xlab("Time (generations)") +
  ylab("Level of isolation") +
  labs(color = "Type") +
  ggtitle("Various isolation metrics through time")

Note that once the data are in a tibble, data manipulation and plotting does not require any specific speciomer function and can be done using your favorite R tools (e.g. the tidyverse). For these reasons the previous plot will be the only example of down-the-line analyses in this vignette, as there are too many ways in which the data of the simulations could be processed and these analyses are typically out of the scope of speciomer. In other words, speciomer provides the tools to import the data into R, use your creativity and other state-of-the-art packages for data manipulation and plotting to do the rest!

Read individual data

Now, this all works fine if we want to read variables that have one value per time point. But what if we want to visualize, for example, the phenotypes of each individual in the population? Then we have one value per trait, per individual and per time point. For these individual-level data we use:

read_individuals(root, variables = "traits", ncols = 3)

Here the ncols argument tells the function that there are three values per individual, and so the content of individual_traits.dat must be split into three columns.

Note that we specified "traits" as a variable name. Internally, the function interprets that as "individual_traits", and so reads the file individual_traits.dat. We could have equally specified variables = "individual_traits" but this takes more space.

One example plot we may want to generate from such data could be:

read_individuals(root, variables = "traits", ncols = 3) %>%
  pivot_longer(cols = paste0("trait", 1:3)) %>%
  mutate(
    name = str_remove(name, "trait"),
    name = fct_recode(
      name, !!!c(Ecological = "1", Mating = "2", Neutral = "3")
    )
  ) %>%
  ggplot(aes(x = time, y = value)) +
  geom_bin2d() +
  facet_grid(. ~ name) +
  theme_classic() +
  theme(axis.text.x = element_text(hjust = 1, angle = 60)) +
  xlab("Time (generations)") +
  ylab("Trait value") +
  labs(fill = "Count") +
  scale_fill_continuous(type = "viridis")

It is possible to add more individual-level variables, such as which ecotype each individual belongs to:

read_individuals(
  root, variables = c("ecotypes", "traits"), ncols = c(1, 3)
)

Then, we have to provide one value of ncols for each data file we want to read. There is one ecotype per individual in individual_ecotype.dat and three trait values per individual in individual_trait.dat, so ncols = c(1, 3).

Read genome-wide data

speciome simulates evolution with explicit genomes, so what if we want to see, for example, Fst data for across all loci and through time? We use read_loci for that:

read_loci(root, variables = "Fst")

This will read the file locus_Fst.dat. Notice that the components of the genetic architecture are loaded and appended to the data, to help us remember e.f. which gene codes for what and with what intensity (these details are not variables but fixed parameters of the genetic architecture). We can load them separately in the architecture file, provided we saved it with the simulation:

read_architecture(root)[["nodes"]]

This tibble contains all kinds of information about the genetic architecture on a per-locus basis. The output of read_architecture is actually a list of two tibbles, one called "nodes" containing locus-wise information and the other called "edges" containing edge-wise information from the gene networks (see below).

These data can then be used to display genome scans of differentiation at various time points, for example:

read_loci(root, "Fst") %>%
  filter(time %in% c(0, 500, 1000)) %>%
  mutate(
    time = str_c("t = ", time),
    time = factor(time, levels = paste("t =", c(0, 500, 1000))), # reorder
    trait = fct_recode(
      as.character(trait), !!!c(Ecological = "1", Mating = "2", Neutral = "3")
    )
  ) %>%
  ggplot(aes(x = locus, y = Fst, xend = locus, color = trait)) +
  geom_segment(yend = 0) +
  theme_classic() +
  facet_grid(time ~ .) +
  scale_color_manual(values = c("forestgreen", "goldenrod", "darkgray")) +
  xlab("Locus") +
  ylab(parse(text = "F[ST]")) +
  labs(color = "Trait")

Read gene networks

Now, in speciome the genetic architecture does not only consist of locus-specific parameters but also of edge-specific parameters, across the edges of the three gene networks. You can access these with:

read_architecture(root)[["edges"]]

Here are recorded the details for each edge, identified by the indexes of the two partner loci (from and to).

It is also possible to read edge-wise data from a simulation. For example, to read, for each edge through time, the correlation in breeding value between the partner loci, use:

read_edges(root, variables = "corbreed")

Under the hood the functions read_loci, read_edges, read_individuals and read_population all call another function, read_speciome, to read the binary .dat files containing the simulation output. It is just that these wrappers around read_speciome organize the output tibble in a specific way to provide observations on a per-locus-per-time-point, per-edge-per-time-point, per-individual-per-time-point or only a per-time-point basis, respectively.

Read whole individual genomes

You may also want to read data recorded for each locus in each individual through time, e.g. whole individual genomes. Beware that these data can take up a huge amount of space (if there are many loci, many individuals and many time points).

root2 <- "../inst/extdata/sim-indiv-genomes/"
read_individual_genomes(root2, locus_architecture = TRUE)

The resulting tibble contains, for each locus in each individual, the allele at each haplotype, the time point when this data was recorded, the allele count (number of 1-alleles) and the genetic value of that locus in that individual. Here locus-specific parameters of the genetic architecture are appended (you can turn this off by setting locus_architecture = FALSE). Additional arguments allow to append more locus-wise variables, or individual-wise variables such as the ecotype of each individual (individual_variables = "ecotypes").