README.md

slendr: Population Genetic Simulations in R

CRAN-version CRAN-downloads R-CMD-check Binder Coverage status

Overview

slendr is a toolbox for running population genomic simulations entirely in R. Our original motivation for developing it was to provide a framework for simulating spatially-explicit genomic data on real geographic landscapes, however, it has grown to be much more than that since then: slendr can now simulate data from traditional, non-spatial demographic models using msprime as a simulation engine, and it even supports selection scenarios via user-defined SLiM estension snippets. In addition to model definition and simulation, slendr also provides a set of function for efficient analysis of tree-sequence genomic data, utilizing the tskit module for underlying computation.

This page briefly summarizes slendr's most important features. A more detailed description of the slendr architecture and an extensive set of practical code examples can be found in our paper in the PCI journal and on our website.

Citing slendr

The slendr paper is now published in the Peer Community Journal!

If you use slendr in your work, please cite it as:

Petr, Martin; Haller, Benjamin C.; Ralph, Peter L.; Racimo, Fernando. slendr: a framework for spatio-temporal population genomic simulations on geographic landscapes. Peer Community Journal, Volume 3 (2023), article no. e121. doi : 10.24072/pcjournal.354.

Citations help me justify further development and fixing bugs! Thank you! ❤️

Main features

Here is a brief summary of slendr's most important features. The R package allows you to:

Utilizing the flexibility of R with its wealth of libraries for statistics, geospatial analysis and graphics, and combining it with the power of population genetic simulation frameworks SLiM and msprime, the slendr R package makes it possible to write entire simulation and analytic pipelines without the need to leave the R environment.

Testing the R package in an online RStudio session

You can open an RStudio session and test examples from the vignettes directly in your web browser by clicking this button (no installation is needed!):

Binder

In case the RStudio instance appears to be starting very slowly, please be patient (Binder is a freely available service with limited computational resources provided by the community). If Binder crashes, try reloading the web page, which will restart the cloud session.

Once you get a browser-based RStudio session, you can navigate to the vignettes/ directory and test the examples on your own!

Installation

slendr is now available CRAN which means that you can install it simply by entering install.packages("slendr") into your R console.

If you would like to test the latest features of the software (perhaps because you need some bug fixes), you can install it with devtools::install_github("bodkan/slendr") (note that this requires the R package devtools).

⚠️⚠️⚠️

Note: the default slendr installation no longer has a hard dependency on geospatial R packages sf, stars, and rnaturalearth. If you indend to use slendr for spatial genomic simulations and data analysis, you have to run install.packages(c("sf", "stars", "rnaturalearth")) first.

⚠️⚠️⚠️

This software is under active development!

If you would like to stay updated:

  1. Click on the "Watch" button on the project's GitHub website.

  2. Follow me on social media where I will be posting progress updates (you can find links on my homepage).

  3. From time to time, take a look at the changelog where I will post updates on new features, breaking changes, etc.

Example

Here is a very brief demonstration of the kind of spatial model which slendr has been originally designed to simulate data from. Please note that although this is a spatially-explicit population genetic model, slendr has an extensive support for traditional, non-spatial simulations as well. Furthermore, this example only shows how to specify and simulate a model in R. It doesn't show how to analyse tree-sequence outputs and compute population genetic statistics from them (this important feature is demonstrated in this tutorial).

1. Setup the spatial context

Imagine that we wanted to simulate spatio-temporal genomic data from a toy model of the history of modern humans in West Eurasia after the Out of Africa migration.

First, we define the spatial context of the simulation. This will represent the "world" which will be occupied by populations in our model.

library(slendr)

# this sets up internal Python environment and needs to be ran only once!
# (do not put this in your R scripts, run this command in the R console
# after you (re-)install slendr)
setup_env()

# activate the internal Python environment needed for simulation and
# tree-sequence processing
init_env()

map <- world(
  xrange = c(-13, 70), # min-max longitude
  yrange = c(18, 65),  # min-max latitude
  crs = "EPSG:3035"    # coordinate reference system (CRS) for West Eurasia
)

We can visualize the defined world map using the function plot_map provided by the package.

plot_map(map)

Although in this example we use a real Earth landscape, the map can be completely abstract (either blank or with user-defined landscape features such as continents, islands, corridors and barriers).

2. Define broader geographic regions

In order to make the definitions of population ranges (below) easier, we can define smaller regions on the map using the function region.

Note that all coordinates in slendr are specified in the geographic coordinate system (i.e., degrees longitude and latitude), but are internally represented in a projected CRS (in our case, EPSG 3035 as specified above).

This makes it easier for us to define spatial features simply by reading the coordinates from a regular map but the internal projected CRS makes simulations more accurate (distances and shapes are not distorted because we can use a CRS tailored to the region of the world we are working with). The projected CRS takes care of the projection of the part of the world we're interested in from the three-dimensional Earth surface to a two-dimensional map.

africa <- region(
  "Africa", map,
  polygon = list(c(-18, 20), c(38, 20), c(30, 33),
                 c(20, 33), c(10, 38), c(-6, 35))
)
europe <- region(
  "Europe", map,
  polygon = list(
    c(-8, 35), c(-5, 36), c(10, 38), c(20, 35), c(25, 35),
    c(33, 45), c(20, 58), c(-5, 60), c(-15, 50)
  )
)
anatolia <- region(
  "Anatolia", map,
  polygon = list(c(28, 35), c(40, 35), c(42, 40),
                 c(30, 43), c(27, 40), c(25, 38))
)

Again, we can use the generic plot_map function to visualize these objects, making sure we specified them correctly:

plot_map(africa, europe, anatolia)

3. Define demographic history and population boundaries

The most important function in the slendr package is population(), which is used to define names, split times, sizes and spatial ranges of populations. Here, we specify times in years before the present, distances in kilometers. If this makes more sense for your models, times can also be given in a forward direction.

You will also note functions such as move() or expand_range() which are designed to take a slendr population object and schedule its spatial dynamics at appropriate times during the model simulation (which will happen at a later step).

Note that in order to make this example executable in a reasonable time on my extremely old laptop, I decreased the sizes of all populations to unrealistic levels. This will speed up the SLiM simulation at a later step.

afr <- population( # African ancestral population
  "AFR", time = 52000, N = 3000, map = map, polygon = africa
)

ooa <- population( # population of the first migrants out of Africa
  "OOA", parent = afr, time = 51000, N = 500, remove = 25000,
  center = c(33, 30), radius = 400e3
) %>%
  move(
    trajectory = list(c(40, 30), c(50, 30), c(60, 40)),
    start = 50000, end = 40000, snapshots = 20
  )

ehg <- population( # Eastern hunter-gatherers
  "EHG", parent = ooa, time = 28000, N = 1000, remove = 6000,
  polygon = list(
    c(26, 55), c(38, 53), c(48, 53), c(60, 53),
    c(60, 60), c(48, 63), c(38, 63), c(26, 60))
)

eur <- population( # European population
  name = "EUR", parent = ehg, time = 25000, N = 2000,
  polygon = europe
)

ana <- population( # Anatolian farmers
  name = "ANA", time = 28000, N = 3000, parent = ooa, remove = 4000,
  center = c(34, 38), radius = 500e3, polygon = anatolia
) %>%
  expand_range( # expand the range by 2.500 km
    by = 2500e3, start = 10000, end = 7000,
    polygon = join(europe, anatolia), snapshots = 20
  )

yam <- population( # Yamnaya steppe population
  name = "YAM", time = 7000, N = 500, parent = ehg, remove = 2500,
  polygon = list(c(26, 50), c(38, 49), c(48, 50),
                 c(48, 56), c(38, 59), c(26, 56))
) %>%
  move(trajectory = list(c(15, 50)), start = 5000, end = 3000, snapshots = 10)

We can use the function plot_map again to get a "compressed" overview of all spatio-temporal range dynamics encoded by the model so far (prior to the simulation itself).

plot_map(afr, ooa, ehg, eur, ana, yam)

4. Define gene-flow events

By default, populations in slendr do not mix even if they are overlapping. In order to schedule an gene-flow event between two populations, we can use the function gene_flow. If we want to specify multiple such events at once, we can collect these events in a simple R list:

gf <- list(
  gene_flow(from = ana, to = yam, rate = 0.5, start = 6500, end = 6400, overlap = FALSE),
  gene_flow(from = ana, to = eur, rate = 0.5, start = 8000, end = 6000),
  gene_flow(from = yam, to = eur, rate = 0.75, start = 4000, end = 3000)
)

5. Compile the model to a set of configuration files

Before we run the simulation, we compile all individual model components (population objects and gene-flow events) to a single R object, specifying some additional model parameters. Additionally, this performs internal consistency checks, making sure the model parameters (split times, gene flow times, etc.) make sense before the (potentially quite computationally costly) simulation is even run.

model <- compile_model(
  populations = list(afr, ooa, ehg, eur, ana, yam), # populations defined above
  gene_flow = gf,
  generation_time = 30,
  resolution = 10e3,                   # resolution in meters per pixel
  competition = 130e3, mating = 100e3, # spatial interaction parameters
  dispersal = 70e3,                    # how far can offspring end up from their parents
)

Compiled model is kept as an R object which can be passed to different functions. In this example, we will simulate data with the slim() engine (but we could also run the simulation through the coalescent engine via the msprime() function).

6. Visualize the model

The package provides an R shiny-based browser app explore_model() for checking the model dynamics interactively and visually. For more complex models, this is much better than static spatial plots such as the one we showed in step 2 above:

explore_model(model)

The function has two modes:

a) Plotting (and "playing") spatial map dynamics:

b) Displaying the demographic history graph (splits and gene-flow events) embedded in the specified model:

7. Run the model in SLiM

Finally, we can execute the compiled model in SLiM. Here we run the simulation in a batch mode, but we could also run it in SLiMgui by setting method = "gui". This would allow us to inspect the spatial simulation as it happens in real time.

The slim function generates a complete SLiM script tailored to run the spatial model we defined above. This saves you, the user, a tremendous amount of time, because you don't have to write new SLiM code every time you design a new demographic model. The output of the simulation run from any slendr model is always a tree sequence, here loaded into an object ts_slim.

ts_slim <- slim(model, sequence_length = 10e6, recombination_rate = 1e-8,
                method = "batch", random_seed = 314159)

As specified here, slendr's SLiM backend will simulate 10 Mb of sequence for each individual, and produce a tree sequence output from the simulation run which can be analysed by many built-in population genetic functions. By default, all individuals living at the end of the simulation are recorded as samples in the tree sequence. If a specific set of samples (ancient and modern) is needed, this can be defined accordingly using a dedicated function.

Note that although we defined a spatial model, we could have just as easily simulated standard, non-spatial data by running the same model through slendr's msprime() back end without the need to make any changes:

ts_msprime <- msprime(model, sequence_length = 10e6, recombination_rate = 1e-8)

Here is a very quick overview of the SLiM simulation run summarised as a GIF animation. Again, please note that the simulation is extremely simplified. We only simulated a very small number of individuals in each population, and we also didn't specify any dispersal dynamics which is why the populations look so clumped.

animate_model(model = model, file = locations_file, steps = 50, width = 500, height = 300)
#> Error in eval(expr, envir, enclos): R: UnableToWritePixelCache `/var/folders/70/b_q2zdh116b9pfg29p03sx600000gn/T//RtmpyLwlps/magick-UHJ_3VsqZdrkY5sTCDUuYTjjSKmibs4k': No space left on device @ error/cache.c/WritePixelCachePixels/6023

At this point, you could either compute some population genetic statistics of interest or perhaps analyse the spatial features of the genealogies simulated by your model.

Further information

The example above provides only a very brief and incomplete overview of the full functionality of the slendr package. There is much more to slendr than what we demonstrated here. For instance:



bodkan/slendr documentation built on Dec. 19, 2024, 11:41 p.m.