Introduction to predictsr
In predictsr: Access the 'PREDICTS' Biodiversity Database

The predictsr package accesses the PREDICTS database (Hudson et al, 2013) from within R, conveniently as a dataframe. It uses the Natural History Museum Data Portal to download the latest versions of the PREDICTS database and some related metadata.

The PREDICTS database comprises over 4 million measurements of species at sites across the world. All major terrestrial plant, animal, and fungal groups are represented within the data. The most well-represented group in the data are arthropods (mainly insects), covering about 2% of the species known by science

There have been 2 public releases of PREDICTS data. The first was in 2016 and the second was in 2022; each consisted of about 3 million and 1 million records, respectively. We include both, as a single dataset, in this package.

However, the PREDICTS database is under constant development and is continuously growing, with later releases planned.

To get started, let's load in the database into R and poke around. To do so you will use the GetPredictsData function, which pulls in the data from the data portal. I'll repeat the default option below, but you can take either 2016, 2022, or both as the extracts to fetch. It reads in both extracts into a single dataframe:

predicts <- predictsr::GetPredictsData(extract = c(2016, 2022))
str(predicts)

Be patient as this may take a few seconds to run, as it downloads the data from the NHM Data Portal.

So we can see that there are over 4 million records in the combined PREDICTS extracts. Let's look at a set of summary statistics for the database:

if (nrow(predicts) > 0) {
  taxa <- predicts[
    !duplicated(predicts[, c("Source_ID", "Study_name", "Taxon_name_entered")]),
  ]
  species_counts <- length(
    unique(taxa[taxa$Rank %in% c("Species", "Infraspecies"), "Taxon"])
  ) +  nrow(taxa[!taxa$Rank %in% c("Species", "Infraspecies"), ])

  print(glue::glue(
    "This database has {length(unique(predicts$SS))} studies across ",
    "{length(unique(predicts$SSBS))} sites, in ",
    "{length(unique(predicts$Country))} countries, and with ",
    "{species_counts} species."
  ))
} else {
  print("No data available - check the download!")
}

So over 30,000 sites, 101 countries, and 12,892 species in this dataframe! There are a couple of important columns that should be noted. The SS column is what we use to identify studies when dealing with PREDICTS data. This is the concantenation of the Source_ID, and the Study_number columns. Another important identifier is the SBBS, the concatenation of Source_ID, Study_number, Block and Site_number, which is what we use to identify single sites in the database.

Let's also check the ranges of sample collection in the database:

if (nrow(predicts) > 0) {
  print(glue::glue(
    "Earliest sample collection (midpoint): {min(predicts$Sample_midpoint)}, ",
    "latest sample collection (midpoint): {max(predicts$Sample_midpoint)}"
  ))
} else {
  print("No data available in the PREDICTS database - check the download!")
}

Accessing site-level summaries

We also include access to the site-level summaries from the full release; to get these data you will need to use the GetSitelevelSummaries function. The function call is very similar to pull in the summaries for the same data as above:

summaries <- predictsr::GetSitelevelSummaries(extract = c(2016, 2022))
str(summaries)

Investigating the summary data closer we see that there are a number of missing columns between the two dataframes:

if (nrow(predicts) > 0 && nrow(summaries) > 0) {
  print(names(predicts)[!(names(predicts) %in% names(summaries))])
}

This is because all of the measurement-level data has been dropped from the dataframe. Now indeed we could try and replicate the creation of the summaries dataframe through some {dplyr} operations. These would be the following (roughly):

Note: I'm not sure on the summary statistics for this one, there seems to be some differences, and there are a couple sites that don't match up.

if (nrow(summaries) > 0) {
  summaries_rep <- predicts |>
    dplyr::mutate(
      Higher_taxa = paste(sort(unique(Higher_taxon)), collapse = ","),
      N_samples = length(Measurement),
      Rescaled_sampling_effort = mean(Rescaled_sampling_effort),
      .by = SSBS
    ) |>
    dplyr::select(
      dplyr::all_of(names(summaries))
    ) |>
    dplyr::distinct() |>
    dplyr::arrange(SSBS)

  summaries_copy <- summaries |>
    subset(SSBS %in% summaries_rep$SSBS) |>
    dplyr::arrange(SSBS)

  all.equal(summaries_copy, summaries_rep)
}

Accessing descriptions of the columns in PREDICTS

As we've seen already, there are 67 (!) columns in the PREDICTS database, and within the NHM Data portal releases, we have included a description of the data that is used in each of these columns. You can access this via the GetColumnDescriptions function:

descriptions <- predictsr::GetColumnDescriptions()
str(descriptions)

So this includes the Column name, the resolution of the Column (Applies_to), whether it is in the PREDICTS extract (Diversity_extract), whether it is in the site-level summaries (Site_extract), the datatype of the Column*, whether it is guaranteed to be nonempty, any additional Notes, and any information on the range of values that the Column may be expected to take (Validation).

Glossary

To clarify some of the PREDICTS jargon, we include the following table of definitions, from the dataframes we have worked with thus far:

if (nrow(descriptions) > 0) {
  descriptions_sub <-  subset(
    descriptions,
    Column %in% c(
      "Source_ID", "SS", "Block", "SSBS", "Diversity_metric_type", "Measurement"
    )
  )

  for (i in seq_along(descriptions_sub$Column)) {
    cat(
      paste0(
        "* `", 
        descriptions_sub$Column[i], 
        "` (`", 
        descriptions_sub$Type[i], 
        "`): "
      )
    )
    notes <- descriptions_sub$Notes[i] |>
      (\(s) gsub("\\*", "  -", s))() |>
      (\(s) gsub("\n\n", "  \n", s))() |>
      (\(s) gsub("\n", "  \n", s))()

    if (notes == " ") {
      notes <- "As title."
    }

    cat(notes)
    cat("  \n")
  }
}

Notes: When we refer to a "study" we typically refer to it from the SS that identifies it. Referring to an "extract" simply refers to a PREDICTS database release, either from 2016 or 2022 (or both). For further complete documentation see the SI in Hudson et. al. (2017).

Notes

The NHM Data Portal API has no rate limits so be considerate with your requests. Make sure you save the data somewhere, or use a tool like targets to save you from re-running workflows.

References

Hudson, Lawrence N., et al. "The PREDICTS database: a global database of how local terrestrial biodiversity responds to human impacts." Ecology and evolution 4.24 (2014): 4701-4735. .

Hudson, Lawrence N., et al. "The database of the PREDICTS (projecting responses of ecological diversity in changing terrestrial systems) project." Ecology and evolution 7.1 (2017): 145-188