In dinilu/EPDr: Interact with the European Pollen Database (EPD)

knitr::opts_chunk$set(message = FALSE, warning = FALSE, fig.width = 7)

In a previous tutorial, we have illustrated how to use the EPDr package to work with data of single entities (e.g., sediment cores) in the European Pollen Database (EPD). However and as in many other databases from international efforts, the spatial extent of the EPD makes it suitable for integrative studies where data from multiple entities can be analyzed together. EPDr does not provide specific objects to store such data, but this can be easily accomplished in R using list objects and functions in the lapply family. EPDr does, however, provide some functions to work on list of epd.entity.df objects. Here, we ilustrate how to combine EPDr with list objects and lapply functions, and how to use the EPDr functions that work on lists of objects.

Searching for entities

The very first thing is to load the library and stablish a valid connection to the database server. Remember that this need to be done for each session we start, and that it automatically closes the connection if we re-start R session and/or Rstudio.

epd.connection <- connect_to_epd(database = "epd",
                                 user = "epdr",
                                 password = "epdrpw")

epd.connection <- connect_to_epd(database = "epd",
                                 user = "epdr",
                                 password = "epdrpw", 
                                 host = "rabbot19.uco.es")

Then, you need to identify the ids for the entities of interest (i.e., e_). Here, the list_ functions, and specially list_e, are your friends. For instance, if you want to work in a specific country. First, you can check which countries are represented in the EPD.

list_countries(epd.connection)

Once you know the exact spelling of the country names of interest, you can pass them as an argument to the list_e function to get the list of all entities ids that fall into your study area.

e_ids <- list_e(epd.connection,
                country = c("Spain", "Portugal"))
e_ids <- e_ids$e_

Note that list_ functions return a data.frame (table) from where we only need the entities ids. This is the reason we subset in the second line to extract just the numbers in the e_ column.

Alternatively, you could filter entities by any other criteria accepted by list_e check ?list_e for further details. On the contrary, if you were interested to work with the whole EPD, you could just query the whole entities ids without passing any searching criteria to the function, which would return all entities.

# e_ids <- list_e(epd.connection)
# e_ids <- e_ids$e_

Note that we did not run this example, and just provide it to acknowledge the posibility.

Retrieving data for multiple entities

Now, you can use the lapply function combined with get_entity to receive a list of epd.entity objects for all the entities of interest.

epd_all <- lapply(e_ids, get_entity, epd.connection)
class(epd_all)
class(epd_all[[1]])

Now, we can use the entity_to_matrices function to calculate the epd.entity.df objects from the list of epd.entity objects.

epd_all <- lapply(epd_all, entity_to_matrices)
class(epd_all[[1]])

Standardizing data across multiple entities

Because each entity may have been analyzed by different researcher (with different taxonomical/morphological criteria, interests, etc.), these data need to be standardized before any analysis can be carried out.

Filtering and discarding entities

EPDr provides several functions that work on list of EPDr objects (epd.entity or epd.entity.df) that facilitate that standardization. For instance, you might decide to not use entities with some restrictions on their use. The remove_restricted function look into all the objects in a list and return only those that are unrestricted.

length(epd_all)
vapply(epd_all, check_restriction, FUN.VALUE=logical(1))
epd_all <- remove_restricted(epd_all)
length(epd_all)

Note that we have lost five entities, that have restricted the use of the data.

Usually, you will also remove all the entities that have no ages for the biological counts. You can use here the remove_wo_ages function.

vapply(epd_all, check_defaultchron, FUN.VALUE=logical(1))
epd_all <- remove_wo_ages(epd_all)
length(epd_all)

Note that we have removed another 22 entities for which there are no ages.

Now, you may need to know what are exactly the entities you have lost in these filters. You can use the extract_e and vapply functions to get a list of the entities id numbers.

vapply(epd_all, extract_e, FUN.VALUE=numeric(1))

Selecting chronologies and standardizing the taxonomy

Becasue, Giesecke et al. [-@giesecke_2013] provides updated chronologies for a lot of entities, we want to be sure to use those chronologies when they are available. To do so, we need to change the default chronology in each object to 9999 (specific number for giesecke) when available.

epd_all <- lapply(epd_all, giesecke_default_chron)

Next, let assume we are interested in pollen data. Here, we need to combine again filter_taxagroups function with lapply.

epd_all <- lapply(epd_all, filter_taxagroups,
                  c("HERB", "TRSH", "DWAR", "LIAN",
                    "HEMI", "UPHE", "INUN"))

Taxagroups represented here are all different type of pollen from plants of different life-forms: HERB for herbs, TRSH for trees and shrubs, DWARF for dwarf shrubs, LIAN for lianas, HEMI for hemiparasitic, and UPHE for upland herbs, and INUN for indeterminables and unknowns.

Now, we can change each taxa name in the objects for the accepted name according to the EPD, using the combination of the lapply and taxa_to_accepted functions. You could also modify the taxonomy to the higher taxonomical level using taxa_to_highertaxa.

epd.taxonomy <- get_taxonomy_epd(epd.connection)
epd_all <- lapply(epd_all, taxa_to_acceptedtaxa, epd.taxonomy)

Finally, we want to all entities reflecting the same taxa. unify_taxonomy work on list of epd.entity.df objects. It looks for the taxa in all the objects and increment all datasets to reflect the same taxa in all. Taxa present in one entity but not in others are included with value = 0.

epd_all <- unify_taxonomy(epd_all, epd.taxonomy)

Standardizing counts to percentages

The following would be to transform raw pollen counts into pollen percentages. Here, again, we combine the lapply function with the counts_to_percentages function.

epd_all <- lapply(epd_all, counts_to_percentages)

Standardizing time of counts data (interpolation or averaging)

Because each entity in our list will have samples at different ages, if we need to run an analysis with sites refering to the same time period, we need to standardize the ages across all epd.entity.df objects in the list. Here, you can use lapply combined with interpolate_counts or intervals_counts depending on the objectives of your study. Here, we will illustrate the use of interpolate_counts.

epd_all <- lapply(epd_all, interpolate_counts, seq(0, 22000, by = 1000))
epd_all[[2]]@commdf@counts[, 1:7]

Note that here we request to get a pollen concentration for every 500 years since 22000 to 0 calibrated years BP. For those entities with data partially covering this range, the data falling outside of the range are filled with NA to acknowledge that this periods are outside of the real data range.

Calculating data quality [@blois_2011]

Counting data have at least one source of uncertainty associated: the age estimated for each sample comes from an age-depth model fit using control data (e.g., radiocarbon data or any other geochronological information). In the case of interpolated data, the uncertainty have a second dimension, depending on how far (in time) we are interpolating countings from a real sample. [@blois_2011] proposed an index to summarize these two uncertainties. See ?blois_quality and Blois et al. [-@blois_2011] for details. EPDr provides a function to calculate data quality according to the Blois index (blois_quality) on epd.entity.df objects.

epd_all[[1]]@agesdf@dataquality
epd_all <- lapply(epd_all, blois_quality)
epd_all[[1]]@agesdf@dataquality

The basic interpretation of the index is that closer to 1 is high quality data, whereas close to 0 is poor quality data.

Plotting and preparing output data

When all standardizations have been accomplished, the data can be prepared for further analysis by preparing tables or can be plotted in the form of maps, for instance.

In the first version of the package, EPDr have a function called table_by_taxa_age that allow to search in a epd.entity.df object and extract the data for a particular taxa and sample (by the sample label).

epd_tables <- lapply(epd_all,
                     table_by_taxa_age,
                     c("Quercus"), c("1000", "2000"))
epd_tables[[1]]

Note that this function returns a list of tables with the entity number, coordinates for that entity (londd and latdd), count, sample_label and taxa. You can combine all those tables into a single one, by

epd_table <- do.call(rbind, epd_tables)
epd_table

This function is also used internally in the map_taxa_age function, which use the data prepared in this way to plot a map of the pollen counts for a particular taxa in a particular time period. In this case, the map_taxa_age was designed to work on list of epd.entity.df objects. Passing the list of objects, the desired taxa name, and the desired age label of interest will produce a ggplot map with pollen counts.

map_taxa_age(epd_all, "Pinus", "1000")

Note that the function automatically recognize that counts are in percentages, and properly acknoledge that in the legend. If the counts were not converted into percentages, the function will plot differently to accomodate this sort of data. Note also how entities without data for that particular time period are plotted with NA (small dark grey dots).

While other functions ignore multiple taxa names or entity ids, the map_taxa_age function accept multiple taxa names, but not sample labels. When multiple taxa names are provided, the plot represent the sum of all those taxa passed to the function. This can be usefull when taxonomy has not been standardized or we are interested in data at higher taxonomical levels than those recognized by the EPD taxonomy.

pinus <- c("Pinus", "Pinus cembra-type", "Pinus halepensis-type",
           "Pinus pinaster", "Pinus sylvestris-type")
map_taxa_age(epd_all, pinus, "1000")

Note how some entities show now a higher value when putting all Pinus species in the same map.

The function also allow for 'on the fly' conversion to presence absence maps, by using a threshold to transform pollen counts (or percentages) into presence-absence data.

map_taxa_age(epd_all, pinus, "1000", pres_abse = T, pollen_thres = 0)

Note how the legend of the map acknoledge the use and value of a threshold. Here, the default value is zero.

map_taxa_age(epd_all, pinus, "1000", pres_abse = T, pollen_thres = 1)

Note the differences in the map when changing the pollen threshold from the default (zero) to 1, and how this change is acknoledge in the legend.

The function allows for many other configuration on dots colors, sizes, etc. It also allow to zoom into particular areas using coordinates. For a full description on all these map tunning arguments check ?map_taxa_age.