knitr::opts_chunk$set(message = FALSE, warning = FALSE)

In this vignette we will illustrate the use of EPDr package, from installing the package to retrieving data from a local server of the European Pollen Database (EPD) and how to use those data to perform some basic analysis and produce most common plots in paleopalynology, like palynograms or maps.

Installing the EPDr package

User can choose to install the latest development version from github or the latest released version from CRAN.

The development version from github can be installed using the install_github function from the devtoolspackage.

library(devtools)
install_github("dinilu/EPDr", force = TRUE)

Alternatively, the last released version can be installed from CRAN using the usual install.package function

install.packages("EPDr")

Setting up the EPD database

Before moving on with this vignette, it is important that the user have access to a running PostgreSQL server with the EPD database. This is where the data are going to be pulled out. Detailed, and step by step, instructions to do so in a local-host are in a specific vignette along with the EPDr package. Check the vignette in the following code if you have not yet.

vignette("EPD-PostgreSQL", package = "EPDr")

Loading the EPDr package

As any other R-package, ÈPDr is loaded in the working environment with the library function.

library(EPDr)

Connecting to the EPD database

The EPDr package uses the RPostgreSQL package to establish a connection between the R session and the EPD database in the PostgreSQL server. Although this can be done to a online server, we assume most users do not have the expertise to setup an online server and most are thus running the server and the R session in the same computer (localhost configuration).

To establish the connection we use the connect_to_epd function, which requires the following arguments: database: database name. No default database name is specified in the function. user: user name. Name of the user with privileges to access the database. No default user name is specified in the function. password: user's password. Password of the user to access the database. host: address to the server. "localhost" is the default value indicating that the server is running in the local computer.

If the function is called with no arguments connect_to_epd(), the function assume the server is running in the local computer ("locahost"), and interactively ask for the necessary arguments to connect with the database: database, user, and password.

epd.connection <- connect_to_epd(database = "epd",
                                 user = "epdr",
                                 password = "epdrpw")
epd.connection <- connect_to_epd(database = "epd",
                                 user = "epdr",
                                 password = "epdrpw", 
                                 host = "rabbot19.uco.es")

In case you have access to a remote EPD server, you only have to provide the online address to the server using the argument host.

epd.connection <- connect_to_epd(database = "epd",
                                 user = "epdr",
                                 password = "epdrpw",
                                 host = "http://remote.epd.server")

Now, we are ready to check the connection with some of the functions in the RPostgreSQL and/or DBI packages. For instance, we can use the dbListTables function to get a list of all the tables in the database. The function should return a list of 45 tables.

library(DBI)
dbListTables(epd.connection)

Querying data from the database

The EPDr package provide several ways to retrieve data from the database: Listing information from different fields in the database with list functions, or extracting specific data for a particular entity (pollen core, polster sample, etc.) with get functions.

Listing fields from the EPD (list functions)

For users unfamiliar with the database it might be useful to list elements in the database, for instance to know which countries are represented or which publications are associated with the entities in the database. To do so, the package provides a series of list functions that return specific tables from the database where important information can be looked at.

Some of these functions (list_countries and list_taxagroups) only accept one argument (connection), which must be a valid connection to the EPD database.

list_countries(epd.connection)
list_taxagroups(epd.connection)

However, other list functions accept several arguments to further filter the result of the query. For instance, list_regions accepts an extra argument (country) to limit the query to particular countries.

list_regions(epd.connection)
list_regions(epd.connection, country = "Spain")

Each function has their own relevant parameters. For instance, list_taxa accepts the argument group_id, which refers to the taxa groups from list_taxagroups.

list_taxa(epd.connection)
list_taxa(epd.connection, group_id = "HERB")

Other list functions accept more arguments. Sample sites, for instance, can be listed specifying the country and region of interest, but also a set of four geographical coordinates (longitude and latitude) encompassing the area of interest (xmin, xmax, ymin, ymax).

list_sites(epd.connection)
list_sites(epd.connection, country = "Spain", region = "Andalucia")
list_sites(epd.connection, coords = c(-4, 10, 36, 40))

Biological counts and datation data in the EPD database are referred to entities. This account for the fact that multiple samples can be taken in the same site. For instance, several cores drilled in the the same lake, or a core and a surface sample (moss polster or pollen trap) being collected in a peat-bog.

Entities thus can be listed according to many different criteria: site name or site ID (site), geographical coordinates (coords), last name of the author (lastname), first name of the author (firstname), initials of the author (initials), publication number (publ), country (country) and region (region), and restriction on the use of data (restrictions). If multiple criteria are going to be used, the author can decide to use a logical operator to control for an additive (AND) or alternative (OR) interaction between criteria.

list_e(epd.connection)
list_e(epd.connection, site = "Adange")
list_e(epd.connection, lastname = "Tzedakis")

When using any of these list functions with arguments to filter the query, we can specify multiple values. That means for instance that we can pass a vector of names as the argument country to the list_e function to get information about all the entities in all those countries.

list_e(epd.connection, country = c("Spain", "Portugal", "France",
                                   "Switzerland", "Austria", "Italy",
                                   "Malta", "Algeria", "Tunisia",
                                   "Morocco", "Atlantic ocean",
                                   "Mediterranean Sea"))

Finally, the list_publ function allow to search the publications we need to cite if are going to use any of the data-sets in the database. This function thus allow to search publications for several criteria, including entity number (e_).

list_publ(epd.connection)
list_publ(epd.connection, e_ = 1)

Retrieving information for particular entities (get functions)

So far, we have been able to look for metadata information (details on country, region, or site) to which entities of the database belong to, but most times we are interested in extracting biological information, along with chronological information that allows us to dating the corresponding biological information, for those particular entities. The EPDr package provide a set of get functions specially designed for this task.

All these get functions have been designed to query data for a particular entity in the database. Hence, most of them (but get_taxonomy_epd and some others) accept two arguments: entity ID (e_) and a valid connection to the EPD database (connection). Note that entities has to be referred by their ID number and not by their sigle name. If you know the sigle of a particular entity, you need to use the list_e function and look for the e_ number for that particular sigle.

These functions are divided in two groups. The first one is comprised by a set of get functions that returns single data frames that resemble the tables in the EPD in which information is restricted for the queried entity (e_). All these functions are called .get_ functions followed by a reference to the table they query. Because the standard user usually won't use these functions we do not provide here any example. These functions works also behind the hook in other get_ functions.

The second group of functions, named as get_ functions followed by a small descriptor, retrieve information from different tables (using .get_ functions) for related data in an entity. Although most user won't need these functions, it is good for them to get familiar with the data structure they return, because they are the same in other more common functions and objects they return. These functions are: get_ent, get_site, get_geochron, get_chron, or get_samples. Because these wrapping functions query information from different tables, they return especial S4 class objects that have been designed for the package. Different tables are stored in different slots of the objects.

For instance, get_ent retrieves information that describe the entity that is queried. It returns a specific object, that has several data frames each one corresponding with one table in the EPD related with the description of the entity.

ent.1 <- get_ent(1, epd.connection)
class(ent.1)
slotNames(ent.1)

Similarly, get_site returns a new object that has several tables describing the site where the entity was sampled.

site.1 <- get_site(1, epd.connection)
class(site.1)
slotNames(site.1)

The get_geochron function returns all geochronological data for a particular entity. This is a description of the geochronological samples (sample id, depth, thickness, etc.) as well as the details of the analytical results from different techniques (C^14^, aminoacid racemization, etc.)

geochron.1 <- get_geochron(1, epd.connection)
class(geochron.1)
slotNames(geochron.1)

However, get_chron retrieves all data relative to the chronologies stored in the EPD for a particular entity.

chron.1 <- get_chron(1, epd.connection)
class(chron.1)
slotNames(chron.1)

Similarly, get_samples retrieves information from all tables about biological samples for the entity.

samples.1 <- get_samples(1, epd.connection)
class(samples.1)
slotNames(samples.1)

Above these functions, there is a special one (get_entity), which in turns retrieve all possible information from the different tables and get_ functions for an entity. Most user, will only need get_entity in their working flow, because the epd.entity objects it returns include all the objects from the previous get_ functions.

epd.1 <- get_entity(1, epd.connection)
class(epd.1)
slotNames(epd.1)

As you can see in the previous code, epd.1 is an epd.entity object, with 10 different slots. The first five slots, contain some basic information: the entity number that was retrieved (e_), the post-bomb zone in which this entity is (postbombzone), the number of chronologies that are available for that entity in the EPD (numberofchron), whether the entity has datation information in Giesecke et al. [-@giesecke_2013] (isingiesecke), and which one is the default chronology (defaultchron) to look for ages in the EPD. Then, the object has five slots (entity, site, geochron, chron, and samples) that store information from the previous get_ functions. Hence, epd.entity objects store all available information from a particular entity in the EPD. This is the basic object in the package, and most methods works in the base of this object or the epd.entity.df object that expand it (we will see that later). Each one of the slots can be looked at using the symbol @ or the slot function.

epd.1@e_
slot(epd.1, "e_")
epd.1@postbombzone
epd.1@numberofchron
epd.1@isingiesecke
epd.1@defaultchron
slotNames(epd.1@entity)
slotNames(epd.1@site)
slotNames(epd.1@geochron)
slotNames(epd.1@chron)
slotNames(epd.1@samples)

Some of the relevant information can be also queried with specific functions from the EPDr package. In general, these functions use the name check_ plus a descriptor of the information they check in the epd.entity object. For instance, check_restriction looks into the specific slots and tables that contains information on the use restrictions that apply to the data-set, and returns TRUE or FALSE depending on the case. However, check_defaultchron returns also a logical (TRUE or FALSE) indicating whether the entity has a default chronology.

check_restriction(epd.1)
check_defaultchron(epd.1)

Working with data from the EPD

Once the user have find the data set he want to work with and have downloaded the data for that particular entity, there are several thinks he can do: Export data for (re-)calibration in CLAM or BACON, apply several transformations and standardization, and, finally, rearrange the data for further analysis or plotting.

Exporting data for CLAM or BACON

EPDr package provides several functions to reformat and export data to comply with format requirements from CLAM or BACON. These functions are named export_ functions. There are five export_ functions: export_c14, export_agebasis, export_events, export_depths, and export_entity.

export_c14 exports data from the C14 and GEOCHRON tables, which store information for samples and radiocarbon dates. export_c14 requires a character string indicating the output format ("clam" or "bacon"), the C14 table (x) and the GEOCHRON table (y). Alternatively, it can be passed only a epd.entity object (x and y missing) and the function will look for both tables automatically on it.

export_c14("clam", epd.1@geochron@c14, epd.1@geochron@geochron)
export_c14("bacon", epd.1@geochron@c14, epd.1@geochron@geochron)
export_c14("clam", epd.1)

Another function (export_agebasis) export the table AGEBASIS, which stores information on the data used to calibrate the different chronologies. export_agebasis takes also as arguments a character vector with output format, and an AGEBASIS table or an epd.entity object to look for it.

export_agebasis("clam", epd.1@chron@agebasis)
export_agebasis("bacon", epd.1)

export_events function re-formats and exports the data stored in the EVENTS and SYNEVENT tables, which refer to geological events registered in the entity (e.g., tephra, etc.). Similarly to export_c14, export_events requires two tables (SYNEVENT and EVENT) or the epd.entity objects in which to look for them.

export_events("clam", epd.1@chron@synevent, epd.1@chron@event)
export_events("bacon", epd.1)

Because, depths has the same output format in CLAM and BACON, export_depths do not require to specify the output format, and only needs a PSAMPLES or an epd.entity to look for that data of biological samples depth and export them.

export_depths(epd.1@samples@psamples)
export_depths(epd.1)

Finally, export_entity function is a wrapping function that sequentially call all previous export_ functions and combine the information to prepare a set of data ready for CLAM or BACON. Because, some data might be duplicated or may have conflicts between them (especially between AGEBASIS and C14), export_entity have several arguments to control how to resolve these conflicts. If that arguments are not called, the function runs interactively and request input from the user to resolve the conflicts on the way. export_entity requires the output format and an epd.entity object to look for all the necessary data.

export_entity("clam", epd.1)

##  Chronology has coincident data with C14 data and, hence, the later will
##  be used
##  C14 data:
##  lab_ID  C14_age cal_age error   reserv. depth   thickn.
##  KIGI-350     910    NA   20 NA   83  5  
##  KIGI-349    2420    NA  200 NA  118  5  
##  KIGI-348    2900    NA  190 NA  140 10  
##  
##  Chronology data:
##  lab_ID  C14_age cal_age error   reserv. depth   thickn.
##  E1_CH1_S2    910    NA  1   NA   83 NA  
##  E1_CH1_S3   2420    NA  1   NA  118 NA  
##  E1_CH1_S4   2900    NA  1   NA  140 NA  
##  
##  Chronology has additional no-C14 data.
##  Chronology data:
##  lab_ID  C14_age cal_age error   reserv. depth   thickn.
##  E1_CH1_S1      0    NA  1   NA    2 NA  
##  E1_CH1_S5   4000    NA  1   NA  200 NA
##  
##  Incorporate these data to the chronology? (Yes: TRUE then Intro,
##  No: FALSE then Intro)TRUE
##  
##        lab_ID C14_age cal_age error reservoir depth thickness
##  11 E1_CH1_S1       0      NA     1        NA     2        NA
##  1   KIGI-350     910      NA    20        NA    83         5
##  2   KIGI-349    2420      NA   200        NA   118         5
##  3   KIGI-348    2900      NA   190        NA   140        10
##  5  E1_CH1_S5    4000      NA     1        NA   200        NA

The function, besides returning the data.frame with the right output format, write files in the working directory. It creates a different folder structure for CLAM or BACON files and then stores the files with the entity number (e_) and the appropriate extensions. If you are running the examples, just check your working directory (getwd())

Standardizing data

Because the data stored in the EPD are in the original format with minimal standardization, the user will need to run several manipulation and standardization steps depending on the goal of their analysis. EPDr package provides a set of functions to run some of these standardizations on the top of epd.entity objects.

Reformat data into matrices (epd.entity.df)

As already explained, epd.entity objects store a copy of the EPD tables for a specific entity. Hence, the data format are the same than in the PostgreSQL database, which is very convenient for database management and optimization, but not so much for data visualization and interpretation. The function entity_to_matrices transform several tables in the epd.entity objects and transform then into matrices that are more interpretable for normal use. For instance, ages are returned in the form of samplesxchronologies matrix, while counts of biological particles are transformed into samplesxtaxa matrices. In order to keep a backup copy of the original data and another copy of the data in the matrices formats the function returns an object of the class epd.entity.df. This object inherit all properties (and data) of epd.entity objects and expand them with new slots: @countstype, @countsprocessing, @taxatype, @taxaprocessing, @samplesdf, @agesdf, @commdf, and @nopodf. In this way, the new object always have a backup copy of the original data and a copy of the transformed data in a more handy format.

epd.1 <- entity_to_matrices(epd.1)
slotNames(epd.1)

The four first slots trace information on data modification and standardizations from other functions (see later).

epd.1@countstype
epd.1@countsprocessing
epd.1@taxatype
epd.1@taxaprocessing

The four last store the data in the matrix (or data.frame) format. @samplesdf has information on the biological samples id and labels.

slotNames(epd.1@samplesdf)

@agesdf stores information on the depth, ages and data quality of the samples.

slotNames(epd.1@agesdf)

@commdf stores taxa names, taxa ids, accepted taxa ids, higher taxa ids, taxa groups, and counts for each taxa.

slotNames(epd.1@commdf)

Finally @nopodf stores the same information than @commdf but for no-biological particles in the samples (e.g., added particles, total sums, etc.)

slotNames(epd.1@nopodf)

Navigating taxonomy in the EPD

Using the get_taxonomy_epd function the user can get a complete table for the taxonomy of the EPD. This table is useful for consulting but also it is fundamental for other functions to modify the taxonomy in certain tables of epd.entity.df objects.

epd.taxonomy <- get_taxonomy_epd(epd.connection)
epd.taxonomy

In the previous table, for each of the variables in the EPD there are several information of interest to correct, modify, or standardize taxonomy of an entity. More importantly in the columns accvar, mhvar, and groupid. accvar indicates the var_ number of the accepted taxa for that variable. This is relevant, because the data stored in the EPD are those reported by the original author. Hence, this column allows to unify the taxonomy across entities from different authors, regions and/or time periods, in which identification skills may have changed. The taxa_to_acceptedtaxa modify all information in the @commdf object to change taxa to the accepted taxa according to the accvar column in the taxonomy table.

epd.1@commdf@taxanames
epd.1 <- taxa_to_acceptedtaxa(epd.1, epd.taxonomy)
epd.1@commdf@taxanames

Check how the two taxanames lists have changed.

Another column in the taxonomy table of the EPD (mhvar) has the var_ numbers of the taxon higher taxonomic level, allowing to switch each variable to the higher taxonomic variable (or name).

epd.1@commdf@taxanames
epd.1.ht <- taxa_to_highertaxa(epd.1, epd.taxonomy)
epd.1.ht@commdf@taxanames

Check how 'Castanea sativa' have changed to 'Castanea'.

These two functions control for the fact that original data can include two or more variables (taxa) that after standardizing the taxonomy may belong to the same taxa. In this case the function sum the counts of the respective variables and modify the @commdf@counts in the object.

Note in the previous example how the number of taxa have been reduced from 47 to 44.

rowSums(epd.1@commdf@counts)
rowSums(epd.1.ht@commdf@counts)

This is the only way that EPD allow to upscale taxonomy. This implementation is however quite limited because it does not allow to standardize taxonomy at a specific taxonomic range (e.g., genus level). Further developments in the package may explore how to combine data from the entities with taxonomic databases, like those mined by packages like taxize.

Filtering taxa

Despite its name, entities in the EPD can contain data from different taxonomic groups, and not only for pollen or spores. Some entities can contain data for algae, acritarchs, etc (check list_taxagroup for the whole list). Hence, if one is interested in a particular biological group, he has to filter and remove the data from the rest of the groups. This is accomplished with the filter_taxagroups function. It takes two arguments. The first one must be an epd.entity.df object (or an epd.entity that is automatically transformed into an epd.entity.df) to be filtered. The second one is a character vector with the groups ids of interest to be retained. For instance, DWAR , HERB, LIAN, TRSH, UPHE, and INUN are all groups of pollen. So, if you want to work with upland plants pollen you can run the following code:

epd.1 <- filter_taxagroups(epd.1, c("DWAR", "HERB", "LIAN",
                                    "TRSH", "UPHE", "INUN"))
epd.1@commdf@taxanames
rowSums(epd.1@commdf@counts)

Note that we have filtered out Pteridophytes spores: "Botrychium", "Cryptogramma crispa", "Lycopodiaceae", "Huperzia selago", "Dryopteris-type", "Selaginella", "Trilete spore(s)", "Pterocarya fraxinifolia", "Fagus".

While the filter_taxagroups function use the groups defined in the EPD, the user may want to run a more personalized filtering. To do so, the filter_taxa allow filter taxa directly by their names. This function allows great versatility, but it also require great caution, because you need to specify the names exactly in the same way they are spelled in the EPD and thus in the epd.entity.df object.

epd.1.ft <- filter_taxa(epd.1, c("Alnus", "Artemisia", "Betula",
                                 "Carpinus betulus", "Corylus"),
                        epd.taxonomy)
head(epd.1.ft@commdf@counts)

Caution has to be paid, because if you specify a taxon that is not in the epd.entity.df object, the function will create a new column with NA values. This is intentional, because when working with multiple entities you may still need to unify their taxonomy. However, if you misspell a name, the data will be withdrawn and a new column with a wrong name will be created.

epd.1.ft <- filter_taxa(epd.1, c("Aluns", "Artemisia", "Betula",
                                 "Carpinus betulus", "Carylus"),
                        epd.taxonomy)
head(epd.1.ft@commdf@counts)

Note how we misspelled "Alnus" and "Corylus" and the intended data were withdrawn and new columns with the misspelled names are created with NA values on them.

Calculating percentages

Once, the data are reduced to those taxa of interest, the raw counts stored in the EPD can be transformed into percentages using the counts_to_percentages functions. This function only requires one argument with the epd.entity.df from which counts will be transformed into percentages relative to the complete sum of particles in each sample. It is thus important to previously properly filter the taxa to the taxa groups of interest to get meaningful percentages.

The function also modifies the slot @countstype to reflect that now the function has percentages and no raw counts.

epd.1@countstype
epd.1 <- counts_to_percentages(epd.1)
epd.1@countstype
head(epd.1@commdf@counts)

Now, the sum of the counts for each sample should be 100 for all samples.

rowSums(epd.1@commdf@counts)

Using updated ages from Giesecke et al. [-@giesecke_2013]

In a publication from 2013, Giesecke et al. [-@giesecke_2013] updated the chronologies (ages-depth models) for more than 800 entities in the database. Although the database itself was not updated with these new chronologies, they released all the information in four different tables in the data repository Pagaea, under Creative Commons Attribution 3.0 Unported. The EPDr package provide a copy of these tables, and import the data from some of them into the epd.entity and epd.entity.df objects. When Giesecke et al. [-@giesecke_2013] provide a chronology for an entity, the slot @isingiesecke is TRUE, and FALSE if they do not. However, the default chronology for each entity is recorded as in the original EPD. So, if you want to use ages estimated by Giesecke et al. [-@giesecke_2013] you need to modified the slot @defaultchron. Most entities in the EPD have one or few chronologies, so @defaultchron is usually a number between 1 and 2(3), unless the entity has no chronologies at all and thus it is 0 or NA. To specify that other functions should use ages estimated by Giesecke et al. [-@giesecke_2013] @defaultchron has to be 9999.

We provide a function giesecke_default_chron that automatically check if the entity has updated ages in Giesecke et al. [-@giesecke_2013] and, if so, update @defaultchron to 9999.

epd.1@defaultchron
epd.1@numberofchron
epd.1@isingiesecke
epd.1 <- giesecke_default_chron(epd.1)
epd.1@defaultchron

Note that @numberofchron reflect the total number of chronologies in the EPD without including those in Giesecke et al. [-@giesecke_2013]

Interpolating counts data to specific time periods

Each EPD entity has counts data for samples at different depths, which combined with the different stratigraphy of each site (e.g., differential sedimentation rates among lakes), make that each counting data correspond to no particular and pre-established ages. Some times the user may need to estimate the counts data for a particular time period (and this can be important when comparing data from multiple entities and sites, but we will cover that later). The interpolate_counts function allow to interpolate counts in an epd.entity.df object (if an epd.entity object is provided this is previously transformed into an edp.entity.df) to those particular time periods specified by the user. The interpolation can be accomplished with three different methods: "linear", "loess", and "smooth spline".

The function takes two mandatory arguments: x and time. x has to be an object of class epd.entity.df to be interpolated, and time has to be a numeric vector indicating the ages to which interpolate the counts. Other arguments are chronology, indicating the chronology that should be used to look for ages of the actual samples (if not provided the function use the default chronology according to the one specified in x), method, indicating the interpolation method to be used, rep_negt that should be TRUE or FALSE, indicating whether negative interpolated values should be replaced by 0, and further arguments (e.g., span, df, and ...) to control loess and smooth spline interpolation methods.

The function, in addition to calculate the

epd.1.int <- interpolate_counts(epd.1,
                                c(0, 1000, 2000, 3000, 4000, 5000),
                                method = "linear")

The new epd.entity.df object returned by the function has been modified in several ways. In the first place the countings now reflect those estimated for the desired time periods.

epd.1@commdf@counts[, 1:5]
epd.1.int@commdf@counts[, 1:5]

Hence, the slot @countsprocessing is also modified to reflect the fact that data have been interpolated.

epd.1@countsprocessing
epd.1.int@countsprocessing

Because the new ages are based in one of the chronologies, the function modifies also the tables with the ages (@depthages) and depths (@depthcm) to report the new ages (one column only with the chronology in which they were based) and estimated depths .

epd.1@agesdf
epd.1.int@agesdf

Note that argument chronology was not specified and the default chronology ("giesecke") was used here.

Accordingly, the samples ids and labels are also modified.

epd.1@samplesdf
epd.1.int@samplesdf

Note that samples ids (samples_) start now from 10001 onwards, instead from 1 (which are reserved for original --not interpolated-- data). This can be used as an extra warning flag.

Average counts data to specific time ranges

Another approach to standardize counts through time is calculating average countings in particular time ranges (e.g., mean Cedrus pollen concentration between 1000 and 2000 cal BP). This can be carried out with the intervals_counts function. The function is quite similar to interpolate_counts, but it requires to arguments to specify time ranges (or intervals). The first one (tmin) is a numeric vector with starting values of the intervals and the second one (tmax) another numeric vector with the ending values of the intervals. It also requires an epd.entity.df object (as argument x), a chronology number (argument chronology), and accept a vector with labels for the new sample labels (argument newlabels). If chronology is not specified it use the default one for that entity, as specified in x@defaultchron. If newlabelsis not specified the function creates new labels with the ranges definition.

Similarly to interpolate_counts, the new epd.entity.df object returned by the function is modified in several ways. In the first place the countings now reflect those estimated for the desired time ranges.

epd.1.ran <- intervals_counts(epd.1,
                              c(0, 1000, 2000, 3000, 4000, 5000),
                              c(999, 1999, 2999, 3999, 4999, 5999))
epd.1.ran@commdf@counts[, 1:5]

Compare the values with those from percentages in the samples or as those interpolated for particular time periods.

The slot @countsprocessing was also modified to reflect the fact that data have been averaged at time intervals.

epd.1.ran@countsprocessing

Because the new ages are based in one of the chronologies, the function modifies also the tables with the ages (@depthages) and depths (@depthcm) to report the new ages (one column only with the chronology in which they were based) and estimated depths .

epd.1.ran@agesdf

Note that argument chronology was not specified and the default chronology ("giesecke") was used here. Also note that ages do not exactly match the intervals midpoint, they represent the average age of the samples used to calculate the average counts for each time range. Same thing happen with depth of the samples.

Accordingly, the samples ids and labels are also modified.

epd.1.ran@samplesdf

Note that samples ids (samples_) start now from 20001 onwards, instead from 1 (which are reserved for original --not interpolated-- data). This can be used as an extra warning flag. Note also the samples labels indicating the right ranges.

Calculating data quality using [@blois_2011]

Because datations of counting samples rely on age-depth models fitted with radiocarbon ages from datation samples there are some uncertainty on the exact ages they belong to. Furthermore, if counts are interpolated or mean values are calculated for time intervals, the uncertainty of the counts increase because of the interpolation methods. Blois et al. [-@blois_2011] proposed an algorithm to calculate an uncertainty (or quality) index. This method basically measure the distance (in time) from any count to the closest control data in the chronology and from any count to the closest counting sample. These two distances are standardized by certain thresholds that can be modified. Finally, the complementary of these distances are averaged to calculate the final quality index.

The blois_quality function implement this calculation on epd.entity.df objects. The function takes the following arguments: x an epd.entity.df object to be used for make the calculations, chronology a number indicating the chronology to be used for ages of the samples and the control points, max_sample_dist a number indicating the maximum distance to actual counting samples to standardize the samples distance, and max_control_dist a number indicating the maximum distance to geochronological control points to standardize the control distances. If chronology is not specified, the function use the default one from x. If max_sample_dist and max_control_dist are not specified the default values are 2000 and 5000, respectively [@blois_2011].

epd.1 <- blois_quality(epd.1)
epd.1.int <- blois_quality(epd.1.int)
epd.1@agesdf@dataquality
epd.1.int@agesdf@dataquality

Extracting data in a table for specific taxa and age

Once, the data are standardized in the desired manner, you can tabulate data to be used in analysis or for exporting to other softwares. One of the functions that help with this is table_by_taxa_age. The function requires the following arguments: x an epd.entity.df object to extract the data from, taxaa character vector with the taxa names that are of interest, and sample_label indicating the sample that are desired to be exported.

table_by_taxa_age(epd.1, c("Pinus", "Quercus"), as.character(c(1:10)))
table_by_taxa_age(epd.1.int, "Quercus", c("1000", "2000", "3000"))

Plotting pollen diagrams from data in epd.entity.df objects

One of the most useful tools in palynology are the pollen diagrams, where particles concetrations are plotted againts depth/age. We have implemented a function to plot pollen diagrams in ggplot. Which allow for a great variaty of customization using semi-automatically information in the objects that are being plotted. Check ?plot_diagram

plot_diagram(epd.1)
plot_diagram(epd.1.int)

Note that plots are shrinked to fit in the tutorial width.

Disconnecting from the EPD server

Finally and although it is not mandatory, it might be convenient to close the connection to the EPD server. That is accomplished with the disconnec_from_epd function. It only requires one argument with a valid connection object to the database.

disconnect_from_epd(epd.connection)

References



dinilu/EPDr documentation built on Aug. 22, 2019, 1:03 p.m.