# Introduction to 'traitdataform'" In traitdataform: Formatting and Harmonizing Ecological Trait-Data

knitr::opts_chunk$set(collapse = TRUE, comment = "#>") library(traitdataform)  Assistance for handling functional trait data and transferring them into the Ecological Trait-data Standard (Schneider et al. 2018, https://terminologies.gfbio.org/terms/ets/pages/ doi: 10.5281/zenodo.1485739). There are two major use cases for the package: • preparation of own trait datasets for upload into public data bases, and • harmonizing trait datasets from different sources by moulding them into a unified format. The toolset of the package includes 1. transforming typical trait-data formats (e.g. species-trait-matrix or measurement-table data) into a unified long-table format and mapping column names into terms provided in the Ecological Trait-data Standard (ETS) (see section 1. Reading data), 2. mapping of trait concepts onto a user-provided trait list (i.e. a thesaurus of traits) or globally accessible URIs (see section 2. Standardize traits) and unify units and factor levels, 3. mapping of species concepts onto globally accessible definitions via URIs (pointing to GFBio taxonomic ontology server) (see section 3. Standardize taxa), 4. Merging and handling compiled trait-data, while keeping track of the metadata for each original dataset (see section 4. Working with trait-datasets) 5. saving trait dataset into a desired format using templates (e.g. for project-specific databases or online repositories) (see section 5. Writing data) This vignette contains step-by step instructions for transferring own data into a standardized trait-dataset for upload to public databases. See Schneider et al. 2019 Towards an Ecological Trait-data Standard Methods in Ecology and Evolution DOI: 10.1111/2041-210X.13288) for a discussion of the rationale. # 1. Reading data ## load data from source The first step is to load your data into R. This can be your own data, read from file, or data published elsewhere, directly accessible via an URL. R knows many ways of getting your data into an R object. In most cases you would read an object from a csv or txt file while maintaining the column headers. carabids <- read.table("../../data/carabid traits final.txt", header = TRUE)  If reading files from a file repository, you can refer to the URL directly, e.g. # pulling data from van der Plas F, van Klink R, Manning P, Olff H, Fischer M (2017) Sensitivity of functional diversity metrics to sampling intensity. Methods in Ecology and Evolution 8(9): 1072-1080. https://doi.org/10.1111/2041-210x.12728 carabids <- read.delim("https://datadryad.org/stash/downloads/file_stream/24267", stringsAsFactors = FALSE)  Most trait data are stored in one of the following two formats: • species$\times$trait matrix : a single account of a trait value for each species (in rows) for a couple of different traits (in columns). No replicates of species are reported. This is the most likely format for literature data, where aggregate measurements or facts for entire species have been collated into a single lookup table. • observation wide table : in case of measured data, authors may report multiple raw measurements of different traits (in columns) taken from a single observation instance of a species, i.e. an individual (in rows). Repeated measures of the same trait might also be included as columns or pooled into average values. This is valuable for investigations of intra-specific variation, and also leaves space for filtering by co-factors or analyzing trait response along environmental gradients. In both cases, additional information on the species or observation may be stored in further columns (e.g. the unit in which a value is reported or the literature source for this measurement or fact, or the date and geolocation of sampling), or in a separate data sheet linked via identifiers for trait, taxon, occurrence or sampling/measurement event. As the column names and the width of the table varies with the number of traits included, merging data from different sources requires user-defined mapping and manual harmonization of these structures. A more effective format is the measurement long-table [@wickham14; @parr16; @kattge11a], where each row is reserved for a single measurement or fact of a specific trait. This allows repeated measurements on a single individual to be stored by linking the data from separate rows via a unique identifier for each individual (labelled occurrenceID). Also, multivariate trait measurements can be recorded in this format by linking multiple rows via a unique measurement identifier. Long-table datasets purport multiple advantages for data manipulation (e.g. filtering, sub-setting and aggregating data), visualization (e.g. plot measured values by factor variable or taxon) and statistical modelling (e.g. ANOVA for testing difference of trait value by sex) [@wickham14]. Each row of the dataset can therefore be interpreted as a statement of an 'entity x having a qualitative/quantitative feature y' [@garnier17; @schneider18]. As long-table formats draw from a defined set of columns, merging of datasets is much easier. The function as.traitdata() provided in the package assist in transferring data into the measurement long-table format. For this function to work, it needs at least to know about the columns of the original data that contain trait values (parameter traits), and the column which contains the taxonomic concept (parameter taxa). dataset1 <- as.traitdata(carabids, taxa = "name_correct", traits = c("body_length", "antenna_length", "metafemur_length", "eyewidth_corr"), units = "mm" ) head(dataset1)  Note that in the output table the columns have been renamed according to the ETS. The essential columns are verbatimTraitName, verbatimTraitValue for the reported measurement or fact as well as verbatimScientificName for the taxon concept. The newly assigned column measurementID contains a running number for each individual trait measurement. The function automatically interprets data as a species$\times$traits matrix if the taxa column contains only unique entries and no duplicates. In case of multiple assignments to the same taxon, the script assumes an observation wide-table and procures a new column occurrenceID which links measurements taken on the same individuals. Both occurrenceID and measurementID can be provided by the author using the parameter occurrences (as a column name or a vector) or measurements (as a column name or a vector). if(!l10n_info()$UTF-8) {Sys.setlocale("LC_CTYPE", "en_US.UTF-8")}

encoding = "windows-1252"),
stringsAsFactors=TRUE)

# Data publication: M. Gossner, Martin; K. Simons, Nadja; Hoeck, Leonhard; W.
# Weisser, Wolfgang (2016): Morphometric measures of Heteroptera sampled in
# grasslands across three regions of Germany. figshare.
# https://doi.org/10.6084/m9.figshare.c.3307611.v1

dataset2 <- as.traitdata(heteroptera_raw,
traits = c("Body_length", "Body_width", "Body_height", "Thorax_length",
"Antenna_Seg2", "Antenna_Seg3", "Antenna_Seg4", "Antenna_Seg5",
"Front.Tibia_length", "Mid.Tibia_length", "Hind.Tibia_length",
"Front.Femur_length", "Hind.Femur_length", "Front.Femur_width",
"Hind.Femur_width", "Rostrum_length", "Rostrum_width",
"Wing_length", "Wing_width"),
taxa = "SpeciesID",
occurrences = "ID"
)

# show different trait measurements for same occurrence/individual
subset(dataset2, occurrenceID == "5" )


This allows the user to be explicit about the structure of the output data.

## specify units

For a standardisation of quantitative trait data, the unit of measurement is essential. Often, this information is kept in the metadata descriptions. But for a standardised table containing measurements from different sources, this information should always accompany the measurement value. The ETS suggests the term verbatimTraitUnit to contain the original author's unit for each measurement in the data table.

The function as.traitdata() creates this column via its parameter units (see example above). This can be done for all traits in a single stroke (if all reported values refer to the same unit) or to each trait specifically (if they used different measurement units or if the table comprises a mixture of quantitative and qualitative traits).
Accordingly, the parameter units takes a single character string, or a vector of character strings, containing valid entries as expected by the package 'units' [@pebesma16, https://github.com/r-quantities/units]. Examples are 'mm', 'm2' or 'm^2', 'm/s'.

The raw data might contain further information on the individuals or the trait measurement itself in further data columns that are valuable for later analysis. This can be for instance data about the sex or developmental stage of the individual, the sampling or preservation method of the specimen, or the conditions under which the measurement was taken.

The parameter keep allows you to specify which columns contain valuable information as a character vector. As a negative version of keep, specifying drop would allow you to name the columns that are not valuable, while all others will be kept. Not specifying keep or drop will result in dropping all columns except the core measurement and identifier columns.

dataset2 <- as.traitdata(heteroptera_raw,
traits = c("Body_length", "Body_width", "Body_height", "Thorax_length",
"Antenna_Seg2", "Antenna_Seg3", "Antenna_Seg4", "Antenna_Seg5",
"Front.Tibia_length", "Mid.Tibia_length", "Hind.Tibia_length",
"Front.Femur_length", "Hind.Femur_length", "Front.Femur_width",
"Hind.Femur_width", "Rostrum_length", "Rostrum_width",
"Wing_length", "Wing_width"),
taxa = "SpeciesID",
occurrences = "ID",
keep = c("Sex")
)



The three extensions of the ETS provide standard terms for this kind of information:

• The Taxon extension provides further terms for specifying the taxonomic resolution of the observation and to ensure the correct reference in case of synonyms and homonyms.
• The Measurement Or Fact extension provides terms to describe information at the level of single measurements or reported facts, such as the original literature reference for the reported value, the method of measurement or statistical method of aggregation. It provides important information that allows for the tracking of potential sources of noise or bias in measured data (e.g. variation in measurement method) or aggregated values (e.g. statistical method), as well as the source of reported facts (e.g. literature source or expert reference).
• The Occurrence extension contains vocabulary to describe information on the observation context of individual specimens, such as sex, life stage or age. This also includes the method of sampling and preservation, as well as the date and geographical location, which provide an important resource to analyze trait variation due to differences in space and time.

We highly recommend mapping the input columns into these standard terms by providing a named vector for keep that gives the target ETS terms as vector names.

dataset2 <- as.traitdata(heteroptera_raw,
traits = c("Body_length", "Body_width", "Body_height", "Thorax_length",
"Antenna_Seg2", "Antenna_Seg3", "Antenna_Seg4", "Antenna_Seg5",
"Front.Tibia_length", "Mid.Tibia_length", "Hind.Tibia_length",
"Front.Femur_length", "Hind.Femur_length", "Front.Femur_width",
"Hind.Femur_width", "Rostrum_length", "Rostrum_width",
"Wing_length", "Wing_width"),
taxa = "SpeciesID",
occurrences = "ID",
units = "mm",
keep = c(order = "Order", family = "Family",
sex = "Sex", lifeStage = "Wing_development",
basisOfRecordDescription = "Source",
verbatimLocality = "Center_Sampling_region",
references = "Voucher_ID" )
)



Note that a lack of a name in the named vector maintains the original name. Note also, that no checking for valid column names (as compared to the traitdata glossary) is performed at this stage. This is to ensure that the raw data table created by as.traittable() can contain any columns that the author considers relevant. The keep parameter can be used to rename columns into intuitive column names.

## derived trait-values

Many traits comprise compound measures of multiple traits, such as length-mass ratios or morphometric indices. Other traits must be refined in terms of factor levels, or reduced to binary trait values. Many of these tasks can be achieved on the matrix raw data using base functions like transform(), factor() or match() or the mutate() function provided by the package 'plyr' before conversion into the long-table format.

However, if the data are converted to long-table format, these tasks may become tedious as they require splitting the data before the computation can be done. The function mutate.traitdata() performs these tasks (working as a wrapper to plyr::mutate()) while keeping an eye on the units.

dataset2 <- mutate.traitdata(dataset2,
Body_shape = Body_length/Body_width,
Body_volume = Body_length*Body_width*Body_height,



# 3. Standardize taxa

For taxon name standardisation, the function standardize_taxa() makes use of fuzzy matching algorithms provided by the package 'taxize' by Scott Chamberlain to match the entries of column verbatimScientificName against the GBIF Backbone Taxonomy. The result is written into a new column scientificName. Additional columns comprise the order (for ambiguous names), the reported taxon rank, as well as a globally unique taxon ID which references the taxon to GBIF Backbone Taxonomy in a universal URI format.

If further layers of taxonomic information are desired as an output, the function takes the parameter return, which by default contains c("taxonID", "scientificName", "order", "taxonRank"). Other specifications can be added here.

Note that for this to work, verbatimScientificName must contain a full account of the species name or higher taxon, no abbreviations (spaces or underscores are handled alright). Note also, that taxon name mapping requires an internet connection and might take some time, depending on the length of your species list.

dataset1Std <- standardize_taxa(dataset1)


## Single-stroke standardization

The functions standardize_traits() and standardize_taxa() are applied sequentially but not necessarily in that order. The output of the first step can be piped into the second step.

To make things even simpler, the functions for format conversion and standardization come with a wrapper function standardize(). Therefore it is possible to run the functions in a single-handed way, if all necessary parameters for the intermediate steps are provided. A single call will do, taking all the optional parameters described above.

dataset1Std <- standardize(carabids,
thesaurus = thesaurus1,
taxa = "name_correct",
units = "mm",
keep = c(measurementDeterminedBy = "source_measurement")
)


As an alternative input pathway, all parameters to standardize() can be specified as attributes of the input object and will be found natively by the function. This allows for the specification of recipes for data integration for projects pulling data from multiple sources.

# 4. Working with trait-datasets

## combine multiple traitdata tables

After standardizing trait and taxon concepts into unified definitions and converting trait values into harmonized units, it is straightforward to combine multiple trait-dataset into one using rbind(). This can be applied before or after the standardisation process, depending on the use case. Use cases of merging data are:

• you collected data from different sources and want to harmonize taxon and trait names: bring data in long-table format and merge into one data object, then harmonize taxa and units following a uniform standard
• No unified trait list or taxon reference exists for the heterogeneous data assembled of different sources (e.g. because spanning many different taxa): Apply standardization to different reference systems before merging the datasets.

The function call will append the data tables while merging the common columns and maintaining columns that are not present in all datasets (this might produce lots of NA). The column datasetID will be added to keep track of the origin of the data. By default this column will contain the object names of the original datasets, but it can be replaced by more meaningful IDs using the parameter datasetID.

newdata <- rbind(dataset1Std, dataset2Std,
datasetID = c("vanderplas15", "gossner15")
)


Note that the package provides a method for the base function rbind() that handles this merge. Documentation can be accessed via ?rbind.traitdata.

The function will handle metadata information on the dataset level as described in the section 'Metadata' of the Traitdata Standard (e.g. author or bibliographicCitation) and add a column datasetID as well as datasetName and author if those are provided in the parameter metadata of the as.traitdata() function call which creates the data. The function as.metadata() provides a standard structure for this case.

metadata1 <- as.metadata(
datasetName = "Carabid traits",
datasetID = "carabids",
bibliographicCitation =  bibentry(
bibtype = "Article",
title = "Sensitivity of functional diversity metrics to sampling intensity",
journal = "Methods in Ecology and Evolution",
author = c(as.person("Fons van der Plas, Roel van Klink, Pete Manning, Han Olff, Markus Fischer")
),
year = 2017,
doi = "10.1111/2041-210x.12728"
),
author = "Fons van der Plas",
)

dataset1 <- as.traitdata(carabids,
taxa = "name_correct",
thesaurus = thesaurus1,
units = "mm",
keep = c(datasetID = "source_measurement", measurementRemark = "note"),
)



Note the use of the bibentry() function to create a formal bibliographic entry. Also note that this also affects the way how the dataset is printed into the R console. This facilitates for data users to acknowledge authorship and ownership of the data, while also providing a machine readable structure that can easily be accessed further down the line.

metadata2 <- as.metadata(
datasetName = "Heteroptera morphometry traits",
datasetID = "heteroptera",
bibliographicCitation =  bibentry(
bibtype = "Article",
title = "Morphometric measures of Heteroptera sampled in grasslands across three regions of Germany",
journal = "Ecology",
volume = 96,
issue = 4,
pages = 1154,
author = c(as.person("Martin M. Gossner , Nadja K. Simons, Leonhard Hoeck, Wolfgang W. Weisser")),
year = 2015,
doi = "10.1890/14-2159.1"
),
author = "Martin M. Gossner",
)

dataset2 <- as.traitdata(heteroptera_raw,
taxa = "SpeciesID",
thesaurus = thesaurus2,
units = "mm",
keep = c(sex = "Sex", references = "Source", lifestage = "Wing_development"),
)

database <- rbind(dataset1, dataset2,
datasetID = c("vanderplas17", "gossner15"),
)



The detailed metadata information of both datasets (e.g. license and bibliographic citation) will be stored in the attributes of the dataset and displayed when calling it in R console. You can access the metadata via the attributes() function. E.g.

attributes(dataset1)$metadata$bibliographicCitation


## writing data recipes

For projects compiling data from multiple sources, it is recommended best practice to refer to original raw data, potentially even by pulling them from their original repository, and make any changes and standardisation procedures script based in R. If many field-based changes are necessary, you can refer to lookup tables to keep the script slim.

traitdataform allows you to script all parameters required for the standardization call into the attributes of the R object. A script for a single data source can then look like this

carabids <- utils::read.delim(url("https://datadryad.org/stash/downloads/file_stream/24267",
encoding = "UTF-8")
)

datasetName = "Carabid traits",
datasetID = "carabids",
bibliographicCitation =  utils::bibentry(
bibtype = "Article",
title = "Sensitivity of functional diversity metrics to sampling intensity",
journal = "Methods in Ecology and Evolution",
author = c(utils::as.person("Fons van der Plas, Roel van Klink, Pete Manning, Han Olff, Markus Fischer")
),
year = 2017,
doi = "10.1111/2041-210x.12728"
),
author = "Fons van der Plas",
)

attr(carabids, 'thesaurus') <-  traitdataform:::as.thesaurus(
body_length = traitdataform:::as.trait("body_length",
expectedUnit = "mm", valueType = "numeric",
identifier = "http://t-sita.cesab.org/BETSI_vizInfo.jsp?trait=Body_length"),
antenna_length = traitdataform:::as.trait("antenna_length",
expectedUnit = "mm", valueType = "numeric",
identifier = "http://t-sita.cesab.org/BETSI_vizInfo.jsp?trait=Antenna_length"),
metafemur_length = traitdataform:::as.trait("femur_length",
expectedUnit = "mm", valueType = "numeric",
identifier = "http://t-sita.cesab.org/BETSI_vizInfo.jsp?trait=Femur_length"),
eyewidth_corr = traitdataform:::as.trait("eye_diameter",
expectedUnit = "mm", valueType = "numeric",
identifier = "http://t-sita.cesab.org/BETSI_vizInfo.jsp?trait=Eye_diameter")
)

attr(carabids, 'taxa') <- "name_correct"
attr(carabids, 'units') <- "mm"
attr(carabids, 'keep') <-  c(measurementDeterminedBy = "source_measurement", measurementRemarks = "note")


When thus specified, the data can be re-formatted simply by calling standardize(carabids).

# 5. Writing data

The final step in converting trait data into a standardised format before uploading the file to a public file hosting service is saving them in a file format that is internationalized, portable and long-term accessible. Internationalization refers to the file encoding ('UTF-8' should be used, 'ASCII' is possible for data with no special characters) as well as the use of decimal delimiters (highly recommended to use '.') and internationally accepted formatting standards for values such as dates (the international norm for date entries is ISO 8601, i.e. "YYYY-MM-DD"). Portability means that the file can be opened on all operating systems (specifically important, the 'end of line' character) and does not rely on proprietary software (like MS Excel or database tools). Long-term accessibility is warranted by choosing a text-based file format (txt, csv or tsv) and by packaging the primary data with all necessary metadata.

The base R function write.table() gives full control over these parameters and should be used to export trait-data.

write.table(dataset1Std, file = "carabids_std.csv",
sep = ",", dec = ".", quote = TRUE, eol = "\r", row.names = FALSE)


Along with these primary data, you should make any ancillary data table available along with the data, e.g. the metadata in a human readable form, as well as the lookup table of traits and taxa:

capture.output(attributes(dataset1Std)$metadata, file = "metadata.txt") write.table(attributes(dataset1Std)$traits, file = "traits.csv",
sep = ",", dec = ".", quote = TRUE, eol = "\r", row.names = FALSE)
write.table(attributes(dataset1Std)\$taxonomy, file = "taxa.csv",
sep = ",", dec = ".", quote = TRUE, eol = "\r", row.names = FALSE)


When publishing the trait data on file servers like Figshare or Zenodo, those files should be uploaded in a single file repository (e.g. in a zip file). R does the archiving for you using zip():

zip("carabids_std.zip", c("carabids_std.csv", "metadata.txt", "traits.csv", "taxa.csv") )


More advise for publishing trait data in a standardised way can be found in our 'Best practice examples for primary data publication' [@schneider18].

# References

## Try the traitdataform package in your browser

Any scripts or data that you put into this service are public.

traitdataform documentation built on March 23, 2021, 1:11 a.m.