***zoolog***: \n Zooarchaeological Analysis with Log-Ratios

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

Introduction

The package zoolog includes functions and reference data to generate and manipulate log-ratios (also known as log size index (LSI) values) from measurements obtained on zooarchaeological material. Log ratios are used to compare the relative (rather than the absolute) dimensions of animals from archaeological contexts [@meadow1999use]. Essentially, the method compares archaeological measurements to a standard, producing a value that indicates how much larger or smaller the archaeological specimen is compared to that standard. zoolog is also able to seamlessly integrate data and references with heterogeneous nomenclature, which is internally managed by a zoolog thesaurus.

The methods included in the package were first developed in the framework of the ERC-Starting Grant 716298 ZooMWest (PI S. Valenzuela-Lamas), and were first used in the paper [@trentacoste2018pre]. They are based on the techniques proposed by @simpson1941large and @simpson1960quantitative, which calculates log size index (LSI) values as: $$ \mbox{LSI} = \log_{10} x - \log_{10} x_{\text{ref}} = \log_{10}(x/x_\text{ref}), $$ where $x$ is the considered measure value and $x_\text{ref}$ is the corresponding reference value. Observe that LSI is defined using logarithms with base 10.

Several different sets of standard reference values are included in the package. These standards include several published and widely used biometric datasets (e.g. @davis1996measurements, @albarella2005neolithic) as well as other less known standards. These references, as well as the data example provided with the package, are based on the measures and measure abbreviations defined in @von1976guide and @davis1992rapid.

In general, zooarchaeological datasets are composed of skeletal remains representing many different anatomical body parts. In investigation of animal size, the analysis of measurements from a given anatomical element provides the best control for the variables affecting size and shape and, as such, it is the preferable option. Unfortunately, this approach is not always viable due to low sample sizes in some archaeological assemblages. This problem can be mitigated by calculating the LSI values for measurements with respect to a reference, which provides a means of aggregating biometric information from different body parts. The resulting log ratios can be compared and statistically analysed under reasonable conditions [@albarella2002size]. However, length and width measurements of different anatomical elements still should not be directly compared or aggregated for statistical analysis.

The package includes a zoolog thesaurus to facilitate its usage by research teams across the globe, and working in different languages and with different recording traditions. The thesaurus enables the zoolog package to recognises many different names for taxa and skeletal elements (e.g. "Bos taurus", "Bos", "BT", "bovino", "bota"). Consequently, there is no need to use a particular, standardised recording code for the names of different taxa or elements.

Acknowledgements

We are particularly grateful to Sabine Deschler-Erb and Barbara Stopp, from the University of Basel (Switzerland) for making the reference values of several specimens available through the ICAZ Roman Period Working Group, which have been included here with their permission. We also thank Francesca Slim and Dimitris Filioglou from the University of Groningen, Claudia Minniti from University of Salento, Sierra Harding and Nimrod Marom from the University of Haifa, and Carly Ameen from the University of Exeter for providing additional reference sets.

The thesaurus has benefited from the contributions from Moussab Albesso, Canan Çakirlar, Jwana Chahoud, Jacopo De Grossi Mazzorin, Sabine Deschler-Erb, Dimitrios Filioglou, Armelle Gardeisen, Sierra Harding, Pilar Iborra, Michael MacKinnon, Nimrod Marom, Claudia Minniti, Francesca Slim, Barbara Stopp, and Emmanuelle Vila.

We are grateful to all of them for their contributions, comments, and help. In addition, users are encouraged to contribute to the thesaurus and other references so that zoolog can be expanded and adapted to any database.

Installation

You can install the released version of zoolog from CRAN with:

install.packages("zoolog")

And the development version from GitHub with:

install.packages("devtools")
devtools::install_github("josempozo/zoolog@HEAD", build_vignettes = TRUE)

Reference standards {#sec:refStandards}

The package zoolog includes several osteometrical references. Currently, the references include reference values for the main domesticates and their agriotypes (Bos, Ovis, Capra, Sus), and red deer (Cervus elaphus). These are drawn from a variety of publications and resources (see below). In addition, the user can consider other references, or the provided references can be extended and updated integrating newer research data. Submission of extended/improved references is encouraged. Please, contact the maintainer through the provided email address to make the new reference fully accessible within the package.

The predefined reference sets included in zoolog are provided in the named list reference, currently comprising the following r length(zoolog::reference) sets:

library(zoolog)
str(reference, max.level = 1)

The reference set reference$Combi is the default reference for computing the log ratios, since it includes the most comprehensive reference for each species.

The package includes also a referencesDatabase collecting the taxon-specific reference standards from all the considered resources. Each reference set is composed of a different combination of taxon-specific standards selected from this referencesDatabase. The selection is defined by the data frame referenceSets:

options(knitr.kable.NA = "")
knitr::kable(referenceSets)

A detailed description of the reference data, including its structure, properties, and considered resources can be found in the ReferencesDatabase help page.

Thesaurus

A thesaurus set is defined in order to make the package compatible with the different recording conventions and languages used by authors of zooarchaeological datasets. This enables the function LogRatios to match values in the user's dataset with the corresponding ones in the reference standard, regardless of differences in nomenclature or naming conventions, as long as both terms are included in the relevant thesaurus. The thesaurus also allows the user to standardize the nomenclature of the dataset if desired.

The user can also use other thesaurus sets or modify the provided one. In this latter case, we encourage the user to contact the maintainer at the provided email address so that the additions can be incorporated into the new versions of the package.

Currently, the zoolog thesaurus set includes four thesauri:

identifierThesaurus : For the column names that identify the variables used in computing the log ratios. It includes the categories Taxon, Element, Measure, and Standard. Each category provides a series of equivalent names. For instance, Taxon includes the options r a <- zoologThesaurus$identifier$Taxon; paste0("*", paste(a[a != ""], collapse = "*, *"), "*"). This thesaurus is case, accent, and punctuation insensitive, so that, for instance, "Especie" is equivalent to "ESPECIE" or "Espècie".

taxonThesaurus : For the names of the different taxa when recording animal bones. The current categories are r paste(names(zoologThesaurus$taxon), collapse = "*, *"), each with different equivalent names. This thesaurus is case, accent, and punctuation insensitive, so that, for instance, "Bos" is equivalent to "bos", "Bos.", or "Bos\ ".

elementThesaurus : Names of anatomical elements when recording animal bones. It currently includes r ncol(zoologThesaurus$element) categories (r paste(names(zoologThesaurus$element)[1:3], collapse = "*, *"), ...), each with different equivalent names. This thesaurus is case, accent, and punctuation insensitive.

measureThesaurus : Names of the measurements. While the English abbreviations from @von1976guide and @davis1992rapid are widely used in published literature, this thesaurus enables other nomenclatures (e.g. original German abbreviations in @von1976guide) to be included. This thesaurus is case sensitive.

Functions

The full list of functions is available under the zoolog help page. We list them here sorted by their prominence for a typical user, and grouped by functionality:

LogRatios : It computes the log ratios of the measurements in a dataset relative to standard reference values. By default reference$Combi is used. The function includes the option 'joinCategories' allowing several taxa (typically Ovis, Capra, and unknown Ovis/Capra) to be considered together with the same reference taxon. : Note that without using 'joinCategories' any taxa not part of the selected reference set will be excluded. For instance, if using reference$NietoDavisAlbarella, log ratios for goats will not be calculated unless 'joinCategories' is set to indicate that the Ovis aries standard should also be applied to goats.

CondenseLogs : It condenses the calculated log ratio values into a reduced number of features by grouping several measure log ratios and selecting or calculating a representative feature value. By default the selected groups represent a single dimension, i.e. Length and Width. Only one feature is extracted per group. Currently, two methods are possible: "priority" (default) or "average". : This operation is motivated by two circumstances. First, not all measurements are available for every bone specimen, which obstructs their direct comparison and statistical analysis. Second, several measurements can be strongly correlated (e.g. SD and Bd both represent bone width). Thus, considering them as independent would produce an over-representation of bone remains with multiple measurements per axis. Condensing each group of measurements into a single feature (e.g. one measure per axis) alleviates both problems. : The default method ("priority"), selects the first available log ratio in each group. Besides, CondenseLogs employs the following by-default group and prioritization introduced in @trentacoste2018pre: Length considers in order of priority GL, GLl, GLm, and HTC. Width considers in order of priority BT, Bd, Bp, SD, Bfd, and Bfp. This order maximises the robustness and reliability of the measurements, as priority is given to the most abundant, more replicable, and less age dependent measurements. But users can set their own features with any group of measures and priorities. The method "average" extracts the mean per group, ignoring the non-available log ratios.

RemoveNACases : It removes the cases (table rows) for which all measurements of interest are non-available (NA). A particular list of measurement names can be explicitly provided or selected by a common initial pattern (e.g. prefix). The default setting removes the rows with no available log ratios to facilitate subsequent analysis of the data.

InCategory : It checks if an element belongs to a category according to a thesaurus. It is similar to base::is.element, returning a logical vector indicating if each element in a given vector is included in a given set. But InCategory checks for equality assuming the equivalencies defined in the given thesaurus. It is intended for the user to easily select a subset of data without having to standardize the analysed dataset.

Nomenclature standardization : This includes two functions enabling the user to map data with heterogeneous nomenclature into a standard one as defined in a thesaurus:

* `StandardizeNomenclature` standardizes a character vector according to 
  a given thesaurus.
* `StandardizeDataSet` standardizes column names and values of a data 
  frame according to a thesaurus set.

AssembleReference : It allows the user to build new references assembling the desired taxon-specific references included in the zoolog referencesDataSet or in any other provided by the user.

Thesaurus readers and writers : This includes functions to read and write a single thesaurus (ReadThesaurus and WriteThesaurus) and a thesaurus set (ReadThesaurusSet and WriteThesaurusSet).

Thesaurus management : This includes functions to modify and check thesauri:

* `NewThesaurus` generates an empty thesarus.
* `AddToThesaurus` adds new names and categories to an existing thesaurus.
* `RemoveRepeatedNames` cleans a thesarus from any repeated names on any 
  category.
* `ThesaurusAmbiguity` checks if there are names included in more than one
  category in a thesaurus.

Examples

The following examples are designed to be read and run sequentially. They represent a possible pipeline, meaningful for the processing and analysis of a dataset. Only occasionally, a small diversion is included to illustrate some alternatives.

Reading data and calculating log ratios

This example reads a dataset from a file in csv format and computes the log-ratios. Then, the cases with no available log-ratios are removed. Finally, the resulting dataset is saved in a file in csv format.

The first step is to set the local path to the folder where you have the dataset to be analysed (this is typically a comma-separated value (csv) file). Here the example dataset from @valenzuela2008alimentacio included in the package is used:

library(zoolog)
dataFile <- system.file("extdata", "dataValenzuelaLamas2008.csv.gz", 
                        package = "zoolog")
data = read.csv2(dataFile,
                 quote = "\"", header = TRUE, na.strings = "",
                 encoding = "UTF-8",
                 stringsAsFactors = TRUE)
knitr::kable(head(data)[, -c(6:20,32:64)])

To enhance the visibility, we have shown only the most relevant columns.

We now calculate the log-ratios using the function LogRatios. Only measurements that have an associated standard will be included in this calculation. The log values will appear as new columns with the prefix 'log' following the original columns with the raw measurements:

dataWithLog <- LogRatios(data)
knitr::kable(head(dataWithLog)[, -c(6:20,32:64)])

Dealing with lazy datasets

If we observe the example dataset more carefully, we can see that the measures recorded for the astragali presents a deviation from the measure definitions in @von1976guide and @davis1992rapid. To see this, we can select the cases where the element is an astragalus. The function InCategory allows us to select them with the help of the thesaurus without requiring to know the terms actually used:

AScases <- InCategory(dataWithLog$Os, "astragalus", zoologThesaurus$element)
knitr::kable(head(dataWithLog[AScases, -c(6:20,32:64)]))

According to the measure definitions, astragali should have no GL measurement, but GLl. However, in the example dataset the GLl measurements have been recorded merged in the GL column. This is a data-entry simplification that is used by some researchers. It is possible because GLl is only relevant for the astragalus, while GL is not applicable to it. Thus, there cannot be any ambiguity between both measures since they can be identified by the bone element. However, since the zoolog reference uses the proper measure name for each bone element (GLl for the astragalus), the reference measure has not been correctly identified. Consequently, the log ratio logGL has NA values and the column logGLl does not exists.

The optional parameter mergedMeasures facilitates the processing of this type of simplified datasets. For the example data, we can use

GLandGLl <- list(c("GL", "GLl"))
dataWithLog <- LogRatios(data, mergedMeasures = GLandGLl)
knitr::kable(head(dataWithLog[AScases, -c(6:20,32:64)]))

This option allows us to automatically select, for each bone element, the corresponding measure present in the reference. Observe that now the log ratios have been computed and assigned to the column logGL.

Using the same ovis reference for all caprines

We could be interested in obtaining the log ratios of all caprines, including Ovis aries, Capra hircus, and undetermined Ovis/Capra, with respect to the reference for Ovis aries. This can be set using the argument joinCategories.

caprineCategory <- list(ovar = c("sheep", "capra", "oc"))
dataWithLog <- LogRatios(data, joinCategories = caprineCategory, mergedMeasures = GLandGLl)
knitr::kable(head(dataWithLog)[, -c(6:20,32:64)])

Note that this option does not remove the distinction in the data between the different species, it just indicates that for these taxa the log ratios must be computed from the same reference ("ovar").

Pruning the data from cases with no available measure

The cases without log-ratios can be removed to facilitate subsequent analyses:

dataWithLogPruned=RemoveNACases(dataWithLog)
knitr::kable(head(dataWithLogPruned[, -c(6:20,32:64)]))

You may want to write the resulting file in the working directory (you need to set it first):

write.csv2(dataWithLogPruned, "myDataWithLogValues.csv", 
           quote=FALSE, row.names=FALSE, na="", 
           fileEncoding="UTF-8")

Condensing log values

After calculating log ratios using the LogRatios function, many rows in the resultant dataframe (dataWithLog in the example above) may contain multiple log values, i.e. you will have several log values associated with a particular archaeological specimen. When analysing log ratios, it is preferential to avoid over-representation of bones with a greater number of measurements and account for each specimen only once. The CondenseLogs function extracts one length and one width value from each row and places these in new Length and Width columns.The 'priority' method described in @trentacoste2018pre has been set as default. Nevertheless, other options (e.g. average of all width log values for a given specimen) can be chosen if preferred. In this case, the default option has been used:

dataWithSummary <- CondenseLogs(dataWithLogPruned)
knitr::kable(head(dataWithSummary)[, -c(6:20,32:64,72:86)])

Standardizing the dataset nomenclature

The integration of the thesaurus functionality facilitates the use of datasets with heterogeneous nomenclatures, without further preprocessing. An extensive catalogue of names for equivalent categories has been integrated in the provided thesaurus set zoologThesaurus. These equivalences are internally and silently managed without requiring any action from the user. However, it can be also interesting to explicitly standardize the data to make figures legible to a wider audience. This is especially useful when different nomenclature for the same concept is found in the same dataset, for instance "sheep" and "ovis" for the same taxon or "hum" and "HU" for the bone element.

If we standardize the studied data, we can see that zoologThesaurus will change "ovar" to "Ovis aries", "hum" to "humerus", and "Especie" to "Taxon", for instance.

dataStandardized <- StandardizeDataSet(dataWithSummary)
knitr::kable(head(dataStandardized)[, -c(6:20,32:64,72:86)])

Selecting only caprines

We may be interested in selecting all caprine elements. This can be done even without standardizing the data using the function InCategory:

dataOC <- subset(dataWithSummary, InCategory(Especie, 
                                             c("sheep", "capra", "oc"),
                                             zoologThesaurus$taxon))
knitr::kable(head(dataOC)[, -c(6:20,32:64)])

Observe that no standardization is performed in the output subset. To standardize the subset data, StandardizeDataSet can be applied either before or after the subsetting.

dataOCStandardized <- StandardizeDataSet(dataOC)
knitr::kable(head(dataOCStandardized)[, -c(6:20,32:64)])

Observe also that the distinction between Ovis aries, Capra hircus, and Ovis/Capra has not been removed from the data.

If we were interested only in one summary measure, Width or Length, we could retain the cases including this measure:

dataOCWithWidth <- RemoveNACases(dataOCStandardized, measureNames = "Width")
dataOCWithLength <- RemoveNACases(dataOCStandardized, measureNames = "Length")

which gives respectively r nrow(dataOCWithWidth) (=nrow(dataOCWithWidth)) and r nrow(dataOCWithLength) (=nrow(dataOCWithLength)) cases.

Different plots for data visualisation

Condensed log values can be visualised as histograms and box plots using ggplot [@wickham2011ggplot2]. Here we will look at some examples of plotting values from caprines.

Horizontal Boxplot with dots grouped by site

For the example plots we will use the package ggplot2.

library(ggplot2)

We can now create a boxplot for the widths:

ggplot(dataOCStandardized, aes(x = Site, y = Width)) +
  geom_boxplot(outlier.shape = NA, na.rm = TRUE) +
  geom_jitter(width = 0.2, height = 0, alpha = 1/2, color = 4, na.rm = TRUE) +
  theme_bw() +
  ggtitle("Caprine widths") +
  ylab("Width log-ratio") +
  coord_flip()

And another boxplot for the lengths:

ggplot(dataOCStandardized, aes(x = Site, y = Length)) +
  geom_boxplot(outlier.shape = NA, na.rm = TRUE) +
  geom_jitter(width = 0.2, height = 0, alpha = 1/2, color = 4, na.rm = TRUE) +
  theme_bw() +
  ggtitle("Caprine lengths") +
  ylab("Length log-ratio") +
  coord_flip()

Histograms grouped by site

We may choose to plot the width data as a histogram:

ggplot(dataOCStandardized, aes(Width)) +
  geom_histogram(bins = 30, na.rm = TRUE) +
  ggtitle("Caprine widths") +
  xlab("Width log-ratio") +
  facet_grid(Site ~.) +
  theme_bw() +
  theme(panel.grid.major.y = element_blank(),
        panel.grid.minor.y = element_blank()) +
  theme(plot.title = element_text(hjust = 0.5, size = 14),
        axis.title.x = element_text(size = 10),
        axis.title.y = element_text(size = 10),
        axis.text = element_text(size = 10) ) +
  scale_y_continuous(breaks = c(0, 10, 20, 30))

Vertical boxplot with dots grouped by taxon and site

Here we reorder the factor levels of dataOCStandardized$Taxon to make the order of the boxplots more intuitive.

levels0 <- levels(dataOCStandardized$Taxon)
levels0
dataOCStandardized$Taxon <- factor(dataOCStandardized$Taxon, 
                                   levels = levels0[c(2,3,1)])
levels(dataOCStandardized$Taxon)

and assign specific colours for each category:

Ocolour <- c("#A2A475", "#D8B70A", "#81A88D")
ggplot(dataOCStandardized, aes(x=Site, y=Width)) + 
  geom_boxplot(aes(fill=Taxon), 
               notch = TRUE, alpha = 0, lwd = 0.377, outlier.alpha = 0,
               width = 0.5, na.rm = TRUE,
               position = position_dodge(0.75),
               show.legend = FALSE) + 
  geom_point(aes(colour = Taxon, shape = Taxon), 
             alpha = 0.7, size = 0.8, 
             position = position_jitterdodge(jitter.width = 0.3),
             na.rm = TRUE) +
  scale_colour_manual(values=Ocolour) +
  scale_shape_manual(values=c(15, 18, 16)) +
  theme_bw(base_size = 8) +
  ylab("LSI value") +
  ggtitle("Sheep/goat LSI width values") 

Histograms grouped by taxon and site

TaxonSiteWidthHist <- ggplot(dataOCStandardized, aes(Width, fill = Taxon)) + 
  geom_histogram(bins = 30, alpha = 0.5, position = "identity") + 
  ggtitle("Sheep/goat Widths") + facet_grid(Site ~ Taxon)
TaxonSiteWidthHist
TaxonSiteWidthHist <- ggplot(dataOCStandardized, aes(Width, fill = Taxon)) + 
  geom_histogram(bins = 30, alpha = 0.5, position = "identity") + 
  ggtitle("Sheep/goat Widths") + facet_grid(~Site)
TaxonSiteWidthHist

Statistical test

We may run a statistical test (here a Student t-test) to check whether the differences in log ratio length values between the sites "OLD" and "ALP" are statistically significant:

t.test(Length ~ Site, dataOCStandardized, 
       subset = Site %in% c("OLD", "ALP"),
       na.action = "na.omit")

Similarly, for the differences in log ratio width values:

t.test(Width ~ Site, dataOCStandardized, 
       subset = Site %in% c("OLD", "ALP"),
       na.action = "na.omit")

For testing all possible pairs of sites, the p-values must be adjusted for multiple comparisons:

library(stats)
pairwise.t.test(dataOCStandardized$Width, dataOCStandardized$Site, 
                pool.sd = FALSE)

How to cite the package zoolog

We have invested a lot of time and effort in creating zoolog. Please cite the package if you publish an analysis or results obtained using zoolog. For example, "Log ratios calculation and analysis was done using the package zoolog [@pozo2021zoolog] in R 4.0.3 [@RCoreTeam2020]."

To get the details of the most recent version of the package, you can use the R citation function: citation("zoolog").

When publishing it is also recommended that the references for standards are properly cited, as well as any details on selecting feature values. The details on the source of each of the reference standards are given in the help page referencesDatabase. For instance, if using the zoolog reference$NietoDavisAlbarella and the default selection method of features, a fair description may be: "Published references for cattle [@nieto2018element], sheep/goat [@davis1996measurements] and pigs [@albarella2005neolithic] were used as standards. One length and one width log ratio value from each specimen were included in the analysis, with values selected following the default zoolog 'priority' method [@trentacoste2018pre]: length values - GL, GLl, GLm, HTC; width values - BT, Bd, Bp, SD, Bfd, Bfp."

Bibliography



Try the zoolog package in your browser

Any scripts or data that you put into this service are public.

zoolog documentation built on Sept. 5, 2021, 5:37 p.m.