knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
options(digits = 3)
library(rdss)

Introduction

Scope and ressources

rdss is an R package, and mainly an R-shiny application, for making sex estimation in past populations somewhat easier. The global approach follows the philosophy of "diagnose sexuelle secondaire" (secondary sex estimation), described by @murail1999_NewApproachSexual.

rdss has been peer-reviewed and is extensively described in an article published in the International Journal of Osteoarchaeology [@santos2021_RdssPackageFacilitate]. This article can be seen as the official and the main documentation for the package. Also, a video tutorial for the R-shiny application is available on Vimeo.

The present vignette only gives some additional details for using rdss through R scripts rather than the graphical user interface, and it is assumed that the reader has already gone through the IJOA article or the video tutorial.

Graphical user interface vs. use in R scripts

rdss was primarily designed as an R-Shiny application. As such, the target audience mainly consists in researchers and students in archaeology or biological anthropology who are not R programmers. With rdss, all steps of secondary sex estimation can be made interactively through a graphical user interface: this is what the package was designed for.

Secondary sex estimation is just a combination of some cumbersome data manipulation steps (subsetting, filtering, discarding or imputing missing values, etc.) with usual and simple supervised learning algorithms (LDA, random forests, logistic regression, etc.). As such, all those operations can easily be made without rdss through R scripts... as long as the user is familiar with R programming!

As explicitly stated in the article, I feel that those users who are familiar with R programming won't need the graphical user interface, and probably won't need rdss at all, since performing secondary sex estimation with base R functions is not especially long nor difficult. However, for the sake of transparency and reproducibility of the analyses, starting with version 1.1.0, rdss exports its main functions, so that they can also be used through the command line. It will also make it easier to share your analyses with your colleagues or your supervisor, or to include them as supporting information in an article.

Note that, at least as a first step, I still recommend to stick with the graphical user interface. Secondary sex estimation will often require several empirical trials before finding the best settings, and the R-shiny application is ideal for that.

Using rdss through the R-shiny application

Simply run the following instructions in the R console:

library(rdss)
start_dss()

See the video tutorial or the seminal article [@santos2021_RdssPackageFacilitate] for full details about rdss and secondary sex estimation.

Using rdss from the command line

@santos2021_RdssPackageFacilitate presented a short toy example of secondary sex estimation of a given individual, based on real archaeological data extracted from the Goldman Data Set online [@auerbach_GoldmanOsteometricData]. The R code below will allow for the replication of the main results (figures and tables) presented in this article.

For the sake of code readability, we'll use the pipes provided by magrittr below:

library(dplyr)
library(magrittr)

Equivalent for first tab in UI: load data

Let's start with the CSV file used in the seminal article [@santos2021_RdssPackageFacilitate]. This data file must include both the individuals of known sex, and the target individuals. It can be imported with the usual read.csv() function.

Important note: please do not use the argument row.names when reading your data file. The individual IDs must be present in your data file, but at this stage, they must be read as an ordinary column, not as row names.

### Load your data file:
dat <- read.csv("https://gitlab.com/f-santos/rdss/-/raw/master/inst/poundbury_with_NA_reduced.csv",
                header = TRUE,
                sep = ",",
                dec = ".",
                stringsAsFactors = TRUE,
                na.strings = "nil",
                fileEncoding = "utf-8")

Now let's display a quick summary for this file:

summary(dat)

rdss offers a bunch of checks for possible formatting errors in your dataframe, which might prevent the subsequent analyses to run normally. As a consequence, a mandatory first step consists in running those checks on your dataframe with the function dss_check_data(). If some formatting errors are found, explicit error messages are displayed, with useful indications on how to solve them. Also, as a side effect, dss_check_data() returns your original dataframe, with possibly some slight changes in column names and factor levels for the sake of standardization (see its help page). Also, the first column is now converted as row names, if it is found to be convenient for that purpose.

### Perform some checks on the dataframe:
dat <- dss_check_data(dtf = dat,     # name of imported dataframe
                      sex = "Sex",   # name of column for sex info in dat
                      females = "F", # abbreviation for female indivs in dat
                      males = "M",   # abbreviation for male indivs in dat
                      tbd = "TBD")   # abbreviation for target indivs in dat
summary(dat)

Since no error was displayed here, you are now ready for the next step.

Equivalent for second tab in UI: select the target individual

You can first select (manually) your target individual for the subsequent steps. For instance, let's say that our target individual will be the individual whose name is Indet_1:

### Select the target individual:
target <- dat["Indet_1", ]
print(target)

The reference dataset is composed of the individual of known sex. Although this is not the case here, the target individual you selected may have missing values. The variables corresponding to those potentially missing values, by definition, are useless for all subsequent steps, and should be removed from the reference dataset.

### Define reference dataset:
ref <- subset(dat, Sex %in% c("F", "M")) %>%
  droplevels() %>%           # remove unused factor levels
  select_if(! is.na(target)) # keep only non-missing variables on target indiv
summary(ref)

Equivalent for third tab in UI: customize the reference sample

Visualization of missing data pattern

A map of missingness can easily be obtained for the reference sample:

### Missingness map for the original reference sample:
dss_plot_md_pattern(ref = ref,
                    target = target,
                    type = "map")     # also try with type = "pattern"

Note that this corresponds to Fig. 2 in @santos2021_RdssPackageFacilitate.

Data subsetting and filtering

Thanks to this map, one can remove manually and explicitly the variables and individuals having too many missing values. Another way is to use quality criteria, e.g.: 1. "Keep only those variables having less than 50% of missing values" (here, this will only remove LRML, which seems to be reasonable) 2. Then, in the remanining data, "keep only those individuals with less than 50% of missing values" (here, this will only remove two individuals).

This can be done by using, for instance:

### Keep only well-preserved individuals and variables:
ref %<>% remove_na(which = "var",
                   prop_min = 0.5) %>%
    remove_na(which = "ind",
              prop_min = 0.5)
summary(ref)

Note that this kind of filtering criteria are not commutative, generally speaking.

The filtered reference dataset has now a much better map of missingness, while still having enough individuals and variables:

### Missingness map for the filtered reference sample:
dss_plot_md_pattern(ref = ref,
                    target = target,
                    type = "map")     # also try with type = "pattern"

Note that this corresponds to Fig. 3 in @santos2021_RdssPackageFacilitate. Also, those missingness maps (when type = "map") are produced using the R function visdat::vis_miss(), which relies of ggplot2. Such maps can then be further customized using the usual ggplot2 syntax.

Equivalent for fourth tab in UI: perform sex estimation

Impute missing values in the reference dataset

First, the remaining missing values in the reference dataset must be imputed. By default, an EM-type algorithm by @josse2016_MissMDAPackageHandling is used.

Note: please do not erase the original reference dataset, since you will have to use it in subsequent steps. Create a new R object instead.

### Impute missing values in the reference dataset:
imputed <- dss_impute_missing(dtf = ref, method = "missMDA")
summary(imputed)

Principal component analysis on the reference dataset

A principal component analysis can then be easily computed and displayed using the wrapper dss_plot_pca(). Classical or robust confidence ellipses can be added on the plot.

### Perform a principal component analysis:
dss_plot_pca(ref = ref,
             imputed_ref = imputed,
             target = target,
             ellipses = "robust",
             labels = FALSE)

Note that this is exactly Fig. 4 in @santos2021_RdssPackageFacilitate.

Build a decision rule for sex estimation and assess its accuracy

Finally, the main function, dss_sex_estimation, builds a decision rule for sex estimation on the reference sample, evaluates this decision rule in leave-one-out cross-validation on this sample, and then applies this rule on the target individual.

### Perform sex estimation:
res <- dss_sex_estimation(ref = imputed,   # imputed reference dataset
                          target = target, # target individual
                          conf = 0.95,     # post. prob. threshold for sex estimation
                          method = "linda") # use robust LDA

The classification results can then be displayed individually:

### Confusion matrix in LOOCV on the reference dataset:
res$table_loocv
### Model coefficients:
res$details
### Results for the target individual:
res$res_dss

Note that those results correspond to Fig. 5 in @santos2021_RdssPackageFacilitate.

References



frederic-santos/rdss documentation built on March 25, 2023, 5:25 p.m.