knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) options(digits = 3)
library(rdss)
rdss
is an R package, and mainly an R-shiny application, for making sex estimation in past populations somewhat easier. The global approach follows the philosophy of "diagnose sexuelle secondaire" (secondary sex estimation), described by @murail1999_NewApproachSexual.
rdss
has been peer-reviewed and is extensively described in an article published in the International Journal of Osteoarchaeology [@santos2021_RdssPackageFacilitate]. This article can be seen as the official and the main documentation for the package. Also, a video tutorial for the R-shiny application is available on Vimeo.
The present vignette only gives some additional details for using rdss
through R scripts rather than the graphical user interface, and it is assumed that the reader has already gone through the IJOA article or the video tutorial.
rdss
was primarily designed as an R-Shiny application. As such, the target audience mainly consists in researchers and students in archaeology or biological anthropology who are not R programmers. With rdss
, all steps of secondary sex estimation can be made interactively through a graphical user interface: this is what the package was designed for.
Secondary sex estimation is just a combination of some cumbersome data manipulation steps (subsetting, filtering, discarding or imputing missing values, etc.) with usual and simple supervised learning algorithms (LDA, random forests, logistic regression, etc.). As such, all those operations can easily be made without rdss
through R scripts... as long as the user is familiar with R programming!
As explicitly stated in the article, I feel that those users who are familiar with R programming won't need the graphical user interface, and probably won't need rdss
at all, since performing secondary sex estimation with base R functions is not especially long nor difficult. However, for the sake of transparency and reproducibility of the analyses, starting with version 1.1.0, rdss
exports its main functions, so that they can also be used through the command line. It will also make it easier to share your analyses with your colleagues or your supervisor, or to include them as supporting information in an article.
Note that, at least as a first step, I still recommend to stick with the graphical user interface. Secondary sex estimation will often require several empirical trials before finding the best settings, and the R-shiny application is ideal for that.
rdss
through the R-shiny applicationSimply run the following instructions in the R console:
library(rdss) start_dss()
See the video tutorial or the seminal article [@santos2021_RdssPackageFacilitate] for full details about rdss
and secondary sex estimation.
rdss
from the command line@santos2021_RdssPackageFacilitate presented a short toy example of secondary sex estimation of a given individual, based on real archaeological data extracted from the Goldman Data Set online [@auerbach_GoldmanOsteometricData]. The R code below will allow for the replication of the main results (figures and tables) presented in this article.
For the sake of code readability, we'll use the pipes provided by magrittr
below:
library(dplyr) library(magrittr)
Let's start with the CSV file used in the seminal article [@santos2021_RdssPackageFacilitate]. This data file must include both the individuals of known sex, and the target individuals. It can be imported with the usual read.csv()
function.
Important note: please do not use the argument row.names
when reading your data file. The individual IDs must be present in your data file, but at this stage, they must be read as an ordinary column, not as row names.
### Load your data file: dat <- read.csv("https://gitlab.com/f-santos/rdss/-/raw/master/inst/poundbury_with_NA_reduced.csv", header = TRUE, sep = ",", dec = ".", stringsAsFactors = TRUE, na.strings = "nil", fileEncoding = "utf-8")
Now let's display a quick summary for this file:
summary(dat)
rdss
offers a bunch of checks for possible formatting errors in your dataframe, which might prevent the subsequent analyses to run normally. As a consequence, a mandatory first step consists in running those checks on your dataframe with the function dss_check_data()
. If some formatting errors are found, explicit error messages are displayed, with useful indications on how to solve them. Also, as a side effect, dss_check_data()
returns your original dataframe, with possibly some slight changes in column names and factor levels for the sake of standardization (see its help page). Also, the first column is now converted as row names, if it is found to be convenient for that purpose.
### Perform some checks on the dataframe: dat <- dss_check_data(dtf = dat, # name of imported dataframe sex = "Sex", # name of column for sex info in dat females = "F", # abbreviation for female indivs in dat males = "M", # abbreviation for male indivs in dat tbd = "TBD") # abbreviation for target indivs in dat summary(dat)
Since no error was displayed here, you are now ready for the next step.
You can first select (manually) your target individual for the subsequent steps. For instance, let's say that our target individual will be the individual whose name is Indet_1
:
### Select the target individual: target <- dat["Indet_1", ] print(target)
The reference dataset is composed of the individual of known sex. Although this is not the case here, the target individual you selected may have missing values. The variables corresponding to those potentially missing values, by definition, are useless for all subsequent steps, and should be removed from the reference dataset.
### Define reference dataset: ref <- subset(dat, Sex %in% c("F", "M")) %>% droplevels() %>% # remove unused factor levels select_if(! is.na(target)) # keep only non-missing variables on target indiv summary(ref)
A map of missingness can easily be obtained for the reference sample:
### Missingness map for the original reference sample: dss_plot_md_pattern(ref = ref, target = target, type = "map") # also try with type = "pattern"
Note that this corresponds to Fig. 2 in @santos2021_RdssPackageFacilitate.
Thanks to this map, one can remove manually and explicitly the variables and individuals having too many missing values. Another way is to use quality criteria, e.g.:
1. "Keep only those variables having less than 50% of missing values" (here, this will only remove LRML
, which seems to be reasonable)
2. Then, in the remanining data, "keep only those individuals with less than 50% of missing values" (here, this will only remove two individuals).
This can be done by using, for instance:
### Keep only well-preserved individuals and variables: ref %<>% remove_na(which = "var", prop_min = 0.5) %>% remove_na(which = "ind", prop_min = 0.5) summary(ref)
Note that this kind of filtering criteria are not commutative, generally speaking.
The filtered reference dataset has now a much better map of missingness, while still having enough individuals and variables:
### Missingness map for the filtered reference sample: dss_plot_md_pattern(ref = ref, target = target, type = "map") # also try with type = "pattern"
Note that this corresponds to Fig. 3 in @santos2021_RdssPackageFacilitate. Also, those missingness maps (when type = "map"
) are produced using the R function visdat::vis_miss()
, which relies of ggplot2
. Such maps can then be further customized using the usual ggplot2
syntax.
First, the remaining missing values in the reference dataset must be imputed. By default, an EM-type algorithm by @josse2016_MissMDAPackageHandling is used.
Note: please do not erase the original reference dataset, since you will have to use it in subsequent steps. Create a new R object instead.
### Impute missing values in the reference dataset: imputed <- dss_impute_missing(dtf = ref, method = "missMDA") summary(imputed)
A principal component analysis can then be easily computed and displayed using the wrapper dss_plot_pca()
. Classical or robust confidence ellipses can be added on the plot.
### Perform a principal component analysis: dss_plot_pca(ref = ref, imputed_ref = imputed, target = target, ellipses = "robust", labels = FALSE)
Note that this is exactly Fig. 4 in @santos2021_RdssPackageFacilitate.
Finally, the main function, dss_sex_estimation
, builds a decision rule for sex estimation on the reference sample, evaluates this decision rule in leave-one-out cross-validation on this sample, and then applies this rule on the target individual.
### Perform sex estimation: res <- dss_sex_estimation(ref = imputed, # imputed reference dataset target = target, # target individual conf = 0.95, # post. prob. threshold for sex estimation method = "linda") # use robust LDA
The classification results can then be displayed individually:
### Confusion matrix in LOOCV on the reference dataset: res$table_loocv
### Model coefficients: res$details
### Results for the target individual: res$res_dss
Note that those results correspond to Fig. 5 in @santos2021_RdssPackageFacilitate.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.