dss_sex_estimation: Perform secondary sex estimation.

View source: R/dss_sex_estimation.R

dss_sex_estimationR Documentation

Perform secondary sex estimation.

Description

Estimate the sex of a target individual using an (imputed, complete) reference dataset of individuals of know sex. This functions is essentially a wrapper for various methods of supervised learning.

Usage

dss_sex_estimation(ref, target, conf = 0.95,
                   method = c("lda", "glmnet", "linda", "rf"),
                   lda_selvar = c("none", "backward", "forward"),
                   rf_ntrees = 200, rf_downsampling = FALSE,
                   glmnet_type = 0,
                   glmnet_measure = c("deviance", "class"),
                   linda_alpha = 0.9)

Arguments

ref

dataframe (previously imputed if necessary) of reference individuals. No missing values allowed.

target

1-row dataframe, target individual.

conf

numeric value lying in [0.5, 1[; confidence level for sex estimation (i.e., posterior probability threshold).

method

character string; supervised learning method to be used for sex estimation. See Details below.

lda_selvar

character string. Only parsed if method = "lda". Method of variable selection to be used in LDA model.

rf_ntrees

numeric value. Only parsed if method = "rf". Number of trees to be used in random forest model.

rf_downsampling

boolean. Only parsed if method = "rf". Use a basic method of downsampling in case of unbalanced female/male classes in random forest model.

glmnet_type

numeric value. Only parsed if method = "glmnet". Passed to glmnet as the alpha argument. In particular, choose '0' for ridge regression and '1' for lasso regression.

glmnet_measure

Only parsed if method = "glmnet". Passed to cv.glmnet as the 'type.measure' argument.

linda_alpha

numeric value. Only parsed if method = "linda". Passed to Linda as the 'alpha' argument.

Details

The argument method leaves the choice between four methods of supervised learning: classical linear discriminant analysis ("lda") performed with lda; robust discriminant analysis ("linda") performed with Linda; random forests ("rf") performed with randomForest; penalized logistic regression ("glmnet") performed with glmnet. See their respective help pages for more details.

Classification accuracy is automatically assessed using leave-one-out cross-validation (or out-of-bag error for random forests). The confusion matrix which is return thus corresponds to cross-validated results.

Value

A list of three components:

res_dss

A dataframe of results for the target individual, with all necessary details about the model used for sex estimation.

table_loocv

A confusion matrix obtained by leave-one-out cross-validation on the reference sample ref.

details

Additional method-specific details, such as coefficient values or variable importance, depending on the value of the method argument.

Author(s)

Frédéric Santos.

See Also

lda, Linda, randomForest, glmnet, cv.glmnet


frederic-santos/rdss documentation built on March 25, 2023, 5:25 p.m.