linComb: Combine two diagnostic tests with several linear combination...

View source: R/linComb.R

linCombR Documentation

Combine two diagnostic tests with several linear combination methods.

Description

The linComb function calculates the combination scores of two diagnostic tests selected among several linear combination methods and standardization options.

Usage

linComb(
  markers = NULL,
  status = NULL,
  event = NULL,
  method = c("scoring", "SL", "logistic", "minmax", "PT", "PCL", "minimax", "TS"),
  resample = c("none", "cv", "repeatedcv", "boot"),
  nfolds = 5,
  nrepeats = 3,
  niters = 10,
  standardize = c("none", "range", "zScore", "tScore", "mean", "deviance"),
  ndigits = 0,
  show.plot = TRUE,
  direction = c("auto", "<", ">"),
  conf.level = 0.95,
  cutoff.method = c("CB", "MCT", "MinValueSp", "MinValueSe", "ValueSp", "ValueSe",
    "MinValueSpSe", "MaxSp", "MaxSe", "MaxSpSe", "MaxProdSpSe", "ROC01", "SpEqualSe",
    "Youden", "MaxEfficiency", "Minimax", "MaxDOR", "MaxKappa", "MinValueNPV",
    "MinValuePPV", "ValueNPV", "ValuePPV", "MinValueNPVPPV", "PROC01", "NPVEqualPPV",
    "MaxNPVPPV", "MaxSumNPVPPV", "MaxProdNPVPPV", "ValueDLR.Negative",
    "ValueDLR.Positive", "MinPvalue", "ObservedPrev", "MeanPrev", "PrevalenceMatching"),
  ...
)

Arguments

markers

a numeric a numeric data frame that includes two diagnostic tests results

status

a factor vector that includes the actual disease status of the patients

event

a character string that indicates the event in the status to be considered as positive event

method

a character string specifying the method used for combining the markers.
Notations: Before getting into these methods, let us first introduce some notations that will be used throughout this vignette. Let D_i, i = 1, 2, \ldots, n_1 be the marker values of i\text{th} individual in diseased group, where D_i = (D_{i1}, D_{i2}) and H_j, j=1,2, \ldots, n_2 be the marker values of j\text{th} individual in healthy group, where H_j = H_{j1}, H_{j2}. Let x_i1 = c(D_{i1}, H_{j1}) be the values of the first marker, and x_i2 = c(D_{i2}, H_{j2}) be values of the second marker for the i\text{th} individual i= 1,2, \ldots, n. Let D_{i,min} = min(D_{i1}, D_{i2}), D_{i,max} = max(D_{i1}, D_{i2}) , H_{j,min} = min(H_{j1}, H_{j2}), H_{j,max} = max(H_{j1}, H_{j2}) and c_i be be the resulting combination score for the i\text{th} individual.

The available methods are:

  • Logistic Regression (logistic): Combination score obtained by fitting a logistic regression modelis as follows:

    c_i = \left(\frac{e^ {\beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2}}}{1 + e^{\beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2}}}\right)

    A combination score obtained by fitting a logistic regression model typically refers to the predicted probability or score assigned to each observation in a dataset based on the logistic regression model’s fitted values

  • Scoring based on Logistic Regression (scoring): Combination score is obtained using the slope values of the relevant logistic regression model, slope values are rounded to the number of digits taken from the user.

    c_i = \beta_1 x_{i1} + \beta_2 x_{i2}

  • Pepe & Thompson’s method (PT): The Pepe and Thompson combination score, developed using their optimal linear combination technique, aims to maximize the Mann-Whitney statistic in the same way that the Min-max method does. Unlike the Min-max method, the Pepe and Thomson method takes into account all marker values instead of just the lowest and maximum values.

    maximize\; U(\alpha) = \left(\frac{1}{n_1,n_2}\right) {\sum_{i=1}^{n_1} {\sum_{j=1}^{n_2}}I(D_{i1} + \alpha D_{i2} >= H_{j1} + \alpha H_{j2})}


    c_i = x_{i1} + \alpha x_{i2}

  • Pepe, Cai & Langton’s method (PCL): Pepe, Cai and Langton combination score obtained by using AUC as the parameter of a logistic regression model.

    maximize\; U(\alpha) = \left(\frac{1}{n_1,n_2}\right) {\sum_{i=1}^{n_1} {\sum_{j=1}^{n_2}}I(D_{i1} + \alpha D_{i2} >}

    H_{j1} + \alpha H_{j2}) + \left(\frac{1}{2} \right) I(D_{i1} + \alpha D_{i2} = H_{j1} + \alpha H_{j2})

  • Min-Max method (minmax): This method linearly combines the minimum and maximum values of the markers by finding a parameter,\alpha , that maximizes the Mann-Whitney statistic, an empirical estimate of the ROC area.

    maximize\;U( \alpha ) = \left(\frac{1}{n_1,n_2}\right) {\sum_{i=1}^{n_1} {\sum_{j=1}^{n_2}}I(D_{i,max} + \alpha D_{i,min} > H_{j,max} + \alpha H_{j,min})}


    c_i = x_{i,max} + \alpha x_{i,min}

    where x_{i,max} = max(x_{i1},x_{i2}) and x_{i,min} = min(x_{i1}, x_{i2})

  • Su & Liu’s method (SL): The Su and Liu combination score is computed through Fisher’s discriminant coefficients, which assumes that the underlying data follow a multivariate normal distribution, and the covariance matrices across different classes are assumed to be proportional.Assuming that D\sim N(\mu_D,\textstyle \sum_D) and H\sim N(\mu_H,\textstyle \sum_H) represent the multivariate normal distributions for the diseased and non-diseased groups, respectively. The Fisher’s coefficients are as follows:

    (\alpha , \beta) = (\textstyle \sum_{D}+\sum_{H})^{\;-1}\mu

    \text{where} \mu_=\mu_D - \mu_H. \text{The combination score in this case is:}

    c_i = \alpha x_{i1} + \beta x_{i2}

  • Minimax approach (minimax): Combination score obtained with the Minimax procedure; t parameter is chosen as the value that gives the maximum AUC from the combination score. Suppose that D follows a multivariate normal distribution D\sim N(\mu_D,\textstyle \sum_D), representing diseased group and H follows a multivariate normal distribution H\sim N(\mu_H,\textstyle \sum_H) , representing the non-diseased group. Then Fisher’s coefficients are as follows:

    (\alpha , \beta) = {[t { \textstyle \sum_{D}} + (1 - t) \textstyle \sum_{H}] ^ {-1}}{(\mu_D - \mu_H)}

    c_i = b_1 x_1 + b_2 x_2

  • Todor & Saplacan’s method (TS):Combination score obtained by using the trigonometric functions of the \Theta value that optimizes the corresponding AUC.

    c_i = sin(\theta) x_{i1} + cos(\theta) x_{i2}

resample

a character string indicating the name of the resampling options. Bootstrapping Cross-validation and repeated cross-validation are given as the options for resampling, along with the number of folds and number of repeats.

  • boot: Bootstrapping is performed similarly; the dataset is divided into folds with replacement and models are trained and tested in these folds to determine the best parameters for the given method and dataset.

  • cv: Cross-validation resampling, the dataset is divided into the number of folds given without replacement; in each iteration, one fold is selected as the test set, and the model is built using the remaining folds and tested on the test set. The corresponding AUC values and the parameters used for the combination are kept in a list. The best-performed model is selected, and the combination score is returned for the whole dataset.

  • repeatedcv: Repeated cross-validation the process is repeated, and the best-performed models selected at each step are stored in another list; the best performed among these models is selected to be applied to the entire dataset.

nfolds

a numeric value that indicates the number of folds for cross validation based resampling methods (5, default)

nrepeats

a numeric value that indicates the number of repeats for "repeatedcv" option of resampling methods (3, default)

niters

a numeric value that indicates the number of bootstrapped resampling iterations (10, default)

standardize

a character string indicating the name of the standardization method. The default option is no standardization applied. Available options are:

  • Z-score (zScore): This method scales the data to have a mean of 0 and a standard deviation of 1. It subtracts the mean and divides by the standard deviation for each feature. Mathematically,

    Z-score = \frac{x - (\overline x)}{sd(x)}

    where x is the value of a marker, \overline{x} is the mean of the marker and sd(x) is the standard deviation of the marker.

  • T-score (tScore): T-score is commonly used in data analysis to transform raw scores into a standardized form. The standard formula for converting a raw score x into a T-score is:

    T-score = \Biggl(\frac{x - (\overline x)}{sd(x)}\times 10 \Biggl) +50

    where x is the value of a marker, \overline{x} is the mean of the marker and sd(x) is the standard deviation of the marker.

  • Range (a.k.a. min-max scaling) (range): This method transforms data to a specific range, between 0 and 1. The formula for this method is:

    Range = \frac{x - min(x)}{max(x) - min(x)}

  • Mean (mean): This method, which helps to understand the relative size of a single observation concerning the mean of dataset, calculates the ratio of each data point to the mean value of the dataset.

    Mean = \frac{x}{\overline{x}}

    where x is the value of a marker and \overline{x} is the mean of the marker.

  • Deviance (deviance): This method, which allows for comparison of individual data points in relation to the overall spread of the data, calculates the ratio of each data point to the standard deviation of the dataset.

    Deviance = \frac{x}{sd(x)}

    where x is the value of a marker and sd(x) is the standard deviation of the marker.

ndigits

a integer value to indicate the number of decimal places to be used for rounding in Scoring method (0, default)

show.plot

a logical a logical. If TRUE, a ROC curve is plotted. Default is TRUE

direction

a character string determines in which direction the comparison will be made. ">": if the predictor values for the control group are higher than the values of the case group (controls > cases). "<": if the predictor values for the control group are lower or equal than the values of the case group (controls < cases).

conf.level

a numeric values determines the confidence interval for the roc curve(0.95, default).

cutoff.method

a character string determines the cutoff method for the roc curve.

...

further arguments. Currently has no effect on the results.

Value

A list of numeric linear combination scores calculated according to the given method and standardization option.

Author(s)

Serra Ilayda Yerlitas, Serra Bersan Gengec, Necla Kochan, Gozde Erturk Zararsiz, Selcuk Korkmaz, Gokmen Zararsiz

Examples

# call data
data(exampleData1)

# define the function parameters
markers <- exampleData1[, -1]
status <- factor(exampleData1$group, levels = c("not_needed", "needed"))
event <- "needed"

score1 <- linComb(
  markers = markers, status = status, event = event,
  method = "logistic", resample = "none", show.plot = TRUE,
  standardize = "none", direction = "<", cutoff.method = "Youden"
)

# call data
data(exampleData2)

# define the function parameters
markers <- exampleData2[, -c(1:3, 6:7)]
status <- factor(exampleData2$Group, levels = c("normals", "carriers"))
event <- "carriers"

score2 <- linComb(
  markers = markers, status = status, event = event,
  method = "PT", resample = "none", standardize = "none", direction = "<",
  cutoff.method = "Youden"
)

score3 <- linComb(
  markers = markers, status = status, event = event,
  method = "minmax", resample = "none", direction = "<",
  cutoff.method = "Youden"
)


dtComb documentation built on June 24, 2024, 5:15 p.m.