Nothing
#' Estimates relative transmission probabilities
#'
#' The function \code{nbProbabilities} uses naive Bayes and an interative estimation
#' procedure to estimate relative transmission probabilities
#'
#' This algorithm takes a dataset of ordered possible infector-infectee pairs in an
#' infectious disease outbreak or cluster and estimates the relative probability the cases are
#' linked by direct transmission using a classification technique called naive Bayes (NB).
#' NB is a simple machine learning algorithm that uses Bayes rule to estimate the
#' probability of an outcome in a prediction dataset given a set of covariates from
#' the observed frequencies in a training dataset.
#'
#' The input dataset - \code{orderedPair} - should represent ordered pairs of cases
#' (where the potential infector was observed before the infectee) and have
#' a unique identifier for each pair (\code{pairIDVar}) as well as the individual ids that are
#' included in the pair (\code{<indIDVar>.1} and \code{<indIDVar>.2}). If cases are concurrent
#' (meaning the order cannot determined) both orders can be included.
#'
#' A subset of pairs should also have pathogen WGS, contact investigation, or some other
#' 'gold standard' defined by \code{goldStdVar} which should be a logical vector with
#' \code{TRUE} indicating links, \code{FALSE} nonlinks, and \code{NA} if
#' the pair cannot be used to train (does not have the information or is indeterminate).
#' These pairs will be used to a training dataset of probable links and non/links.
#' The covariates can be any categorical variables and could represent
#' spatial, clinical, demographic, and temporal characteristics of the case pair.
#'
#' Because the outcomes in the training set represent probable and not certain
#' transmission events and a given case could have mulitple probable infectors,
#' the algorithm uses an iterative estimation procedure. This procedure randomly chooses one
#' link of all of the possible links to include in the training dataset \code{nReps}
#' times, and then uses \code{mxn} cross prediction to give all pairs a turn
#' in the prediction dataset.
#'
#' The output of this function is a list of two dataframes: one with the estimates of the
#' transmission probabilities (\code{probabilities}) and the other with the contribution of
#' the covariates to the probabilities in the form of odds ratios (\code{estimates}). The
#' 95% confidence intervals reported for these odds ratios use Rubin's Rules, a technique developed
#' for multiple imputation, to pool the error across all iterations.
#'
#'
#' @param orderedPair The name of the ordered pair-level dataset with the covariates.
#' @param indIDVar The name (in quotes) of the column with the individual ID.
#' (data frame \code{orderedPair} must have columns called \code{<indIDVar>.1} and \code{<indIDVar>.2}).
#' @param pairIDVar The name (in quotes) of the column with the unique pair ID variable.
#' @param goldStdVar The name (in quotes) of the column with a logical vector defining
#' training links/non-links
#' @param covariates A character vector containing the covariate column names (in quotes).
#' All covariates need to be categorical factor variables.
#' @param label An optional label string for the run.
#' @param l Laplace smoothing parameter that is added to each cell.
#' @param n The number of folds for nxm cross validation (should be at least 10).
#' @param m The number of times to create n folds in nxm cross validation.
#' @param nReps The number of times to randomly select the "true" infector (should be at least 10).
#' @param progressBar A logical indicating if a progress bar should be printed (default is TRUE).
#'
#' @return List containing two data frames:
#' \enumerate{
#' \item \code{probabilities} - a data frame of transmission probabilities. Column names:
#' \itemize{
#' \item \code{label} - the optional label of the run.
#' \item \code{<pairIDVar>} - the pair ID with the name specified.
#' \item \code{pAvg} - the mean transmission probability for the pair over all iterations.
#' \item \code{pSD} - the standard deviation of the transmission probability for the pair
#' over all iterations.
#' \item \code{pScaled} - the mean relative transmission probability for the pair over.
#' all iterations: pAvg scaled so that the probabilities for all infectors per infectee add to 1.
#' \item \code{pRank} - the rank of the probability of the the pair out of all pairs for that
#' infectee (in case of ties all values have the minimum rank of the group).
#' \item \code{nEstimates} - the number of probability estimates that contributed to pAvg. This
#' represents the number of prediction datasets this pair was included in over the \code{nxm}
#' cross prediction repeated \code{nReps} times.
#' }
#' \item \code{estimates} - a data frame with the contribution of covariates. Column names:
#' \itemize{
#' \item \code{label} - the optional label of the run
#' \item \code{level} - the covariate name and level
#' \item \code{nIter} - the number of iterations included in the estimates: \code{n*m*nReps}
#' \item \code{logorMean} - the mean value of the log odds ratio across iterations
#' \item \code{logorSE} - the standard error of the log odds ratio across iterations
#' \item \code{logorCILB} - the lower bound of the 95% confidence interval of the log odds ratio
#' across iterations
#' \item \code{logorCIUB} - the upper bound of the 95% confidence interval of the log odds ratio
#' across iterations
#' }
#' }
#'
#' @examples
#' ## Use the pairData dataset which represents a TB-like outbreak
#' # First create a dataset of ordered pairs
#' orderedPair <- pairData[pairData$infectionDiffY >= 0, ]
#'
#' ## Create a variable called snpClose that will define probable links
#' # (<3 SNPs) and nonlinks (>12 SNPs) all pairs with between 2-12 SNPs
#' # will not be used to train.
#' orderedPair$snpClose <- ifelse(orderedPair$snpDist < 3, TRUE,
#' ifelse(orderedPair$snpDist > 12, FALSE, NA))
#' table(orderedPair$snpClose)
#'
#' ## Running the algorithm
#' #NOTE should run with nReps > 1.
#' resGen <- nbProbabilities(orderedPair = orderedPair,
#' indIDVar = "individualID",
#' pairIDVar = "pairID",
#' goldStdVar = "snpClose",
#' covariates = c("Z1", "Z2", "Z3", "Z4", "timeCat"),
#' label = "SNPs", l = 1,
#' n = 10, m = 1, nReps = 1)
#'
#' ## Merging the probabilities back with the pair-level data
#' nbResults <- merge(resGen[[1]], orderedPair, by = "pairID", all = TRUE)
#'
#' @references
#' Barnard J. and Rubin D. Small-Sample Degrees of Freedom with Multiple Imputation
#' \emph{Biometrika}. 1999 Dec;86(4):948-55.
#'
#' @export
nbProbabilities <- function(orderedPair, indIDVar, pairIDVar,
goldStdVar, covariates, label = "",
l = 1, n = 10, m = 1, nReps = 10,
progressBar = TRUE){
orderedPair <- as.data.frame(orderedPair)
#Creating variables with the individual indID variable
indIDVar1 <- paste0(indIDVar, ".1")
indIDVar2 <- paste0(indIDVar, ".2")
#Checking that the named variables are in the data frame
if(!indIDVar1 %in% names(orderedPair)){
stop(paste0(indIDVar1, " is not in the data frame."))
}
if(!indIDVar2 %in% names(orderedPair)){
stop(paste0(indIDVar2, " is not in the data frame."))
}
if(!pairIDVar %in% names(orderedPair)){
stop(paste0(pairIDVar, " is not in the data frame."))
}
if(!goldStdVar %in% names(orderedPair)){
stop(paste0(goldStdVar, " is not in the data frame."))
}
#Checking that the covariates are in the data frame
covarTest <- covariates %in% names(orderedPair)
if(FALSE %in% covarTest){
stop("At least one of the covariates is not in the data frame.")
}
#Checking that all of the covariates are factors
covarDf <- orderedPair[, covariates]
notFactor <- names(covarDf)[!sapply(covarDf, is.factor)]
notFactorC <- paste0(notFactor, collapse = ", ")
if(length(notFactor) == 1){
stop(paste0(notFactorC, " is not a factor"))
}else if(length(notFactor > 1)){
stop(paste0(notFactorC, " are not factors"))
}
#### Setting up data frames ####
#Subsetting to only relevant columns to be more efficient
orderedPair <- orderedPair[, c(indIDVar1, indIDVar2, pairIDVar,
goldStdVar, covariates)]
#Creating an pairID that where order doesn't matter
orderedPair$pairID_uo <- ifelse(orderedPair[, indIDVar1] < orderedPair[, indIDVar2],
paste(orderedPair[, indIDVar1],
orderedPair[, indIDVar2], sep = "_"),
paste(orderedPair[, indIDVar2],
orderedPair[, indIDVar1], sep = "_"))
#Finding all pairs that can be included in the training dataset
#And subsetting into the potential links
posTrain <- orderedPair[!is.na(orderedPair[, goldStdVar]), ]
#### Iterative Estimation Procedure ####
#Initializing data frames to hold results and coefficients
rAll <- data.frame("p" = numeric(), pairIDVar = character(), stringsAsFactors = FALSE)
names(rAll) <- c("p", pairIDVar)
cAll <- data.frame("level" = character(), "odds" = numeric(), stringsAsFactors = FALSE)
if(progressBar == TRUE){
pb <- utils::txtProgressBar(min = 0, max = nReps, style = 3)
}
for (k in 1:nReps){
#Randomly choosing the "true" infector from all possible
#Calculating probabilities using mxn cross prediction
cvResults <- runCV(posTrain, orderedPair, indIDVar, pairIDVar,
goldStdVar, covariates, l, n, m)
rAll <- dplyr::bind_rows(rAll, cvResults$rFolds)
cAll <- dplyr::bind_rows(cAll, cvResults$cFolds)
if(progressBar == TRUE){
utils::setTxtProgressBar(pb, k)
}
}
#### Summarizing Probabilities Over iterations ####
#Averaging the probabilities over all the replicates
sumData1 <- dplyr::group_by(rAll, !!rlang::sym(pairIDVar))
sumData2 <- dplyr::summarize(sumData1,
pAvg = mean(!!rlang::sym("p"), na.rm = TRUE),
pSD = stats::sd(!!rlang::sym("p"), na.rm = TRUE),
nEstimates = sum(!is.na(!!rlang::sym("p"))),
label = dplyr::first(label),
.groups = 'drop')
sumData2 <- dplyr::ungroup(sumData2)
probs <- as.data.frame(dplyr::full_join(sumData2, orderedPair, by = pairIDVar),
stringsAsFactors = FALSE)
#Calculating the total of all probabilities per infectee
totalP <- stats::aggregate(probs$pAvg, by = list(probs[, indIDVar2]), sum, na.rm = TRUE)
names(totalP) <- c(indIDVar2, "pTotal")
#Calculating the scaled probabilities
probs2 <- merge(probs, totalP, by = indIDVar2)
probs2$pScaled <- ifelse(probs2$pTotal != 0, probs2$pAvg / probs2$pTotal, 0)
#Ranking the probabilities for each possible infector
#Ties are set to the minimum rank of that group
probs2 <- probs2[order(probs2[, indIDVar2], -probs2$pScaled), ]
probs2$pRank <- stats::ave(-probs2$pScaled, probs2[, indIDVar2],
FUN = function(x){
rank(x, ties.method = "min")
})
#Only keeping columns of interest
probs2 <- probs2[, c("label", pairIDVar, "pAvg", "pSD", "pScaled", "pRank", "nEstimates")]
#### Summarizing Measures of Effect Over iterations ####
#Averaging over the measures of effect
coeffL <- by(cAll,
INDICES = list(cAll$level),
FUN = function(x){
data.frame("level" = unique(x$level),
"rowOrder" = unique(x$rowOrder),
"logorMean" = mean(x$est, na.rm = TRUE),
"VW" = mean(x$se ^ 2, na.rm = TRUE),
"VB" = stats::var(x$est, na.rm = TRUE),
"nIter" = sum(!is.na(x$est)),
"label" = label, stringsAsFactors = FALSE)
})
coeff <- do.call(dplyr::bind_rows, coeffL)
#Reordering the rows
coeff <- coeff[order(coeff$rowOrder), ]
#Calculating the total variance using Rubin's rules
coeff$VT <- coeff$VW + coeff$VB + coeff$VB / coeff$nIter
coeff$logorSE <- sqrt(coeff$VT)
#coeff$df <- (coeff$nIter - 1) * (1 - coeff$VW / ((1 + 1/coeff$nIter) * coeff$VB)) ^ 2
coeff$logorCILB <- coeff$logorMean - 1.96 * coeff$logorSE
coeff$logorCIUB <- coeff$logorMean + 1.96 * coeff$logorSE
#Only keeping columns of interest
coeff2 <- coeff[, c("label", "level", "nIter", "logorMean", "logorSE",
"logorCILB", "logorCIUB")]
return(list("probabilities" = probs2, "estimates" = coeff2))
}
runCV <- function(posTrain, orderedPair, indIDVar, pairIDVar,
goldStdVar, covariates, l, n, m){
#Creating variables with the individual indID variable
indIDVar1 <- paste0(indIDVar, ".1")
indIDVar2 <- paste0(indIDVar, ".2")
if(length(unique(posTrain$pairID_uo)) != nrow(posTrain)){
#Finding the pairs that are concurrent (both orders included)
concurrentID <- posTrain[duplicated(posTrain$pairID_uo), "pairID_uo"]
concurrent <- posTrain[posTrain$pairID_uo %in% concurrentID, ]
#Randomly selecting order for concurrent pairs
concurrent2L <- linksL <- by(concurrent,
INDICES = list(concurrent$pairID_uo),
FUN = function(x){
x[sample(nrow(x), 1), ]
})
concurrent2 <- do.call(dplyr::bind_rows, concurrent2L)
#Combining choose for concurrent pairs with the rest of the pairs
posTrain2 <- dplyr::bind_rows(posTrain[!posTrain$pairID_uo %in% concurrentID, ],
concurrent2)
}else{
posTrain2 <- posTrain
}
#Subsetting to just the possible links
posLinks <- posTrain2[posTrain2[, goldStdVar] == TRUE, ]
#Choosing the true infector from all possibles (if multiple)
linksL <- by(posLinks,
INDICES = list(posLinks[, indIDVar2]),
FUN = function(x){
x[sample(nrow(x), 1), ]
})
links <- do.call(dplyr::bind_rows, linksL)
links$linked <- TRUE
links <- links[, c(pairIDVar, indIDVar2, "linked")]
#Combining the links with the non-links that do not share an infectee with the links
trainingFull <- dplyr::full_join(orderedPair, links, by = c(pairIDVar, indIDVar2))
trainingFull2 <- trainingFull[trainingFull[, pairIDVar] %in% links[, pairIDVar] |
(trainingFull[, pairIDVar] %in% posTrain2[, pairIDVar] &
!trainingFull[, indIDVar2] %in% links[, indIDVar2]), ]
trainingFull2[is.na(trainingFull2$linked), "linked"] <- FALSE
#Creating the cross-valindIDation folds for that part of the training dataset
cv_splits <- caret::createMultiFolds(trainingFull2$linked, k = n, times = m)
#Initializing data frames to hold results and coefficients
rFolds <- data.frame("p" = numeric(), pairIDVar = character(), stringsAsFactors = FALSE)
names(rFolds) <- c("p", pairIDVar)
cFolds <- data.frame("level" = character(), "est" = numeric(), stringsAsFactors = FALSE)
#Running the methods for all of the CV Folds
for (i in 1:length(cv_splits)){
#Finding training dataset
trainingPairID <- trainingFull2[, pairIDVar][cv_splits[[i]]]
trainingRaw <- trainingFull2[trainingFull2[, pairIDVar] %in% trainingPairID, ]
#Finding the infectee whose infector was found in the training dataset
foundInfector <- trainingRaw[trainingRaw$linked == TRUE, ]
#Finding all pairs that share an infectee with the pairs where the infector was found
shareInfectee <- orderedPair[!orderedPair[, pairIDVar] %in% foundInfector[, pairIDVar] &
orderedPair[, indIDVar2] %in% foundInfector[, indIDVar2], ]
shareInfectee$linked <- FALSE
#Creating the training datasets
#Setting probabilities to 1 for training links and 0 for training non-links that are
#defined as such by the gold standard (not just by sharing an infectee in the above df)
training <- dplyr::bind_rows(trainingRaw, shareInfectee)
training$p <- ifelse(training[, goldStdVar] == FALSE, 0,
ifelse(training[, "linked"] == TRUE, 1, NA))
#Creating the prediction dataset
prediction <- dplyr::full_join(orderedPair, links, by = c(pairIDVar, indIDVar2))
prediction <- prediction[!prediction[, pairIDVar] %in% training[, pairIDVar], ]
prediction[is.na(prediction$linked), "linked"] <- FALSE
#Calculating probabilities for one split
sim <- performNB(training, prediction, obsIDVar = pairIDVar,
goldStdVar = "linked", covariates, l)
#Combining the results from fold iteration with the previous folds
rFolds <- dplyr::bind_rows(rFolds, sim[[1]])
sim[[2]]$rowOrder <- 1:nrow(sim[[2]])
cFolds <- dplyr::bind_rows(cFolds, sim[[2]])
}
return(list("rFolds" = rFolds, "cFolds" = cFolds))
}
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.