#' @export
#' @title Call markers based on an existing map
#' @description This function uses an existing genetic map to call genetic markers, including markers polymorphic on multiple chromosomes.
#' @param rawData Raw data for a genetic marker.
#' @param thresholdChromosomes The test-statistic threshold for declaring a marker to be polymorphic on a chromosome.
#' @param thresholdAlleleClusters The p-value threshold for declaring two underlying founder alleles to have different marker alleles. Multiple possible values should be input.
#' @param maxChromosomes The maximum number of chromosomes that a marker can be polymorphic on
#' @param existingImputations An object of class mpcrossMapped from the mpMap2 package, containing data about imputed underlying genotypes.
#' @param tDistributionPValue Paramater controlling the size of each detected cluster, ranging from 0 to 1. Small values result in small clusters, and large values result in large clusters.
#' @param useOnlyExtraImputationPoints Should we only use the non-marker positions to identify the correct locations?
#' @param ... Extra arguments. Only \code{existingLocalisationStatistics} is supported, mostly so the example can run quickly.
#' @return At the minimum, a list containing an entry \code{called} indicating whether the marker could be successfully called. If it could, other entries are returned.
#' \describe{
#' \item{overallAssignment}{Defines clusters within the data. }
#' \item{classificationsPerPosition}{Defines genotype calls per genetic location to which the marker was mapped. }
#' \item{clusterBoundaries}{Contours giving the boundaries of each cluster in \code{overallAssignment}.}
#' \item{preliminaryGroups}{The preliminary groups based on IBD imputations, which the final genotype calls are built from.}
#' \item{pValuesMatrices}{The matrices of p-values used to form a graph, and therefore identify founder alleles.}
#' }
#' @details
#' This function uses an existing genetic map to call a genetic marker. There are a number of advantages to this approach
#' \describe{
#' \item{1.}{It can correctly call markers which are polymorphic on multiple chromosomes, therefore converting one marker into two.}
#' \item{2.}{It avoids incorrectly calling markers polymorphic on multiple chromosomes. Incorrect calling can lead to supurious genetic interactions.}
#' \item{3.}{It can call markers that initially appear to be monomorphic in the population. }
#' \item{4.}{It can call additional marker alleles for markers that would otherwise be ignored. }
#' }
#'
#' Once a genetic map has been constructed, it should be used to impute underlying founder genotypes at an equally spaced grid of points using function \code{\link[mpMap2]{imputeFounders}}.
#' The steps in the algorithm are as follows:
#' \describe{
#' \item{1.}{Determine which chromosomes the marker is associated to, and where on those chromosomes. This is determined using function \code{\link[mpMap2]{addExtraMarkerFromRawCall}}, which is itself based on a manova model.
#' The marker is assumed associated to chromosomes for which the test statistic is greater than \code{thresholdChromosomes}.
#' An appropriate value for \code{thresholdChromosomes} can be determined by looking at the results of \code{\link[mpMap2]{addExtraMarkerFromRawCall}}, for a number of different markers.}
#' \item{2.}{Determine the distribution of marker alleles, at all the associated genetic locations.
#' This is done by taking the founders to be the vertices of a graph, and connecting founders which seem to part of the same marker allele.
#' The resulting graph should be a union of disjoint complete graphs (cliques). }
#' \item{3.}{We now have a preliminary assignment of marker alleles to lines, where the assignment may be of 1, 2, 3 or more \emph{different} marker alleles, depending on how many chromosomes the marker is associated with.
#' For example, if the marker is associated with two chromosomes, then there will be two marker alleles for each line.
#' For each unique combination of marker alleles, we take the lines which have that assignment of marker alleles, and fit a skew-t distribution. }
#' \item{4.}{For each fitted distribution, determine a confidence region using p-value \code{tDistributionPValue}. }
#' \item{5.}{Use these confidence regions to construct marker calls at each associated location.}
#' }
#' @examples
#' data(eightParentSubsetMap)
#' data(wsnp_Ku_rep_c103074_89904851)
#' data(callFromMapExampleLocalisationStatistics)
#' library(ggplot2)
#' library(gridExtra)
#' #We use an existing set of localisation statistics, to make the example faster
#' called <- callFromMap(rawData = as.matrix(wsnp_Ku_rep_c103074_89904851), existingImputations =
#' eightParentSubsetMap, useOnlyExtraImputationPoints = TRUE, tDistributionPValue = 0.8,
#' thresholdChromosomes = 80, existingLocalisationStatistics = existingLocalisationStatistics)
#' plotData <- wsnp_Ku_rep_c103074_89904851
#' plotData$genotype1B <- factor(called$classificationsPerPosition$Chr1BLoc31$finals)
#' plotData$imputed1B <- factor(imputationData(eightParentSubsetMap)[, "Chr1BLoc31"])
#' plotData$genotype1D <- factor(called$classificationsPerPosition$Chr1DLoc16$finals)
#' plotData$imputed1D <- factor(imputationData(eightParentSubsetMap)[, "Chr1DLoc16"])
#'
#' plotImputations1B <- ggplot(plotData, mapping = aes(x = theta, y = r, color = imputed1B)) +
#' geom_point() + theme_bw() + ggtitle("Imputed genotype, 1B") +
#' guides(color=guide_legend(title="IBD genotype"))
#'
#' called1B <- ggplot(plotData, mapping = aes(x = theta, y = r, color = genotype1B)) +
#' geom_point() + theme_bw() + ggtitle("Called genotype, 1B") +
#' guides(color=guide_legend(title="Called cluster")) + scale_color_manual(values =
#' c("black", RColorBrewer::brewer.pal(n = 4, name = "Set1")))
#'
#' plotImputations1D <- ggplot(plotData, mapping = aes(x = theta, y = r, color = imputed1D)) +
#' geom_point() + theme_bw() + ggtitle("Imputed genotype, 1D") +
#' guides(color=guide_legend(title="IBD genotype"))
#'
#' called1D <- ggplot(plotData, mapping = aes(x = theta, y = r, color = genotype1D)) +
#' geom_point() + theme_bw() + ggtitle("Called genotype, 1D") +
#' guides(color=guide_legend(title="Called cluster")) +
#' scale_color_manual(values = c("black",RColorBrewer::brewer.pal(n=3,name = "Set1")[1:2]))
#'
#' \donttest{grid.arrange(plotImputations1B, plotImputations1D, called1B, called1D)}
callFromMap <- function(rawData, thresholdChromosomes = 100, thresholdAlleleClusters = c(1e-10, 1e-20, 1e-30, 1e-40), maxChromosomes = 2, existingImputations, tDistributionPValue = 0.6, useOnlyExtraImputationPoints = TRUE, ...)
{
args <- list(...)
if("existingLocalisationStatistics" %in% names(args))
{
rawResult <- args$existingLocalisationStatistics
}
else
{
rawResult <- mpMap2::addExtraMarkerFromRawCall(mpcrossMapped = existingImputations, newMarker = rawData, useOnlyExtraImputationPoints = useOnlyExtraImputationPoints)
}
chromosomes <- names(existingImputations@map)
chromosomeScores <- sapply(chromosomes, function(x) max(rawResult@data[names(rawResult@map[[x]])]))
chromosomeScores <- sort(chromosomeScores, decreasing=TRUE)
if(sum(chromosomeScores > thresholdChromosomes) > maxChromosomes)
{
result <- list(called = FALSE)
attr(result, "failures") <- "Detected on too many chromosomes"
return(result)
}
if(sum(chromosomeScores > thresholdChromosomes) == 0)
{
result <- list(called = FALSE)
attr(result, "failures") <- "Detected on zero chromosomes"
return(result)
}
bestChromosomes <- names(chromosomeScores[chromosomeScores > thresholdChromosomes])
bestPositionsChromosomes <- sapply(bestChromosomes, function(x) names(which.max(rawResult@data[names(rawResult@map[[x]])])))
thresholdAlleleClusters <- sort(thresholdAlleleClusters, decreasing = FALSE)
internalResults <- vector(mode = "character", length = length(thresholdAlleleClusters))
for(index in 1:length(thresholdAlleleClusters))
{
thresholdAlleleCluster <- thresholdAlleleClusters[index]
result <- callFromMapInternal(bestPositionsChromosomes = bestPositionsChromosomes, rawData = rawData, thresholdAlleleCluster = thresholdAlleleCluster, existingImputations = existingImputations, tDistributionPValue = tDistributionPValue)
if(!is.null(result) && !is.character(result)) return(result)
if(is.character(result)) internalResults[index] <- result
}
result <- list(called = FALSE)
attr(result, "failures") <- internalResults
return(result)
}
callFromMapInternal <- function(bestPositionsChromosomes, rawData, thresholdAlleleCluster, existingImputations, tDistributionPValue)
{
nFounders <- nFounders(existingImputations)
dataPerPosition <- clusters <- list()
pValuesMatrices <- list()
for(position in bestPositionsChromosomes)
{
dataPerPosition[[position]] <- factor(existingImputations@geneticData[[1]]@imputed@data[,position])
results <- matrix(0, nrow = nFounders, ncol = nFounders)
for(i in 1:(nFounders-1))
{
for(j in (i+1):nFounders)
{
contrastRow <- rep(0, nFounders)
contrastRow[i] <- 1
contrastRow[j] <- -1
currentModelData <- dataPerPosition[[position]]
contrasts(currentModelData) <- cbind(condInterest = contrastRow)
model <- lm(rawData ~ currentModelData)
testResult <- car::linearHypothesis(model, "currentModelDatacondInterest")
#Taken from print.linearHypothesis.mlm
SSPE.qr <- qr(testResult$SSPE)
eigs <- Re(eigen(qr.coef(SSPE.qr, testResult$SSPH), symmetric = FALSE)$values)
pillai <- Pillai(eigs, testResult$df, testResult$df.residual)
ok <- (pillai[2] >= 0) & (pillai[3] > 0) & (pillai[4] > 0)
ok <- !is.na(ok) & ok
if(!ok)
{
return("Problem computing Pillai-Bartlett test statistic")
}
pvalue <- pf(pillai[2], pillai[3], pillai[4], lower.tail = FALSE)
results[j, i] <- results[i, j] <- pvalue
}
}
pValuesMatrices[[position]] <- results
adjacencyMatrix <- matrix(0, nrow = nFounders, ncol = nFounders)
adjacencyMatrix[results > thresholdAlleleCluster] <- 1
diag(adjacencyMatrix) <- 1
maxCliques <- igraph::max_cliques(igraph::graph_from_adjacency_matrix(adjacencyMatrix))
#The max cliques should actually partition the graph, so check that
maxCliqueVector <- do.call(c, maxCliques)
if(length(maxCliqueVector) != nFounders || length(unique(maxCliqueVector)) != nFounders) return("Overlapping cliques")
#If marker is monomorphic, return NULL
if(length(maxCliques) == 1) return("monoallelic")
clusters[[position]] <- maxCliques
}
classifyPosition <- function(position)
{
mapping <- vector(mode = "integer", length = nFounders)
for(clusterCounter in 1:length(clusters[[position]])) mapping[clusters[[position]][[clusterCounter]]] <- clusterCounter
return(mapping[as.integer(dataPerPosition[[position]])])
}
groups <- sapply(bestPositionsChromosomes, classifyPosition, simplify = FALSE)
combinedGroups <- do.call(interaction, groups)
nCombinedGroups <- length(levels(combinedGroups))
levels(combinedGroups) <- as.character(1:nCombinedGroups)
overallClusterAssignments <- vector(mode = "integer", length = nrow(rawData))
names(overallClusterAssignments) <- rownames(rawData)
overallClusterAssignments[] <- NA
clusterAssignmentsPerPosition <- list()
insideNumberOfGroups <- vector(mode = "integer", length = nrow(rawData))
clusterBoundaries <- list()
for(group in 1:nCombinedGroups)
{
groupData <- rawData[combinedGroups == group,]
groupData <- as.data.frame(groupData)
colnames(groupData) <- c("x", "y")
done <- FALSE
try(
{
setTimeLimit(120, 120, transient = TRUE)
model <- selm(cbind(x, y) ~ 1, family = "ST", data = groupData, fixed.param = list(alpha = 0))
distribution <- extractSECdistr(model, compNames = c("x", "y"))
plotRange <- cbind(range(groupData[,1]), range(groupData[, 2]))
pdf(NULL)
plotResults <- plot(distribution, col = 2, probs = tDistributionPValue, landmarks = "", range = plotRange)
dev.off()
done <- TRUE
}, silent = TRUE
)
if(!done)
{
try(
{
setTimeLimit(120, 120, transient = TRUE)
model <- selm(cbind(x, y) ~ 1, family = "ST", data = groupData)
distribution <- extractSECdistr(model, compNames = c("x", "y"))
plotRange <- cbind(range(groupData[,1]), range(groupData[, 2]))
pdf(NULL)
plotResults <- plot(distribution, col = 2, probs = tDistributionPValue, landmarks = "", range = plotRange)
dev.off()
done <- TRUE
}, silent = TRUE
)
}
if(!done)
{
setTimeLimit()
return("Error fitting t distribution")
}
clusterBoundaries[[group]] <- plotResults$plot$contourLines[[1]]
data <- cbind(plotResults$plot$contourLines[[1]]$x, plotResults$plot$contourLines[[1]]$y)
contourValue <- mean(sn::dmst(data, dp = plotResults$object@dp))
isInside <- sn::dmst(rawData, dp = plotResults$object@dp) > contourValue
#Not *exactly* sure why this occassionally comes out as NA. Numerical overflow for the occasional point?
isInside[is.na(isInside)] <- FALSE
overallClusterAssignments[isInside] <- group
insideNumberOfGroups[isInside] <- insideNumberOfGroups[isInside] + 1
}
#If it's in multiple groups, then mark the cluster assignment as NA.
overallClusterAssignments[insideNumberOfGroups > 1] <- NA
overallClusterAssignments <- factor(overallClusterAssignments, levels = 1:nCombinedGroups)
classificationsPerPosition <- list()
for(position in bestPositionsChromosomes)
{
mapping <- vector(mode = "integer", length = nFounders)
for(clusterCounter in 1:length(clusters[[position]])) mapping[clusters[[position]][[clusterCounter]]] <- clusterCounter
currentPositionClassificationTable <- mapping[dataPerPosition[[position]]]
conversionTable <- table(overallClusterAssignments, currentPositionClassificationTable)
#Switch how we assign clusters to marker alleles. Potentially the commented out version is better?
for(k in 1:ncol(conversionTable)) conversionTable[,k] <- conversionTable[,k] / sum(conversionTable[,k])
result <- vector(mode = "integer", length = nrow(rawData))
for(group in 1:nCombinedGroups)
{
result[overallClusterAssignments == group] <- which.max(conversionTable[group,])
}
names(result) <- rownames(rawData)
classificationsPerPosition[[position]] <- list(finals = result, founders = mapping)
}
setTimeLimit()
return(list(overallAssignment = as.integer(overallClusterAssignments), classificationsPerPosition = classificationsPerPosition, pValuesMatrices = pValuesMatrices, preliminaryGroups = combinedGroups, clusterBoundaries = clusterBoundaries, called = TRUE))
}
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.