UniformClusters: Uniform Clusters

UniformClustersR Documentation

Uniform Clusters

Description

This group of functions takes in input a COTAN object and handle the task of dividing the dataset into Uniform Clusters, that is clusters that have an homogeneous genes' expression. This condition is checked by calculating the GDI of the cluster and verifying that no more than a small fraction of the genes have their GDI level above the given GDIThreshold

Usage

GDIPlot(
  objCOTAN,
  genes,
  condition = "",
  statType = "S",
  GDIThreshold = 1.43,
  GDIIn = NULL
)

cellsUniformClustering(
  objCOTAN,
  GDIThreshold = 1.43,
  ratioAboveThreshold = 0.01,
  cores = 1L,
  maxIterations = 25L,
  optimizeForSpeed = TRUE,
  deviceStr = "cuda",
  initialClusters = NULL,
  initialResolution = 0.8,
  useDEA = TRUE,
  distance = NULL,
  hclustMethod = "ward.D2",
  saveObj = TRUE,
  outDir = "."
)

isClusterUniform(
  GDIThreshold,
  ratioAboveThreshold,
  ratioQuantile,
  fractionAbove,
  usedGDIThreshold,
  usedRatioAbove
)

checkClusterUniformity(
  objCOTAN,
  clusterName,
  cells,
  GDIThreshold = 1.43,
  ratioAboveThreshold = 0.01,
  cores = 1L,
  optimizeForSpeed = TRUE,
  deviceStr = "cuda",
  saveObj = TRUE,
  outDir = "."
)

mergeUniformCellsClusters(
  objCOTAN,
  clusters = NULL,
  GDIThreshold = 1.43,
  ratioAboveThreshold = 0.01,
  batchSize = 0L,
  allCheckResults = data.frame(),
  cores = 1L,
  optimizeForSpeed = TRUE,
  deviceStr = "cuda",
  useDEA = TRUE,
  distance = NULL,
  hclustMethod = "ward.D2",
  saveObj = TRUE,
  outDir = "."
)

Arguments

objCOTAN

a COTAN object

genes

a named list of genes to label. Each array will have different color.

condition

a string corresponding to the condition/sample (it is used only for the title).

statType

type of statistic to be used. Default is "S": Pearson's chi-squared test statistics. "G" is G-test statistics

GDIThreshold

the threshold level that discriminates uniform clusters. It defaults to 1.43

GDIIn

when the GDI data frame was already calculated, it can be put here to speed up the process (default is NULL)

ratioAboveThreshold

the fraction of genes allowed to be above the GDIThreshold. It defaults to 1\%

cores

number of cores to use. Default is 1.

maxIterations

max number of re-clustering iterations. It defaults to 25

optimizeForSpeed

Boolean; when TRUE COTAN tries to use the torch library to run the matrix calculations. Otherwise, or when the library is not available will run the slower legacy code

deviceStr

On the torch library enforces which device to use to run the calculations. Possible values are "cpu" to us the system CPU, "cuda" to use the system GPUs or something like "cuda:0" to restrict to a specific device

initialClusters

an existing clusterization to use as starting point: the clusters deemed uniform will be kept and the rest processed as normal

initialResolution

a number indicating how refined are the clusters before checking for uniformity. It defaults to 0.8, the same as Seurat::FindClusters()

useDEA

Boolean indicating whether to use the DEA to define the distance; alternatively it will use the average Zero-One counts, that is faster but less precise.

distance

type of distance to use. Default is "cosine" for DEA and "euclidean" for Zero-One. Can be chosen among those supported by parallelDist::parDist()

hclustMethod

It defaults is "ward.D2" but can be any of the methods defined by the stats::hclust() function.

saveObj

Boolean flag; when TRUE saves intermediate analyses and plots to file

outDir

an existing directory for the analysis output. The effective output will be paced in a sub-folder.

ratioQuantile

the GDI quantile corresponding to the usedRatioAbove

fractionAbove

the fraction of genes above the usedGDIThreshold

usedGDIThreshold

the threshold level actually used to calculate fourth argument

usedRatioAbove

the fraction of genes actually used to calculate the third argument

clusterName

the tag of the cluster

cells

the cells belonging to the cluster

clusters

The clusterization to merge. If not given the last available clusterization will be used, as it is probably the most significant!

batchSize

Number pairs to test in a single round. If none of them succeeds the merge stops. Defaults to 2 (\#cl)^{2/3}

allCheckResults

An optional data.frame with the results of previous checks about the merging of clusters. Useful to restart the merging process after an interruption.

Details

GDIPlot() directly evaluates and plots the GDI for a sample.

cellsUniformClustering() finds a Uniform clusterizations by means of the GDI. Once a preliminary clusterization is obtained from the Seurat-package methods, each cluster is checked for uniformity via the function checkClusterUniformity(). Once all clusters are checked, all cells from the non-uniform clusters are pooled together for another iteration of the entire process, until all clusters are deemed uniform. In the case only a few cells are left out (\leq 50), those are flagged as "-1" and the process is stopped.

isClusterUniform() takes in the current thresholds and used them to check whether the calculated cluster parameters are sufficient to determine whether the cluster is uniform and in the positive scenario the corresponding answer

checkClusterUniformity() takes a COTAN object and a cells' cluster and checks whether the latter is uniform by GDI. The function runs COTAN to check whether the GDI is lower than the given GDIThreshold (1.43) for all but at the most ratioAboveThreshold (1\%) genes. If the GDI results to be too high for too many genes, the cluster is deemed non-uniform.

mergeUniformCellsClusters() takes in a uniform clusterization and iteratively checks whether merging two near clusters would form a uniform cluster still. Multiple thresholds will be used from 1.37 up to the given one in order to prioritize merge of the best fitting pairs.

This function uses the cosine distance to establish the nearest clusters pairs. It will use the checkClusterUniformity() function to check whether the merged clusters are uniform. The function will stop once no tested pairs of clusters are mergeable after testing all pairs in a single batch

Value

GDIPlot() returns a ggplot2 object with a point got each gene, where on the ordinates are the GDI levels and on the abscissa are the average gene expression (log scaled). Also marked are the given threshold (in red) and the 50\% and 75\% quantiles (in blue).

cellsUniformClustering() returns a list with 2 elements:

  • "clusters" the newly found cluster labels array

  • "coex" the associated COEX data.frame

a single Boolean value when it is possible to decide the answer with the given information and NA otherwise

checkClusterUniformity returns a list with:

  • "isUniform": a flag indicating whether the cluster is uniform

  • "fractionAbove": the percentage of genes with GDI above the threshold

  • "ratioQuantile": the quantile associated to the high quantile associated to given ratio

  • "size": the number of cells in the cluster

  • "GDIThreshold" the used GDI threshold

  • "ratioAboveThreshold" the used fraction of genes above threshold allowed in uniform clusters

a list with:

  • "clusters" the merged cluster labels array

  • "coex" the associated COEX data.frame

Examples

data("test.dataset")

objCOTAN <- automaticCOTANObjectCreation(raw = test.dataset,
                                         GEO = "S",
                                         sequencingMethod = "10X",
                                         sampleCondition = "Test",
                                         cores = 6L,
                                         saveObj = FALSE)

groupMarkers <- list(G1 = c("g-000010", "g-000020", "g-000030"),
                     G2 = c("g-000300", "g-000330"),
                     G3 = c("g-000510", "g-000530", "g-000550",
                            "g-000570", "g-000590"))
gdiPlot <- GDIPlot(objCOTAN, genes = groupMarkers, cond = "test")
plot(gdiPlot)

## Here we override the default GDI threshold as a way to speed-up
## calculations as higher threshold implies less stringent uniformity
## It real applications it might be appropriate to change the threshold
## in cases of relatively low genes/cells number, or in cases when an
## rough clusterization is needed in the early satges of the analysis
##

splitList <- cellsUniformClustering(objCOTAN, cores = 6L,
                                    optimizeForSpeed = TRUE,
                                    deviceStr = "cuda",
                                    initialResolution = 0.8,
                                    GDIThreshold = 1.46, saveObj = FALSE)

clusters <- splitList[["clusters"]]

firstCluster <- getCells(objCOTAN)[clusters %in% clusters[[1L]]]
firstClusterIsUniform <-
  checkClusterUniformity(objCOTAN, GDIThreshold = 1.46,
                         ratioAboveThreshold = 0.01,
                         cluster = clusters[[1L]], cells = firstCluster,
                         cores = 6L, optimizeForSpeed = TRUE,
                         deviceStr = "cuda", saveObj = FALSE)[["isUniform"]]

objCOTAN <- addClusterization(objCOTAN,
                              clName = "split",
                              clusters = clusters)

objCOTAN <- addClusterizationCoex(objCOTAN,
                                  clName = "split",
                                  coexDF = splitList[["coex"]])

identical(reorderClusterization(objCOTAN)[["clusters"]], clusters)

mergedList <- mergeUniformCellsClusters(objCOTAN,
                                        GDIThreshold = 1.43,
                                        ratioAboveThreshold = 0.02,
                                        batchSize = 2L,
                                        clusters = clusters,
                                        cores = 6L,
                                        optimizeForSpeed = TRUE,
                                        deviceStr = "cpu",
                                        distance = "cosine",
                                        hclustMethod = "ward.D2",
                                        saveObj = FALSE)

objCOTAN <- addClusterization(objCOTAN,
                              clName = "merged",
                              clusters = mergedList[["clusters"]],
                              coexDF = mergedList[["coex"]])

identical(reorderClusterization(objCOTAN), mergedList)


seriph78/COTAN documentation built on July 2, 2024, 9:27 a.m.