UniformClusters: Uniform Clusters
In seriph78/COTAN: COexpression Tables ANalysis

UniformClusters

R Documentation

Uniform Clusters

Description

This group of functions takes in input a COTAN object and handle the task of dividing the dataset into Uniform Clusters, that is clusters that have an homogeneous genes' expression. This condition is checked by calculating the GDI of the cluster and verifying that no more than a small fraction of the genes have their GDI level above the given GDIThreshold

Usage

GDIPlot(
  objCOTAN,
  genes,
  condition = "",
  statType = "S",
  GDIThreshold = 1.43,
  GDIIn = NULL
)

genesSelector(objCOTAN, genesSel, numGenes = 2000L)

cellsUniformClustering(
  objCOTAN,
  checker = NULL,
  GDIThreshold = NaN,
  initialResolution = 0.8,
  maxIterations = 25L,
  cores = 1L,
  optimizeForSpeed = TRUE,
  deviceStr = "cuda",
  useDEA = TRUE,
  distance = NULL,
  genesSel = "HVG_Seurat",
  hclustMethod = "ward.D2",
  initialClusters = NULL,
  initialIteration = 1L,
  saveObj = TRUE,
  outDir = "."
)

checkClusterUniformity(
  objCOTAN,
  clusterName,
  cells,
  checker,
  cores = 1L,
  optimizeForSpeed = TRUE,
  deviceStr = "cuda",
  saveObj = TRUE,
  outDir = "."
)

mergeUniformCellsClusters(
  objCOTAN,
  clusters = NULL,
  checkers = NULL,
  GDIThreshold = NaN,
  batchSize = 0L,
  cores = 1L,
  optimizeForSpeed = TRUE,
  deviceStr = "cuda",
  useDEA = TRUE,
  distance = NULL,
  hclustMethod = "ward.D2",
  allCheckResults = data.frame(),
  initialIteration = 1L,
  saveObj = TRUE,
  outDir = "."
)

Arguments

`objCOTAN`	a `COTAN` object
`genes`	a named `list` of genes to label. Each array will have different color.
`condition`	a string corresponding to the condition/sample (it is used only for the title).
`statType`	type of statistic to be used. Default is "S": Pearson's chi-squared test statistics. "G" is G-test statistics
`GDIThreshold`	legacy. The threshold level that is used in a SimpleGDIUniformityCheck. It defaults to `1.43`
`GDIIn`	when the `GDI` data frame was already calculated, it can be put here to speed up the process (default is `NULL`)
`genesSel`	Decides whether and how to perform the gene-selection. used for the clustering. It is a string indicating one of the following selection methods: `"HGDI"` Will pick-up the genes with highest GDI `"HVG_Seurat"` Will pick-up the genes with the highest variability via the Seurat package (the default method) `"HVG_Scanpy"` Will pick-up the genes with the highest variability according to the `Scanpy` package (using the Seurat implementation)
`numGenes`	The number of genes to return
`checker`	the object that defines the method and the threshold to discriminate whether a cluster is uniform transcript. See UniformTranscriptCheckers for more details
`initialResolution`	a number indicating how refined are the clusters before checking for uniformity. It defaults to `0.8`, the same as `Seurat::FindClusters()`
`maxIterations`	max number of re-clustering iterations. It defaults to `25`
`cores`	number of cores to use. Default is 1.
`optimizeForSpeed`	Boolean; when `TRUE` `COTAN` tries to use the `torch` library to run the matrix calculations. Otherwise, or when the library is not available will run the slower legacy code
`deviceStr`	On the `torch` library enforces which device to use to run the calculations. Possible values are `"cpu"` to us the system CPU, `"cuda"` to use the system GPUs or something like `"cuda:0"` to restrict to a specific device
`useDEA`	Boolean indicating whether to use the DEA to define the distance; alternatively it will use the average Zero-One counts, that is faster but less precise.
`distance`	type of distance to use. Default is `"cosine"` for DEA and `"euclidean"` for Zero-One. Can be chosen among those supported by `parallelDist::parDist()`
`hclustMethod`	It defaults is `"ward.D2"` but can be any of the methods defined by the `stats::hclust()` function.
`initialClusters`	an existing clusterization to use as starting point: the clusters deemed uniform will be kept and the remaining cells will be processed as normal
`initialIteration`	the number associated tot he first iteration; it defaults to 1. Useful in case of restart of the procedure to avoid intermediate data override
`saveObj`	Boolean flag; when `TRUE` saves intermediate analyses and plots to file
`outDir`	an existing directory for the analysis output. The effective output will be paced in a sub-folder.
`clusterName`	the tag of the cluster
`cells`	the cells belonging to the cluster
`clusters`	The clusterization to merge. If not given the last available clusterization will be used, as it is probably the most significant!
`checkers`	a `list` of objects that defines the method and the increasing thresholds to discriminate whether to merge two clusters if deemed uniform transcript. See UniformTranscriptCheckers for more details
`batchSize`	Number pairs to test in a single round. If none of them succeeds the merge stops. Defaults to `2 (\#cl)^{2/3}`
`allCheckResults`	An optional `data.frame` with the results of previous checks about the merging of clusters. Useful to restart the merging process after an interruption.

Details

GDIPlot() directly evaluates and plots the GDI for a sample.

genesSelector() selects the most representative genes of the data.set

cellsUniformClustering() finds a Uniform clusterizations by means of the GDI. Once a preliminary clusterization is obtained from the Seurat-package methods, each cluster is checked for uniformity via the function checkClusterUniformity(). Once all clusters are checked, all cells from the non-uniform clusters are pooled together for another iteration of the entire process, until all clusters are deemed uniform. In the case only a few cells are left out (\leq 50), those are flagged as "-1" and the process is stopped.

checkClusterUniformity() takes a COTAN object and a cells' cluster and checks whether the latter is uniform by looking at the genes' GDI distribution. The function runs checkObjIsUniform() on the given input checker

mergeUniformCellsClusters() takes in a uniform clusterization and iteratively checks whether merging two near clusters would form a uniform cluster still. Multiple thresholds will be used from 1.37 up to the given one in order to prioritize merge of the best fitting pairs.

This function uses the cosine distance to establish the nearest clusters pairs. It will use the checkClusterUniformity() function to check whether the merged clusters are uniform. The function will stop once no tested pairs of clusters are mergeable after testing all pairs in a single batch

Value

GDIPlot() returns a ggplot2 object with a point got each gene, where on the ordinates are the GDI levels and on the abscissa are the average gene expression (log scaled). Also marked are the given threshold (in red) and the 50\% and 75\% quantiles (in blue).

genesSelector() returns an array with the genes' names

cellsUniformClustering() returns a list with 2 elements:

"clusters" the newly found cluster labels array
"coex" the associated COEX data.frame

checkClusterUniformity returns a checker object of the same type as the input one, that contains both threshold and results of the check: see UniformTranscriptCheckers for more details

a list with:

"clusters" the merged cluster labels array
"coex" the associated COEX data.frame

Examples

data("test.dataset")

objCOTAN <- automaticCOTANObjectCreation(raw = test.dataset,
                                         GEO = "S",
                                         sequencingMethod = "10X",
                                         sampleCondition = "Test",
                                         cores = 6L,
                                         saveObj = FALSE)

groupMarkers <- list(G1 = c("g-000010", "g-000020", "g-000030"),
                     G2 = c("g-000300", "g-000330"),
                     G3 = c("g-000510", "g-000530", "g-000550",
                            "g-000570", "g-000590"))

gdiPlot <- GDIPlot(objCOTAN, genes = groupMarkers, cond = "test")
plot(gdiPlot)

## Here we override the default checker as a way to reduce the number of
## clusters as higher thresholds imply less stringent uniformity checks
##
## In real applications it might be appropriate to do so in the cases when
## the wanted resolution is lower such as in the early stages of the analysis
##

checker <- new("AdvancedGDIUniformityCheck")
identical(checker@firstCheck@GDIThreshold, 1.297)

checker2 <- shiftCheckerThresholds(checker, 0.1)
identical(checker2@firstCheck@GDIThreshold, 1.397)

splitList <- cellsUniformClustering(objCOTAN, cores = 6L,
                                    optimizeForSpeed = TRUE,
                                    deviceStr = "cuda",
                                    initialResolution = 0.8,
                                    checker = checker2,
                                    saveObj = FALSE)

clusters <- splitList[["clusters"]]

firstCluster <- getCells(objCOTAN)[clusters %in% clusters[[1L]]]

checkerRes <-
  checkClusterUniformity(objCOTAN, checker = checker2,
                         cluster = clusters[[1L]], cells = firstCluster,
                         cores = 6L, optimizeForSpeed = TRUE,
                         deviceStr = "cuda", saveObj = FALSE)

objCOTAN <- addClusterization(objCOTAN,
                              clName = "split",
                              clusters = clusters,
                              coexDF = splitList[["coex"]],
                              override = FALSE)

identical(reorderClusterization(objCOTAN)[["clusters"]], clusters)

## It is possible to pass a list of checkers tot the merge function that will
## be applied each to the *resulting* merged *clusterization* obtained using
## the previous checker. This ensures that the most similar clusters are
## merged first improving the overall performance

mergedList <- mergeUniformCellsClusters(objCOTAN,
                                        checkers = c(checker, checker2),
                                        batchSize = 2L,
                                        clusters = clusters,
                                        cores = 6L,
                                        optimizeForSpeed = TRUE,
                                        deviceStr = "cpu",
                                        distance = "cosine",
                                        hclustMethod = "ward.D2",
                                        saveObj = FALSE)

objCOTAN <- addClusterization(objCOTAN,
                              clName = "merged",
                              clusters = mergedList[["clusters"]],
                              coexDF = mergedList[["coex"]],
                              override = TRUE)

identical(reorderClusterization(objCOTAN), mergedList[["clusters"]])

seriph78/COTAN documentation built on June 1, 2025, 4:57 p.m.