depeche: Perform optimization and penalized K-means clustering

Description Usage Arguments Value Examples

View source: R/depeche.R

Description

This is the central function of the package. As input, only a dataset is required. It starts by performing optimizations and then performs clustering based on the values identified in the optimization step.

Usage

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
depeche(
  inDataFrame,
  samplingSubset = seq_len(nrow(inDataFrame)),
  dualDepecheSetup,
  penalties = 2^seq(0, 5, by = 0.5),
  sampleSize = "default",
  selectionSampleSize = "default",
  k = 30,
  minARIImprovement = 0.01,
  optimARI = 0.95,
  maxIter = 100,
  log2Off = FALSE,
  center = "default",
  nCores = "default",
  createOutput = TRUE
)

Arguments

inDataFrame

A dataframe or matrix with the data that will be used to create the clustering. Cytometry data should be transformed using biexponential, arcsinh transformation or similar, and day-to-day normalizations should to be performed for all data if not all data has been acquired on the same run. Scaling, etc, is on the other hand performed within the function.

samplingSubset

If the dataset is made up of an unequal number of cells from multiple individuals, it might be wise to pre-define a subset of the rows, which includes equal or near-equal numbers of cells from each individual, to avoid a few outliers to dominate the analysis. This can be done here. Should be a vector of row numbers in the inDataFrame.

dualDepecheSetup

Optionally, a dataframe with two columns: the first specifying which step (1 or 2) the variable should be included in, the second specifying the column name for the variable in question. It is used if a two-step clustering should be performed, e.g. in the case where phenotypic clustering should be performed, followed by clustering on functional variables.

penalties

This argument decides whether a single penalty will be used for clustering, or if multiple penalties will be evaluated to identify the optimal one. A single value, a vector of values, or possibly a list of two vectors, if dual clustering is performed can be given here. The suggested default values are empirically defined and might not be optimal for a specific dataset, but the algorithm will warn if the most optimal values are on the borders of the range. Note that when the penalty is 0, there is no penalization, which means that the algorithm runs standard K-means clustering.

sampleSize

This controls what fraction of the dataset that will be used to run the penalty optimization. 'default' results in the full file in files up to 10000 events. In cases where the sampleSize argument is larger than 10000, default leads to the generation of a random subset to the same size also for the selectionSampleSize. A user specified number is also accepted.

selectionSampleSize

The size of the dataset used to find the optimal solution out of the many generated by the penalty optimization at each sample size. 'default' results in the full file in files up to 10000 events. In cases where the sampleSize argument is larger than 10000, default leads to the generation of a random subset to the same size also for the selectionSampleSize. A user specified number is also accepted.

k

Number of initial cluster centers. The higher the number, the greater the precision of the clustering, but the computing time also increases linearly with the number of starting points. Default is 30. If penalties=0, k-means clustering with k clusters will be performed.

minARIImprovement

This is the stop criterion for the penalty optimization algorithm: the more iterations that are run, the smaller will the improvement of the corrected Rand index be, and this sets the threshold when the inner iterations stop. Defaults to 0.01.

optimARI

Above this level of ARI, all solutions are considered equally valid, and the median solution is selected among them.

maxIter

The maximal number of iterations that are performed in the penalty optimization.

log2Off

If the automatic detection for high kurtosis, and followingly, the log2 transformation, should be turned off.

center

If centering should be performed. Alternatives are 'default', 'mean', 'peak' and FALSE. 'peak' results in centering around the highest peak in the data, which is useful in most cytometry situations. 'mean' results in mean centering. 'default' gives different results depending on the data: datasets with 100+ variables are mean centered, and otherwise, peak centering is used. FALSE results in no centering, mainly for testing purposes.

nCores

If multiCore is TRUE, then this sets the number of parallel processes. The default is currently 87.5 percent with a cap on 10 cores, as no speed increase is generally seen above 10 cores for normal computers.

createOutput

For testing purposes. Defaults to TRUE. If FALSE, no plots are generated.

Value

A nested list with varying components depending on the setup above:

clusterVector

A vector with the same length as number of rows in the inDataFrame, where the cluster identity of each observation is noted.

clusterCenters/log2ClusterCenters

A matrix containing information about where the centers are in all the variables that contributed to creating the cluster with the given penalty term. Is used by dAllocate. If a variable is penalized, its value will appear at the center of the data with the centering scheme used in the depeche run, to make dAllocate function runs possible. If the data was log2-transformed, the cluster centers will reflect the log2 transformed positions and the cluter center matrix wil be named accordingly, not to introduce any unnecessary variables that were sparsed out for each cluster. 1 means that the variable was used, 0 that it was discarded.

penaltyOptList

A list of two dataframes:

penaltyOpt.df

A one row dataframe with the settings for the optimal penalty.

meanOptimDf

A dataframe with the information about the results with all tested penalty values.

If a dual setup is used, the result will be a nested list, where the first sublist with the information above of the result of the primary clustering and the following list components are the result of all the secondary clusterings combined.

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
# Load some data
data(testData)

# First, just run with the standard settings
## Not run: 
testDataDepecheResult <- depeche(testData[, 2:15])

# Look at the result
str(testDataDepecheResult)

# Now, a dual depeche setup is used
testDataDepecheResultDual <- depeche(testData[, 2:15],
    dualDepecheSetup = data.frame(
        rep(1:2, each = 7),
        colnames(testData[, 2:15])
    ), penalties = c(64, 128), sampleSize = 500,
    selectionSampleSize = 500, maxIter = 20
)

# Look at the result
str(testDataDepecheResultDual)

## End(Not run)

DepecheR documentation built on Nov. 8, 2020, 5:44 p.m.