optimalFlowClassification: optimalFlowClassification

optimalFlowClassificationR Documentation

optimalFlowClassification

Description

Performs a supervised classification of input data when a database and a partition of the database are provided.

Usage

optimalFlowClassification(
  X,
  database,
  templates,
  consensus.method = "pooling",
  cov.estimation = "standard",
  alpha.cov = 0.85,
  initial.method = "supervized",
  max.clusters = NA,
  alpha.tclust = 0,
  restr.factor.tclust = 1000,
  classif.method = "qda",
  qda.bar = TRUE,
  cost.function = "points",
  cl.paral = 1,
  equal.weights.voting = TRUE,
  equal.weights.template = TRUE
)

Arguments

X

Datasample to be classified.

database

A list where each entry is a partition (clustering) represented as dataframe, of the same dimensions, where the last variable represents the labels of the partition.

templates

List of the consensus clusterings for every group in the partition of the database obtained by optimalFlowTemplates

consensus.method

The consensus.method value that was used in optimalFlowTemplates.

cov.estimation

How to estimate covariance matrices in each cluster of a partition. "standard" is for using cov(), while "robust" is for using robustbase::covMcd.

alpha.cov

Only when cov.estimation = "robust". Indicates the value of alpha in robustbase::covMcd.

initial.method

Indicates how to obtain a partition of X. Takes values in c("supervized", "unsupervized"). Supervized uses tclust initilized by templates. Unsupevized usese flowMeans.

max.clusters

The maximum numbers of clusters for flowMeans. Only when initial.method = unsupervized.

alpha.tclust

Level of trimming allowed fo tclust. Only when initial.method = supervized.

restr.factor.tclust

Fixes the restr.fact parameter in tclust. Only when initial.method = supervized.

classif.method

Indicates what type of supervised learning we want to do. Takes values on c("matching", "qda", "random forest").

qda.bar

Only if classif.method = "qda". If True then the appropriate consensus clustering (template, prototype) is used for learning. If False, the closest partition in the appropriate group is used.

cost.function

Only if classif.method = "matching". Indicates the cost function, distance between clusters, to be used for label matching.

cl.paral

Number of cores to be used in parallel procedures.

equal.weights.voting

only when classif.method = "qda" and qda.bar =F, or when classif.method = "random forest". Indicates the weights structure when looking for the most similar partition in a group.

equal.weights.template

If True, weights assigned to every cluster in a partion are uniform (1/number of clusters). If False, weights assigned to clusters are the proportions of points in every cluster compared to the total amount of points in the partition.

Value

A list formed by:

cluster

Labels assigned to the input data.

clusterings

A list that contains the initial unsupervized or semi-supervized clusterings of the cytometry of interest. Can have as much entries as the number of templates in the semi-supervized case (initial.method = "supervized), or only one entry in the case of initial.method = "unsupervized". Each entry is a list where the most relevant argument for the clusterings is cluster.

assigned.template.index

Label of the group for which the template is closer to the data. When classical qda or random forest ares used for classification there is a secon argument indicating the index of the cytometry in the cluster used for learning.

cluster.vote

Only when classif.method = "matching" or when consensus.method in c("hierarchical", "k-barycenter"). Vote on the type of every label in the partition of the data. In essence, cluster + cluster.vote return a fuzzy clustering of the data of interest.

References

E del Barrio, H Inouzhe, JM Loubes, C Matran and A Mayo-Iscar. (2019) optimalFlow: Optimal-transport approach to flow cytometry gating and population matching. arXiv:1907.08006

Examples

# # We construct a simple database selecting only some of the Cytometries and some cell types for simplicity and for a better visualisation.
database <- buildDatabase(
  dataset_names = paste0('Cytometry', c(2:5, 7:9, 12:17, 19, 21)),
    population_ids = c('Monocytes', 'CD4+CD8-', 'Mature SIg Kappa', 'TCRgd-'))
# # To select the appropriate number of templates, via hierarchical tree, in an interactive fashion and produce a clustering we can also use:
# templates.optimalFlow <- optimalFlowTemplates(database = database)
templates.optimalFlow <- optimalFlowTemplates(database = database, templates.number = 5,
                                             cl.paral = 1)
classification.optimalFlow <- optimalFlowClassification(Cytometry1[
  which(match(Cytometry1$`Population ID (name)`,c("Monocytes", "CD4+CD8-", "Mature SIg Kappa",
                                                  "TCRgd-"), nomatch = 0) > 0), 1:10], database, templates.optimalFlow, cl.paral = 1)
scoreF1.optimalFlow <- optimalFlow::f1Score(classification.optimalFlow$cluster,
                                           Cytometry1[which(match(Cytometry1$`Population ID (name)`,
                                                                                 c("Monocytes", "CD4+CD8-", "Mature SIg Kappa", "TCRgd-"), nomatch = 0) > 0),], noise.types)



HristoInouzhe/optimalFlow documentation built on April 23, 2023, 5:45 p.m.