fix_missing_cell_labels: Replace NA cell column data values left after running...

View source: R/label_transfer.R

fix_missing_cell_labelsR Documentation

Replace NA cell column data values left after running transfer_cell_labels.


Try to replace NA values left in a query cell_data_set after running transfer_cell_labels.


  reduction_method = c("UMAP", "PCA", "LSI"),
  to_column_name = from_column_name,
  out_notna_models_dir = NULL,
  k = 10,
  nn_control = list(),
  top_frac_threshold = 0.5,
  top_next_ratio_threshold = 1.5,
  verbose = FALSE



the cell_data_set upon which to perform this operation


a string specifying the reduced dimension matrix to use for the label transfer. These are "PCA", "LSI", and "UMAP". Default is "UMAP".


a string giving the name of the query cds column with NA values to fix.


a string giving the name of the query cds column where the fixed column data will be stored. The default is from_column_name


a string with the name of the transform model directory where you want to save the not-NA transform models, which includes the nearest neighbor index. If NULL, the not-NA models are not saved. The default is NULL.


an integer giving the number of reference nearest neighbors to find. This value must be large enough to find meaningful column value fractions. See the top_frac_threshold parameter below for additional information. The default is 10.


An optional list of parameters used to make the nearest neighbors index. See the set_nn_control help for additional details. The default metric is cosine for reduction_methods PCA and LSI and is euclidean for reduction_method UMAP.


a numeric value. The top fraction of reference values must be greater than top_frac_threshold in order to be transferred to the query. The top fraction is the fraction of the k neighbors with the most frequent value. The default is 0.5.


a numeric value giving the minimum value of the ratio of the counts of the most frequent to the second most frequent reference values required for transferring the reference value to the query. The default is 1.5.


a boolean controlling verbose output.


fix_missing_cell_labels uses non-NA cell data values in the query cell_data_set to replace NAs in nearby cells. It partitions the cells into a set with NA and a set with non-NA column data values. It makes a nearest neighbor index using cells with non-NA values, and for each cell with NA, it tries to find an acceptable non-NA column data value as follows. If more than top_frac_threshold fraction of them have the same value, it replaces the NA with it. If not, it checks whether the ratio of the most frequent to the second most frequent values is at least top_next_ratio_threshold, in which case it copies the most frequent value. Otherwise, it leaves the NA.


an updated cell_data_set object


  ## Not run: 
     expression_matrix <- readRDS(system.file('extdata',
     cell_metadata <- readRDS(system.file('extdata',
     gene_metadata <- readRDS(system.file('extdata',

     cds <- new_cell_data_set(expression_data=expression_matrix,

    ncell <- nrow(colData(cds))
    cell_sample <- sample(seq(ncell), 2 * ncell / 3)
    cell_set <- seq(ncell) %in% cell_sample
    cds1 <- cds[,cell_set]
    cds1 <- preprocess_cds(cds1)
    cds1 <- reduce_dimension(cds1, build_nn_index=TRUE)
    save_transform_models(cds1, 'tm')

    cds2 <- cds[,!cell_set]
    cds2 <- load_transform_models(cds2, 'tm')
    cds2 <- preprocess_transform(cds2, 'PCA')
    cds2 <- reduce_dimension_transform(cds2)
    cds2 <- transfer_cell_labels(cds2, 'UMAP', colData(cds1), 'cao_cell_type', 'transfer_cell_type')
    cds2 <- fix_missing_cell_labels(cds2, 'UMAP', 'transfer_cell_type', 'fixed_cell_type')
## End(Not run)

cole-trapnell-lab/monocle3 documentation built on May 24, 2022, 5:25 p.m.