tof_downsample_density: Downsample high-dimensional cytometry data by randomly...

View source: R/downsampling.R

tof_downsample_densityR Documentation

Downsample high-dimensional cytometry data by randomly selecting a proportion of the cells in each group.

Description

This function downsamples the number of cells in a 'tof_tbl' using the density-dependent downsampling algorithm described in Qiu et al., (2011).

Usage

tof_downsample_density(
  tof_tibble,
  group_cols = NULL,
  density_cols = where(tof_is_numeric),
  target_num_cells,
  target_prop_cells,
  target_percentile = 0.03,
  outlier_percentile = 0.01,
  distance_function = c("euclidean", "cosine", "l2", "ip"),
  density_estimation_method = c("mean_distance", "sum_distance", "spade"),
  ...
)

Arguments

tof_tibble

A 'tof_tbl' or a 'tibble'.

group_cols

Unquoted names of the columns in 'tof_tibble' that should be used to define groups within which the downsampling will be performed. Supports tidyselect helpers. Defaults to 'NULL' (no grouping).

density_cols

Unquoted names of the columns in 'tof_tibble' to use in the density estimation for each cell. Defaults to all numeric columns in 'tof_tibble'.

target_num_cells

An approximate constant number of cells (between 0 and 1) that should be sampled from each group defined by 'group_cols'. Slightly more or fewer cells may be returned due to how the density calculation is performed.

target_prop_cells

An approximate proportion of cells (between 0 and 1) that should be sampled from each group defined by 'group_cols'. Slightly more or fewer cells may be returned due to how the density calculation is performed. Ignored if 'target_num_cells' is specified.

target_percentile

The local density percentile (i.e. a value between 0 and 1) to which the downsampling procedure should adjust all cells. In short, the algorithm will continue to remove cells from the input 'tof_tibble' until the local densities of all remaining cells is equal to 'target_percentile'. Lower values will result in more cells being removed. See Qiu et al., (2011) for details. Defaults to 0.1 (the 10th percentile of local densities). Ignored if either 'target_num_cells' or 'target_prop_cells' are specified.

outlier_percentile

The local density percentile (i.e. a value between 0 and 1) below which cells should be considered outliers (and discarded). Cells with a local density below 'outlier_percentile' will never be selected during the downsampling procedure. Defaults to 0.01 (cells below the 1st local density percentile will be removed).

distance_function

A string indicating which distance function to use for the cell-to-cell distance calculations. Options include "euclidean" (the default) and "cosine" distances.

density_estimation_method

A string indicating which algorithm should be used to calculate the local density estimate for each cell. Options include k-nearest neighbor density estimation using the mean distance to a cell's k-nearest neighbors ("mean_distance"; the default), k-nearest neighbor density estimation using the summed distance to a cell's k nearest neighbors ("sum_distance") and counting the number of neighboring cells within a spherical radius around each cell as described in Qiu et al., 2011 ("spade"). While "spade" often produces the best results, it is slower than knn-density estimation methods.

...

Optional additional arguments to pass to tof_knn_density or tof_spade_density.

Value

A 'tof_tbl' with the same number of columns as the input 'tof_tibble', but fewer rows. The number of rows will depend on the chosen value of 'target_percentile', with fewer cells selected with lower values of 'target_percentile'.

See Also

Other downsampling functions: tof_downsample(), tof_downsample_constant(), tof_downsample_prop()

Examples

sim_data <-
    dplyr::tibble(
        cd45 = rnorm(n = 1000),
        cd38 = rnorm(n = 1000),
        cd34 = rnorm(n = 1000),
        cd19 = rnorm(n = 1000)
    )

tof_downsample_density(
    tof_tibble = sim_data,
    density_cols = c(cd45, cd34, cd38),
    target_prop_cells = 0.5,
    density_estimation_method = "spade"
)

tof_downsample_density(
    tof_tibble = sim_data,
    density_cols = c(cd45, cd34, cd38),
    target_num_cells = 200L,
    density_estimation_method = "spade"
)

tof_downsample_density(
    tof_tibble = sim_data,
    density_cols = c(cd45, cd34, cd38),
    target_num_cells = 200L,
    density_estimation_method = "mean_distance"
)


keyes-timothy/tidytof documentation built on Aug. 28, 2024, 8:37 a.m.