prepareData: Prepare data

View source: R/1_prepareData.R

prepareDataR Documentation

Prepare data

Description

Prepare data into format for diffcyt pipeline

Usage

prepareData(
  d_input,
  experiment_info,
  marker_info,
  cols_to_include = NULL,
  subsampling = FALSE,
  n_sub = NULL,
  seed_sub = NULL
)

Arguments

d_input

Input data. Must be a flowSet or list of flowFrames, DataFrames, data.frames, or matrices as input (one flowFrame or list item per sample).

experiment_info

data.frame, DataFrame, or tbl_df of experiment information, for example sample IDs and group IDs. Must contain a column named sample_id.

marker_info

data.frame, DataFrame, or tbl_df of marker information for each column of data. This should contain columns named marker_name and marker_class. The columns contain: (i) marker names (and any other column names); and (ii) a factor indicating the marker class for each column (with entries "type", "state", or "none").

cols_to_include

Logical vector indicating which columns to include from the input data. Default = all columns.

subsampling

Whether to use random subsampling to select an equal number of cells from each sample. Default = FALSE.

n_sub

Number of cells to select from each sample by random subsampling, if subsampling = TRUE. Default = number of cells in smallest sample.

seed_sub

Random seed for subsampling. Set to an integer value to generate reproducible results. Default = NULL.

Details

Functions in the diffcyt analysis pipeline assume that input data is provided as a SummarizedExperiment object, which contains a single matrix of expression values, together with row and column meta-data.

This function accepts a flowSet or a list of flowFrames, data.frames, or matrices as input (i.e. one flowFrame or list item per sample). The function then concatenates the data tables into a single matrix of values, and adds row and column meta-data.

Row meta-data should be provided as a data frame named experiment_info, containing columns of relevant experiment information, such as sample IDs and group IDs (for each sample). This must contain at least a column named sample_id.

Column meta-data should be provided as a data frame named marker_info, containing the following columns of marker information. The column names must be as shown.

  • marker_name: protein marker names (and column names for any other columns)

  • marker_class: factor indicating the protein marker class for each column of data (usually, entries will be either "type", "state", or "none")

The split into 'cell type' and 'cell state' markers is crucial for the analysis. Cell type markers are used to define cell populations by clustering, and to test for differential abundance of cell populations; while cell state markers are used to test for differential states within cell populations.

The optional argument cols_to_include allows unnecessary columns (e.g. any columns not containing protein markers) to be discarded.

Optionally, random subsampling can be used to select an equal number of cells from each sample (subsampling = TRUE). This can be useful when there are large differences in total numbers of cells per sample, since it ensures that samples with relatively large numbers of cells do not dominate the clustering. However, subsampling should generally not be used when rare cell populations are of interest, due to the significant loss of information if cells from the rare population are discarded.

Value

d_se: Returns data as a SummarizedExperiment containing a single matrix of data (expression values) in the assays slot, together with row meta-data (experiment information) and column meta-data (marker information). The metadata slot also contains the experiment_info data frame, and a vector n_cells of the number of cells per sample; these can be accessed with metadata(d_se)$experiment_info and metadata(d_se)$n_cells.

Examples

# For a complete workflow example demonstrating each step in the 'diffcyt' pipeline, 
# see the package vignette.

# Function to create random data (one sample)
d_random <- function(n = 20000, mean = 0, sd = 1, ncol = 20, cofactor = 5) {
  d <- sinh(matrix(rnorm(n, mean, sd), ncol = ncol)) * cofactor
  colnames(d) <- paste0("marker", sprintf("%02d", 1:ncol))
  d
}

# Create random data (without differential signal)
set.seed(123)
d_input <- list(
  sample1 = d_random(), 
  sample2 = d_random(), 
  sample3 = d_random(), 
  sample4 = d_random()
)

experiment_info <- data.frame(
  sample_id = factor(paste0("sample", 1:4)), 
  group_id = factor(c("group1", "group1", "group2", "group2")), 
  stringsAsFactors = FALSE
)

marker_info <- data.frame(
  channel_name = paste0("channel", sprintf("%03d", 1:20)), 
  marker_name = paste0("marker", sprintf("%02d", 1:20)), 
  marker_class = factor(c(rep("type", 10), rep("state", 10)), 
                        levels = c("type", "state", "none")), 
  stringsAsFactors = FALSE
)

# Prepare data
d_se <- prepareData(d_input, experiment_info, marker_info)


lmweber/diffcyt documentation built on March 19, 2024, 5:24 a.m.