prepData: Data preparation

View source: R/prepData.R

prepDataR Documentation

Data preparation

Description

Data preparation

Usage

prepData(
  x,
  panel = NULL,
  md = NULL,
  features = NULL,
  transform = TRUE,
  cofactor = 5,
  panel_cols = list(channel = "fcs_colname", antigen = "antigen", class = "marker_class"),
  md_cols = list(file = "file_name", id = "sample_id", factors = c("condition",
    "patient_id")),
  by_time = TRUE,
  FACS = FALSE,
  fix_chs = c("common", "all"),
  ...
)

Arguments

x

a flowSet holding all samples or a path to a set of FCS files.

panel

a data.frame containing, for each channel, its column name in the input data, targeted protein marker, and (optionally) class ("type", "state", or "none"). If 'panel' is unspecified, it will be constructed from the first input sample via guessPanel.

md

a table with column describing the experiment. An exemplary metadata table could look as follows:

  • file_name: the FCS file name

  • sample_id: a unique sample identifier

  • patient_id: the patient ID

  • condition: brief sample description (e.g. reference/stimulated, healthy/diseased)

If 'md' is unspecified, the flowFrame/Set identifier(s) will be used as sample IDs with no additional metadata factors.

features

a logical vector, numeric vector of column indices, or character vector of channel names. Specified which column to keep from the input data. Defaults to the channels listed in the input panel.

transform

logical. Specifies whether an arcsinh-transformation with cofactor cofactor should be performed, in which case expression values (transformed counts) will be stored in assay(x, "exprs").

cofactor

numeric cofactor(s) to use for optional arcsinh-transformation when transform = TRUE; single value or a vector with channels as names.

panel_cols

a names list specifying the panel column names that contain channel names, targeted protein markers, and (optionally) marker classes. When only some panel_cols deviate from the defaults, specifying only these is sufficient.

md_cols

a named list specifying the column names of md that contain the FCS file names, sample IDs, and factors of interest (batch, condition, treatment etc.). When only some md_cols deviate from the defaults, specifying only these is sufficient.

by_time

logical; should samples be ordered by acquisition time? Ignored if !is.null(md) in which case samples will be ordered as they are listed in md[[md_cols$file]]. (see details)

FACS

logical; is this FACS / flow cytometry data? By default, prepData moves non-mass channels to the output SCE's int_colData; FACS = TRUE assures that all channels are kept as assay data. If FALSE, prepData will try and access the input flowFrame/Set's "$CYT" descriptor (keyword(., "$CYT")) to determine the data type; this may be inaccurate for some cytometer descriptors.

fix_chs

specifies the strategy to use in case of panel discrepancies. "common" will retain only channels present in all frames/FCS files; "all" will retain the union of channels across samples. In the latter case, a logical matrix with rows = channels and columns = samples will be stored under metadata slot chs_by_fcs specifying which channels were/n't (FALSE/TRUE) measured in which samples.

...

additional arguments passed to read.FCS. E.g., channel_alias in case of panel discrepancies between frames/ FCS files. By default, transformation = truncate_max_range = FALSE.

Details

By default, non-mass channels (e.g., time, event lengths) will be removed from the output SCE's assay data and instead stored in the object's internal cell metadata (int_colData) to assure these data are not subject to transformations or other computations applied to the assay data.

For more than 1 sample, prepData will concatenate cells into a single SingleCellExperiment object. Note that cells will hereby be order by "Time", regardless of whether by_time = TRUE or FALSE. Instead, by_time determines the sample (not cell!) order; i.e., whether samples should be kept in their original order, or should be re-ordered according to their acquision time stored in keyword(flowSet, "$BTIM").

When a metadata table is specified (i.e. !is.null(md)), argument by_time will be ignored and sample ordering is instead determined by md[[md_cols$file]].

Value

a SingleCellExperiment.

Author(s)

Helena L Crowell helena.crowell@uzh.ch

Examples

data(PBMC_fs, PBMC_panel, PBMC_md)
prepData(PBMC_fs, PBMC_panel, PBMC_md)

# channel-specific transformation
cf <- sample(seq_len(10)[-1], nrow(PBMC_panel), TRUE)
names(cf) <- PBMC_panel$fcs_colname
sce <- prepData(PBMC_fs, cofactor = cf)
int_metadata(sce)$cofactor

# input has different name for "condition"
md <- PBMC_md
m <- match("condition", names(md))
colnames(md)[m] <- "treatment"

# add additional factor variable batch ID
md$batch_id <- sample(c("A", "B"), nrow(md), TRUE)

# specify 'md_cols' that differ from defaults
factors <- list(factors = c("treatment", "batch_id"))
ei(prepData(PBMC_fs, PBMC_panel, md, md_cols = factors))

# without panel & metadata tables
sce <- prepData(raw_data)

# 'flowFrame' identifiers are used as sample IDs
levels(sce$sample_id)

# panel was guess with 'guessPanel';
# non-mass channels are set to marker class "none"
rowData(sce)


HelenaLC/CATALYST documentation built on Nov. 30, 2024, 4:04 a.m.