RandomF_FCS: Random Forest classifier for supervised demarcation of groups...
In rprops/Phenoflow_package: Advanced analysis of microbial flow cytometry data

Random Forest classifier for supervised demarcation of groups using flow cytometry data.

RandomF_FCS(
  x,
  sample_info,
  sample_col = "name",
  target_label,
  downsample = 0,
  classification_type = "sample",
  param = c("FL1-H", "FL3-H", "FSC-H", "SSC-H"),
  p_train = 0.75,
  seed = 777,
  cleanFCS = FALSE,
  timesplit = 0.1,
  TimeChannel = "Time",
  plot_fig = FALSE,
  method = "rf"
)

`x`	flowSet object where the necessary metadata for classification is included in the phenoData.
`sample_info`	Sample information necessary for the classification, has to contain a column named "name" which matches the samplenames of the FCS files stored in the flowSet.
`sample_col`	Column name of the sample names in sample_info. Defaults to "name".
`target_label`	column name of the sample_info dataframe that should be predicted based on the flow cytometry data.
`downsample`	Indicate to which sample size should be downsampled. By default samples are downsampled to the sample size of the sample with the lowest number of cells. Defaults to sample level.
`classification_type`	whether to perform sample-level or single-cell level classification (defaults to sample-level)
`param`	Parameters to base classification on.
`p_train`	Percentage of the data set that should be used for training the model.
`seed`	Set random seed to be used during the analysis. Put at 777 by default.
`cleanFCS`	Indicate whether outlier removal should be conducted prior to model estimation. Defaults to FALSE. I would recommend to make sure samples have > 500 cells. Will denoise based on the parameters specified in 'param'.
`timesplit`	Fraction of timestep used in flowAI for denoising. Please consult the 'flowAI::flow_auto_qc' function for more information.
`TimeChannel`	Name of time channel in the FCS files. This can differ between flow cytometers. Defaults to "Time". You can check this by: colnames(flowSet).
`plot_fig`	Should the confusion matrix and the overall performance statistics on the test data partition be visualized? Defaults to FALSE.
`method`	method used by caret::train for learning (defaults to Random forests)

# 1. Example with environmental data:

# Load raw data (imported using flowCore)
data(flowData)

# Format necessary metadata
metadata <- data.frame(names = flowCore::sampleNames(flowData), 
do.call(rbind, lapply(strsplit(flowCore::sampleNames(flowData),"_"), rbind)))
colnames(metadata) <- c("Sample_names", "Cycle_nr", "Location", "day", 
"timepoint", "Staining", "Reactor_phase", "replicate")

# Run Random Forest classifier to predict the Reactor phase based on the
# single-cell FCM data
model_rf <- RandomF_FCS(flowData, sample_info = metadata[1:10, ], sample_col = "Sample_names", 
target_label = "Reactor_phase",
downsample = 10)

# Make a model prediction on new data and report contigency table of predictions
model_pred <- RandomF_predict(x = model_rf[[1]], new_data =  flowData[1], cleanFCS = FALSE)
print(model_pred)

# 2. Example with synthetic community data
# Load flow cytometry data of two strains with each 5,000 cells measured
data(flowData_ax)

# Quickly generate the necesary metadata
metadata_syn <- data.frame(name = flowCore::sampleNames(flowData_ax),
                       labels = flowCore::sampleNames(flowData_ax))

# Run Random forest model on 100 cells of each strain
model_rf_syn <-
  RandomF_FCS(
    flowData_ax,
    sample_info = metadata_syn,
    sample_col = "name",
    target_label = "labels",
    downsample = 100,
    plot_fig = TRUE
  )
                        
# Make predictions on each of the samples or on new data of the mixed communities
model_pred_syn <- RandomF_predict(x = model_rf_syn[[1]], new_data =  flowData_ax, cleanFCS = FALSE)
print(model_pred_syn)