R/sc_feature_filter.R

Defines functions sc_feature_filter

Documented in sc_feature_filter

#' Filter scRNA-seq expression matrix to keep only highly informative features. Integrated pipeline.
#'
#' This pipeline function takes an expression matrix as an input and
#' select the features (genes, transcripts) with an estimated technical noise
#' level lower that biological variation in the data.
#' This is achieved by binning the data and calculating the correlation
#' for each bin with highly expressed (lowest noise) gene set
#' (see the vignette for details on the method).
#'
#' The function can optionally produce three plots of \code{print_plots} is \code{TRUE}.
#' It is recommended to open a graphical device (i.e. through \code{pdf} or \code{png}),
#' to call \code{scFeatureFilter},and then to close the device with \code{dev.off}.
#'
#' @param sc_data A data frame, a matrix or a \code{SingleCellExperiment} object.
#' If data frame or matrix, it should contain expression values for each gene as rows, and
#' expression values for the cells as columns.
#'
#' @param print_plots A boolean. Should the function produce three plots as a side effect?
#' Plots are the output of \code{\link{plot_mean_variance}}, \code{\link{plot_correlations_distributions}}
#' and \code{\link{plot_metric}}.
#'
#' @param max_zeros A number between 0 and 1. Maximum proportion of cells with 0 expression
#' for a feature to be kept.
#'
#' @param threshold A number higher than 1. The higher the more stringent the feature selection
#' will be. See \code{\link{determine_bin_cutoff}}.
#'
#' @param top_window_size Size of the reference bin. See \code{\link{define_top_genes}}
#'
#' @param other_window_size Size of the other bins of feature. See \code{\link{bin_scdata}}
#'
#' @param n_random Number of control windows generated by shuffling the top bin
#' of features.
#'
#' @param sce_assay, if \code{sc_data} is an \code{SingleCellExperiment} object,
#' \code{sce_assay} should be one of \code{names(assays(<SingleCellExperiment>))}.
#'
#' @return A \code{matrix} or a \code{tibble}, depending on the type of \code{sc_data},
#' containing only the top expressed features.
#'
#' @examples
#' sc_feature_filter(scData_hESC)
#'
#' # with plots
#' \dontrun{
#' pdf("diagnostic.pdf")
#' sc_feature_filter(sc_data, print_plots = TRUE)
#' dev.off()
#' }
#'
#' @export
sc_feature_filter <- function(
    sc_data,
    print_plots = FALSE,
    max_zeros = 0.75,
    threshold = 2,
    top_window_size = 100,
    other_window_size = 1000,
    n_random = 3,
    sce_assay = NULL
){
    binned_data <- sc_data %>%
        calculate_cvs(max_zeros = max_zeros, sce_assay = sce_assay) %>%
        define_top_genes(window_size = top_window_size) %>%
        bin_scdata(window_size = other_window_size)

    cor_dis <- correlate_windows(binned_data, n_random = n_random)

    metrics <- get_mean_median(cor_dis)

    if(is.data.frame(sc_data)) {
        is_matrix <- FALSE
    } else {
        is_matrix <- TRUE
    }

    filtered_data <- filter_expression_table(
        binned_data,
        bin_cutoff = determine_bin_cutoff(metrics, threshold = threshold),
        as_matrix = is_matrix
    )

    if(print_plots) {
        print(
            plot_mean_variance(binned_data) + annotation_logticks(sides = "l")
        )
        print(
            plot_correlations_distributions(
                correlations_to_densities(cor_dis),
                metrics = metrics
            )
        )
        print(
            plot_metric(metrics, threshold = threshold)
        )
    }

    return(filtered_data)
}

Try the scFeatureFilter package in your browser

Any scripts or data that you put into this service are public.

scFeatureFilter documentation built on Nov. 8, 2020, 7:49 p.m.