{.tabset}

Overview

This stage imports multiple SingleCellExperiment objects and performs integration using one or several methods.

r emoji::emoji("gear") Config file: config/integration/01_integration.yaml

r emoji::emoji("clipboard") HTML report target (in config/pipeline.yaml): DRAKE_TARGETS: ["report_integration"]

r emoji::emoji("scroll") Example report (used config)

r emoji::emoji("ladder") Structure

The steps below are performed for each integration method, plus separately for HVGs with removed cell cycle-related genes (if this removal was requested).

Config parameters

Config for this stage is stored in the config/integration/01_integration.yaml file. Directory with this file is read from SCDRAKE_INTEGRATION_CONFIG_DIR environment variable upon {scdrake} load or attach, and saved as scdrake_integration_config_dir option. This option is used as the default argument value in several {scdrake} functions.


Integration sources

INTEGRATION_SOURCES:
  - pbmc1k:
      path: "path/to/.drake or path/to/sce.Rds"
      path_type: "drake_cache"
      description: "10x Genomics PBMC 1k dataset"
      hvg_rm_cc_genes: True
      hvg_cc_genes_var_expl_threshold: 5
    pbmc3k:
      path: "path/to/.drake or path/to/sce.Rds"
      path_type: "drake_cache"
      description: "10x Genomics PBMC 3k dataset"
      hvg_rm_cc_genes: False
      hvg_cc_genes_var_expl_threshold: null

Type: list of named lists

This parameter specifies datasets to be integrated. Those are read in from single-sample pipeline results (the sce_final_norm_clustering targets) taken from corresponding {drake} caches or from SCE objects saved in Rds format. The latter should correspond to sce_final_norm_clustering targets, but one can modify their e.g. colData() as needed.

Each name of a nested list specifies the name of a single-sample, e.g. pbmc1k. Parameters for each single-sample are:

Currently, there are some limitations of which datasets can be integrated:

Selection of highly variable genes (HVGs)

This is done prior to integration, and is same for all methods specified in the INTEGRATION_METHODS (see below).

If a single-sample has the hvg_rm_cc_genes set to True, cell cycle-related genes will be removed prior to HVG combination and selection.


HVG_COMBINATION_INT: "hvg_metric"

Type: character scalar ("hvg_metric" | "intersection" | "union" | "all")

How to combine HVGs from single-samples:


HVG_SELECTION_INT: "top"
HVG_SELECTION_VALUE_INT: 3000

Type: character scalar, integer scalar

HVG selection strategy. Only relevant when HVG_COMBINATION_INT is "hvg_metric", i.e. this selection will be applied to combined HVG metrics.

These parameters are similar to HVG_SELECTION and HVG_SELECTION_VALUE in the 02_norm_clustering stage of the single-sample pipeline (see vignette("stage_norm_clustering") for more details).

Integration methods

INTEGRATION_METHODS:
  - uncorrected:
      pca_selection_method: "forced"
      pca_forced_pcs: 15
      tsne_perp: 20
      tsne_max_iter: 1000
    rescaling:
      pca_selection_method: "forced"
      pca_forced_pcs: 15
      tsne_perp: 20
      tsne_max_iter: 1000
      integration_params:
        log.base: 2
        pseudo.count: 1
    regression:
      pca_selection_method: "corrected"
      pca_forced_pcs: 15
      tsne_perp: 20
      tsne_max_iter: 1000
      integration_params:
        d: 50
    mnn:
      pca_selection_method: "corrected"
      pca_forced_pcs: 15
      tsne_perp: 20
      tsne_max_iter: 1000
      integration_params:
        k: 20
        prop.k: null
        cos.norm: True
        ndist: 3
        d: 50
        merge.order: null
        auto.merge: True
    harmony:
      pca_selection_method: null
      pca_forced_pcs: null
      tsne_perp: 20
      tsne_max_iter: 1000
      integration_params:
        dims.use: 50
        theta: null
        lambda: null
        sigma: 0.1
        nclust: null
        tau: 0
        block.size: 0.05
        max.iter.harmony: 10
        max.iter.cluster: 20
        epsilon.cluster: 0.00001
        epsilon.harmony: 0.0001

Type: list of named lists

Each named list is specifying parameters for one integration method. The ones specified in the integration_params list are used directly in an integration method, while the others are used prior/after integration. The "uncorrected" and at least one integration method must be set (by default, all methods are performed). Here are descriptions and possible parameters (some of them are used by multiple methods, and thus described only once) for each integration method:

Integration diagnostics

Fast MNN clustering

Used to obtain basic integration diagnostics. See scran::buildSNNGraph() and bluster::makeSNNGraph() for more details.


INTEGRATION_SNN_K: 10

Type: integer scalar

A number of nearest neighbors to consider during graph construction.


INTEGRATION_SNN_TYPE: "rank"

Type: character scalar ("rank" | "number" | "jaccard")

A type of weighting scheme to use for shared neighbors.


INTEGRATION_SNN_CLUSTERING_METHOD: "walktrap"

Type: character scalar ("walktrap" | "louvain")

A type of MNN clustering algorithm. See igraph::cluster_walktrap() and igraph::cluster_louvain() for more details.

Input files

SELECTED_MARKERS_FILE: null

Type: character scalar or null

Similar to the SELECTED_MARKERS_FILE parameter in vignette("stage_norm_clustering").


INTEGRATION_REPORT_RMD_FILE: "Rmd/integration/01_integration.Rmd"

Type: character scalar

A path to RMarkdown file used for HTML report of this pipeline stage.

Output files

INTEGRATION_BASE_OUT_DIR: "01_integration"

Type: character scalar

A path to base output directory for this stage. It will be created under BASE_OUT_DIR specified in 00_main.yaml config.


INTEGRATION_SELECTED_MARKERS_OUT_DIR: "selected_markers"
INTEGRATION_REPORT_HTML_FILE: "01_integration.html"

Type: character scalar

Names of files and directories created under INTEGRATION_BASE_OUT_DIR. Subdirectories are not allowed.

HTML output parameters

INTEGRATION_KNITR_MESSAGE: False
INTEGRATION_KNITR_WARNING: False
INTEGRATION_KNITR_ECHO: False

Type: logical scalar

These are passed to knitr::opts_chunk() and used for rendering of stage's HTML report.

Outputs

Here you can find description of the most important targets for this stage. However, for a full overview, you have to inspect the source code of the get_integration_subplan() function.

HTML report target name: report_integration

SingleCellExperiment objects

sce_int_import: list of imported single-sample SCE objects, as defined in the INTEGRATION_SOURCES parameter. Because {drake} cannot watch for changes in a cache, this target is always rerun. Also, some constraints are checked (common normalization method and HVG metrics).


sce_int_raw_snn_clustering: sce_int_import with computed fast SNN clustering for each single-sample, used for integration diagnostics


sce_int_processed: sce_int_import subsetted to common colData, and features (rows) and their corresponding metadata (hvg_ids, hvg_metric_fit)


sce_int_multibatchnorm: sce_int_processed with SCE objects normalized for inter-batch sequencing depth (batchelor::multiBatchNorm())


sce_int_df: a tibble with integrated SCE objects. Each row is one method defined in the INTEGRATION_METHODS parameter and either with or without removed cell cycle-related genes from HVGs. See ?sce_int_df_fn for information about what is added or modified in metadata() of each integrated SCE object. The integrated assay is named "integrated", except for Harmony integration, which only computes a batch-corrected reduced dimensions (available in reducedDims(sce, "harmony")).


sce_int_pca_df: sce_int_df with computed PCA and selected number of PCs. Also includes PC selection statistics and plot


sce_int_clustering_df: sce_int_pca_df with computed MNN clustering used for integration diagnostics


sce_int_dimred_df: sce_int_pca_df with computed t-SNE and UMAP dimensionality reductions

Highly variable genes (HVGs) selection

hvg_int, hvg_int_with_cc: lists of HVGs and selection parameters. The latter is not NULL when any of the single-samples defined in the INTEGRATION_SOURCES parameter has hvg_rm_cc_genes set to True.


hvg_plots_int_df: a tibble with HVG diagnostic plots

Plots

sce_int_dimred_plots_df: a tibble with dimensionality reduction plots

Selected markers

selected_markers_int_plots_df: a tibble with plots of selected markers, similar to selected_markers_plots in the 02_norm_clustering stage of the single-sample pipeline

selected_markers_plots_files_out: make this target to export the plots



bioinfocz/scdrake documentation built on Jan. 29, 2024, 10:24 a.m.