{.tabset}

Overview

A stage for calculation, visualization and reporting of cell cluster markers ("global markers"). This stage is common to both single-sample and integration pipelines.

r emoji::emoji("gear") Config files: config/single_sample/cluster_markers.yaml, config/integration/cluster_markers.yaml

r emoji::emoji("clipboard") HTML report target (in config/pipeline.yaml): DRAKE_TARGETS: ["report_cluster_markers"]

r emoji::emoji("scroll") Example report for PBMC 1k data (used config)

r emoji::emoji("scroll") Example report for integrated data (used config)

r emoji::emoji("ladder") Structure


Marker calculation and interpretation

Taken from OSCA, slightly modified:

To interpret our clustering results, we identify the genes that drive separation between clusters. These marker genes allow us to assign biological meaning to each cluster based on their functional annotation. In the most obvious case, the marker genes for each cluster are a priori associated with particular cell types, allowing us to treat the clustering as a proxy for cell type identity. The same principle can be applied to discover more subtle differences between clusters (e.g., changes in activation or differentiation state) based on the behavior of genes in the affected pathways.

Identification of marker genes is usually based around the retrospective detection of differential expression between clusters. Genes that are more strongly DE are more likely to have caused separate clustering of cells in the first place. Several different statistical tests are available to quantify the differences in expression profiles, and different approaches can be used to consolidate test results into a single ranking of genes for each cluster. These choices parametrize the theoretical differences between the various marker detection strategies.

Used statistical tests

For each cell grouping and gene, three distinct, adjustable statistical tests are computed through scran::findMarkers():

These tests are performed in a pairwise fashion within each grouping, i.e. each level of the grouping is tested against each other.

Example: three pairwise tests of each type are computed for k-means clustering with $k = 3$:

The overall p-values, false discovery rate (FDR) and effect sizes (log2 fold change (LFC) for t- and binomial test, and Area Under Curve for Wilcoxon test) are calculated for the cell grouping from its all pairwise tests (scran::combineMarkers()).

Narrowing down the number of cluster markers

A more or less stringent approach can be used to narrow down the number of cluster markers, based on results from individual pairwise tests. This will also affect the combined p-value and FDR:

This can be controlled via the PVAL_TYPE parameter.

Handling blocking factors

Blocking factors (e.g. batch effect, sex differences, cell cycle phases) can be handled by further nesting. For that, each pairwise test is performed separately in each level of the blocking factor. Then, p-values from individual levels' tests are combined, and the final combined p-values are obtained by the method of choice (see above).

This can be controlled via the BLOCK_COLUMN parameter.


Config parameters

Cluster markers config is stored in the config/single_sample/cluster_markers.yaml and config/integration/cluster_markers.yaml files (the location of this file is different for the single-sample and integration pipelines). As for the pipeline config, directory with this file is read from environment variables:

Options named in lowercase are set upon {scdrake} load or attach. Then the actual directory used depends on whether you run run_single_sample_r() or run_integration_r().


Cluster markers

CLUSTER_MARKERS_SOURCES_DEFAULTS:
  COMMON_PARAMS:
    PLOT_DIMREDS: ["umap"]
    BLOCK_COLUMN: null
  PARAMS_T:
    LFC_DIRECTION: "up"
    LFC_TEST: 0
    PVAL_TYPE: "any"
    MIN_PROP: null
    STD_LFC: False
    TOP_N_HEATMAP: 10
    TOP_N_WT_HEATMAP: "top"
    TOP_N_PLOT: 5
    TOP_N_WT_PLOT: "top"
  PARAMS_WILCOX:
    LFC_DIRECTION: "up"
    LFC_TEST: 0
    PVAL_TYPE: "any"
    MIN_PROP: null
    STD_LFC: null
    TOP_N_HEATMAP: 10
    TOP_N_WT_HEATMAP: "top"
    TOP_N_PLOT: 5
    TOP_N_WT_PLOT: "top"
  PARAMS_BINOM:
    LFC_DIRECTION: "up"
    LFC_TEST: 0
    PVAL_TYPE: "any"
    MIN_PROP: null
    STD_LFC: null
    TOP_N_HEATMAP: 10
    TOP_N_WT_HEATMAP: "top"
    TOP_N_PLOT: 5
    TOP_N_WT_PLOT: "top"

Type: named list (dictionary) of named lists

Default parameters for computation and reporting of cluster markers. These can be overriden for each of the cluster markers source in CLUSTER_MARKERS_SOURCES (see below).


CLUSTER_MARKERS_SOURCES:
  - markers_cluster_graph_leiden_r0.4:
      source_column: "cluster_graph_leiden_r0.4"
      description: "Cluster markers for Leiden clustering (r = 0.4)"
    markers_cluster_graph_leiden_r0.8:
      source_column: "cluster_graph_leiden_r0.8"
      description: "Cluster markers for Leiden clustering (r = 0.8)"

Type: list of named lists

This parameter is used to specify cell groupings ("sources") in which markers are searched for, and test and output parameters. You can use any categorical column - results of cell clustering, phase (cell cycle), etc., or custom groupings defined by CELL_GROUPINGS in 02_norm_clustering.yaml and 02_int_clustering.yaml. See clustering targets for a list of clustering names.

Let's examine the first entry (markers_cluster_graph_leiden_r0.4):

In each entry you can override parameters in CLUSTER_MARKERS_SOURCES_DEFAULTS (by using lowercase names). For example, let's consider the following scenario:

CLUSTER_MARKERS_SOURCES:
  - markers_cluster_graph_leiden_r0.4:
      source_column: "cluster_graph_leiden_r0.4"
      description: "Cluster markers for Leiden clustering (r = 0.4)"
      common_params:
        plot_dimreds: ["umap", "pca"]
      params_t:
        pval_type: "some"
        top_n_wt_heatmap: "fdr"
        top_n_wt_plot: "fdr"

Here, train_params and common_params will override corresponding parameters in PARAMS_T and COMMON_PARAMS, respectively, in CELL_ANNOTATION_SOURCES_DEFAULTS.

Watch out for the proper indentation. See the "Merging of nested named lists" section in vignette("scdrake_config").


MAKE_CLUSTER_MARKERS_PLOTS: True

Type: logical scalar

Set to False to skip making of marker plots. In that case, for compatibility reasons, empty files will be created.


Input files

CLUSTER_MARKERS_TABLE_TEMPLATE_RMD_FILE: "Rmd/common/cluster_markers_table_template.Rmd"
CLUSTER_MARKERS_REPORT_RMD_FILE: "Rmd/common/cluster_markers.Rmd"

Type: character scalar

Paths to RMarkdown files used for HTML reports of this pipeline stage.


Output files

CLUSTER_MARKERS_BASE_OUT_DIR: "cluster_markers"

Type: character scalar

A path to base output directory for this stage. It will be created under BASE_OUT_DIR specified in 00_main.yaml config.


CLUSTER_MARKERS_REPORT_HTML_FILE: "cluster_markers.html"
CLUSTER_MARKERS_HEATMAPS_OUT_DIR: "cluster_markers_heatmaps"
CLUSTER_MARKERS_PLOTS_BASE_OUT_DIR: "cluster_markers_plots"
CLUSTER_MARKERS_DIMRED_PLOTS_BASE_OUT_DIR: "cluster_markers_dimred_plots"
CLUSTER_MARKERS_TABLES_OUT_DIR: "cluster_markers_tables"

Type: character scalar

Names of files and directories created under CLUSTER_MARKERS_BASE_OUT_DIR. Subdirectories are not allowed.

Outputs

Here you can find description of the most important targets for this stage. However, for a full overview, you have to inspect the source code of the get_cluster_markers_subplan() function.

SingleCellExperiment objects

sce_dimred_cluster_markers, sce_final_cluster_markers: SCE objects with computed dimensionality reductions. Used for generation of marker plots and heatmaps

sce_cluster_markers: Final SCE object with cell clusterings from the 02_norm_clustering (single-sample pipeline) or 02_int_clustering stage (integration pipeline). Used to compute cluster markers. You can inspect the cell groupings which can be used for marker computation:

drake::loadd(sce_final_cluster_markers)
SingleCellExperiment::colData(sce_final_cluster_markers)

Tibbles with parameters

Usually, each row of these tibbles is passed to a function within the {drake} plan.

cluster_markers_params: all parameters for computation of cluster markers

The following tibbles are derived from cluster_markers_params: cluster_markers_test_params, cluster_markers_heatmap_params, cluster_markers_plot_params, cluster_markers_dimred_plot_params

Tibbles with cluster markers test results

cluster_markers_raw: this tibble has the same columns as cluster_markers_test_params, except the list column (<DataFrame>) markers with test results is added, and each level of source_column is expanded to a separate row.

So for example, if a row has group_level = 1, its markers DataFrame contains test results of this level versus all other ones.

See scran::combineMarkers() for more details on the output format, and scran_markers() for a description of changes made to the output of the former function.


cluster_markers: same as cluster_markers_raw, but for Wilcox tests, combined LFC is added to DataFrames in markers column


cluster_markers_processed: same as cluster_markers, but markers DataFrames don't contain nested DataFrames (lfc_* or auc_*) - those are replaced by combined effect sizes. This way you can inspect the effect sizes of comparisons with all other levels.

cluster_markers_out: holds the same data as cluster_markers_processed, but markers are coerced to dataframes and column names are normalized to snake_case


cluster_markers_for_tables: a summarized tibble glued from cluster_markers_out, cluster_markers_heatmaps_df, and cluster_markers_plots_top. Marker tables in markers column are in publish-ready forms for HTML output, e.g. there are <a> links to ENSEMBL or PDF files of marker plots.

Heatmaps

seu_for_cluster_markers_heatmaps: a Seurat object used to generate heatmaps through marker_heatmap() (a wrapper around Seurat::DoHeatmap()). In scale.data slot of the RNA assay are UMI counts transformed to z-score.


cluster_markers_heatmaps_df: a tibble derived from cluster_markers_heatmaps_df, but enriched with cluster markers test results

Two heatmaps for each row in cluster_markers_heatmaps_df are made: one with log2 UMI counts, and second with the same values transformed to z-score

For storage reasons, heatmap objects are not saved as R objects, but only exported to PDF. However, you can obtain a tibble holding these objects with

drake::loadd(cluster_markers_heatmaps_df)
heatmaps_tbl <- marker_heatmaps_wrapper(
  seu = seu_for_cluster_markers_heatmaps,
  params = cluster_markers_heatmaps_df,
  marker_type = "global",
  save = FALSE,
  return_type = "tibble"
)

In heatmaps_tbl, two list columns are appended to cluster_markers_heatmaps_df:

For more control you can use the marker_heatmap() function (of which is marker_heatmaps_wrapper() a wrapper).

Marker plots

cluster_markers_plots_top: a tibble with markers for which plots will be made. See ?markers_plots_top for details.

As for heatmaps, plot objects ({ggplot2}, {patchwork}) are not saved, but the underlying function can be used:

markers_plots_files(sce_dimred_cluster_markers, cluster_markers_plots_top, save = FALSE, return_type = "tibble")

Similarly, the actual plotting function can be used, see ?marker_plot

Dimensionality reduction plots

cluster_markers_dimred_plots: a tibble with dimensionality reduction plots for each of source_column in the CLUSTER_MARKERS_SOURCES parameter and PLOT_DIMREDS in COMMON_PARAMS

Other targets

config_cluster_markers: a list holding parameters for this stage



bioinfocz/scdrake documentation built on Sept. 19, 2024, 4:43 p.m.