``` {r, echo = FALSE} knitr::opts_chunk$set(eval = FALSE)

```r
## -- We will initialize a scdrake project in temporary directory, but the user will do it in home.
tmp_dir <- tempfile()
scdrake::init_project(
  tmp_dir, set_active_project = FALSE, set_wd = FALSE, ask = FALSE, download_example_data = TRUE
)

{scdrake} offers two pipelines - one for single-sample, and second one for integration of multiple samples (which were processed by the single-sample pipeline before). As for now, each pipeline consists of two subpipelines (referred to as stages), and two stages common to both single-sample and integration pipelines. Details, along with graphical structure, are available in vignette("pipeline_overview").

In this guide you will see how to quickly setup {scdrake} and run the first stage (01_input_qc) of its single-sample pipeline using the provided example data. Once you familiarize yourself with {scdrake} basics, we will guide you through the second stage of the single-sample pipeline (02_norm_clustering) that demonstrates cell clustering and annotation. Later you can continue with guide for the integration pipeline in vignette("scdrake_integration").

If possible, we will show how to follow the steps from within R and through the command line interface (CLI, see vignette("scdrake_cli") for a short overview).

We strongly recommend to use the {scdrake}'s Docker image as it is well tested. Even if you are familiar with Docker or Singularity, please, refer to vignette("scdrake_docker") as there are given some critical parameters.


The three steps

After installation, you basically need the three steps described below to run the {scdrake} pipeline.

Step 1: initialize a new project {.tabset}

{scdrake} is a project-based package, so the first step is to initialize a new project. Project simply means a directory in which the analysis of your data will take place. That also means:

Now we initialize a new {scdrake} project directory.

In R

First we load the {scdrake} package, and then call the init_project() function:

library(scdrake)
init_project("~/scdrake_projects/pbmc1k", download_example_data = TRUE)

On command line

mkdir ~/scdrake_projects/pbmc1k
cd ~/scdrake_projects/pbmc1k
scdrake --download-example-data init-project

On command line (Docker)

We assume that you are running a detached container that has a shared directory mounted as /home/rstudio/scdrake_projects (as described in vignette("scdrake_docker")).

mkdir ~/scdrake_projects/pbmc1k
cd ~/scdrake_projects/pbmc1k
docker exec -it -u rstudio -w /home/rstudio/scdrake_projects/pbmc1k <CONTAINER ID or NAME> \
  scdrake --download-example-data init-project

On command line (Singularity)

mkdir -p ~/scdrake_singularity
cd ~/scdrake_singularity
mkdir -p home/${USER} scdrake_projects/pbmc1k
singularity exec -e --no-home \
    --bind "home/${USER}/:/home/${USER},scdrake_projects/:/home/${USER}/scdrake_projects" \
    --pwd "/home/${USER}/scdrake_projects/pbmc1k" \
    path/to/scdrake_image.sif \
    scdrake --download-example-data init-project

{-}

This will:

Important: whenever you will be running the {scdrake} pipeline, make sure your working directory is set to the project's root. Although you can specify the project directory in CLI using the -d / --dir parameter, we rather recommend to always be present in the project's root directory before you issue {scdrake} commands (this also conforms the project-based approach). You can notice in the examples that the -d parameter is never used.

Let's inspect files in the project directory:

{.tabset}

In R

fs::dir_tree(all = TRUE)

On command line

If you have the tree tool installed (sudo apt install tree to fix it):

tree

Otherwise:

Rscript -e 'fs::dir_tree("~/scdrake_projects/pbmc1k")'

On command line (Docker)

docker exec -it -u rstudio -w /home/rstudio/scdrake_projects/pbmc1k <CONTAINER ID or NAME> \
  Rscript -e 'fs::dir_tree()'

On command line (Singularity)

singularity exec -e --no-home \
    --bind "home/${USER}/:/home/${USER},scdrake_projects/:/home/${USER}/scdrake_projects" \
    --pwd "/home/${USER}/scdrake_projects/pbmc1k" \
    path/to/scdrake_image.sif \
    Rscript -e 'fs::dir_tree()'

{-}

fs::dir_tree(tmp_dir, all = TRUE)

The most important files and directories are:


Step 2: modify the configuration files

In this three step guide we are going to run the first stage (01_input_qc) of the single-sample pipeline. This stage imports scRNA-seq data, computes per-cell quality control metrics (total number of UMI, number of detected genes, percentage of expressed mitochondrial genes) and performs filtering, which can be either based on median absolute deviation (dataset-sensitive filtering, default) or custom thresholds (custom filtering). You can read more about this stage and its analysis steps, parameters and outputs in vignette("stage_input_qc").

The configuration files for all stages are stored in the config/ directory. At this time we only need to modify a single parameter in the config file for the 01_input_qc stage.

To do so, open config/single_sample/01_input_qc.yaml and set value of path inside INPUT_DATA to "example_data/pbmc1k" such that it points to the directory with PBMC 1k example dataset, which has been downloaded on project initialization:

INPUT_DATA:
  type: "cellranger"
  path: "example_data/pbmc1k"
  delimiter: ","
  target_name: "target_name"

Important: all paths in configs must be relative to project's root directory, or absolute (not recommended), unless otherwise stated.


Step 3: run the single-sample pipeline {.tabset}

Now we are ready to execute the single-sample pipeline. {drake} pipelines are composed of individual steps called targets. Each target is represented by R expression that returns its value (R object). When target is finished, its value is saved into cache directory and can be used by other targets or load by the user into the current R session.

{drake} allows to execute the pipeline such that specific targets are made. This is controlled by the DRAKE_TARGETS parameter in the config/pipeline.yaml file (see vignette("config_pipeline")).

The default value of DRAKE_TARGETS is report_input_qc, which represents the final target of the 01_input_qc stage - a HTML report.

All files from this stage will be saved under the output/single_sample directory. For the report, it is output/single_sample/01_input_qc/01_input_qc.html. The output directory is specified by the BASE_OUT_DIR parameter in config/single_sample/00_main.yaml.

The 00_main.yaml config stores parameters that are common to all stages of the single-sample or integration pipeline, e.g. titles or organism (see vignette("config_main")).

Now let's run the pipeline:

In R

run_single_sample_r()

On command line

scdrake --pipeline-type single_sample run

On command line (Docker)

docker exec -it -u rstudio -w /home/rstudio/scdrake_projects/pbmc1k <CONTAINER ID or NAME> \
  scdrake --pipeline-type single_sample run

On command line (Singularity)

singularity exec -e --no-home \
    --bind "home/${USER}/:/home/${USER},scdrake_projects/:/home/${USER}/scdrake_projects" \
    --pwd "/home/${USER}/scdrake_projects/pbmc1k" \
    path/to/scdrake_image.sif \
    scdrake --pipeline-type single_sample run

{-}

Now you can inspect the HTML report located in output/single_sample/01_input_qc/01_input_qc.html. It gives you a summary view on the quality of the dataset and shows how each filtering type affects the number of cells. The report is mostly self-explanatory as it explains all analysis steps that took place and their visual outputs.


Recapitulation

In the three steps above you have seen the basic functionality of {scdrake}. We can briefly summarize it:

Each stage or general config has its own vignette describing analysis steps, parameters and outputs (targets - R objects, HTML reports):

You can navigate these vignettes in the top bar in the Articles drop-down menu.

For the integration pipeline you might be interested in its guide in vignette("scdrake_integration")

Note that the default config files are meant for the example PBMC data, and so before you analyse your own data, you should read the section below about important steps before you do so.


Modifying parameters and rerunning the pipeline

Let's practice a bit more with {scdrake} and modify a cell filtering parameter. This will also demonstrate the {drake}'s ability to skip finished targets.

Let's assume you decided to be more strict in dataset-sensitive cell filtering and want to use a MAD threshold of 2. At the same time, you want to save the report into a different file so it can be compared with the previous, more benevolent filtering. Now open config/single_sample/01_input_qc.yaml and change:

MAD_THRESHOLD: 2
INPUT_QC_BASE_OUT_DIR: "01_input_qc_strict"

After repeating the Step 3 above you can see the most time-consuming targets sce_raw and empty_droplets were skipped. Also, we simply output the HTML report for the new parameter into a different directory (output/single_sample/01_input_qc_strict/01_input_qc.html), and so we can easily compare it with the previous report. If you open the new report, you can see that more cells were filtered out.


Cell clustering and annotation

Now we will look at perhaps the most important outcomes of a scRNA-seq data analysis: cell clustering and cell type annotation. These particular procedures are implemented in the second stage 02_norm_clustering of the single-sample pipeline (see vignette("stage_norm_clustering")). Clustering and annotation is preceded by other necessary steps, most importantly: normalization, highly variable genes selection, PCA calculation, and dimensionality reduction using principal components (UMAP, t-SNE). A similar stage is also available in the integration pipeline: 02_int_clustering (see vignette("stage_int_clustering")).

Initially, we won't make any changes to the config/single_sample/02_norm_clustering.yaml config and keep its sensible defaults:

What we need in order to perform the 02_norm_clustering stage is just to change targets for {drake}: open config/pipeline.yaml and change DRAKE_TARGETS to ["report_norm_clustering", "report_norm_clustering_simple"]

These targets will make two reports for this stage. The first one includes technical details about important analysis steps, while the second one is simplified and limited to dimensionality reduction plots.

Now run the pipeline using the familiar command from the Step 3.

Then you can go through the report located in the output/single_sample/02_norm_clustering/02_norm_clustering.html file. The report provides all necessary information about the performed analysis steps, and graphical outputs, most importantly dimensionality reduction plots displaying cell-cluster membership.

Also, at the end of the report you can find results from the automatic cell type annotation. You can see that clusters represents the main immune cell types. If you are interested in markers of the predicted cell types, you can click on the Marker heatmaps PDF link.

Visualizing markers of interest

In {scdrake} there is a convenient way how to visualize expression of markers of interest in reduced dimensions. It simply uses a CSV file that defines groups of markers given by their symbol. This is an example of such CSV file (selected_markers.csv) that is bundled with {scdrake} and automatically copied to the root of a new project on its initialization:

Naive_CD4+_T,IL7R:CCR7
Memory_CD4+,IL7R:S100A4
CD14+_Mono,CD14:LYZ
B,MS4A1
CD8+_T,CD8A
FCGR3A+_Mono,FCGR3A:MS4A7
NK,GNLY:NKG7
DC,FCER1A:CST3
Platelet,PPBP

The first column specifies a group of markers (cell types in this case) and the second column lists the marker symbols separated by colon. Note that there is no header.

To enable usage of this file and subsequent generation of expression plots, in config/single_sample/02_norm_clustering.yaml simply change SELECTED_MARKERS_FILE to "selected_markers.csv"

Now you have two options how to tell {scdrake} to make the expression plots (PDFs) using the CSV file:

There are three PDFs with different reduced dimensions (PCA, t-SNE, UMAP) and each contains a matrix of expression plots of genes supplied in the CSV file.

Assigning labels to clusters

When identities of clusters are known, it is possible to transform cluster numbers into more informative cell type names. For that there is a handy parameter CELL_GROUPINGS in config/single_sample/02_norm_clustering.yaml.

Let's say you want to manually annotate some clusters in Leiden clustering with resolution 0.4 and 0.6. For that you can change the parameter as follows (please, do not search for any biological relevance here - the used annotation is solely purposed to demonstrate the usage of this parameter):

CELL_GROUPINGS:
  - cluster_graph_leiden_r0.4_annotated:
      source_column: "cluster_graph_leiden_r0.4"
      description: "Graph-based clustering (Leiden alg., r = 0.4), annotated clusters"
      assignments:
        1: "memory CD4+"
        6: "B"
        7: "memory CD4+"
    cluster_graph_leiden_r0.6_annotated:
      source_column: "cluster_graph_leiden_r0.6"
      description: "Graph-based clustering (Leiden alg., r = 0.6), annotated clusters"
      assignments:
        2: "TNK"
        3: "CD4 naive"

There are two items in the CELL_GROUPINGS, and each of them is referencing Leiden clustering with different resolution. The first item/grouping has the following structure:

These new cell groupings can be then referenced in other parameters. For example, you can add them to the NORM_CLUSTERING_REPORT_DIMRED_PLOTS_OTHER list, which specifies variables to color by cells in dimensionality reduction plots:

NORM_CLUSTERING_REPORT_DIMRED_PLOTS_OTHER:
  - "phase": "Cell cycle phases"
    "doublet_score": "Doublet score"
    "total": "Total number of UMI"
    "detected": "Detected number of genes"
    "cluster_graph_leiden_r0.4_annotated": null
    "cluster_graph_leiden_r0.6_annotated": null

For the new groupings we are not specifying the plot titles - those are taken from CELL_GROUPINGS from the description item.

Again, there are two options how to generate the plots:

It is also possible to assign custom cell metadata from a CSV file - see the ADDITIONAL_CELL_DATA_FILE parameter in vignette("stage_norm_clustering") or "I want to manually annotate cells" section in vignette("scdrake_faq").

Note that CELL_GROUPINGS and ADDITIONAL_CELL_DATA_FILE are also available in the 02_int_clustering stage of the integration pipeline.

Calculation of marker genes

Besides the visualization of marker genes of interest, one can also calculate statistical tests to discover markers that drive the separation of cells into clusters. We won't go into details here; see vignette("stage_cluster_markers") that describes the cluster_markers stage, which is available for both single-sample and integration pipelines.

It is also possible to perform differential gene expression between clusters (or in general, groups of cells) through the contrasts stage (see vignette("stage_contrasts")).

Note that in these stages you can reference cell groupings that were defined in the CELL_GROUPINGS parameter.


Important: before you analyse your own data

The default config files are designed to work with the provided example data. The parameters mostly default to arguments in underlying functions in packages that are called by {scdrake} or are set according to common/best analysis practices (e.g. those described in OSCA). Anyway, it is worth saying that:

Below you can find important parameters which you should review before you run the pipeline on your data for the first time. All parameters are documented in vignettes for their respective stages or general configs.

Pipeline config (pipeline.yaml) {.unlisted}

-> vignette("config_pipeline")

Main config (00_main.yaml) {.unlisted}

-> vignette("config_main")

Single-sample / Input QC stage (01_input_qc.yaml) {.unlisted}

-> vignette("stage_input_qc")

Single-sample / Normalization and clustering stage (02_norm_clustering.yaml) {.unlisted}

-> vignette("stage_norm_clustering")

Integration / Integration stage (01_integration.yaml) {.unlisted}

-> vignette("stage_integration")

Integration / Post-integration clustering stage (02_int_clustering.yaml) {.unlisted}

-> vignette("stage_int_clustering")

Common / Cluster markers and contrasts stages (cluster_markers.yaml, contrasts.yaml) {.unlisted}

-> vignette("stage_cluster_markers"), vignette("stage_contrasts")


Updating the project files {.tabset}

When a new version of {scdrake} is released, you can update your project with (assuming your working directory is in a {scdrake} project's root):

In R

update_project()

On command line

scdrake update-project

On command line (Docker)

docker exec -it -u rstudio -w /path/to/scdrake/project <CONTAINER ID or NAME> \
  scdrake update-project

On command line (Singularity)

singularity exec -e --no-home \
    --bind "home/${USER}/:/home/${USER},scdrake_projects/:/home/${USER}/scdrake_projects" \
    --pwd "/home/${USER}/scdrake_projects/project" \
    path/to/scdrake_image.sif \
    scdrake update-project

{-}

This will overwrite project files by the package-bundled ones:

By default, you will be asked if you want to continue, as you might lose your local modifications.


Going further

Now you should know the {scdrake} basics:

For the integration pipeline you might be interested in its guide in vignette("scdrake_integration").

For the full insight into {scdrake} you can also read the following vignettes:


## -- This is a cleanup after the vignette is built.
fs::dir_delete(tmp_dir)


bioinfocz/scdrake documentation built on Sept. 19, 2024, 4:43 p.m.