``` {r, echo = FALSE} knitr::opts_chunk$set(eval = FALSE)

***

In this guide you will see how to integrate two datasets. The *prerequisity* here is:

- A project initialized within the quick start guide (`vignette("scdrake")`) that should live in the
  `~/scdrake_projects/pbmc1k` directory.
- You have successfully run the pipeline for the `report_norm_clustering` or `report_norm_clustering_simple` target(s).

> For Docker we assume that the container has a shared directory mounted as `/home/rstudio/scdrake_projects`,
as described in `vignette("scdrake_docker")`.

***

The integration pipeline starts with import of `SingleCellExperiment` (SCE) objects from `drake` caches of underlying
single-sample analyses. These objects are the final ones from the `02_norm_clustering` stage, that is,
normalized, with known highly variable genes and clusters, and with computed reduced dimensions.

### Prepare the second sample - PBMC 3k {.tabset}

As a second sample for the integration pipeline we will use another dataset from 10x Genomics - PBMC 3k.
To stick to the project-based approach, we will initialize a new `scdrake` project:

#### In R

```r
init_project("~/scdrake_projects/pbmc3k")

If not done automatically, change your RStudio project or switch the current working directory to the project's root.

On command line

mkdir ~/scdrake_projects/pbmc3k
cd ~/scdrake_projects/pbmc3k
scdrake init-project

On command line (Docker)

mkdir ~/scdrake_projects/pbmc3k
cd ~/scdrake_projects/pbmc3k
docker exec -it -u rstudio -w /home/rstudio/scdrake_projects/pbmc3k <CONTAINER ID or NAME> \
  scdrake init-project

On command line (Singularity)

mkdir -p ~/scdrake_singularity
cd ~/scdrake_singularity
mkdir -p home/${USER} scdrake_projects/pbmc3k
singularity exec -e --no-home \
    --bind "home/${USER}/:/home/${USER},scdrake_projects/:/home/${USER}/scdrake_projects" \
    --pwd "/home/${USER}/scdrake_projects/pbmc3k" \
    path/to/scdrake_image.sif \
    scdrake init-project

{-}

Now we will repeat the steps we have already done for the PBMC 1k sample. In ~/scdrake_projects/pbmc3k:

The config modifications for the second sample are ready, so let's run the pipeline:

{.tabset}

In R

run_single_sample_r()

On command line

scdrake --pipeline-type single_sample run

On command line (Docker)

docker exec -it -u rstudio -w /home/rstudio/scdrake_projects/pbmc3k <CONTAINER ID or NAME> \
  scdrake --pipeline-type single_sample run

On command line (Singularity)

singularity exec -e --no-home \
    --bind "home/${USER}/:/home/${USER},scdrake_projects/:/home/${USER}/scdrake_projects" \
    --pwd "/home/${USER}/scdrake_projects/pbmc3k" \
    path/to/scdrake_image.sif \
    scdrake --pipeline-type single_sample run

{-}


Running the integration pipeline {.tabset}

The configuration file for the integration pipeline is located in config/integration/01_integration.yaml (see vignette("stage_integration")). By default, four integration methods are enabled (you can disable them in the INTEGRATION_METHODS parameter), plus the uncorrected method, which is mandatory as it is used later in the cluster_markers and contrasts stages (uncorrected just performs batch-specific correction for sequencing depth via batchelor::multiBatchNorm()). At least one integration method and uncorrected must be always enabled.

First, as before for the individual samples, we will also initialize a new scdrake project for the integration analysis:

In R

init_project("~/scdrake_projects/pbmc_integration")

On command line

mkdir ~/scdrake_projects/pbmc_integration
cd ~/scdrake_projects/pbmc_integration
scdrake init-project

On command line (Docker)

mkdir ~/scdrake_projects/pbmc_integration
cd ~/scdrake_projects/pbmc_integration
docker exec -it -u rstudio -w /home/rstudio/scdrake_projects/pbmc_integration <CONTAINER ID or NAME> \
  scdrake init-project

On command line (Singularity)

mkdir -p home/${USER} scdrake_projects/pbmc_integration
singularity exec -e --no-home \
    --bind "home/${USER}/:/home/${USER},scdrake_projects/:/home/${USER}/scdrake_projects" \
    --pwd "/home/${USER}/scdrake_projects/pbmc_integration" \
    path/to/scdrake_image.sif \
    scdrake init-project

{-}

Now we modify configs for the integration pipeline:

And let's run the pipeline.

{.tabset}

In R

run_integration_r()

On command line

scdrake --pipeline-type integration run

On command line (Docker)

docker exec -it -u rstudio -w /home/rstudio/scdrake_projects/pbmc_integration <CONTAINER ID or NAME> \
  scdrake --pipeline-type integration run

{-}

The output is saved in output/integration, as specified by BASE_OUT_DIR in config/integration/00_main.yaml. For 01_integration stage, you can find its final report in output/integration/01_integration/01_integration.html.

You can try to load the target sce_int_dimred_df (a tibble object) containing integrated SingleCellExperiment objects with computed reduced dimensions:

drake::loadd(sce_int_dimred_df)

Post-integration clustering and cell annotation

The post-integration clustering stage (see vignette("stage_int_clustering")) basically replicates the clustering, cell annotation and visualization parts of the 02_norm_clustering stage of the single-sample pipeline. It uses a SingleCellExperiment object from a selected integration method specified in the INTEGRATION_FINAL_METHOD parameter in config/integration/02_int_clustering.yaml.

You can also try to run the post-integration clustering stage by setting DRAKE_TARGETS to ["report_int_clustering"]. By default, the result from the mnn (mutual nearest neighbors) integration method is used.

Cluster markers and contrasts stages

The usage of these stages is the same as in the single-sample pipeline.



bioinfocz/scdrake documentation built on Sept. 19, 2024, 4:43 p.m.