``` {r, echo = FALSE} knitr::opts_chunk$set(eval = FALSE)
*** In this guide you will see how to integrate two datasets. The *prerequisity* here is: - A project initialized within the quick start guide (`vignette("scdrake")`) that should live in the `~/scdrake_projects/pbmc1k` directory. - You have successfully run the pipeline for the `report_norm_clustering` or `report_norm_clustering_simple` target(s). > For Docker we assume that the container has a shared directory mounted as `/home/rstudio/scdrake_projects`, as described in `vignette("scdrake_docker")`. *** The integration pipeline starts with import of `SingleCellExperiment` (SCE) objects from `drake` caches of underlying single-sample analyses. These objects are the final ones from the `02_norm_clustering` stage, that is, normalized, with known highly variable genes and clusters, and with computed reduced dimensions. ### Prepare the second sample - PBMC 3k {.tabset} As a second sample for the integration pipeline we will use another dataset from 10x Genomics - PBMC 3k. To stick to the project-based approach, we will initialize a new `scdrake` project: #### In R ```r init_project("~/scdrake_projects/pbmc3k")
If not done automatically, change your RStudio project or switch the current working directory to the project's root.
mkdir ~/scdrake_projects/pbmc3k
cd ~/scdrake_projects/pbmc3k
scdrake init-project
mkdir ~/scdrake_projects/pbmc3k cd ~/scdrake_projects/pbmc3k docker exec -it -u rstudio -w /home/rstudio/scdrake_projects/pbmc3k <CONTAINER ID or NAME> \ scdrake init-project
mkdir -p ~/scdrake_singularity cd ~/scdrake_singularity mkdir -p home/${USER} scdrake_projects/pbmc3k singularity exec -e --no-home \ --bind "home/${USER}/:/home/${USER},scdrake_projects/:/home/${USER}/scdrake_projects" \ --pwd "/home/${USER}/scdrake_projects/pbmc3k" \ path/to/scdrake_image.sif \ scdrake init-project
Now we will repeat the steps we have already done for the PBMC 1k sample. In ~/scdrake_projects/pbmc3k
:
config/single_sample/01_input_qc.yaml
and set path
inside INPUT_DATA
to
"../pbmc1k/example_data/pbmc3k"
(the example data for PBMC 3k has been already downloaded when
you had initialized the project for PBMC 1k dataset).config/pipeline.yaml
and set DRAKE_TARGETS
to ["sce_final_norm_clustering"]
.The config modifications for the second sample are ready, so let's run the pipeline:
run_single_sample_r()
scdrake --pipeline-type single_sample run
docker exec -it -u rstudio -w /home/rstudio/scdrake_projects/pbmc3k <CONTAINER ID or NAME> \ scdrake --pipeline-type single_sample run
singularity exec -e --no-home \ --bind "home/${USER}/:/home/${USER},scdrake_projects/:/home/${USER}/scdrake_projects" \ --pwd "/home/${USER}/scdrake_projects/pbmc3k" \ path/to/scdrake_image.sif \ scdrake --pipeline-type single_sample run
The configuration file for the integration pipeline is located in config/integration/01_integration.yaml
(see vignette("stage_integration")
).
By default, four integration methods are enabled (you can disable them in the INTEGRATION_METHODS
parameter), plus
the uncorrected
method, which is mandatory as it is used later in the cluster_markers
and contrasts
stages
(uncorrected
just performs batch-specific correction for sequencing depth via batchelor::multiBatchNorm()
).
At least one integration method and uncorrected
must be always enabled.
First, as before for the individual samples, we will also initialize a new scdrake
project for the integration analysis:
init_project("~/scdrake_projects/pbmc_integration")
mkdir ~/scdrake_projects/pbmc_integration
cd ~/scdrake_projects/pbmc_integration
scdrake init-project
mkdir ~/scdrake_projects/pbmc_integration cd ~/scdrake_projects/pbmc_integration docker exec -it -u rstudio -w /home/rstudio/scdrake_projects/pbmc_integration <CONTAINER ID or NAME> \ scdrake init-project
mkdir -p home/${USER} scdrake_projects/pbmc_integration singularity exec -e --no-home \ --bind "home/${USER}/:/home/${USER},scdrake_projects/:/home/${USER}/scdrake_projects" \ --pwd "/home/${USER}/scdrake_projects/pbmc_integration" \ path/to/scdrake_image.sif \ scdrake init-project
Now we modify configs for the integration pipeline:
~/scdrake_projects/pbmc_integration
:config/integration/01_integration.yaml
: set cache_path
to ../pbmc1k/.drake
and
../pbmc3k/.drake
for pbmc1k
and pbmc3k
entries, respectively.config/pipeline.yaml
: set DRAKE_TARGETS
to ["report_integration"]
.
To save time, we only run the final target of the 01_integration
stage.And let's run the pipeline.
run_integration_r()
scdrake --pipeline-type integration run
docker exec -it -u rstudio -w /home/rstudio/scdrake_projects/pbmc_integration <CONTAINER ID or NAME> \ scdrake --pipeline-type integration run
The output is saved in output/integration
, as specified by BASE_OUT_DIR
in
config/integration/00_main.yaml
. For 01_integration
stage, you can find its final report in
output/integration/01_integration/01_integration.html
.
You can try to load the target sce_int_dimred_df
(a tibble
object) containing integrated SingleCellExperiment
objects with computed reduced dimensions:
drake::loadd(sce_int_dimred_df)
The post-integration clustering stage (see vignette("stage_int_clustering")
) basically replicates the clustering,
cell annotation and visualization parts of the 02_norm_clustering
stage of the single-sample pipeline.
It uses a SingleCellExperiment
object from a selected integration method specified in the INTEGRATION_FINAL_METHOD
parameter in config/integration/02_int_clustering.yaml
.
You can also try to run the post-integration clustering stage by setting DRAKE_TARGETS
to ["report_int_clustering"]
.
By default, the result from the mnn
(mutual nearest neighbors) integration method is used.
The usage of these stages is the same as in the single-sample pipeline.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.