exec/SCTK_runQC_Documentation2.md

Generation of comprehensive quality control metrics with SCTK

This pipeline will import data from single-cell preprocessing algorithms (e.g. CellRanger), generate various quality control metrics (e.g. doublet scores), and output results in standard data containers (e.g. SingleCellExperiment). Both the original droplet matrix and the filtered cell matrix will be processed. This pipeline is focused on single cell data generated from microfluidic devices (e.g. 10X).

Specifications

Installation

Docker users

If you are using docker image to run the pipeline, please skip this section and refer sections Running the pipeline and Docker and Singularity Images. Docker image has all dependencies installed properly so that you can use it directly on any machine.

Install Python packages and dependencies

This pipeline depends on Python package described below. If these packages is not installed or not at the proper version, please install them by running the following code:

pip3 install --user h5py==2.9.0
pip3 install --user Scrublet
pip3 install --user virtualenv
pip3 install --user scanpy

Install R packages and dependencies

The script will automatically try to install the "singleCellTK" package from Bioconductor if not available. However, currently this code is only located on a development branch which needs to be installed from GitHub:

library(devtools)
install_github("compbiomed/singleCellTK@devel")

Data formats

This pipeline can currently import data from the following tools:

For each tool, the pipeline expect the data in a certain format within a specific directory structure. For example, data generated with CellRanger V3 is expected to be under a sample folder (specified with the "-s" flag) and a base folder (specified with the "-b" flag). Some tools prepend the sample name onto the output files instead of making a separate subdirectory (e.g. BUStools and SEQC). The combination of --base_path ("-b") and --sample ("-s") should specify the location of the data files:

Arguments

Required arguments

The required arguments are as follows:

-b, --basePath (required). Base path for the output from the preprocessing algorithm

-P, --preproc (required). Algorithm used for preprocessing. One of One of 'CellRangerV2', 'CellRangerV3', 'BUStools', 'STARSolo', 'SEQC', 'Optimus', 'DropEst', 'SceRDS', 'CountMatrix' and 'AnnData'.

-s, --sample (required). Name of the sample. This will be prepended to the cell barcodes.

-o, --directory (required). Output directory. A new subdirectory will be created with the name "sample". R, Python, and FlatFile directories will be created under the "sample" directory containing the data containers with QC metrics. Default ".". More information about output directory structure is explained in "Understanding outputs" section below.

-F, --outputFormat (required). The output format of this QC pipeline. Currently, it supports R (RDS), FlatFile, Python (AnnData) and HTAN (manifest files that meets HTAN requirement).

-S, --splitSample (required). Save a SingleCellExperiment object for each sample. Default is TRUE. If FALSE, the data of all samples will be combined into one SingleCellExperiment object and this object will be outputed.

Optional arguments

The optional arguments are as follows. Their usage depend on type of data and user-defined behaviour.

optional arguments

-g, --gmt (optional). GMT file containing gene sets for quality control.

-t, --delim (optional, required when -g is specified). Delimiter used in GMT file. Default "\t".

-G, --genome (optional). The name of genome reference. This is only required for CellRangerV2 data.

-y, --yaml (optional). YAML file used to specify parameters of QC functions called by singleCellTK QC pipeline. Please check "Specify parameters for QC algorithms" section for details.

-c, --cellData (optional). The full path of the RDS file or Matrix file of the cell matrix. This would be use only when --preproc is SceRDS or CountMatrix.

-r, --rawData (optional). The full path of the RDS file or Matrix file of the droplet matrix. This would be provided only when --preproc is SceRDS or CountMatrix.

-C, --cellPath (optional). The directory contains matrix.mtx.gz, features.tsv.gz and barcodes.tsv.gz files originally generated by 10x CellrangerV2 or CellrangerV3 (files in the filtered_feature_bc_matrix directory). This argument only works when --preproc is CellRangerV2 or CellRangerV3. Default is NULL. If 'base_path' is NULL, 'cellPath' or 'rawPath' should be specified.

-R, --rawPath (optional). The directory contains matrix.mtx.gz, features.tsv.gz and barcodes.tsv.gz files originally generated by 10x CellrangerV2 or CellrangerV3 (files in the raw_feature_bc_matrix directory). This argument only works when --preproc is CellRangerV2 or CellRangerV3. Default is NULL. If 'base_path' is NULL, 'cellPath' or 'rawPath' should be specified.

-d, --dataType. Type of data as input. Default is Both, which means taking both droplet and cell matrix as input. If set as 'Droplet', it will only processes droplet data. If set as 'Cell', it will only processes cell data.

-D, --detectCells. Detect cells from droplet matrix. Default is FALSE. This argument is only eavluated when -d is 'Droplet'. If set as TRUE, cells will be detected and cell matrixed will be subset from the droplet matrix. Also, quality control will be performed on the detected cell matrix.

-m, --cellDetectMethod. Methods to detect cells from droplet matrix. Default is 'EmptyDrops'. This argument is only eavluated when -D is 'TRUE'. Other options could be 'Knee' or 'Inflection'. More information is provided in the section below.

-n, --numCores. Number of cores used to run the pipeline. By default is 1. Parallel computing is enabled if -n is greater than 1.

-T, --parallelType. Type of parallel computing used for parallel computing. Parallel computing used in this pipeline depends on "BiocParallel" package. Default is 'MulticoreParam'. It can be 'MulticoreParam' or 'SnowParam'. This argument will be evaluated only when numCores > 1.

-i, --studyDesign. The txt file containing the desrciption of the study design. Default is NULL. This would be shown at the begining the html report of cell and droplet QC.

-L, --subTitle. The subtitle used in the cell and droplet QC HTML report. Default is None. The subtitle can contain information of the sample, like sample name, etc. If -S is set as TRUE, the length of subsitle should be the same as the length of samples. If -S is set as FALSE, the length of subtitle should be one or NULL.

Running the pipeline

Methods to run pipeline on CellrangerV2 or CellrangerV3 data set

This pipeline enables different ways to import CellrangerV2/CellrangerV3 for flexibility.

  1. If the cellranger data set is saved in the default cellranger output directory, you can load the data by running following code:

For CellRangerV3:

Rscript SCTK_runQC.R \
-b /basepath \
-P CellRangerV3 \
-s SampleName \
-o Output_Directory \
-S TRUE \
-F R,Python,FlatFile,HTAN

For CellRangerV2, the reference used by cellranger needs to be specified by -G/--genome:

Rscript SCTK_runQC.R \
-b /basepath \
-P CellRangerV2 \
-s SampleName \
-o Output_Directory \
-S TRUE \
-G hg19 \
-F R,Python,FlatFile,HTAN

As shown in the Data formats, -b specify the base path and usually it's the folder where you ran 10x cellranger count. -s specify the sample name, which has to be same as the name of the sample folder under the base folder. The folder layout would look like the following:

├── BasePath
└── SampleName 
    ├── outs
    |   ├── filtered_feature_bc_matrix
    |   |   ├── barcodes.tsv.gz
    |   |   ├── features.tsv.gz
    |   |   └── matrix.mtx.gz
    |   ├── raw_feature_bc_matrix
    |   |   ├── barcodes.tsv.gz
    |   |   ├── features.tsv.gz
    |   |   └── matrix.mtx.gz
        ...
  1. If the cellranger count output have been moved out of the default cellranger output directory, you can specified input using -R and -C:
Rscript SCTK_runQC.R \
-P CellRangerV2 \
-C /path/to/cell/matrix \
-R /path/to/droplet/matrix \
-s SampleName \
-o Output_Directory \
-S TRUE \
-F R,Python,FlatFile,HTAN

In this case, you must skip -b arguments and you can also skip -G argument for CellRangerV2 data.

Methods to run pipeline on data set stored in RDS file or matrix(txt or mtx)

  1. If you data in stored as a SingleCellExperiment in RDS file, singleCellTK also supports that type of input. To run quality control with RDS file as input, run the following code:
Rscript SCTK_runQC.R \
-P SceRDS \
-s Samplename \
-o Output_Directory \
-S TRUE \
-F R,Python,FlatFile,HTAN \
-r /path/to/rds/file/droplet.RDS \
-c /path/to/rds/file/cell.RDS
  1. If your input is stored in txt file as a matrix, which has barcodes as colnames and genes as rownames, run the following code to start the quality control pipeline:
Rscript SCTK_runQC.R \
-P CountMatrix \
-s Samplename \
-o Output_Directory \
-S TRUE \
-F R,Python,FlatFile,HTAN \
-r /path/to/matrix/file/droplet.txt \
-c path/to/matrix/file/cell.txt

Methods to run pipeline on data set generated by other algorithms

If your data is preprocessed by other algorithms, you can load the data as shown in the figure above. Basically, the templated is shown below:

Rscript SCTK_runQC.R \
-b /base/path \
-P Preprocessing_Algorithm \
-s SampleName \
-o Output_Directory \
-S TRUE \
-F R,Python,FlatFile,HTAN

Methods to run pipeline on only droplet count or cell count data

User can choose to run QC pipeline on only one of the droplet count or cell count matrix, instead of running on both. In this case, the pipeline will only take single input and perform quality control on it. One of the example is shown below:

Rscript SCTK_runQC.R \
-b /base/path \
-P Preprocessing_Algorithm \
-s SampleName \
-o Output_Directory \
-S TRUE \
-F R,Python,FlatFile,HTAN \
-d Droplet \
-D TRUE \
-m EmptyDrops

-d argument is used to specify which count matrix to used in the QC pipeline. Default is "Both", which means take both droplet and cell count data as input.

If -d argument is set as "Droplet", the QC pipeline will only take droplet count matrix as input and perform quality control. You can choose whether to detect cells from the droplet matrix by setting -D as TRUE. If yes, cell count matrix will be detected and the pipeline will also perform quality control on this matrix and output the result. You could further define the method used to detect cells from droplet matrix by setting -m argument. -m could be one of "EmptyDrops", "Knee" or "Inflection". "EmptyDrops" will keep cells that pass the "runEmptyDrops" function test. "Knee" and "Inflection" will keep cells that pass the knee or inflection point returned from "runBarcodeRankDrops" function.

If -d argument is set as "Cell", the QC pipeline will only take cell count matrix as input and perform quality control. A figure showing the analysis steps and outputs of different inputs is shown below:

Specify parameters for QC algorithms

User can specify parameters for QC algorithms in this pipeline with a yaml file (supplied with -y/--yamlFile argument). The current supported QC algorithms including doublet dection (bcds, cxds, cxds_bcds_hybrid, doubletFinder, doubletCells and scrublet), decontamination (decontX), emptyDrop detection (emptyDrops) and barcodeRankDrops (barcodeRanks). A summary of each function is shown below:

An example of QC parameters yaml file is shown below:

Params:   ### should not be omitted
  emptyDrops:           ### Name of the QC functions in the pipeline
    lower: 50           ### Parameter for this function
    niters: 5000        ### Parameter for this function
    testAmbient: True   ### Parameter for this function

  barcodeRanks: 
    lower: 50

The format of yaml file can be found here. The parameters should be consistent with the parameters of each QC function in singleCellTK package. Parameters that are not defined in this yaml file will use the default value.

Analyzing genes sets

Quantifying the level of gene sets can be useful quality control. For example, the percentage of counts from mitochondrial genes can be an indicator or cell stress or death.

Users can pass a GMT file to the pipeline with one row for each gene set. The first column should be the name of the gene set (e.g. mito). The second column for each gene set in the GMT file (i.e. the description) should contain the location of where to look for the matching IDs in the data. If set to 'rownames', then the gene set IDs will be matched with the row IDs of the data matrix. If a character string or an integer index is supplied, then gene set IDs will be matched to IDs the that column of feature table.

Parallel computing

SCTK QC pipeline enables parallel computing to speed up the analysis. Parallel computing is enabled by setting -n/--numCores > 1. The -n/--numCores is used to set the number of cores used for the pipeline.

The backend of parallel computing is supported by "BiocParallel" package. Therefore, users can select different types of parallel evaluation by setting -T/--parallelType argument. Default is "MulticoreParam". Currently, "MulticoreParam" and "SnowParam" is supported for -T argument. However, "MulticoreParam" is not supported by Windows system. Winows user can choose "SnowParam" as the backend of parallel computing.

Understanding outputs

The output directory is created under the path specified by -o/--directory argument. Each sample is stoed in the subdirectory (named by -s/--sample argument) within this output direcotry. Within each sample directory, the each output format will be separated into subdirectories. The output file hierarchy is shown below:

(root; output directory)
├── level3Meta.csv
├── level4Meta.csv
└── sample1 
    ├── R
    |   ├── sample1_Droplets.rds
    |   └── sample1_Cells.rds
    ├── Python
    |   ├── Droplets
    |   |   └── sample1.h5ad
    |   └── Cells
    |       └── sample1.h5ad
    ├── FlatFile
    |   ├── Droplets
    |   |   ├── assays
    |   |   |   └── sample1_counts.mtx.gz
    |   |   ├── metadata
    |   |   |   └── sample1_metadata.rds
    |   |   ├── sample1_colData.txt.gz
    |   |   └── sample1_rowData.txt.gz 
    |   └── Cells
    |       ├── assays
    |       |   └── sample1_counts.mtx.gz
    |       ├── metadata
    |       |   └── sample1_metadata.rds
    |       ├── reducedDims
    |       |   ├──sample1_decontX_UMAP.txt.gz
    |       |   ├──sample1_scrublet_TSNE.txt.gz
    |       |   └──sample1_scrublet_UMAP.txt.gz
    |       ├── sample1_colData.txt.gz
    |       └── sample1_rowData.txt.gz 
    └── sample1_QCparameters.yaml

Docker and Singularity Images

singleCellTK is available to use with both Docker and Singularity. This container be used as a portal to the singleCellTK command line interface. The work was done in R. The singleCellTK docker image is available from Docker Hub.

Docker

If you have not used docker before, you can follow the instruction to install and set up docker in Windows, Mac or Linux.

The Docker image can be obtained by running:

docker pull campbio/sctk_qc

Noted that the transcriptome data and GMT file needed to be accessible to the container via mounted volume. In the below example, mount volumn is enabled for accessing input and output directory using argument -v. To learn more about mounted volumes, please check out this post.

The usage of each argument is the same as running command line analysis. Here is an example code to perform quanlity control of CellRangerV3 data singleCellTK docker:

docker run --rm -v /path/to/data:/SCTK_docker \
-it campbio/sctk_qc:1.7.6 \
-b /SCTK_docker/cellranger \
-P CellRangerV3 \
-s pbmc_100x100 \
-o /SCTK_docker/result/tenx_v3_pbmc \
-g /SCTK_docker/mitochondrial_human_symbol.gmt \
-S TRUE \
-F R,Python,Flatfile,HTAN

Singularity

The Singulatiry image can easily be built using Docker Hub as a source:

singularity pull docker://campbio/sctk_qc:1.7.6

The usage of singleCellTK Singularity image is very similar to that of Docker. In Singularity 3.0+, the mount volume is automatically overlaid.

It's recommended to re-set the home directory when you run singularity. Singularity will mount \$HOME path on your machine by default, which might contain your personal R/Python library folder. If we don't re-set the home to mount, singularity will try to use R/Python libraries which are not built within the singularity image and cause some conflicts. You can point to some "sanitized home", which is different from \$HOME path on your machine, using argument -H/--home. Besides, you can use argument --bind/-B to specify your own mount volume, which is the path that contains the dataset and will be used to store the output of QC pipeline. The example is shown as below:

singularity run --home=/PathToSanitizedHome \
--bind /PathToData:/data sctk_qc_1.7.6.sif \
-P CellRangerV3 \
-s gencodev34_pbmc_1k_v3 \
-b /data/gencodev34_pbmc_1k_v3
-o /data/result/gencodev34_pbmc_1k_v3 \
-S TRUE \
-F R,Python,FlatFile,HTAN \
-n 15 \
-T MulticoreParam

Also, you might want to specify cpu architecture when run the Singularity image on BU SCC using '#$ -l cpu_arch=broadwell|haswell|skylake|cascadelake' command. Because the python packages are compiled by SIMD instructions that are only compatible on these cpu architectures. If you are runnning singularity on other cluster, please contact IT helps about how to specify cpu architecture when you run the singularity image. One of the example is shown below:

#!/bin/bash
#$ -cwd
#$ -j y
#$ -P camplab
#$ -pe omp 16
#$ -l cpu_arch=broadwell|haswell|skylake|cascadelake
singularity run --home=/PathToSanitizedHome \
--bind /PathToData:/data sctk_qc_1.7.6.sif \
-P CellRangerV3 \
-s gencodev34_pbmc_1k_v3 \
-b /data/gencodev34_pbmc_1k_v3
-o /data/result/gencodev34_pbmc_1k_v3 \
-S TRUE \
-F R,Python,FlatFile,HTAN \
-n 15 \
-T MulticoreParam

Documentation of tools that are currently available within the pipeline:

Empty droplet detection:

Doublet Detection

Ambient RNA detection



Try the singleCellTK package in your browser

Any scripts or data that you put into this service are public.

singleCellTK documentation built on Nov. 8, 2020, 5:21 p.m.