exec/SCTK_runQC_Documentation.md
In singleCellTK: Comprehensive and Interactive Analysis of Single Cell RNA-Seq Data

Generation of comprehensive quality control metrics with SCTK

This pipeline will import data from single-cell preprocessing algorithms (e.g. CellRanger), generate various quality control metrics (e.g. doublet scores), and output results in standard data containers (e.g. SingleCellExperiment). Both the original droplet matrix and the filtered cell matrix will be processed. This pipeline is focused on single cell data generated from microfluidic devices (e.g. 10X).

The pipeline is currently written in the R language. Users will need to install R version 3.6.2 (or higher) in order to run all of the required packages.
For importing files from the HCA Optimus pipeline, the "scipy" module needs to be installed in the default version of Python on the system.

The script will automatically try to install the "singleCellTK" package from Bioconductor if not available. However, currently this code is only located on a development branch which needs to be installed from GitHub:

library(devtools)
install_github("compbiomed/singleCellTK@devel")

To run the pipeline script, users will need to download the 'SCTK_runQC.R' and run the following code:

Rscript SCTK_runQC.R \
-b /base/path \
-P Preprocessing_Algorithm \
-s SampleName \
-o Output_Directory \
-S TRUE \
-F R,Python,FlatFile,HTAN

SCTK_runQC.R \
-b /base/path/ \
-P Preprocessing_Algorithm \
-s SampleName \
-o Output_Directory \
-S TRUE \
-F R,Python,FlatFile,HTAN

if the Rscript executable can be found in your environment with the command

/usr/bin/env Rscript --vanilla

This pipeline can currently import data from the following tools:

For each tool, the pipeline expect the data in a certain format within a specific directory structure. For example, data generated with CellRanger V3 is expected to be under a sample folder (specified with the "-s" flag) and a base folder (specified with the "-b" flag). Some tools prepend the sample name onto the output files instead of making a separate subdirectory (e.g. BUStools and SEQC). The combination of --base_path ("-b") and --sample ("-s") should specify the location of the data files:

The required arguments are as follows:

-b, --basePath (required). Base path for the output from the preprocessing algorithm

-P, --preproc (required). Algorithm used for preprocessing. One of One of 'CellRangerV2', 'CellRangerV3', 'BUStools', 'STARSolo', 'SEQC', 'Optimus', 'DropEst', 'SceRDS', 'CountMatrix' and 'AnnData'.

-s, --sample (required). Name of the sample. This will be prepended to the cell barcodes.

-o, --directory (required). Output directory. A new subdirectory will be created with the name "sample". R, Python, and FlatFile directories will be created under the "sample" directory containing the data containers with QC metrics. Default ".". More information about output directory structure is explained in "Understanding outputs" section below.

-F, --outputFormat (required). The output format of this QC pipeline. Currently, it supports R (RDS), FlatFile, Python (AnnData) and HTAN.

-S, --splitSample (required). Save a SingleCellExperiment object for each sample. Default is TRUE. If FALSE, the data of all samples will be combined into one SingleCellExperiment object and this object will be outputed.

The optional arguments are as follows. Their usage depend on type of data and user-defined behaviour.

-g, --gmt (optional). GMT file containing gene sets for quality control.

-t, --delim (optional, required when -g is specified). Delimiter used in GMT file. Default "\t".

-G, --genome (optional). The name of genome reference. This is only required for CellRangerV2 data.

-y, --yaml (optional). YAML file used to specify parameters of QC functions called by singleCellTK QC pipeline. Please check "Specify parameters for QC algorithms" section for details.

-c, --cellData (optional). The full path of the RDS file or Matrix file of the cell matrix. This would be use only when --preproc is SceRDS or CountMatrix.

-r, --rawData (optional). The full path of the RDS file or Matrix file of the droplet matrix. This would be provided only when --preproc is SceRDS or CountMatrix.

-C, --cellPath (optional). The directory contains matrix.mtx.gz, features.tsv.gz and barcodes.tsv.gz files originally generated by 10x CellrangerV2 or CellrangerV3 (files in the filtered_feature_bc_matrix directory). This argument only works when --preproc is CellRangerV2 or CellRangerV3. Default is NULL. If 'base_path' is NULL, 'cellPath' or 'rawPath' should be specified.

-R, --rawPath (optional). The directory contains matrix.mtx.gz, features.tsv.gz and barcodes.tsv.gz files originally generated by 10x CellrangerV2 or CellrangerV3 (files in the raw_feature_bc_matrix directory). This argument only works when --preproc is CellRangerV2 or CellRangerV3. Default is NULL. If 'base_path' is NULL, 'cellPath' or 'rawPath' should be specified.

-d, --dataType. Type of data as input. Default is Both, which means taking both droplet and cell matrix as input. If set as 'Droplet', it will only processes droplet data. If set as 'Cell', it will only processes cell data.

-D, --detectCells. Detect cells from droplet matrix. Default is FALSE. This argument is only eavluated when -d is 'Droplet'. If set as TRUE, cells will be detected and cell matrixed will be subset from the droplet matrix. Also, quality control will be performed on the detected cell matrix.

-m, --cellDetectMethod. Methods to detect cells from droplet matrix. Default is 'EmptyDrops'. This argument is only eavluated when -D is 'TRUE'. Other options could be 'Knee' or 'Inflection'. More information is provided in the section below.

-n, --numCores. Number of cores used to run the pipeline. By default is 1. Parallel computing is enabled if -n is greater than 1.

-T, --parallelType. Type of parallel computing used for parallel computing. Parallel computing used in this pipeline depends on "BiocParallel" package. Default is 'MulticoreParam'. It can be 'MulticoreParam' or 'SnowParam'. This argument will be evaluated only when numCores > 1.

This pipeline enables different ways to import CellrangerV2/CellrangerV3 for flexibility.

If the cellranger data set is saved in the default cellranger output directory, you can load the data by running following code:

For CellRangerV3:

Rscript SCTK_runQC.R \
-b /base/path \
-P CellRangerV3 \
-s SampleName \
-o Output_Directory \
-S TRUE \
-F R,Python,FlatFile,HTAN

For CellRangerV2, the reference used by cellranger needs to be specified by -G/--genome:

Rscript SCTK_runQC.R \
-b /base/path \
-P CellRangerV2 \
-s SampleName \
-o Output_Directory \
-S TRUE \
-G hg19 \
-F R,Python,FlatFile,HTAN

If the cellranger count output have been moved out of the default cellranger output directory, you can specified input using -R and -C:

Rscript SCTK_runQC.R \
-P CellRangerV2 \
-C /path/to/cell/matrix \
-R /path/to/droplet/matrix \
-s SampleName \
-o Output_Directory \
-S TRUE \
-F R,Python,FlatFile,HTAN

In this case, you must skip -b arguments and you can also skip -G argument for CellRangerV2 data.

If you data in stored as a SingleCellExperiment in RDS file, singleCellTK also supports that type of input. To run quality control with RDS file as input, run the following code:

Rscript SCTK_runQC.R \
-P SceRDS \
-s Samplename \
-o Output_Directory \
-S TRUE \
-F R,Python,FlatFile,HTAN \
-r /path/to/rds/file/droplet.RDS \
-c /path/to/rds/file/cell.RDS

If your input is stored in txt file as a matrix, which has barcodes as colnames and genes as rownames, run the following code to start the quality control pipeline:

Rscript SCTK_runQC.R \
-P CountMatrix \
-s Samplename \
-o Output_Directory \
-S TRUE \
-r /path/to/matrix/file/droplet.txt \
-f path/to/matrix/file/cell.txt

If your data is preprocessed by other algorithms, you can load the data as shown in the figure above. Basically, the templated is shown below:

Rscript SCTK_runQC.R \
-b /base/path \
-P Preprocessing_Algorithm \
-s SampleName \
-o Output_Directory \
-S TRUE \
-F R,Python,FlatFile,HTAN

User can specify parameters for QC algorithms in this pipeline with a yaml file (supplied with -y/--yamlFile argument). The current supported QC algorithms including doublet dection (bcds, cxds, cxds_bcds_hybrid, doubletFinder, doubletCells and scrublet), decontamination (decontX), emptyDrop detection (emptyDrops) and barcodeRankDrops (barcodeRanks). An example of QC parameters yaml file is shown below:

Params:   ### should not be omitted
  bcds:
    ntop: 600

  cxds:
    ntop: 600

  cxds_bcds_hybrid:
    nTop: 600

  decontX:
    maxIter: 600

  emptyDrops:
    lower: 50
    niters: 5000
    testAmbient: True

  barcodeRanks:
    lower: 50

The format of yaml file can be found here. The parameters should be consistent with the parameters of each QC function in singleCellTK package. Parameters that are not defined in this yaml file will use the default value.

Quantifying the level of gene sets can be useful quality control. For example, the percentage of counts from mitochondrial genes can be an indicator or cell stress or death.

Users can pass a GMT file to the pipeline with one row for each gene set. The first column should be the name of the gene set (e.g. mito). The second column for each gene set in the GMT file (i.e. the description) should contain the location of where to look for the matching IDs in the data. If set to 'rownames', then the gene set IDs will be matched with the row IDs of the data matrix. If a character string or an integer index is supplied, then gene set IDs will be matched to IDs the that column of feature table.

The output directory is created under the path specified by -o/--directory argument. Each sample is stoed in the subdirectory (named by -s/--sample argument) within this output direcotry. Within each sample directory, the each output format will be separated into subdirectories. The output file hierarchy is shown below:

(root; output directory)
├── level3Meta.csv
├── level4Meta.csv
└── sample1 
    ├── R
    |   ├── sample1_Droplets.rds
    |   └── sample1_Cells.rds
    ├── Python
    |   ├── Droplets
    |   |   └── sample1.h5ad
    |   └── Cells
    |       └── sample1.h5ad
    ├── FlatFile
    |   ├── Droplets
    |   |   ├── assays
    |   |   |   └── sample1_counts.mtx.gz
    |   |   ├── metadata
    |   |   |   └── sample1_metadata.rds
    |   |   ├── sample1_colData.txt.gz
    |   |   └── sample1_rowData.txt.gz 
    |   └── Cells
    |       ├── assays
    |       |   └── sample1_counts.mtx.gz
    |       ├── metadata
    |       |   └── sample1_metadata.rds
    |       ├── reducedDims
    |       |   ├──sample1_decontX_UMAP.txt.gz
    |       |   ├──sample1_scrublet_TSNE.txt.gz
    |       |   └──sample1_scrublet_UMAP.txt.gz
    |       ├── sample1_colData.txt.gz
    |       └── sample1_rowData.txt.gz 
    └── sample1_QCparameters.yaml

singleCellTK is available to use with both Docker and Singularity. This container be used as a portal to the singleCellTK command line interface. The work was done in R. The singleCellTK docker image is available from Docker Hub.

If you have not used docker before, you can follow the instruction to install and set up docker in Windows, Mac or Linux.

The Docker image can be obtained by running:

docker pull campbio/sctk_qc

Noted that the transcriptome data and GMT file needed to be accessible to the container via mounted volume. In the below example, mount volumn is enabled for accessing input and output directory using argument -v. To learn more about mounted volumes, please check out this post.

The usage of each argument is the same as running command line analysis. Here is an example code to perform quanlity control of CellRangerV3 data singleCellTK docker:

docker run --rm -v /path/to/data:/SCTK_docker \
-it campbio/sctk_qc:1.7.5 \
-b /SCTK_docker/cellranger \
-P CellRangerV3 \
-s pbmc_100x100 \
-o /SCTK_docker/result/tenx_v3_pbmc \
-g /SCTK_docker/mitochondrial_human_symbol.gmt \
-S TRUE \
-F R,Python,Flatfile,HTAN

The Singulatiry image can easily be built using Docker Hub as a source:

singularity pull docker://campbio/sctk_qc:1.7.5

The usage of singleCellTK Singularity image is very similar to that of Docker. In Singularity 3.0+, the mount volume is automatically overlaid. However, you can use argument --bind/-B to specify your own mount volume. The example is shown as below:

singularity run sctk_qc_1.7.5.sif \
-b ./cellranger \
-P CellRangerV3 \
-s pbmc_100x100 \
-o ./result/tenx_v3_pbmc \
-g ./mitochondrial_human_symbol.gmt

The code above assumed that the dataset is in your current directory, which is automatically mounted by Singularity. If you run Singularity image on BU SCC，it's recommended to re-set the home directory to mount. Otherwise, the container will load libraries in the SCC shared libraries, which might cause some conflicts. You can point to some "sanitized home" using argument -H/--home. Also, you might want to specify cpu architecture when run the docker on BU SCC using #$ -l cpu_arch=broadwell|haswell|skylake|cascadelake. Because the python packages are compiled by SIMD instructions that are only available on these two cpu architectures.

Empty droplet detection:

emptyDrops from the package DropletUtils

Doublet Detection

doubletCells from the package scran
cxds, bcds, and cxds_bcds_hybrid from the package scds
doubletFinder from the package DoubletFinder
Scrublet from the package scrublet

Ambient RNA detection

decontX from the package celda

Any scripts or data that you put into this service are public.

singleCellTK documentation built on Nov. 8, 2020, 5:21 p.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

singleCellTK
Comprehensive and Interactive Analysis of Single Cell RNA-Seq Data

exec/SCTK_runQC_Documentation.md
In singleCellTK: Comprehensive and Interactive Analysis of Single Cell RNA-Seq Data

Generation of comprehensive quality control metrics with SCTK

Specifications

Installation

Running the pipeline

Data formats

Arguments

Required arguments

Optional arguments

Methods to run pipeline on CellrangerV2 or CellrangerV3 data set

Methods to run pipeline on data set stored in RDS file or txt file (matrix)

Methods to run pipeline on data set generated by other algorithms

Specify parameters for QC algorithms

Analyzing genes sets

Understanding outputs

Docker and Singularity Images

Docker

Singularity

Documentation of tools that are currently available within the pipeline:

Empty droplet detection:

Doublet Detection

Ambient RNA detection

Try the singleCellTK package in your browser

R Package Documentation

Browse R Packages

We want your feedback!

singleCellTK Comprehensive and Interactive Analysis of Single Cell RNA-Seq Data

exec/SCTK_runQC_Documentation.md In singleCellTK: Comprehensive and Interactive Analysis of Single Cell RNA-Seq Data

Generation of comprehensive quality control metrics with SCTK

Specifications

Installation

Running the pipeline

Data formats

Arguments

Required arguments

Optional arguments

Methods to run pipeline on CellrangerV2 or CellrangerV3 data set

Methods to run pipeline on data set stored in RDS file or txt file (matrix)

Methods to run pipeline on data set generated by other algorithms

Specify parameters for QC algorithms

Analyzing genes sets

Understanding outputs

Docker and Singularity Images

Docker

Singularity

Documentation of tools that are currently available within the pipeline:

Empty droplet detection:

Doublet Detection

Ambient RNA detection

Try the singleCellTK package in your browser

R Package Documentation

Browse R Packages

We want your feedback!

singleCellTK
Comprehensive and Interactive Analysis of Single Cell RNA-Seq Data

exec/SCTK_runQC_Documentation.md
In singleCellTK: Comprehensive and Interactive Analysis of Single Cell RNA-Seq Data