This pipeline will import data from single-cell preprocessing algorithms (e.g. CellRanger), generate various quality control metrics (e.g. doublet scores), and output results in standard data containers (e.g. SingleCellExperiment). Both the original droplet matrix and the filtered cell matrix will be processed. This pipeline is focused on single cell data generated from microfluidic devices (e.g. 10X).
The script will automatically try to install the "singleCellTK" package from Bioconductor if not available. However, currently this code is only located on a development branch which needs to be installed from GitHub:
library(devtools)
install_github("compbiomed/singleCellTK@devel")
To run the pipeline script, users will need to download the 'SCTK_runQC.R' and run the following code:
Rscript SCTK_runQC.R \
-b /base/path \
-P Preprocessing_Algorithm \
-s SampleName \
-o Output_Directory \
-S TRUE \
-F R,Python,FlatFile,HTAN
or
SCTK_runQC.R \
-b /base/path/ \
-P Preprocessing_Algorithm \
-s SampleName \
-o Output_Directory \
-S TRUE \
-F R,Python,FlatFile,HTAN
if the Rscript executable can be found in your environment with the command
/usr/bin/env Rscript --vanilla
This pipeline can currently import data from the following tools:
For each tool, the pipeline expect the data in a certain format within a specific directory structure. For example, data generated with CellRanger V3 is expected to be under a sample folder (specified with the "-s" flag) and a base folder (specified with the "-b" flag). Some tools prepend the sample name onto the output files instead of making a separate subdirectory (e.g. BUStools and SEQC). The combination of --base_path ("-b") and --sample ("-s") should specify the location of the data files:
The required arguments are as follows:
-b, --basePath (required). Base path for the output from the preprocessing algorithm
-P, --preproc (required). Algorithm used for preprocessing. One of One of 'CellRangerV2', 'CellRangerV3', 'BUStools', 'STARSolo', 'SEQC', 'Optimus', 'DropEst', 'SceRDS', 'CountMatrix' and 'AnnData'.
-s, --sample (required). Name of the sample. This will be prepended to the cell barcodes.
-o, --directory (required). Output directory. A new subdirectory will be created with the name "sample". R, Python, and FlatFile directories will be created under the "sample" directory containing the data containers with QC metrics. Default ".". More information about output directory structure is explained in "Understanding outputs" section below.
-F, --outputFormat (required). The output format of this QC pipeline. Currently, it supports R (RDS), FlatFile, Python (AnnData) and HTAN.
-S, --splitSample (required). Save a SingleCellExperiment object for each sample. Default is TRUE. If FALSE, the data of all samples will be combined into one SingleCellExperiment object and this object will be outputed.
The optional arguments are as follows. Their usage depend on type of data and user-defined behaviour.
-g, --gmt (optional). GMT file containing gene sets for quality control.
-t, --delim (optional, required when -g is specified). Delimiter used in GMT file. Default "\t".
-G, --genome (optional). The name of genome reference. This is only required for CellRangerV2 data.
-y, --yaml (optional). YAML file used to specify parameters of QC functions called by singleCellTK QC pipeline. Please check "Specify parameters for QC algorithms" section for details.
-c, --cellData (optional). The full path of the RDS file or Matrix file of the cell matrix. This would be use only when --preproc is SceRDS or CountMatrix.
-r, --rawData (optional). The full path of the RDS file or Matrix file of the droplet matrix. This would be provided only when --preproc is SceRDS or CountMatrix.
-C, --cellPath (optional). The directory contains matrix.mtx.gz, features.tsv.gz and barcodes.tsv.gz files originally generated by 10x CellrangerV2 or CellrangerV3 (files in the filtered_feature_bc_matrix directory). This argument only works when --preproc is CellRangerV2 or CellRangerV3. Default is NULL. If 'base_path' is NULL, 'cellPath' or 'rawPath' should be specified.
-R, --rawPath (optional). The directory contains matrix.mtx.gz, features.tsv.gz and barcodes.tsv.gz files originally generated by 10x CellrangerV2 or CellrangerV3 (files in the raw_feature_bc_matrix directory). This argument only works when --preproc is CellRangerV2 or CellRangerV3. Default is NULL. If 'base_path' is NULL, 'cellPath' or 'rawPath' should be specified.
-d, --dataType. Type of data as input. Default is Both, which means taking both droplet and cell matrix as input. If set as 'Droplet', it will only processes droplet data. If set as 'Cell', it will only processes cell data.
-D, --detectCells. Detect cells from droplet matrix. Default is FALSE. This argument is only eavluated when -d is 'Droplet'. If set as TRUE, cells will be detected and cell matrixed will be subset from the droplet matrix. Also, quality control will be performed on the detected cell matrix.
-m, --cellDetectMethod. Methods to detect cells from droplet matrix. Default is 'EmptyDrops'. This argument is only eavluated when -D is 'TRUE'. Other options could be 'Knee' or 'Inflection'. More information is provided in the section below.
-n, --numCores. Number of cores used to run the pipeline. By default is 1. Parallel computing is enabled if -n is greater than 1.
-T, --parallelType. Type of parallel computing used for parallel computing. Parallel computing used in this pipeline depends on "BiocParallel" package. Default is 'MulticoreParam'. It can be 'MulticoreParam' or 'SnowParam'. This argument will be evaluated only when numCores > 1.
This pipeline enables different ways to import CellrangerV2/CellrangerV3 for flexibility.
For CellRangerV3:
Rscript SCTK_runQC.R \
-b /base/path \
-P CellRangerV3 \
-s SampleName \
-o Output_Directory \
-S TRUE \
-F R,Python,FlatFile,HTAN
For CellRangerV2, the reference used by cellranger needs to be specified by -G/--genome:
Rscript SCTK_runQC.R \
-b /base/path \
-P CellRangerV2 \
-s SampleName \
-o Output_Directory \
-S TRUE \
-G hg19 \
-F R,Python,FlatFile,HTAN
Rscript SCTK_runQC.R \
-P CellRangerV2 \
-C /path/to/cell/matrix \
-R /path/to/droplet/matrix \
-s SampleName \
-o Output_Directory \
-S TRUE \
-F R,Python,FlatFile,HTAN
In this case, you must skip -b arguments and you can also skip -G argument for CellRangerV2 data.
Rscript SCTK_runQC.R \
-P SceRDS \
-s Samplename \
-o Output_Directory \
-S TRUE \
-F R,Python,FlatFile,HTAN \
-r /path/to/rds/file/droplet.RDS \
-c /path/to/rds/file/cell.RDS
Rscript SCTK_runQC.R \
-P CountMatrix \
-s Samplename \
-o Output_Directory \
-S TRUE \
-r /path/to/matrix/file/droplet.txt \
-f path/to/matrix/file/cell.txt
If your data is preprocessed by other algorithms, you can load the data as shown in the figure above. Basically, the templated is shown below:
Rscript SCTK_runQC.R \
-b /base/path \
-P Preprocessing_Algorithm \
-s SampleName \
-o Output_Directory \
-S TRUE \
-F R,Python,FlatFile,HTAN
User can specify parameters for QC algorithms in this pipeline with a yaml file (supplied with -y/--yamlFile argument). The current supported QC algorithms including doublet dection (bcds, cxds, cxds_bcds_hybrid, doubletFinder, doubletCells and scrublet), decontamination (decontX), emptyDrop detection (emptyDrops) and barcodeRankDrops (barcodeRanks). An example of QC parameters yaml file is shown below:
Params: ### should not be omitted
bcds:
ntop: 600
cxds:
ntop: 600
cxds_bcds_hybrid:
nTop: 600
decontX:
maxIter: 600
emptyDrops:
lower: 50
niters: 5000
testAmbient: True
barcodeRanks:
lower: 50
The format of yaml file can be found here. The parameters should be consistent with the parameters of each QC function in singleCellTK package. Parameters that are not defined in this yaml file will use the default value.
Quantifying the level of gene sets can be useful quality control. For example, the percentage of counts from mitochondrial genes can be an indicator or cell stress or death.
Users can pass a GMT file to the pipeline with one row for each gene set. The first column should be the name of the gene set (e.g. mito). The second column for each gene set in the GMT file (i.e. the description) should contain the location of where to look for the matching IDs in the data. If set to 'rownames', then the gene set IDs will be matched with the row IDs of the data matrix. If a character string or an integer index is supplied, then gene set IDs will be matched to IDs the that column of feature table.
The output directory is created under the path specified by -o/--directory argument. Each sample is stoed in the subdirectory (named by -s/--sample argument) within this output direcotry. Within each sample directory, the each output format will be separated into subdirectories. The output file hierarchy is shown below:
(root; output directory)
├── level3Meta.csv
├── level4Meta.csv
└── sample1
├── R
| ├── sample1_Droplets.rds
| └── sample1_Cells.rds
├── Python
| ├── Droplets
| | └── sample1.h5ad
| └── Cells
| └── sample1.h5ad
├── FlatFile
| ├── Droplets
| | ├── assays
| | | └── sample1_counts.mtx.gz
| | ├── metadata
| | | └── sample1_metadata.rds
| | ├── sample1_colData.txt.gz
| | └── sample1_rowData.txt.gz
| └── Cells
| ├── assays
| | └── sample1_counts.mtx.gz
| ├── metadata
| | └── sample1_metadata.rds
| ├── reducedDims
| | ├──sample1_decontX_UMAP.txt.gz
| | ├──sample1_scrublet_TSNE.txt.gz
| | └──sample1_scrublet_UMAP.txt.gz
| ├── sample1_colData.txt.gz
| └── sample1_rowData.txt.gz
└── sample1_QCparameters.yaml
singleCellTK is available to use with both Docker and Singularity. This container be used as a portal to the singleCellTK command line interface. The work was done in R. The singleCellTK docker image is available from Docker Hub.
If you have not used docker before, you can follow the instruction to install and set up docker in Windows, Mac or Linux.
The Docker image can be obtained by running:
docker pull campbio/sctk_qc
Noted that the transcriptome data and GMT file needed to be accessible to the container via mounted volume. In the below example, mount volumn is enabled for accessing input and output directory using argument -v. To learn more about mounted volumes, please check out this post.
The usage of each argument is the same as running command line analysis. Here is an example code to perform quanlity control of CellRangerV3 data singleCellTK docker:
docker run --rm -v /path/to/data:/SCTK_docker \
-it campbio/sctk_qc:1.7.5 \
-b /SCTK_docker/cellranger \
-P CellRangerV3 \
-s pbmc_100x100 \
-o /SCTK_docker/result/tenx_v3_pbmc \
-g /SCTK_docker/mitochondrial_human_symbol.gmt \
-S TRUE \
-F R,Python,Flatfile,HTAN
The Singulatiry image can easily be built using Docker Hub as a source:
singularity pull docker://campbio/sctk_qc:1.7.5
The usage of singleCellTK Singularity image is very similar to that of Docker. In Singularity 3.0+, the mount volume is automatically overlaid. However, you can use argument --bind/-B to specify your own mount volume. The example is shown as below:
singularity run sctk_qc_1.7.5.sif \
-b ./cellranger \
-P CellRangerV3 \
-s pbmc_100x100 \
-o ./result/tenx_v3_pbmc \
-g ./mitochondrial_human_symbol.gmt
The code above assumed that the dataset is in your current directory, which is automatically mounted by Singularity. If you run Singularity image on BU SCC,it's recommended to re-set the home directory to mount. Otherwise, the container will load libraries in the SCC shared libraries, which might cause some conflicts. You can point to some "sanitized home" using argument -H/--home. Also, you might want to specify cpu architecture when run the docker on BU SCC using #$ -l cpu_arch=broadwell|haswell|skylake|cascadelake. Because the python packages are compiled by SIMD instructions that are only available on these two cpu architectures.
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.