cellCounts: Map and quantify single cell RNA-seq data generated by 10X...
In Rsubread: Mapping, quantification and variant analysis of sequencing data

Description Usage Arguments Details Value Author(s) See Also

Process raw 10X scRNA-seq data and generate UMI counts for each gene in each cell.

cellCounts(

    # input data
    index,
    sample.index,
    input.mode = "BCL",
    cell.barcode = NULL,
  
    # specify the aligner used for read mapping
    aligner = "align",
  
    # parameters used by featureCounts for assigning and counting UMIs
    annot.inbuilt = "mm10",
    annot.ext = NULL,
    isGTFAnnotationFile = FALSE,
    GTF.featureType = "exon",
    GTF.attrType = "gene_id",
    useMetaFeatures = TRUE,
    
    # number of threads
    nthreads = 10,

    # other parameters passed to align, subjunc and featureCounts functions 
    ...)

`index`	A character string providing the base name of index files created for a reference genome by the `buildindex` function.
`sample.index`	A data frame containing index set name for each sample and other sample-related information. The data frame must contain four columns with column headers named `InputDirectory`, `Lane`, `SampleName` and `IndexSetName`. Note that this is not the Sample Sheet generated by the Illumina sequencer. `cellCounts` uses the index set names provided in this data frame to generate a Sample Sheet and then uses this Sample Sheet to demultiplex all the samples. The name of an index set provided for a sample specifies the set of indices that were used for the sequencing of the sample. An example of the index set name is "SI-P01-A2". The column `InputDirectory` of this data frame includes one or more directories in which raw sequencing data are saved. See below for more details.
`input.mode`	Specify the input mode. Currently only the BCL-format input is supported (`"BCL"`).
`cell.barcode`	A character string giving the name of a text file (can be gzipped) that contains the set of cell barcodes used in sample preparation. If `NULL`, a cell barcode set will be determined for the input data by `cellCounts` based on the matching of cell barcodes sequences of the first 100,000 reads in the data with the three cell barcode sets used by 10X Genomics. `NULL` by default.
`aligner`	Specify the name of the aligner used for read mapping. Currently it has only one possible value `"align"`, indicating that the `align` function will be used for mapping.
`annot.inbuilt`	Specify an inbuilt annotation for UMI counting. See `featureCounts` for more details. `"mm10"` by default.
`annot.ext`	Specify an external annotation for UMI counting. See `featureCounts` for more details. `NULL` by default.
`isGTFAnnotationFile`	See `featureCounts` for more details. `FALSE` by default.
`GTF.featureType`	See `featureCounts` for more details. `"exon"` by default.
`GTF.attrType`	See `featureCounts` for more details. `"gene_id"` by default.
`useMetaFeatures`	Specify if UMI counting should be carried out at the meta-feature level (eg. gene level). See `featureCounts` for more details. `TRUE` by default.
`nthreads`	A numeric value giving the number of threads used for read mapping and counting. `10` by default.
`...`	other parameters passed to `align` and and `featureCounts` functions.

The cellCounts function takes as input raw scRNA-seq read data generated from the 10X Genomics platform. It utilizes the read mapping and counting functions included in the Rsubread package to process the scRNA-seq data. It calls the align function to map reads to a reference genome and calls the featureCounts function to assign reads to genes. It performs sample demultiplexing, cell barcode demultiplexing and read deduplication before producing UMI counts for each gene in each cell. The cellCounts function is able to process multiple datasets stored in multiple different directories at the same time.

Sample-related information should be provided to the sample.index parameter. This includes the name of index set used for each sample, sample name, the flowcell lane used for the sequencing of each sample and the location where the sample data were saved. All these information should be stored in a data.frame object, which can then be provided to the sample.index parameter. Below is an example of the data.frame object provided to sample.index:

InputDirectory		Lane		SampleName	IndexSetName
/path/to/dataset1	1		Sample1		SI-GA-E1
/path/to/dataset1	1		Sample2		SI-GA-E2
/path/to/dataset1	2		Sample1		SI-GA-E1
/path/to/dataset1	2		Sample2		SI-GA-E2
/path/to/dataset2	1		Sample3		SI-GA-E3
/path/to/dataset2	1		Sample4		SI-GA-E4
/path/to/dataset2	2		Sample3		SI-GA-E3
/path/to/dataset2	2		Sample4		SI-GA-E4
...

The cellCounts function returns a List object to R, and it also outputs three gzipped FASTQ files and one BAM file for each sample. The three gzipped FASTQ files include cell barcode and UMI sequences (R1), sample index sequences (I1) and the actual genomic sequences of the reads (R2), respectively. The BAM file includes location-sorted read mapping results.

The returned List object contains the following components:

`counts`	a `List` object including UMI counts for each sample. Each component in this object is a matrix that contains UMI counts for a sample. Rows in the matrix are genes and columns are cells.
`annotation`	a `data.frame` object containing an annotation (eg. a gene annotation). UMIs were assigned to features (eg. genes) in this annotation. Rows in the annotation are features. Columns of the annotation include `GeneID`, `Chr`, `Start`, `End` and `Length`.
`sample.info`	a `data.frame` object containing sample information and also statistics related to the quantification result. It includes the following columns: `SampleName`, `InputDirectory`, `TotalCells`, `HighConfidenceCells`, `RescuedCells`, `TotalUMI`, `TotalReads`, `MappedReads` and `AssignedReads`. The order of samples in this object is the same as that in the components `counts` and `cell.confidence`.
`cell.confidence`	a `List` object indicating if a cell is a high-confidence cell or a rescued cell (low confidence). Each component in the `List` object corresponds to a sample. Each component is a logical vector with a `TRUE` value indicating a high-confidence cell.