cellCounts: Map and quantify single cell RNA-seq data generated by 10X...

Description Usage Arguments Details Value Author(s) See Also

Description

Process raw 10X scRNA-seq data and generate UMI counts for each gene in each cell.

Usage

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
cellCounts(

    # input data
    index,
    sample.index,
    input.mode = "BCL",
    cell.barcode = NULL,
  
    # specify the aligner used for read mapping
    aligner = "align",
  
    # parameters used by featureCounts for assigning and counting UMIs
    annot.inbuilt = "mm10",
    annot.ext = NULL,
    isGTFAnnotationFile = FALSE,
    GTF.featureType = "exon",
    GTF.attrType = "gene_id",
    useMetaFeatures = TRUE,
    
    # number of threads
    nthreads = 10,

    # other parameters passed to align, subjunc and featureCounts functions 
    ...)

Arguments

index

A character string providing the base name of index files created for a reference genome by the buildindex function.

sample.index

A data frame containing index set name for each sample and other sample-related information. The data frame must contain four columns with column headers named InputDirectory, Lane, SampleName and IndexSetName. Note that this is not the Sample Sheet generated by the Illumina sequencer. cellCounts uses the index set names provided in this data frame to generate a Sample Sheet and then uses this Sample Sheet to demultiplex all the samples. The name of an index set provided for a sample specifies the set of indices that were used for the sequencing of the sample. An example of the index set name is "SI-P01-A2". The column InputDirectory of this data frame includes one or more directories in which raw sequencing data are saved. See below for more details.

input.mode

Specify the input mode. Currently only the BCL-format input is supported ("BCL").

cell.barcode

A character string giving the name of a text file (can be gzipped) that contains the set of cell barcodes used in sample preparation. If NULL, a cell barcode set will be determined for the input data by cellCounts based on the matching of cell barcodes sequences of the first 100,000 reads in the data with the three cell barcode sets used by 10X Genomics. NULL by default.

aligner

Specify the name of the aligner used for read mapping. Currently it has only one possible value "align", indicating that the align function will be used for mapping.

annot.inbuilt

Specify an inbuilt annotation for UMI counting. See featureCounts for more details. "mm10" by default.

annot.ext

Specify an external annotation for UMI counting. See featureCounts for more details. NULL by default.

isGTFAnnotationFile

See featureCounts for more details. FALSE by default.

GTF.featureType

See featureCounts for more details. "exon" by default.

GTF.attrType

See featureCounts for more details. "gene_id" by default.

useMetaFeatures

Specify if UMI counting should be carried out at the meta-feature level (eg. gene level). See featureCounts for more details. TRUE by default.

nthreads

A numeric value giving the number of threads used for read mapping and counting. 10 by default.

...

other parameters passed to align and and featureCounts functions.

Details

The cellCounts function takes as input raw scRNA-seq read data generated from the 10X Genomics platform. It utilizes the read mapping and counting functions included in the Rsubread package to process the scRNA-seq data. It calls the align function to map reads to a reference genome and calls the featureCounts function to assign reads to genes. It performs sample demultiplexing, cell barcode demultiplexing and read deduplication before producing UMI counts for each gene in each cell. The cellCounts function is able to process multiple datasets stored in multiple different directories at the same time.

Sample-related information should be provided to the sample.index parameter. This includes the name of index set used for each sample, sample name, the flowcell lane used for the sequencing of each sample and the location where the sample data were saved. All these information should be stored in a data.frame object, which can then be provided to the sample.index parameter. Below is an example of the data.frame object provided to sample.index:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
InputDirectory		Lane		SampleName	IndexSetName
/path/to/dataset1	1		Sample1		SI-GA-E1
/path/to/dataset1	1		Sample2		SI-GA-E2
/path/to/dataset1	2		Sample1		SI-GA-E1
/path/to/dataset1	2		Sample2		SI-GA-E2
/path/to/dataset2	1		Sample3		SI-GA-E3
/path/to/dataset2	1		Sample4		SI-GA-E4
/path/to/dataset2	2		Sample3		SI-GA-E3
/path/to/dataset2	2		Sample4		SI-GA-E4
...

Value

The cellCounts function returns a List object to R, and it also outputs three gzipped FASTQ files and one BAM file for each sample. The three gzipped FASTQ files include cell barcode and UMI sequences (R1), sample index sequences (I1) and the actual genomic sequences of the reads (R2), respectively. The BAM file includes location-sorted read mapping results.

The returned List object contains the following components:

counts

a List object including UMI counts for each sample. Each component in this object is a matrix that contains UMI counts for a sample. Rows in the matrix are genes and columns are cells.

annotation

a data.frame object containing an annotation (eg. a gene annotation). UMIs were assigned to features (eg. genes) in this annotation. Rows in the annotation are features. Columns of the annotation include GeneID, Chr, Start, End and Length.

sample.info

a data.frame object containing sample information and also statistics related to the quantification result. It includes the following columns: SampleName, InputDirectory, TotalCells, HighConfidenceCells, RescuedCells, TotalUMI, TotalReads, MappedReads and AssignedReads. The order of samples in this object is the same as that in the components counts and cell.confidence.

cell.confidence

a List object indicating if a cell is a high-confidence cell or a rescued cell (low confidence). Each component in the List object corresponds to a sample. Each component is a logical vector with a TRUE value indicating a high-confidence cell.

Author(s)

Yang Liao and Wei Shi

See Also

buildindex, align, featureCounts


Rsubread documentation built on March 17, 2021, 6:01 p.m.