Description Usage Arguments Details Value Examples
Calculate QC metrics
1 2 | calculateQCMetrics(object, feature_controls = NULL, cell_controls = NULL,
nmads = 5, pct_feature_controls_threshold = 80)
|
object |
an SCESet object containing expression values and experimental information. Must have been appropriately prepared. |
feature_controls |
a named list containing one or more vectors (character vector of feature names, logical vector, or a numeric vector of indices are all acceptable) used to identify feature controls (for example, ERCC spike-in genes, mitochondrial genes, etc). |
cell_controls |
a character vector of cell (sample) names, or a logical vector, or a numeric vector of indices used to identify cell controls (for example, blank wells or bulk controls). |
nmads |
numeric scalar giving the number of median absolute deviations
to be used to flag potentially problematic cells based on total_counts (total
number of counts for the cell, or library size) and total_features (number of
features with non-zero expression). For total_features, cells are flagged for
filtering only if total_features is |
pct_feature_controls_threshold |
numeric scalar giving a threshold for percentage of expression values accounted for by feature controls. Used as to flag cells that may be filtered based on high percentage of expression from feature controls. |
Calculate useful quality control metrics to help with pre-processing of data and identification of potentially problematic features and cells.
The following QC metrics are computed:
Total number of counts for the cell (aka “library size”)
Total counts on the log10-scale
The number of endogenous features (i.e. not control features) for the cell that have expression above the detection limit (default detection limit is zero)
Would this cell be filtered out based on its log10-depth being (by default) more than 5 median absolute deviations from the median log10-depth for the dataset?
Would this cell be filtered out based on its coverage being (by default) more than 5 median absolute deviations from the median coverage for the dataset?
Should the cell be filtered out on the basis of having a high percentage of counts assigned to control features? Default threshold is 80 percent (i.e. cells with more than 80 percent of counts assigned to feature controls are flagged).
Total number of counts for the cell
that come from (one or more sets of user-defined) control features. Defaults
to zero if no control features are indicated. If more than one set of
feature controls are defined (for example, ERCC and MT genes are defined
as controls), then this metric is produced for all sets, plus the union of
all sets (so here, we get columns
counts_feature_controls_ERCC
,
counts_feature_controls_MT
and
counts_feature_controls
).
Just as above, the total number of counts from feature controls, but on the log10-scale. Defaults to zero (i.e.~log10(0 + 1), offset to avoid negative infinite values) if no feature control are indicated.
Just as for the counts described above, but expressed as a percentage of the total counts. Defined for all control sets and their union, just like the raw counts. Defaults to zero if no feature controls are defined.
Would this cell be
filtered out on the basis that the percentage of counts from feature
controls is higher than a defined threhold (default is 80%)? Just as with
counts_feature_controls
, this is defined for all control sets
and their union.
What percentage of the total counts is accounted for by the 50 highest-count features? Also computed for the top 100 and top 200 features, with the obvious changes to the column names. Note that the top “X” percentage will not be computed if the total number of genes is less than “X”.
Percentage of features that are not “detectably
expressed”, i.e. have expression below the lowerDetectionLimit
threshold.
Total number of counts for the cell that come from endogenous features (i.e. not control features). Defaults to 'depth' if no control features are indicated.
Total number of counts from endogenous features on the log10-scale. Defaults to all counts if no control features are indicated.
Number of defined feature controls
that have expression greater than the threshold defined in the object
(that is, they are “detectably expressed”; see
object@lowerDetectionLimit
to check the threshold). As with other
metrics for feature controls, defined for all sets of feature controls
(set names appended as above) and their union. So we might commonly get
columns n_detected_feature_controls_ERCC
,
n_detected_feature_controls_MT
and
n_detected_feature_controls
(ERCC and MT genes detected).
Has the cell been defined as a cell control? If
more than one set of cell controls are defined (for example, blanks and
bulk libraries are defined as cell controls), then this metric is produced
for all sets, plus the union of all sets (so we could typically get
columns is_cell_control_Blank
,
is_cell_control_Bulk
, and is_cell_control
, the latter
including both blanks and bulks as cell controls).
These cell-level QC metrics are added as columns to the “phenotypeData”
slot of the SCESet
object so that they can be inspected and are
readily available for other functions to use. Furthermore, wherever
“counts” appear in the above metrics, the same metrics will also be
computed for “exprs”, “tpm” and “fpkm” values (if TPM and FPKM values
are present in the SCESet
object), with the appropriate term
replacing “counts” in the name. The following feature-level QC metrics are
also computed:
The mean expression level of the gene/feature.
The rank of the feature's mean expression level in the cell.
The number of cells for which the expression level of the feature is above the detection limit (default detection limit is zero).
The total number of counts assigned to that feature across all cells.
Total feature counts on the log10-scale.
The percentage of all counts that are accounted for by the counts assigned to the feature.
The percentage of all cells that have no detectable
expression (i.e. is_exprs(object)
is FALSE
) for the feature.
Is the feature a control feature? Default is
'FALSE' unless control features are defined by the user. If more than one
feature control set is defined (as above), then a column of this type is
produced for each control set (e.g. here, is_feature_control_ERCC
and
is_feature_control_MT
) as well as the column named
is_feature_control
, which indicates if the feature belongs to any of
the control sets.
These feature-level QC metrics are added as columns to the “featureData”
slot of the SCESet
object so that they can be inspected and are
readily available for other functions to use. As with the cell-level metrics,
wherever “counts” appear in the above, the same metrics will also be
computed for “exprs”, “tpm” and “fpkm” values (if TPM and FPKM values
are present in the SCESet
object), with the appropriate term
replacing “counts” in the name.
an SCESet object
1 2 3 4 5 6 7 8 9 10 11 12 13 | data("sc_example_counts")
data("sc_example_cell_info")
pd <- new("AnnotatedDataFrame", data=sc_example_cell_info)
rownames(pd) <- pd$Cell
example_sceset <- newSCESet(countData=sc_example_counts, phenoData=pd)
example_sceset <- calculateQCMetrics(example_sceset)
## with a set of feature controls defined
example_sceset <- calculateQCMetrics(example_sceset, feature_controls = 1:40)
## with a named set of feature controls defined
example_sceset <- calculateQCMetrics(example_sceset,
feature_controls = list(ERCC = 1:40))
|
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.