This stage build upon the 01_input_qc
stage and performs perhaps the most important steps in scRNA-seq data analysis:
normalization, highly variable gene selection, dimensionality reduction, clustering, cell type annotation, and
visualization.
r emoji::emoji("gear")
Config file: config/single_sample/02_norm_clustering.yaml
r emoji::emoji("clipboard")
HTML report target (in config/pipeline.yaml
):
DRAKE_TARGETS: ["report_norm_clustering", "report_norm_clustering_simple"]
r emoji::emoji("scroll")
Example report
(used config)
r emoji::emoji("scroll")
Example simplified report
(used config)
r emoji::emoji("ladder")
Structure
Seurat::cc.genes.updated.2019
and Seurat::CellCycleScoring()
functionscuttle::computePooledFactors()
, scuttle::logNormCounts()
)sctransform
also returns a specified number of HVGs.sctransform
normalization{scdrake}
is using a method based on the percentage of variance explained by the cell cycle phase
in the expression profile for each gene
(details)scDblFinder::computeDoubletDensity()
).forced
){SingleR}
. More details are given in the accompanying
book and the underlying method is described
here.{celldex}
package.{celldex}
package don't contain every possible cell type and the
annotation may bring inaccurate labels for some datasets. Thus, it is also possible to supply {SingleR}
a custom
reference datasets in the form of {SingleCellExperiment}
or {SummarizedExperiment}
object.Config for this stage is stored in the config/single_sample/02_norm_clustering.yaml
file.
Directory with this file is read from SCDRAKE_SINGLE_SAMPLE_CONFIG_DIR
environment variable upon {scdrake}
load or
attach, and saved as scdrake_single_sample_config_dir
option. This option is used as the default argument value in
several {scdrake}
functions.
SPATIAL: False
Type: logical scalar
If True
, pipeline enables spatial extension.
This option enables spatial extension as selection of spatially variable genes (SVGS), pseudo tissue visualization and other visualization options for marker-based annotation.
NORMALIZATION_TYPE: "scran"
Type: character scalar ("scran"
| "sctransform"
| "none"
)
A normalization type:
"scran"
: normalization by deconvolution."sctransform"
: regularized negative binomial regression. The returned counts in log1p (natural log) scale are
converted to log2. Note: SingleCellExperiment
normalized by this method cannot be used for integration."none"
: skip normalization. This should be used only in case you are importing a SingleCellExperiment
object in the 01_input_qc
stage and this object was normalized before within this pipeline,
i.e. you are importing a modified sce_norm
target or one of its downstream, dependent SingleCellExperiment
targets.sce_final_norm_clustering
, subset cells to e.g. three clusters, save the result to Rds
file, and run the pipeline again while importing this file in the 01_input_qc
stage.
Note that in that case you should disable empty droplets removal, and cell and gene filtering,
as data are already of high quality. Overall, this is an example of reclustering, which is otherwise hard to
achieve within the pipeline.See vignette("pipeline_overview")
for more details about normalization methods.
scran
normalization parametersThese parameters are used when NORMALIZATION_TYPE
is "scran"
.
SCRAN_USE_QUICKCLUSTER: True
Type: logical scalar
Whether to cluster cells prior to normalization. See ?scran::quickCluster
for more details.
SCRAN_QUICKCLUSTER_METHOD: "igraph"
Type: character scalar ("igraph"
| "hclust"
)
A clustering method: "igraph"
uses graph-based clustering, while "hclust"
uses hierarchical clustering.
See ?scran::quickCluster
for more details.
sctransform
normalization parametersThese parameters are used when NORMALIZATION_TYPE
is "sctransform"
.
SCT_VARS_TO_REGRESS: null
Type: list of character scalar or null
Batch variables to regress out during the SCTransform
normalization. You can regress out variables such as
["percent.mt", "nFeature_RNA", "nCount_RNA"]
or cell cycle using ["S.Score", "G2M.Score"]
or ["CC.Difference"]
.
However, for correction of cell cycle effect, a better method is the removal of cell cycle-related genes during
selection of highly variable genes (HVGs).
This can be set by HVG selection parameters, and the approach is explained
in more details in vignette("pipeline_overview")
.
SCT_N_HVG: 3000
Type: integer scalar
A number of top HVGs to return from the SCTransform
normalization.
HVG_METRIC: "gene_var"
Type: character scalar ("gene_var"
| "gene_cv2"
| "sctransform"
)
A metric used to find HVGs. See https://bioconductor.org/books/3.15/OSCA.basic/feature-selection.html for more details.
"gene_var"
: variance of the log-counts (scran::modelGeneVar()
)."gene_cv2"
: coefficient of variation (scran::modelGeneCV2()
)."sctransform"
: use HVGs selected by SCTransform
.
Only relevant when the NORMALIZATION_TYPE
parameter is "sctransform"
.
You can adjust number of these HVGs by the SCT_N_HVG
parameter.
When SPATIAL option is enabled, spatially variable genes (SVGs) are automaticaly calculated together with HVGs. That is done using Seurat::SVFInfo, selection method MoransI. A union of unique SVGs and HVGs is taking to further processing, see https://www.biorxiv.org/content/10.1101/2021.08.27.457741v1 for more details. HVG_SELECTION: "top"
Type: character scalar ("top"
| "significance"
| "threshold"
)
A HVG selection strategy. Only relevant when the HVG_METRIC
parameter is "gene_var"
or "gene_cv2"
.
"top"
: take top X genes according to the metric
(details)."significance"
: use a FDR threshold
(details)."threshold"
: use a threshold on minimum value of the metric
(details).For "top"
and "threshold"
, bio
and ratio
columns are used for HVG_METRIC
of "gene_var"
and "gene_cv2"
,
respectively. These columns are present in the DataFrame returned from the underlying method
(scran::modelGeneVar()
or scran::modelGeneCV2()
).
HVG_SELECTION_VALUE: 1000
Type: depends on value of the HVG_SELECTION
parameter:
HVG_SELECTION: "top"
.HVG_SELECTION: "significance"
.HVG_SELECTION: "threshold"
.A threshold value for HVG selection. The following parameter combinations are recommended defaults:
HVG_SELECTION: "top" HVG_SELECTION_VALUE: 1000
0.05
.HVG_SELECTION: "significance" HVG_SELECTION_VALUE: 0.05
0
or 1
.HVG_METRIC: "gene_var" HVG_SELECTION: "threshold" HVG_SELECTION_VALUE: 0
HVG_METRIC: "gene_cv2" HVG_SELECTION: "threshold" HVG_SELECTION_VALUE: 1
HVG_RM_CC_GENES: False
Type: logical scalar
Whether to apply a correction for cell cycle prior to HVG selection. We implement a cell cycle-related genes removal strategy described here. It is based on the percentage of variance explained by the cell cycle phase in the expression profile for each gene.
Note that when SCTransform
is used for normalization including cell cycle variables to regress out, than
the results of this strategy remain unclear.
HVG_CC_GENES_VAR_EXPL_THRESHOLD: 5
Type: positive numeric scalar
A threshold on gene cell cycle variance explained, i.e. genes with var. expl. greater than this threshold will be marked as cell cycle-related and removed prior to selection of HVGs.
Doublets are droplets in which multiple cells were captured.
MAX_DOUBLET_SCORE: 3.5
Type: positive numeric scalar or null
A threshold for doublet score computed by scDblFinder::computeDoubletDensity()
.
Cells with doublet score greater than the threshold will be removed.
If null
, doublets are not removed, but doublet score is still computed.
PCA_SELECTION_METHOD: "forced"
Type: character scalar ("forced"
| "elbow"
| "technical_noise"
)
A method to select first N principal components (PCs):
"forced"
: use the number of PCs specified by the PCA_FORCED_PCS
parameter."elbow"
: use the
elbow point method."technical_noise"
: use the
technical noise.PCA_FORCED_PCS: 15
Type: positive integer scalar
If the PCA_SELECTION_METHOD
parameter is "forced"
, use this number of first principal components.
t-stochastic neighbor embedding
See ?scater::runTSNE
for the underlying method.
TSNE_PERP: 20
Type: numeric scalar
t-SNE perplexity.
TSNE_MAX_ITER: 1000
Type: positive integer scalar
A maximum number of t-SNE iterations.
Parameters for graph-based clustering.
CLUSTER_GRAPH_SNN_K: 10
Type: positive integer scalar
A number of shared nearest neighbors in graph used for community detection algorithms (graph-based clustering).
Passed as the k
parameter to bluster::makeSNNGraph()
. From the function's help page:
The choice of k controls the connectivity of the graph and the resolution of community detection algorithms. Smaller values of k will generally yield smaller, finer clusters, while increasing k will increase the connectivity of the graph and make it more difficult to resolve different communities. The value of k can be roughly interpreted as the anticipated size of the smallest subpopulation.
CLUSTER_GRAPH_SNN_TYPE: "rank"
Type: character scalar ("rank"
| "number"
| "jaccard"
)
A type of weighting scheme to use for shared neighbors. Passed as type
parameter to bluster::makeSNNGraph()
(see the function's help page for more information).
CLUSTER_GRAPH_LEIDEN_ENABLED: True
Type: logical scalar
If True
, enable Leiden community detection algorithm.
CLUSTER_GRAPH_LEIDEN_RESOLUTIONS: [0.2, 0.4, 0.6, 0.8, 1.0]
Type: numeric vector
Resolutions for Leiden clustering. Lower values result in more coarse-grained clustering, while higher ones to more
fine-grained structures. Cell-cluster memberships will be named as cluster_graph_leiden_r{r}
(e.g. in colData()
of a SCE object).
CLUSTER_GRAPH_LOUVAIN_ENABLED: False
Type: logical scalar
If True
, enable Louvain community detection algorithm.
CLUSTER_GRAPH_LOUVAIN_RESOLUTIONS: [0.2, 0.4, 0.6, 0.8, 1.0]
Type: numeric vector
Resolutions for Louvain clustering. Lower values result in more coarse-grained clustering, while higher ones to more
fine-grained structures. Cell-cluster memberships will be named as cluster_graph_louvain_r{r}
(e.g. in colData()
of a SCE object).
CLUSTER_GRAPH_WALKTRAP_ENABLED: False
Type: logical scalar
If True
, enable Walktrap community detection algorithm.
CLUSTER_KMEANS_K_ENABLED: False
Type: logical scalar
If True
, enable k-means clustering with selected numbers of clusters (k
s).
CLUSTER_KMEANS_K: !code 3:6
Type: integer vector
A vector of number of clusters for which to compute k-means.
Note: !code 3:6
will evaluate to c(3, 4, 5, 6)
.
CLUSTER_KMEANS_KBEST_ENABLED: False
Type: logical scalar
If True
, enable k-means clustering with best k
based on gap statistics.
CLUSTER_SC3_ENABLED: False
Type: logical scalar
If True
, enable SC3 with selected numbers of clusters (k
s).
CLUSTER_SC3_K: !code 5:6
Type: integer vector
A vector of number of clusters for which to compute SC3 clustering.
Note: !code 3:6
will evaluate to c(3, 4, 5, 6)
.
CLUSTER_SC3_N_CORES: 1
Type: integer scalar
A number of CPU cores to use for SC3 computation. This may lead to out-of-memory errors for a number of cores greater than 1.
Note that drake
may have problems when the pipeline is run in parallel mode and at the same time a target is using
a parallel code (within target parallelism). For the SC3 computation is recommended to run the pipeline in
sequential mode (DRAKE_PARALLELISM: "loop"
) and set DRAKE_TARGETS: ["cluster_sce_sc3"]
.
Once the cluster_sce_sc3
target is finished, you can switch the parallelism back to clustermq
or future
.
CLUSTER_SC3_ENABLED: False
Type: logical scalar
If False
, skip SC3 clustering.
CELL_ANNOTATION_SOURCES_DEFAULTS: TRAIN_PARAMS: GENES: "de" SD_THRESH: 1 DE_METHOD: "classic" DE_N: null ASSAY_TYPE: "logcounts" CLASSIFY_PARAMS: QUANTILE: 0.8 TUNE_THRESH: 0.05 ASSAY_TYPE: "logcounts" PRUNE_SCORE_PARAMS: N_MADS: 3 MIN_DIFF_MED: -.inf MIN_DIFF_NEXT: 0 DIAGNOSTICS_PARAMS: HEATMAP_N_TOP_MARKERS: 20
Type: named list (dictionary) of named lists
Default parameters for computation and reporting of cell type annotation. These can be overriden for each of the cell
annotation reference source in CELL_ANNOTATION_SOURCES
(see below).
TRAIN_PARAMS
: used for training, passed to SingleR::trainSingleR()
.GENES: "de"
(character scalar: "de"
| "sd"
| "all"
):
feature selection mode, see Details in ?SingleR::trainSingleR()
.SD_THRESH: 1
(numeric scalar): the minimum threshold on the standard deviation per gene.
Only used when GENES: "sd"
.DE_METHOD: "classic"
(character scalar: "classic"
| "wilcox"
| "t"
): how DE genes should be detected
between pairs of labels. Defaults to "classic"
, which sorts genes by the log-fold changes and takes the top DE_N
.
Setting to "wilcox"
or "t"
will use Wilcoxon ranked sum test or Welch t-test between labels, respectively,
and take the top DE_N
upregulated genes per comparison.DE_N: null
(null
or integer scalar): the number of DE genes to use when GENES: "de"
. If DE_METHOD: "classic"
,
defaults to $500 * (2/3) ^ log2(N)$ where $N$ is the number of unique labels. Otherwise, null
defaults to 10
.ASSAY_TYPE: "logcounts"
(character scalar or integer): the assay of the reference dataset containing the relevant
expression matrix.CLASSIFY_PARAMS
: used for classification, passed to SingleR::classifySingleR()
.QUANTILE: 0.8
(numeric scalar): the quantile of the correlation distribution to use to compute the score for each label.TUNE_THRESH: 0.05
(numeric scalar): the maximum difference from the maximum correlation to use in fine-tuning.ASSAY_TYPE: "logcounts"
(character scalar or integer): the assay of the test dataset containing the relevant
expression matrix.PRUNE_SCORE_PARAMS
: used for label pruning after classification, passed to SingleR::pruneScores()
.N_MADS: 3
(numeric scalar): the number of MADs to use for defining low outliers in the per-label distribution
of delta values (i.e., difference from median).MIN_DIFF_MED: -.inf
(numeric scalar): the minimum acceptable delta for each cell.MIN_DIFF_NEXT: 0
(numeric scalar): the minimum difference between the best score and the next best score in fine-tuning.DIAGNOSTICS_PARAMS
: parameters for post-classification diagnostics.HEATMAP_N_TOP_MARKERS: 20
(integer scalar): how many top markers to put into heatmaps after classification.
Markers are computed after classification using the predicted cell labels.CELL_ANNOTATION_SOURCES: - human_primary_cell_atlas_main: reference_type: "celldex" reference: "HumanPrimaryCellAtlasData" description: > Microarray datasets derived from human primary cells (Mabbott et al. 2013). Most of the labels refer to blood subpopulations but cell types from other tissues are also available. label_column: "label.main" label_subsets: [] monaco_immune_fine: reference_type: "celldex" reference: "MonacoImmuneData" description: "This is the human immune reference that best covers all of the bases for a typical PBMC sample." label_column: "label.fine" label_subsets: []
Type: list of named lists or null
Specifies reference dataset(s) to be used for cell type annotation. The method used in {SingleR}
for training and
classification is described here.
The names of named lists are used to name the reference datasets, e.g. as the prefixes in names of colData()
columns
containing per-cell labels.
More info about the structure of related targets is given in the Targets section.
To skip the cell annotation step, define CELL_ANNOTATION_SOURCES
as null
.
Each reference dataset must have the following parameters:
reference_type: "celldex"
(character scalar: "celldex"
| "file"
): type of the reference dataset.reference: "HumanPrimaryCellAtlasData"
(character scalar):reference_type: "celldex"
: a name of function to call in the {celldex}
package. This function must return a
SummarizedExperiment
object. See {celldex}
's
vignette
for an overview of available datasets.reference_type: "file"
: a path to Rds file with saved SingleCellExperiment
or SummarizedExperiment
object.
The assay type specified TRAIN_PARAMS/assay_type
must be present in assayNames()
of the object.
The label_column
must be present in colData()
of the object.description: "..."
(character scalar): a description of the reference dataset, will appear in some plots.label_column: "label.main"
(character scalar): specifies which column from colData()
of the reference dataset
will be used for training. For datasets from the {celldex}
package, those are label.main
,
label.fine
(more granular labels), or label.ont
(cell ontology IDs).label_subsets: []
(character vector): labels to subset the reference dataset prior to training,
e.g. ["T cells", "B cells", "Progenitors"]
.In each entry (e.g. human_primary_cell_atlas_main
above) you can override default parameters in
CELL_ANNOTATION_SOURCES_DEFAULTS
(by using lowercase names), e.g.
CELL_ANNOTATION_SOURCES: - human_primary_cell_atlas_main: ... train_params: de_method: "wilcox" de_n: 30
will override corresponding parameters in TRAIN_PARAMS
in CELL_ANNOTATION_SOURCES_DEFAULTS
.
MANUAL_ANNOTATION: False
Type: logical scalar
Whether marker-based annotation is enabled. Marker-based annotation is based on expression profiles, markers are taken from user-defined file. Resulted annotation is stored in annotation_metadata
target. Marker-based cell annotation implemented from Giotto package enrichment function, which is based on PAGE method from Kim SY et al.
ANNOTATION_MARKERS: Null
Type: logical scalar
Path to user-define file contacting markers for each expected cell type. Each column name is one cell type, in rows are gene markers. Example:
Astrocytes,Endothelial,Neurons GFAP,PECAM1,SLC6A1 SLC1A3,A2M,CHAT GLUL,,GAD2
SCALE_ANNOTATION: False
Type: logical scalar
Whether to scale gene profiles.
OVERLAP: 5
Type: numeric scalar
Minimal overlap of genes in cell/cell in selected clusters, to assign cell type.
ANNOTATION_CLUSTERING: "cluster_kmeans_k4"
Type: character scalar
On which clustering (column of cell metadata) to perform annotation.
SHOW_VALUE: "value" (`"value"` | `"zscores"` | `"zscores_rescaled"`)
Type: character scalar
Which value to show in resulting heatmap.
MAKE_CELL_PLOT: False
Type: character scalar
Only toghether with spatial option. Whether to show pseudo tissue cell plots with enrichment of annotation.
HEATMAP_DIMRED: "umap"
Type: character scalar
Which dimension reduction use to show result of marker-based annotation.
Using the two parameters below, it is possible to add custom cell metadata or rearrange the current one into new groupings
(saved as colData()
columns of the SingleCellExperiment
object). These groupings can be then referenced in other parameters,
e.g. NORM_CLUSTERING_REPORT_DIMRED_PLOTS_OTHER
or in the cluster_markers
and contrasts
stages.
ADDITIONAL_CELL_DATA_FILE: null
Type: character scalar or null
A path to file with additional cell data which will be added to colData()
of the final SCE object of this stage.
The additional data will overwrite existing columns (e.g. cluster_graph_louvain
), but are appended before the
CELL_GROUPINGS
below is applied, so you can refer to additional data columns inside CELL_GROUPINGS
.
In the additional data, there can be only cell barcodes that are present after cell filtering in the 01_input_qc
stage -
the missing ones will have NA
set in the resulting data. Rownames are not mandatory.
The supported file formats are CSV (comma separated, .csv
extension) and Rds (dataframe-like objects, .Rds
extension).
The additional data must contain Barcode
column by which they will be joined with the original cell data (colData()
).
CELL_GROUPINGS: null
Type: list of named lists or null
This parameter specifies new cell groupings based on existing columns (primarily the clustering ones). Each key is a
name of new cell grouping (column in colData()
of the sce_dimred
target) and contains a named list in the form:
source_column
(character scalar): a name of existing column in colData()
.description
(character scalar, optional): a description of the cell grouping, will appear in plots.
If not set, the name of the assignment will be used.assignments
(named list/vector of character scalars):<old level>: "<new level>"
: assignment of an old level to a new level. Different old levels with the same new
level will be merged. Order of assignments does not matter, levels are just replaced
(under the hood, dplyr::recode()
is used). Unspecified levels will be kept as they are.
You cannot replace existing groupings (columns).For possible column names for source_column
, see the sce_final_norm_clustering
target in the section below.
Let's look at example cell grouping:
CELL_GROUPINGS: - cluster_graph_leiden_r0.4_annotated: source_column: "cluster_graph_leiden_r0.4" description: "Graph-based clustering (Leiden alg.), annotated clusters" assignments: 1: "memory_CD4+" 6: "B" 7: "memory_CD4+"
cluster_graph_leiden_r0.4_annotated
: a name of new cell grouping added to colData()
.source_column: "cluster_graph_leiden_r0.4"
: a name of column in colData()
whose level will be used for the
new assignments.description: "Graph-based clustering ..."
: a description of the new assignment. If it was not set,
the name of the grouping (cluster_graph_leiden_r0.4_annotated
) would be used instead.assignments
: assignments of the old levels to new ones. That is, level (cluster) 6
will be renamed to "B"
,
and levels 1
and 7
will be merged to a new level "memory_CD4+"
.
All unspecified levels will be kept as they are.NORM_CLUSTERING_REPORT_DIMRED_NAMES: ["umap", "pca", "tsne"]
Type: character vector ("umap"
| "pca"
| "tsne"
)
A vector of names of dimensionality reduction methods for which plots in the report will be made.
NORM_CLUSTERING_REPORT_DIMRED_PLOTS_OTHER: - "phase": "Cell cycle phases" "doublet_score": "Doublet score" "total": "Total number of UMI" "detected": "Detected number of genes"
Type: list of named lists with character scalars
Names of other variables in colData()
to plot dimreds and color by, displayed in the report.
The format is "variable_name": "description"
, where description will appear in plot title.
Variables defined in the CELL_GROUPINGS
parameter can have description
set to null
-
their description from CELL_GROUPINGS
will be used.
SELECTED_MARKERS_FILE: null
Type: character scalar or null
A path to CSV file with groups of markers to make expression dimred plots for. Set to null
to skip.
The plots will be created for each of NORM_CLUSTERING_REPORT_DIMRED_NAMES
.
The format is following: each row defines gene symbols for a group. First column is a name of group and second column are gene symbols delimited by ":". Example:
Naive_CD4+_T,IL7R:CCR7
Do not use a header!
An example CSV file for PBMC 1k dataset is automatically copied to a newly initialized project
(as system.file("extdata", "selected_markers.csv", package = "scdrake", mustWork = TRUE)
).
You can also view this file on GitHub here.
NORM_CLUSTERING_REPORT_RMD_FILE: "Rmd/single_sample/02_norm_clustering.Rmd" NORM_CLUSTERING_REPORT_SIMPLE_RMD_FILE: "Rmd/single_sample/02_norm_clustering_simple.Rmd"
Type: character scalar
Paths to RMarkdown files used for HTML reports of this pipeline stage.
NORM_CLUSTERING_REPORT_RMD_FILE
is technically more detailed, while NORM_CLUSTERING_REPORT_SIMPLE_RMD_FILE
outputs only the dimensionality reduction plots along with cell annotation.
NORM_CLUSTERING_BASE_OUT_DIR: "02_norm_clustering"
Type: character scalar
A path to base output directory for this stage. It will be created under BASE_OUT_DIR
specified in 00_main.yaml
config.
NORM_CLUSTERING_SELECTED_MARKERS_OUT_DIR: "selected_markers" NORM_CLUSTERING_DIMRED_PLOTS_OUT_DIR: "dimred_plots" NORM_CLUSTERING_OTHER_PLOTS_OUT_DIR: "other_plots" NORM_CLUSTERING_CELL_ANNOTATION_OUT_DIR: "cell_annotation" NORM_CLUSTERING_REPORT_HTML_FILE: "02_norm_clustering.html" NORM_CLUSTERING_REPORT_SIMPLE_HTML_FILE: "02_norm_clustering_simple.html"
Type: character scalar
Names of files and directories created under NORM_CLUSTERING_BASE_OUT_DIR
. Subdirectories are not allowed.
NORM_CLUSTERING_KNITR_MESSAGE: False NORM_CLUSTERING_KNITR_WARNING: False NORM_CLUSTERING_KNITR_ECHO: False
Type: logical scalar
These are passed to knitr::opts_chunk()
and used for rendering of stage's HTML report.
Here you can find description of the most important targets for this stage.
However, for a full overview, you have to inspect the
source code of the
get_input_norm_clustering_subplan()
function.
HTML report target names: report_norm_clustering
, report_norm_clustering_simple
(only dimred plots)
SingleCellExperiment
objectssce_cc
: a SCE object (the sce_final_input_qc
target from the input_qc
stage) with added cell cycle information
to colData()
: phase
("G1"
, "G2m"
or "S"
), s_score
, g2m_score
, cc_difference
(s_score
- g2m_score
).
sce_norm
: sce_cc
with normalized counts in SingleCellExperiment::logcounts()
assay and normalization_type
item
added to metadata()
.
sce_norm_hvg
: sce_norm
with added HVG information:
metadata()
: hvg_metric
, hvg_selection
, hvg_selection_value
, hvg_metric_fit
, hvg_rm_cc_genes
, hvg_ids
rowData()
: is_hvg
(logical)If the parameter HVG_RM_CC_GENES
is True
, sce_norm_hvg
will also contain a PCA dimensionality reduction
(named pca_with_cc
) for whose computation all HVGs were used.
sce_rm_doublets
: sce_norm_hvg
with removed doublets and added information:
metadata()
: has_filtered_doublets
(FALSE
if MAX_DOUBLET_SCORE
parameter is null
), max_doublet_score
colData()
: doublet_score
(numeric), is_doublet
(logical)sce_pca
: sce_rm_doublets
with calculated PCA (50 PCs). Reduced dimensions can be retrieved by
SingleCellExperiment::reducedDim(sce_pca, "pca")
sce_pca_selected_pcs
: sce_pca
with selected number of PCs. Reduced dimensions in pca
slot are subsetted to
the selected number of PCs, and pca_all
slot contains the full matrix of 50 PCs. Information about selection of PCs
is added to metadata()
: pca_selection_method
, pca_selected_pcs
.
sce_dimred
: sce_pca_selected_pcs
with calculated t-SNE and UMAP,
both using the selected number of PCs for calculation
sce_final_norm_clustering
: a final SCE object of this stage which will be used in the cluster_markers
and
contrasts
stages. This object is derived from sce_dimred
, but cell clusterings along with CELL_GROUPINGS
are
added to colData()
.
The names of columns containing cell clusterings are:
cluster_graph_walktrap
, cluster_graph_louvain_r<r>
(e.g. cluster_graph_louvain_r0.4
),
cluster_graph_leiden_r<r>
(e.g. cluster_graph_leiden_r0.4
)<r>
is based on the CLUSTER_GRAPH_LEIDEN_RESOLUTIONS
and CLUSTER_GRAPH_LOUVAIN_RESOLUTIONS
parameterscluster_kmeans_kbest
, cluster_kmeans_k<k>
(e.g. cluster_kmeans_k5
)<k>
is based on the CLUSTER_KMEANS_K
parametercluster_sc3_<k>
(e.g. cluster_sc3_k5
)<k>
is based on the CLUSTER_SC3_K
parameterThe names of reduces dimensionality matrices (which can be retrieved by SingleCellExperiment::reducedDim()
):
"pca"
"tsne"
"umap"
Or you can view their names with SingleCellExperiment::reducedDimNames()
pca_elbow_pcs
, pca_gene_var_pcs
: a number of selected PCs for elbow point and technical variance methods, respectively
pca_selected_pcs_plot
: a {ggplot2}
object showing cummulative variance explained and the selected
number of first PCs for each of the methods
clusters_all
: a named list of integer vectors which wraps all clusterings below
graph_snn
: an {igraph}
object of shared nearest-neighbor graph, returned from scran::buildSNNGraph()
cluster_graph_leiden
, cluster_graph_walktrap
, cluster_graph_louvain
: lists with cell clusters returned from igraph::cluster_walktrap()
and igraph::cluster_louvain()
, respectively. Both functions are using graph_snn
as the input
cluster_graph_louvain_df
, cluster_graph_walktrap_df
, cluster_graph_leiden_df
: dataframes with graph-based clustering results
cluster_graph_leiden_clustree
, cluster_graph_louvain_clustree
: {ggplot2}
objects of clustree::clustree()
plots
cluster_kmeans_kbest_gaps
: a clusGap
object used to calculate the best K. Returned from cluster::clusGap()
, to which matrix of
selected PCs is passed
cluster_kmeans_kbest_k
: the best K for k-means, returned from cluster::maxSE()
cluster_kmeans_kbest
, cluster_kmeans_k
: lists with cell clusters returned from stats::kmeans()
.
In the former, cluster_kmeans_kbest_k
is used as the number of clusters. In the latter, Ks from the CLUSTER_KMEANS_K
parameter are used.
cluster_kmeans_k_df
, cluster_kmeans_kbest_df
: dataframes with k-means clustering results
cluster_kmeans_k_clustree
: {ggplot2}
object of clustree::clustree()
plot. If CLUSTER_KMEANS_KBEST_ENABLED
is True
,
it will also contain k-means result with the best k
.
sce_sc3
: a SingleCellExperiment
object returned from SC3::sc3()
cluster_sc3
: a list with cell clusters
cluster_sc3_df
: a dataframe with SC3 clustering results
cluster_sc3_stability_plots
: a list of {ggplot2}
objects, each returned from make_sc3_stability_plots()
(where SC3::sc3_plot_cluster_stability()
is used internally)
cluster_sc3_clustree
: {ggplot2}
object of clustree::clustree()
plot
cell_annotation_params
: a tibble with parameters parsed from CELL_ANNOTATION_SOURCES
cell_annotation
: cell_annotation_params
with added cell_annotation
list column holding outputs (DataFrame
s)
from SingleR::SingleR()
. The structure of those DataFrame
s is described in ?SingleR::classifySingleR()
.
cell_annotation_labels
: a named list of character vectors with per-cell label assignments.
Later added to the cell_data
target and in turn, to colData()
of the sce_final_norm_clustering
and
sce_int_clustering_final
(in integration pipeline) targets/objects.
By default, there are three vectors with labels for each reference dataset named as:
<reference_name>_labels_raw
: "raw" labels. Taken from first.labels
column of SingleR::SingleR()
output<reference_name>_labels
: fine-tuned labels. Taken from labels
column of SingleR::SingleR()
output<reference_name>_labels_pruned
: fine-tuned and pruned labels (contains NA
s for low quality labels).
Taken from pruned.labels
column of SingleR::SingleR()
output.Note that you can use names of cell_annotation_labels
's items in the CELL_GROUPINGS
and
NORM_CLUSTERING_REPORT_DIMRED_PLOTS_OTHER
parameters, and also use them in cluster_markers
(CLUSTER_MARKERS_SOURCES
) and contrasts
(CONTRASTS_SOURCES
) stages
cell_annotation_diagnostic_plots
: cell_annotation
with added list columns holding diagnostic plots,
i.e. for each reference dataset:
score_heatmaps
: heatmaps of per-cell label scores created for each clustering (as column annotation)
(details).marker_heatmaps
: NULL
if cell_annotation$train_params$de
is not "de"
, otherwise heatmap for each label
containing top upregulated markers from pairwise t-tests
(details).
Number of top markers is specified in cell_annotation$diagnostics_params$heatmap_n_top_markers
.delta_distribution_plot
: violin plots (in one figure / object) of per-cell deltas for each label.
Deltas are differences between the score for the assigned label and the median across all labels for each cell
(details).cell_annotation_diagnostic_plots_files
: paths to PDF files of diagnostic plots in cell_annotation_diagnostic_plots
(score_heatmaps_out_file
, delta_distribution_plot_out_file
, marker_heatmaps_out_file
).
That means when you make this target, those files will be created.
hvg_plot
: a {patchwork}
object with HVG metric statistics (average expression vs. variance)
pca_phase_plots
: a list of one or two {ggplot2
} objects. The first one is a plot of the first two PCs colored by
cell cycle phase, where PCA was computed using HVGs. If the parameter HVG_RM_CC_GENES
is True
,
the second plot is the same as the first one, but with PCA computed on all genes. In that case it also means that
in the first plot, PCA was computed using HVGs with removed cell cycle-related genes.
pca_doublet_plot
, pca_total_plot
: a {ggplot2
} plots of the first two PCs colored by doublet score and total
number of UMIs, respectively
dimred_plots_clustering
: a tibble holding plots of reduced dimensions colored by a clustering
dimred_plots_clustering_files_out
, dimred_plots_clustering_united_files
: make this target to export the plots
(the latter produces a single multipage PDF)
dimred_plots_other_vars
: a tibble holding plots of reduced dimensions colored by a specified variable,
as defined by the NORM_CLUSTERING_REPORT_DIMRED_NAMES
and NORM_CLUSTERING_REPORT_DIMRED_PLOTS_OTHER
parameters.
dimred_plots_other_vars_files_out
: make this target to export the plots
selected_markers_plots
: a tibble holding selected marker plots ({patchwork}
objects) for each dimensional
reduction specified in the NORM_CLUSTERING_REPORT_DIMRED_NAMES
parameter. Selected markers are read from CSV file
defined in the SELECTED_MARKERS_FILE
parameter.
selected_markers_plots_files_out
: make this target to export the plots
You can also use the underlying function selected_markers_dimplot()
:
selected_markers_dimplot( sce = sce_final_norm_clustering, dimred = "umap", selected_markers_df = selected_markers_df, assay = "logcounts" )
The selected_markers_df
tibble can be read from the CSV file with
selected_markers_df <- readr::read_csv("markers.csv", col_names = c("group", "markers"), col_types = "cc")
config_norm_clustering
: a list holding parameters for this stage
cc_genes
: a dataframe of cell cycle genes, based on Seurat::cc.genes.updated.2019
data. If the ORGANISM
parameter
in the 00_main.yaml
config is "mouse"
, then gene symbols are first converted to sentence-case ("MKI67"
-> "Mki67"
).
cell_data
: a DataFrame holding colData()
of sce_rm_doublets
, clusterings from the clusters_all
target,
cell annotation labels from the cell_annotation_labels
target, and the new cell groupings defined in
the CELL_GROUPINGS
parameter.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.