all_times <- list() # store the time for each chunk knitr::knit_hooks$set(time_it = local({ now <- NULL function(before, options) { if (before) { now <<- Sys.time() } else { res <- difftime(Sys.time(), now, units = "secs") all_times[[options$label]] <<- res } } })) knitr::opts_chunk$set( tidy = TRUE, tidy.opts = list(width.cutoff = 95), message = FALSE, warning = FALSE, time_it = TRUE )
scCustomize has several helper functions to simplify/streamline common tasks in scRNA-seq analysis. Let's load packages and raw data object for this tutorial.
# Load Packages library(ggplot2) library(dplyr) library(magrittr) library(patchwork) library(Seurat) library(scCustomize) pbmc <- pbmc3k.SeuratData::pbmc3k
pbmc <- UpdateSeuratObject(pbmc)
We'll add some random meta data variables to pbmc data form use in this vignette
pbmc$sample_id <- sample(c("sample1", "sample2", "sample3", "sample4", "sample5", "sample6"), size = ncol(pbmc), replace = TRUE) pbmc$batch <- sample(c("Batch1", "Batch2"), size = ncol(pbmc), replace = TRUE)
One the first steps after creating object if often to calculate and add a number of different QC metrics at the cell level.
To simplify the process of adding cell-level QC metrics scCustomize contains a wrapper function which can be customized to add all or some of the available QC metrics. The functions have built-in support for large number of default species and support for both gene symbols or Ensembl IDs.
The default parameters of the function Add_Cell_QC_Metrics
will add:
pbmc <- Add_Cell_QC_Metrics(seurat_object = pbmc, species = "human")
For more details on cell QC metrics and the functions available in scCustomize please see Object QC Vignette. For details on plotting functions associated with QC metrics see QC Plotting vignette.
scCustomize contains a set of functions to aid in use of meta data both within and outside of objects.
pbmc <- pbmc3k.SeuratData::pbmc3k pbmc <- UpdateSeuratObject(object = pbmc) pbmc$sample_id <- sample(c("sample1", "sample2", "sample3", "sample4", "sample5", "sample6"), size = ncol(pbmc), replace = TRUE) pbmc$batch <- sample(c("Batch1", "Batch2"), size = ncol(pbmc), replace = TRUE)
Fetch_Meta()
functions as simple getter function to obtain meta data from object and return data.frame.
meta_data <- Fetch_Meta(object = pbmc) head(meta_data, 10)
meta_data <- Fetch_Meta(object = pbmc) head(meta_data, 10) %>% kableExtra::kbl(row.names = TRUE) %>% kableExtra::kable_styling(bootstrap_options = c("bordered", "condensed", "responsive", "striped"))
While cell-level meta data is helpful in some situations often all that is required is sample-level meta data. This can easily be extracted and filtered using Extract_Sample_Meta()
.
sample_meta <- Extract_Sample_Meta(object = pbmc, sample_name = "sample_id")
sample_meta %>% kableExtra::kbl(row.names = TRUE) %>% kableExtra::kable_styling(bootstrap_options = c("bordered", "condensed", "responsive", "striped"))
As you can see by default Extract_Sample_Meta
removes a default set of columns (see documentation) which do not provide meaningful sample-level information (e.g., nFeature_RNA). However, you may want to remove other columns too. This can be achieved using either positive or negative selection using variables_include
or variables_exclude
parameters.
sample_meta <- Extract_Sample_Meta(object = pbmc, sample_name = "sample_id", variables_exclude = c("nFeature_RNA", "nCount_RNA", "seurat_annotations", "orig.ident"))
sample_meta %>% kableExtra::kbl(row.names = TRUE) %>% kableExtra::kable_styling(bootstrap_options = c("bordered", "condensed", "responsive", "striped"))
While some original number columns are not valid at sample-level it can be valuable to get summary information for those variables. This can be achieved by merging outputs with Median_Stats
function.
sample_meta <- Extract_Sample_Meta(object = pbmc, sample_name = "sample_id", variables_exclude = c("nFeature_RNA", "nCount_RNA", "seurat_annotations", "orig.ident")) sample_median <- Median_Stats(seurat_object = pbmc, group_by_var = "sample_id") sample_merged <- right_join(x = sample_meta, y = sample_median)
sample_merged %>% kableExtra::kbl(row.names = TRUE) %>% kableExtra::kable_styling(bootstrap_options = c("bordered", "condensed", "responsive", "striped"))
scCustomize provides easy function to add sample-level meta data to object without the need to first convert it to cell-level meta data. This makes adding meta data from summary or supplemental tables to cell level object data very easy.
In order to add meta data you will need to specify:
@meta.data
column that matches sample-level meta data (often "orig.ident"). This is example command:
obj <- Add_Sample_Meta(seurat_object = obj, meta_data = sample_meta, join_by_seurat = "orig.ident", join_by_meta = "sample_id")
Starting in Seurat V5 each assay now possess it's own meta.data slot which is feature-level meta data. During course of normal analysis this is where information on variable features is stored. However, we can also use it to store alternate feature names, in most cases this is Ensembl IDs matching the symbols used in object creation/analysis.
scCustomize provides the function Add_Alt_Feature_ID()
to automatically match and add these features using the same files used in object creation. Users only need to supply either path to the features.tsv.gz file or the hdf5 file produced from Cell Ranger output.
# Using features.tsv.gz file obj <- Add_Alt_Feature_ID(seurat_object = obj, features_tsv = "sample01/outs/filtered_feature_bc_matrix/features.tsv.gz", assay = "RNA") # Using hdf5 file obj <- Add_Alt_Feature_ID(seurat_object = obj, hdf5_file = "sample01/outs/outs/filtered_feature_bc_matrix.h5", assay = "RNA")
NOTE: If using features.tsv.gz file the file from either filtered or raw outputs can be used as they are identical.
NOTE: If using hdf5 file the file from either filtered_feature_bc or raw_feature_bc can be used as the features slot is identical. Though it is faster to load filtered_feature_bc file due to droplet filtering.
scCustomize also makes forward-facing a number of utilities that are used internally in functions but may also have utility on their own.
Feature_Present()
to check for features.Feature_Present
is fairly basic function to check if feature exists in data. It can be used with Seurat or LIGER objects as well as generic data formats (Matrix, data.frame, tibble).
In addition to some warning messages Feature_Present
returns a list with 3 entries when run:
bad_features
> 0 then Feature_Present
will convert the gene list
bad_features` to all upper case and to sentence case and check against all possible features to see if wrong case was provided. # Example gene list with all examples (found genes, wrong case (lower) and misspelled (CD8A forgetting to un-shift when typing 8)) gene_input_list <- c("CD14", "CD3E", "Cd4", "CD*A") genes_present <- Feature_Present(data = pbmc, features = gene_input_list)
Now let's look at the output:
genes_present
By default Feature_Present
has 3 sets of warnings/messages it prints to console when it finds issues. If using the function yourself on its own or wrapped inside your own function and you prefer no messages each of these can be toggled using optional parameters.
case_check_msg
prints and list of features if alternate case features are found in data. omit_warn
prints warning and list of all features not found in data. print_msg
prints message if all features in gene_list
are found in data. In order to keep run times down and support offer greater support for offline use Feature_Present
does not include a check for updated gene symbols. If you're dataset is from human cells/donors you can simply supply the not found features from Feature_Present
to Seurat's UpdateSymbolList
function.
gene_input_list <- c("CD14", "CD3E", "Cd4", "CD*A", "SEPT1") genes_present <- Feature_Present(data = pbmc, features = gene_input_list) check_symbols <- UpdateSymbolList(symbols = genes_present[[2]], verbose = TRUE)
It can often be advantageous to merge raw data before creating analysis objects vs creating lots of objects and merging them all later. scCustomize features a modified version of the internal LIGER function MergeSparseDataAll()
.
Merge_Sparse_Data_All()
will combine a list of sparse matrices and return single sparse matrix. Additionally, by specifying the add_cell_ids
parameter you can specify a prefix to be added to the barcodes from each entry in the list (using "_" as delimiter).
This function can be especially useful when combined with any of the scCustomize's Read_
data functions which automatically return named lists of matrices and the ability to specify sample orig.ident
when creating Seurat objects. See Read & Write Vignette for more info on the data import functions.
# Read in data GEO_10X <- Read10X_GEO(data_dir = "assets/GSE152183_RAW_Marsh/") # Merge data and add sample prefix GEO_10X_merged <- Merge_Sparse_Data_All(matrix_list = GEO_10X, add_cell_ids = names(GEO_10X)) # Create Seurat Object and specify orig.ident location GEO_10X_Seurat <- Seurat::CreateSeuratObject(counts = GEO_10X_merged, names.field = 1, names.delim = "_", min.features = 200, min.cells = 5)
Sometimes it can be advantageous to create a list of multiple Seurat Objects in order to run similar pipeline on all objects in loop.
NOTE: This function will likely be deprecated in near future given the updates to Seurat object structure and support for assays containing different sets of features and layers within assays.
To facilitate ease in merging such lists into single object scCustomize contains simple wrapper Merge_Seurat_List
that uses purrr::reduce()
to merge all objects in list into single combined object
list_of_objects <- list(obj1, obj2, obj2, ..., obj10) merged_seurat <- Merge_Seurat_List(list_seurat = list_of_objects) # Can also add sample specific ids to each object during the merge cell_ids <- c("sample1", "sample2", "sample3", ..., "sample10") merged_seurat <- Merge_Seurat_List(list_seurat = list_of_objects, add.cell.ids = cell_ids)
Seurat V5 objects now have the ability to split within the object into layers. However, I find that the syntax to do this is not the most intuitive and can be simplified with a new simple wrapper function: Split_Layers()
pbmc <- Split_Layers(seurat_object = pbmc, split.by = "sample_id")
Split_Layers()
defaults to "RNA" assay but can be used for any assay present in object (users should check whether splitting assay other than "RNA" is valid before proceeding).
Seurat objects contain an extra empty slot that can be used to store any extra information desired.
scCustomize contains two functions Store_Misc_Info_Seurat
and a wrapper around that function Store_Palette_Seurat
to make this process easy.
# Data can be vectors or data.frames misc_info <- "misc_vector_dataframe_list_etc" # Add data to the @misc slot in Seurat Object pbmc <- Store_Misc_Info_Seurat(seurat_object = pbmc, data_to_store = misc_info, data_name = "misc_info_name")
If you are storing a list in the @misc
slot there is additional parameter that dictates whether to store the information as a list or whether to store each entry in the list separately.
# Create list misc_info <- list("misc_item1", "misc_item2", etc) # Store the list directly pbmc <- Store_Misc_Info_Seurat(seurat_object = pbmc, data_to_store = misc_info, data_name = "misc_info_name", list_as_list = TRUE) # Store each entry in list as separate entity in `@misc` slot pbmc <- Store_Misc_Info_Seurat(seurat_object = pbmc, data_to_store = misc_info, data_name = "misc_info_name", list_as_list = FALSE)
One of the most common times I use this function is to store color palettes associated with clustering or subclustering.
To make it easier to remember function call in this situation scCustomize contains a wrapper function Store_Palette_Seurat
.
# Data can be vectors or data.frames annotated_color_palette <- c("color1", "color2", "color3", "etc") # Add data to the @misc slot in Seurat Object pbmc <- Store_Palette_Seurat(seurat_object = pbmc, palette = annotated_color_palette, palette_name = "Round01_Color_Pal") # Then you can easily call that palette (with tab completion) when plotting without ever needing to reload the palette in current environment DimPlot(object = pbmc, cols = pbmc@misc$Round01_Color_Pal)
Sometimes, especially with public data, you may want to modify the cell barcode names before creating analysis object.
scCustomize contains a selection of functions to simplify this process:
Replace_Suffix
can be used on single matrix/data.frame or list of matrices/data.frames to modify to remove suffixes
# For single object data_mod <- Replace_Suffix(data = raw_data, current_suffix = "-1", new_suffix = "-2") # For list of objects containing same suffix raw_data_list <- list(raw_data1, raw_data2, raw_data3, ..., raw_data10) new_suffixes <- c("-1", "-2", "-3", ..., "-10") data_mod <- Replace_Suffix(data = raw_data_list, current_suffix = "-1", new_suffix = new_suffixes) # For list of objects containing different suffixes raw_data_list <- list(raw_data1, raw_data2, raw_data3, ..., raw_data10) old_suffixes <- c("-A", "-B", "-C", ..., "-J") new_suffixes <- c("-1", "-2", "-3", ..., "-10") data_mod <- Replace_Suffix(data = raw_data_list, current_suffix = old_suffixes, new_suffix = new_suffixes)
Replace_Suffix
can also be used to strip suffixes from data
# For single object data_mod <- Replace_Suffix(data = raw_data, current_suffix = "-1", new_suffix = "")
scCustomize has 3 functions to facilitate changing the type of delimiters present in cell barcodes.
Change_Delim_Prefix()
Change just the suffix delimiter.Change_Delim_Suffix()
Change just the prefix delimiter.Change_Delim_All()
Change all delimiters.These functions all take identical inputs and can be applied to either single matrix/data.frames or lists of matrices/data.frames.
data_mod <- Change_Delim_Prefix(data = raw_data, current_delim = ".", new_delim = "_") data_mod <- Change_Delim_Suffix(data = raw_data, current_delim = ".", new_delim = "_") data_mod <- Change_Delim_All(data = raw_data, current_delim = ".", new_delim = "_")
scCustomize also contains number of helper functions that may be generally helpful in other R analyses and are not limited to scRNA-seq.
See vignette on Color Palettes for full details on stored palettes and palette plotting functions.
Sometimes it can be helpful to split a vector into chunks of predetermined size or into X number of equally sized chunks. For this we can use Split_Vector()
and for random example lets use the variable features from pbmc object.
pbmc <- FindVariableFeatures(pbmc)
all_var_features <- VariableFeatures(object = pbmc) # Now we can split into vectors of 100 genes per vector split_100_genes <- Split_Vector(x = all_var_features, chunk_size = 100) lengths(split_100_genes) # Or we can split into 7 equally sized vectors split_7_chunks <- Split_Vector(x = all_var_features, num_chunk = 7) lengths(split_7_chunks)
The base R seq
family of functions has a ton of uses. However, due to the way numbers are ordered in R it can sometimes be helpful to have preceding zeros in your number sequence in order to keep things in numerical order (e.g., 01, 02, 03, instead of 1, 2, 3) especially when naming files.
The scCustomize function seq_zeros()
allows you to do just that.
# provide total sequence length seq_zeros(seq_length = 15) # provide sequence seq_zeros(seq_length = 1:15)
By default the function will add 1 or more zeros depending on the size of seq_length
seq5 <- seq_zeros(seq_length = 5) head(seq5) seq15 <- seq_zeros(seq_length = 15) head(seq15) seq150 <- seq_zeros(seq_length = 150) head(seq150) seq1500 <- seq_zeros(seq_length = 1500) head(seq1500)
Users can also manually set the number of preceding zeros using the num_zeros
parameter.
seq_zeros(seq_length = 5, num_zeros = 3)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.