all_times <- list() # store the time for each chunk knitr::knit_hooks$set(time_it = local({ now <- NULL function(before, options) { if (before) { now <<- Sys.time() } else { res <- difftime(Sys.time(), now, units = "secs") all_times[[options$label]] <<- res } } })) knitr::opts_chunk$set( tidy = TRUE, tidy.opts = list(width.cutoff = 95), message = FALSE, warning = FALSE, time_it = TRUE )
Seurat and other packages provide excellent tools for importing data however when importing large numbers of samples or samples with non-standard names this process can be cumbersome. scCustomize provides number of functions to simplify the process of importing many data sets at the same time and speed up the process using parallelization.
For this tutorial, I will be utilizing several publicly available data sets from NCBI GEO.
library(ggplot2) library(dplyr) library(magrittr) library(Seurat) library(scCustomize)
While most single cell packages have support for 10X and other commercial formats sometimes additional functions are required.
CellBender is tool for the removal of ambient RNA (GitHub, Preprint). Starting with CellBender v3 the output file is styled like Cell Ranger h5 files it also contains some additional information. This causes Seurat::Read10X_h5()
to to fail when trying to import data.
scCustomize has new function to read both new and old CellBender output files.
cell_bender_mat <- Read_CellBender_h5_Mat(file_name = "PATH/SampleA_out_filtered.h5")
Often when downloading files from NCBI GEO or other repos all of the files are contained in single directory and contain non-standard file names.
However, functions like Seurat::Read10X()
expect non-prefixed files (i.e. Cell Ranger outputs).
scCustomize has three functions to deal with these situations without need for renaming files.
The function Read10X_GEO
can be used to iteratively read all sets of 10X style files within single directory.
For this example I will be utilizing data from Marsh et al., 2022 (Nature Neuroscience), which were downloaded from NCBI GEO GSE152183
list.files("assets/GSE152183_RAW_Marsh/") GEO_10X <- Read10X_GEO(data_dir = "assets/GSE152183_RAW_Marsh/")
Read10X_GEO
Additional ParametersRead10X_GEO
also contains several additional optional parameters to streamline the import process.
parallel
and num_cores
parameters enable use of multiple cores to speed up data import.sample_list
By default Read10X_GEO
will import all sets of files found within single directory. However, if only a subset of files is desired a vector of sample prefixes can be supplied to sample_list
.sample_names
By default Read10X_GEO
names each entry in the returned list (see below) using the file name prefix. If different names are desired they can be supplied to sample_names
.Seurat::Read10X()
. See ?Read10X_GEO
for more details.Read10X_GEO
Import FormatRead10X_GEO
will return list of matrices (single modality data) or list of list of matrices (multi-modal data).
There is equivalent function for reading in 10X H5 formatted files Read10X_h5_GEO
.
NOTE: If files have shared aspect to file name specify this using shared_suffix
parameter to avoid that being incorporated into names to list entries in returned list.
GEO_10X <- Read10X_h5_GEO(data_dir = "/path/to/data/", shared_suffix = "filtered_feature_bc_matrix")
Importing CellBender h5 files from single directory can be done using Read_CellBender_h5_Multi_File()
, which functions very similar to Read10X_h5_GEO
.
Here is example directory/file setup:
Parent_Directory ├── Exp_Name │ └── Cell_Bender_Results │ └── SampleA_CB_out_filtered.h5 │ └── SampleB_CB_out_filtered.h5 │ └── SampleC_CB_out_filtered.h5
Multi_CB <- Read_CellBender_h5_Multi_File(data_dir = "Exp_Name/Cell_Bender_Results/", custom_name = "_CB_out_filtered.h5")
Often data is uploaded to NCBI GEO or other repositories with single file (.csv, .tsv, .txt, etc) containing all of the information.
For this example I will be utilizing data from Hammond et al., 2019 (Immunity), which were downloaded from NCBI GEO GSE121654.
Read_GEO_Delim
uses fread function for automatic detection of file delimiter and fast read times and then converts objects to sparse matrices to save memory
# Read in and use file names to name the list (default) GEO_Single <- Read_GEO_Delim(data_dir = "assets/GSE121654_RAW_Hammond/GSE121654_RAW_Hammond/", file_suffix = ".dge.txt.gz") # Read in and use new sample names to name the list GEO_Single <- Read_GEO_Delim(data_dir = "assets/GSE121654_RAW_Hammond/GSE121654_RAW_Hammond/", file_suffix = ".dge.txt.gz", sample_names = c("sample01", "sample02", "sample03", "sample04"))
sample_names
parameter.
Read_GEO_Delim
additional parametersSee manual entry for more info.
In addition to those functions for single directories, scCustomize contains functions for when files are contained in multiple sub-directories within shared parent directory.
NOTE: These functions all assume that each sub-directory contains one sample and that sub-directory structure is identical between all samples.
Take an abbreviated example directory found below styled as output from Cell Ranger count
Parent_Directory ├── sample_01 │ └── outs │ └── filtered_feature_bc_matrix │ └── feature.tsv.gz │ └── barcodes.tsv.gz │ └── matrix.mtx.gz └── sample_02 └── outs └── filtered_feature_bc_matrix └── feature.tsv.gz └── barcodes.tsv.gz └── matrix.mtx.gz
# In this case we can use default_10X = TRUE to tell function where to find the matrix files multi_10x <- Read10X_Multi_Directory(base_path = "Parent_Directory/", default_10X = TRUE)
In order to properly import the data Read10X_Multi_Directory
needs to know how to navigate the sub-directory structure.
default_10X
tells the function that the directory structure matches the standardized output from Cell Ranger (see above).secondary_path
parameter as long as structure is the same for all samples (see below).For instance:
Parent_Directory ├── sample_01 │ └── gex_matrices │ └── feature.tsv.gz │ └── barcodes.tsv.gz │ └── matrix.mtx.gz └── sample_02 └── gex_matrices └── feature.tsv.gz └── barcodes.tsv.gz └── matrix.mtx.gz
# In this case we can use default_10X = FALSE to tell function where to find the matrix files multi_10x <- Read10X_Multi_Directory(base_path = "Parent_Directory", default_10X = FALSE, secondary_path = "gex_matrices")
Read10X_Multi_Directory
also contains several additional parameters.
parallel
and num_cores
to use multiple core processing.sample_list
By default Read10X_Multi_Directory
will read in all sub-directories present in parent directory. However a subset can be specified by passing a vector of sample directory names.sample_names
As with other functions by default Read10X_Multi_Directory
will use the sub-directory names within parent directory to name the output list entries. Alternate names for the list entries can be provided here if desired. These names will also be used to add cell prefixes if merge = TRUE
(see below). merge
logical (default FALSE). Whether to combine all samples into single sparse matrix and using sample_names
to provide sample prefixes.scCustomize contains function: Read10X_h5_Multi_Directory
can be used to read 10X Genomics H5 files similarly to Read10X_Multi_Directory
scCustomize also contains function Read_CellBender_h5_Multi_Directory()
which can be used to read CellBender outputs in multiple sub-directories similar to Read10X_h5_Multi_Directory
using the same type of parameters as Read_CellBender_h5_Multi_File
Rather than creating and merging Seurat objects it can sometimes be advantageous to simply combine the sparse matrices before creating Seurat object.
GEO_Single <- list(mat1, mat2, mat3) GEO_Merged <- Merge_Sparse_Data_All(matrix_list = GEO_Single)
Merge_Sparse_Data_All
function through scCustomize.
If you have multimodal data (each entry in list contains sub-list with matrices) then you can use Merge_Sparse_Multimodal_All()
. This function will return a list with each entry representing a merged matrix for single modality.
GEO_Merged_Multimodal <- Merge_Sparse_Multimodal_All(matrix_list = GEO_Multimodal)
Merge_Sparse_Data_All
contains a number of optional parameters to control modification to the cell barcodes.
NOTE: If any of the barcodes in the input matrix list overlap and no prefixes/suffixes are provided the function will error.
add_cell_ids
to ensure barcodes are unique (and make the import to Seurat smoother with samples already labeled). prefix = FALSE
. cell_id_delimiter
parameter. To easily merge many Seurat objects contained in a list scCustomize contains simple function.
# Merge a list of compatible Seurat objects of any length and add cell prefixes if desired Seurat_Merged <- Merge_Seurat_List(list_seurat = list_of_objects, add.cell.ids = (c("cell", "prefixes", "to", "add")))
Create_10X_H5
provides convenient wrapper around write10xCounts()
from DropletUtils package. Output can then be easily read in using Seurat::Read10X_h5()
or LIGER's createLiger()
(which assumes H5 file is formatted as if from Cell Ranger).
# Provide file path and specify type of files as either cell ranger triplicate files, matrix, or data.frame Create_10X_H5(raw_data_file_path = "/path/matrix.mtx", source_type = "Matrix", save_file_path = "/path/", save_name = "name")
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.