library(knitr) htmltools::tagList(rmarkdown::html_dependency_font_awesome()) # set dpi knitr::opts_chunk$set( collapse = TRUE, comment = "#>", dpi = 60 )
scfetch
is designed to accelerate users download and prepare single-cell datasets from public resources. It can be used to:
GEO/SRA
, foramt fastq files to standard style that can be identified by 10x softwares (e.g. CellRanger).GEO/SRA
, support downloading original 10x generated bam files (with custom tags) and normal bam files, and convert bam files to fastq files.GEO
, PanglanDB
and UCSC Cell Browser
, load the downnloaded matrix to Seurat
.Zeenodo
and CELLxGENE
.SeuratObject
, AnnData
, SingleCellExperiment
, CellDataSet/cell_data_set
and loom
).scfetch
is an R package distributed as part of the CRAN.
To install the package, start R and enter:
# install via CRAN (v0.5.0) # old version, it's better to install via Github install.packages("scfetch") # if you install from CRAN, you should install the following packages # install.packages("devtools") #In case you have not installed it. devtools::install_github("alexvpickering/GEOfastq") # download fastq devtools::install_github("cellgeni/sceasy") # format conversion devtools::install_github("mojaveazure/seurat-disk") # format conversion devtools::install_github("satijalab/seurat-wrappers") # format conversion # install via Github (v0.5.0) devtools::install_github("showteeth/scfetch")
For data structures conversion, scfetch
requires several python pcakages, you can install with:
```{bash, eval=FALSE}
conda install -c bioconda loompy anndata
pip install anndata loompy
In general, it is **recommended** to install from [Github repository](https://github.com/showteeth/scfetch) (update more timely). Once `scfetch` is installed, it can be loaded by the following command. ```r library("scfetch")
Since the downloading process is time-consuming, we provide the commands used to illustrate the usage.
For fastq files stored in SRA, scfetch
can extract sample information and run number with GEO accession number or users can also provide a dataframe contains the run number of interested samples.
Extract all samples under GSE130636
and the platform is GPL20301
(use platform = NULL
for all platforms):
GSE130636.runs <- ExtractRun(acce = "GSE130636", platform = "GPL20301")
With the dataframe contains gsm and run number, scfetch
will download sra files using prefetch
. The returned result is a dataframe contains failed runs. If not NULL
, users can re-run DownloadSRA
by setting gsm.df
to the returned result.
# a small test GSE130636.runs <- GSE130636.runs[GSE130636.runs$run %in% c("SRR9004346", "SRR9004351"), ] # download, you may need to set prefetch.path out.folder <- tempdir() GSE130636.down <- DownloadSRA( gsm.df = GSE130636.runs, out.folder = out.folder ) # GSE130636.down is null or dataframe contains failed runs
The out.folder
structure will be: gsm_number/run_number
.
After obtaining the sra files, scfetch
provides function SplitSRA
to split sra files to fastq files using parallel-fastq-dump
(parallel, fastest and gzip output), fasterq-dump
(parallel, fast but unzipped output) and fastq-dump
(slowest and gzip output).
For fastqs generated with 10x Genomics, SplitSRA
can identify read1, read2 and index files and format the read1 and read2 to 10x required format (sample1_S1_L001_R1_001.fastq.gz
and sample1_S1_L001_R2_001.fastq.gz
). In detail, the file with read length 26 or 28 is considered as read1, the files with read length 8 or 10 are considered as index files and the remain file is considered as read2. The read length rules is from Sequencing Requirements for Single Cell 3' and Sequencing Requirements for Single Cell V(D)J.
The returned result is a vector of failed sra files. If not NULL
, users can re-run SplitSRA
by setting sra.path
to the returned result.
# parallel-fastq-dump requires sratools.path # you may need to set split.cmd.path and sratools.path sra.folder <- tempdir() GSE130636.split <- SplitSRA( sra.folder = sra.folder, fastq.type = "10x", split.cmd.threads = 4 )
scfetch
can extract sample information and run number with GEO accession number or users can also provide a dataframe contains the run number of interested samples.
GSE138266.runs <- ExtractRun(acce = "GSE138266", platform = "GPL18573")
With the dataframe contains gsm and run number, scfetch
provides DownloadBam
to download bam files using prefetch
. It suooorts 10x generated bam files and normal bam files.
prefetch
, scfetch
adds --type TenX
to make sure the downloaded bam files contain these tags. DownloadBam
will download sra files first and then convert sra files to bam files with sam-dump
. After testing the efficiency of prefetch
+ sam-dump
and sam-dump
, the former is much faster than the latter (52G sra and 72G bam files):
```{bash test_downbam_efficiency}The returned result is a dataframe containing failed runs (either failed to download sra files or failed to convert to bam files for normal bam; failed to download bam files for 10x generated bam). If not `NULL`, users can re-run `DownloadBam` by setting `gsm.df` to the returned result. The following is an example to download 10x generated bam file: ```r # a small test GSE138266.runs <- GSE138266.runs[GSE138266.runs$run %in% c("SRR10211566"), ] # download, you may need to set prefetch.path out.folder <- tempdir() GSE138266.down <- DownloadBam( gsm.df = GSE138266.runs, out.folder = out.folder ) # GSE138266.down is null or dataframe contains failed runs
The out.folder
structure will be: gsm_number/run_number
.
With downloaded bam files, scfetch
provides function Bam2Fastq
to convert bam files to fastq files. For bam files generated from 10x softwares, Bam2Fastq
utilizes bamtofastq
tool developed by 10x Genomics.
The returned result is a vector of bam files failed to convert to fastq files. If not NULL
, users can re-run Bam2Fastq
by setting bam.path
to the returned result.
bam.folder <- tempdir() # you may need to set bamtofastq.path and bamtofastq.paras GSE138266.convert <- Bam2Fastq( bam.folder = bam.folder )
scfetch
provides functions for users to download count matrices and annotations (e.g. cell type annotation and composition) from GEO and some single-cell databases (e.g. PanglaoDB and UCSC Cell Browser). scfetch
also supports loading the downloaded data to Seurat
.
Until now, the public resources supported and the returned results:
| Resources | URL | Download Type | Returned results | |-------------------|-----------------------------------|---------------|-----------------------------------------------| | GEO | https://www.ncbi.nlm.nih.gov/geo/ | count matrix | SeuratObject or count matrix for bulk RNA-seq | | PanglaoDB | https://panglaodb.se/index.html | count matrix | SeuratObject | | UCSC Cell Browser | https://cells.ucsc.edu/ | count matrix | SeuratObject |
GEO is an international public repository that archives and freely distributes microarray, next-generation sequencing, and other forms of high-throughput functional genomics data submitted by the research community. It provides a very convenient way for users to explore and select interested scRNA-seq datasets.
scfetch
provides ExtractGEOMeta
to extract sample metadata, including sample title, source name/tissue, description, cell type, treatment, paper title, paper abstract, organism, protocol, data processing methods, et al.
# extract metadata of specified platform GSE200257.meta <- ExtractGEOMeta(acce = "GSE200257", platform = "GPL24676") # set VROOM_CONNECTION_SIZE to avoid error: Error: The size of the connection buffer (786432) was not large enough Sys.setenv("VROOM_CONNECTION_SIZE" = 131072 * 60) # extract metadata of all platforms GSE94820.meta <- ExtractGEOMeta(acce = "GSE94820", platform = NULL)
After manually check the extracted metadata, users can download count matrix and load the count matrix to Seurat with ParseGEO
.
For count matrix, ParseGEO
supports downloading the matrix from supplementary files and extracting from ExpressionSet
, users can control the source by specifying down.supp
or detecting automatically (ParseGEO
will extract the count matrix from ExpressionSet
first, if the count matrix is NULL or contains non-integer values, ParseGEO
will download supplementary files). While the supplementary files have two main types: single count matrix file containing all cells and CellRanger-style outputs (barcode, matrix, feature/gene), users are required to choose the type of supplementary files with supp.type
.
With the count matrix, ParseGEO
will load the matrix to Seurat automatically. If multiple samples available, users can choose to merge the SeuratObject with merge
.
# for cellranger output out.folder <- tempdir() GSE200257.seu <- ParseGEO( acce = "GSE200257", platform = NULL, supp.idx = 1, down.supp = TRUE, supp.type = "10x", out.folder = out.folder ) # for count matrix, no need to specify out.folder, download count matrix to tmp folder GSE94820.seu <- ParseGEO(acce = "GSE94820", platform = NULL, supp.idx = 1, down.supp = TRUE, supp.type = "count")
For bulk RNA-seq, set data.type = "bulk"
in ParseGEO
, this will return count matrix.
PanglaoDB is a database which contains scRNA-seq datasets from mouse and human. Up to now, it contains 5,586,348 cells from 1368 datasets (1063 from Mus musculus and 305 from Homo sapiens). It has well organized metadata for every dataset, including tissue, protocol, species, number of cells and cell type annotation (computationally identified). Daniel Osorio has developed rPanglaoDB to access PanglaoDB data, the functions of scfetch
here are based on rPanglaoDB.
Since PanglaoDB is no longer maintained, scfetch
has cached all metadata and cell type composition and use these cached data by default to accelerate, users can access the cached data with PanglaoDBMeta
(all metadata) and PanglaoDBComposition
(all cell type composition).
scfetch
provides StatDBAttribute
to summary attributes of PanglaoDB:
# use cached metadata StatDBAttribute(df = PanglaoDBMeta, filter = c("species", "protocol"), database = "PanglaoDB")
scfetch
provides ExtractPanglaoDBMeta
to select interested datasets with specified species, protocol, tissue and cell number (The available values of these attributes can be obtained with StatDBAttribute
). User can also choose to whether to add cell type annotation to every dataset with show.cell.type
.
scfetch
uses cached metadata and cell type composition by default, users can change this by setting local.data = FALSE
.
hsa.meta <- ExtractPanglaoDBMeta( species = "Homo sapiens", protocol = c("Smart-seq2", "10x chromium"), show.cell.type = TRUE, cell.num = c(1000, 2000) )
scfetch
provides ExtractPanglaoDBComposition
to extract cell type annotation and composition (use cached data by default to accelerate, users can change this by setting local.data = FALSE
).
hsa.composition <- ExtractPanglaoDBComposition(species = "Homo sapiens", protocol = c("Smart-seq2", "10x chromium"))
After manually check the extracted metadata, scfetch
provides ParsePanglaoDB
to download count matrix and load the count matrix to Seurat. With available cell type annotation, uses can filter datasets without specified cell type with cell.type
. Users can also include/exclude cells expressing specified genes with include.gene
/exclude.gene
.
With the count matrix, ParsePanglaoDB
will load the matrix to Seurat automatically. If multiple datasets available, users can choose to merge the SeuratObject with merge
.
# small test hsa.seu <- ParsePanglaoDB(hsa.meta[1:3, ], merge = TRUE)
The UCSC Cell Browser is a web-based tool that allows scientists to interactively visualize scRNA-seq datasets. It contains 1040 single cell datasets from 17 different species. And, it is organized with the hierarchical structure, which can help users quickly locate the datasets they are interested in.
scfetch
provides ShowCBDatasets
to show all available datasets. Due to the large number of datasets, ShowCBDatasets
enables users to perform lazy load of dataset json files instead of downloading the json files online (time-consuming!!!). This lazy load requires users to provide json.folder
to save json files and set lazy = TRUE
(for the first time of run, ShowCBDatasets
will download current json files to json.folder
, for next time of run, with lazy = TRUE
, ShowCBDatasets
will load the downloaded json files from json.folder
.). And, ShowCBDatasets
supports updating the local datasets with update = TRUE
.
json.folder <- tempdir() # first time run, the json files are stored under json.folder # ucsc.cb.samples = ShowCBDatasets(lazy = TRUE, json.folder = json.folder, update = TRUE) # second time run, load the downloaded json files ucsc.cb.samples <- ShowCBDatasets(lazy = TRUE, json.folder = json.folder, update = FALSE) # always read online # ucsc.cb.samples = ShowCBDatasets(lazy = FALSE)
The number of datasets and all available species:
# the number of datasets nrow(ucsc.cb.samples) # available species unique(unlist(sapply(unique(gsub(pattern = "\\|parent", replacement = "", x = ucsc.cb.samples$organisms)), function(x) { unlist(strsplit(x = x, split = ", ")) })))
scfetch
provides StatDBAttribute
to summary attributes of UCSC Cell Browser:
StatDBAttribute(df = ucsc.cb.samples, filter = c("organism", "organ"), database = "UCSC")
scfetch
provides ExtractCBDatasets
to filter metadata with collection, sub-collection, organ, disease status, organism, project and cell number (The available values of these attributes can be obtained with StatDBAttribute
except cell number). All attributes except cell number support fuzzy match with fuzzy.match
, this is useful when selecting datasets.
hbb.sample.df <- ExtractCBDatasets(all.samples.df = ucsc.cb.samples, organ = c("brain", "blood"), organism = "Human (H. sapiens)", cell.num = c(1000, 2000))
scfetch
provides ExtractCBComposition
to extract cell type annotation and composition.
hbb.sample.ct <- ExtractCBComposition(json.folder = json.folder, sample.df = hbb.sample.df)
After manually check the extracted metadata, scfetch
provides ParseCBDatasets
to load the online count matrix to Seurat. All the attributes available in ExtractCBDatasets
are also same here. Please note that the loading process provided by ParseCBDatasets
will load the online count matrix instead of downloading it to local. If multiple datasets available, users can choose to merge the SeuratObject with merge
.
hbb.sample.seu <- ParseCBDatasets(sample.df = hbb.sample.df)
scfetch
provides functions for users to download processed single-cell RNA-seq data from Zenodo, CELLxGENE and Human Cell Atlas, including RDS
, RData
, h5ad
, h5
, loom
objects.
Until now, the public resources supported and the returned results:
| Resources | URL | Download Type | Returned results | |------------------|-----------------------------------|----------------------------------------|-------------------------| | Zenodo | https://zenodo.org/ | count matrix, rds, rdata, h5ad, et al. | NULL or failed datasets | | CELLxGENE | https://cellxgene.cziscience.com/ | rds, h5ad | NULL or failed datasets | | Human Cell Atlas | https://www.humancellatlas.org/ | rds, rdata, h5, h5ad, loom | NULL or failed projects |
Zenodo contains various types of processed objects, such as SeuratObject
which has been clustered and annotated, AnnData
which contains processed results generated by scanpy
.
scfetch
provides ExtractZenodoMeta
to extract dataset metadata, including dataset title, description, available files and corresponding md5. Please note that when the dataset is restricted access, the returned dataframe will be empty.
# single doi zebrafish.df <- ExtractZenodoMeta(doi = "10.5281/zenodo.7243603") # vector dois multi.dois <- ExtractZenodoMeta(doi = c("1111", "10.5281/zenodo.7243603", "10.5281/zenodo.7244441"))
After manually check the extracted metadata, users can download the specified objects with ParseZenodo
. The downloaded objects are controlled by file.ext
and the provided object formats should be in lower case (e.g. rds/rdata/h5ad).
The returned result is a dataframe containing failed objects. If not NULL
, users can re-run ParseZenodo
by setting doi.df
to the returned result.
out.folder <- tempdir() multi.dois.parse <- ParseZenodo( doi = c("1111", "10.5281/zenodo.7243603", "10.5281/zenodo.7244441"), file.ext = c("rdata", "rds"), out.folder = out.folder )
The CELLxGENE is a web server contains 910 single-cell datasets, users can explore, download and upload own datasets. The downloaded datasets provided by CELLxGENE have two formats: h5ad (AnnData v0.8)
and rds (Seurat v4)
.
scfetch
provides ShowCELLxGENEDatasets
to extract dataset metadata, including dataset title, description, contact, organism, ethnicity, sex, tissue, disease, assay, suspension type, cell type, et al.
# all available datasets all.cellxgene.datasets <- ShowCELLxGENEDatasets()
scfetch
provides StatDBAttribute
to summary attributes of CELLxGENE:
StatDBAttribute(df = all.cellxgene.datasets, filter = c("organism", "sex"), database = "CELLxGENE")
scfetch
provides ExtractCELLxGENEMeta
to filter dataset metadata, the available values of attributes can be obtained with StatDBAttribute
except cell number:
# human 10x v2 and v3 datasets human.10x.cellxgene.meta <- ExtractCELLxGENEMeta( all.samples.df = all.cellxgene.datasets, assay = c("10x 3' v2", "10x 3' v3"), organism = "Homo sapiens" )
After manually check the extracted metadata, users can download the specified objects with ParseCELLxGENE
. The downloaded objects are controlled by file.ext
(choose from "rds"
and "h5ad"
).
The returned result is a dataframe containing failed datasets. If not NULL
, users can re-run ParseCELLxGENE
by setting meta
to the returned result.
out.folder <- tempdir() ParseCELLxGENE( meta = human.10x.cellxgene.meta[1:5, ], file.ext = "rds", out.folder = out.folder )
There are many tools have been developed to process scRNA-seq data, such as Scanpy, Seurat, scran and Monocle. These tools have their own objects, such as Anndata
of Scanpy
, SeuratObject
of Seurat
, SingleCellExperiment
of scran
and CellDataSet
/cell_data_set
of Monocle2
/Monocle3
. There are also some file format designed for large omics datasets, such as loom. To perform a comprehensive scRNA-seq data analysis, we usually need to combine multiple tools, which means we need to perform object conversion frequently. To facilitate user analysis of scRNA-seq data, scfetch
provides multiple functions to perform object conversion between widely used tools and formats. The object conversion implemented in scfetch
has two main advantages:
SeuratDisk
to convert SeuratObject to loom, use zellkonverter
to perform conversion between SingleCellExperiment
and Anndata
. When there is no such tools, we use sceasy
to perform conversion.# library library(Seurat) # pbmc_small library(scRNAseq) # seger
SeuratObject
:
# object
pbmc_small
SingleCellExperiment
:
seger <- scRNAseq::SegerstolpePancreasData()
Here, we will convert SeuratObject to SingleCellExperiment
, CellDataSet
/cell_data_set
, Anndata
, loom
.
The conversion is performed with functions implemented in Seurat
:
sce.obj <- ExportSeurat(seu.obj = pbmc_small, assay = "RNA", to = "SCE")
To CellDataSet
(The conversion is performed with functions implemented in Seurat
):
# BiocManager::install("monocle") # reuqire monocle cds.obj <- ExportSeurat(seu.obj = pbmc_small, assay = "RNA", reduction = "tsne", to = "CellDataSet")
To cell_data_set
(The conversion is performed with functions implemented in SeuratWrappers
):
# remotes::install_github('cole-trapnell-lab/monocle3') # reuqire monocle3 cds3.obj <- ExportSeurat(seu.obj = pbmc_small, assay = "RNA", to = "cell_data_set")
AnnData
is a Python object, reticulate
is used to communicate between Python and R. User should create a Python environment which contains anndata
package and specify the environment path with conda.path
to ensure the exact usage of this environment.
The conversion is performed with functions implemented in sceasy
:
# remove pbmc_small.h5ad first anndata.file <- tempfile(pattern = "pbmc_small_", fileext = ".h5ad") # you may need to set conda.path ExportSeurat( seu.obj = pbmc_small, assay = "RNA", to = "AnnData", anndata.file = anndata.file )
The conversion is performed with functions implemented in SeuratDisk
:
loom.file <- tempfile(pattern = "pbmc_small_", fileext = ".loom") ExportSeurat( seu.obj = pbmc_small, assay = "RNA", to = "loom", loom.file = loom.file )
The conversion is performed with functions implemented in Seurat
:
seu.obj.sce <- ImportSeurat(obj = sce.obj, from = "SCE", count.assay = "counts", data.assay = "logcounts", assay = "RNA")
CellDataSet
to SeuratObject
(The conversion is performed with functions implemented in Seurat
):
seu.obj.cds <- ImportSeurat(obj = cds.obj, from = "CellDataSet", count.assay = "counts", assay = "RNA")
cell_data_set
to SeuratObject
(The conversion is performed with functions implemented in Seurat
):
seu.obj.cds3 <- ImportSeurat(obj = cds3.obj, from = "cell_data_set", count.assay = "counts", data.assay = "logcounts", assay = "RNA")
AnnData
is a Python object, reticulate
is used to communicate between Python and R. User should create a Python environment which contains anndata
package and specify the environment path with conda.path
to ensure the exact usage of this environment.
The conversion is performed with functions implemented in sceasy
:
# you may need to set conda.path seu.obj.h5ad <- ImportSeurat( anndata.file = anndata.file, from = "AnnData", assay = "RNA" )
The conversion is performed with functions implemented in SeuratDisk
and Seurat
:
# loom will lose reduction seu.obj.loom <- ImportSeurat(loom.file = loom.file, from = "loom")
The conversion is performed with functions implemented in zellkonverter
.
# remove seger.h5ad first seger.anndata.file <- tempfile(pattern = "seger_", fileext = ".h5ad") SCEAnnData( from = "SingleCellExperiment", to = "AnnData", sce = seger, X_name = "counts", anndata.file = seger.anndata.file )
seger.anndata <- SCEAnnData( from = "AnnData", to = "SingleCellExperiment", anndata.file = seger.anndata.file )
The conversion is performed with functions implemented in LoomExperiment
.
# remove seger.loom first seger.loom.file <- tempfile(pattern = "seger_", fileext = ".loom") SCELoom( from = "SingleCellExperiment", to = "loom", sce = seger, loom.file = seger.loom.file )
seger.loom <- SCELoom( from = "loom", to = "SingleCellExperiment", loom.file = seger.loom.file )
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.