knitr::opts_chunk$set(cache = TRUE)

Introduction

TENxIO allows users to import 10X pipeline files into known Bioconductor classes. The package is not comprehensive, there are files that are not supported. It currently does not support Visium datasets. It does replace some functionality in DropletUtils. If you would like a file format to be supported. Please open an issue at https://github.com/waldronlab/TENxIO.

Supported Formats

| Extension | Class | Imported as | |---------------------|---------------|----------------------| | .h5 | TENxH5 | SingleCellExperiment w/ TENxMatrix | | .mtx / .mtx.gz | TENxMTX | SummarizedExperiment w/ dgCMatrix | | .tar.gz | TENxFileList | SingleCellExperiment w/ dgCMatrix | | peak_annotation.tsv | TENxPeaks | GRanges | | fragments.tsv.gz | TENxFragments | RaggedExperiment | | .tsv / .tsv.gz | TENxTSV | tibble | | spatial.tar.gz | TENxSpatialList | inter. DataFrame list |

Tested 10X Products

We have tested these functions with some datasets from 10x Genomics including those from:

Note. That extensive testing has not been performed and the codebase may require some adaptation to ensure compatibility with all pipeline outputs.

Bioconductor implementations

We are aware of existing functionality in both DropletUtils and SpatialExperiment. We are working with the authors of those packages to cover the use cases in both those packages and possibly port I/O functionality into TENxIO. We are using long tests and the DropletTestFiles package to cover example datasets on ExperimentHub, if you would like to know more, see the longtests directory on GitHub.

Installation

if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

BiocManager::install("waldronlab/TENxIO")

Load the package

library(TENxIO)

Description

TENxIO offers an set of classes that allow users to easily work with files typically obtained from the 10X Genomics website. Generally, these are outputs from the Cell Ranger pipeline.

Procedure

Loading the data into a Bioconductor class is a two step process. First, the file must be identified by either the user or the TENxFile function. The appropriate function will be evoked to provide a TENxIO class representation, e.g., TENxH5 for HDF5 files with an .h5 extension. Secondly, the import method for that particular file class will render a common Bioconductor class representation for the user. The main representations used by the package are SingleCellExperiment, SummarizedExperiment, GRanges, and RaggedExperiment.

Dataset versioning

The versioning schema in the package mostly applies to HDF5 resources and is loosely based on versions of 10X datasets. For the most part, version 3 datasets usually contain ranged information at specific locations in the data file. Version 2 datasets will usually contain a genes.tsv file, rather than features.tsv as in version 3. If the file version is unknown, the software will attempt to derive the version from the data where possible.

File classes

TENxFile

The TENxFile class is the catch-all class superclass that allows transition to subclasses pertinent to specific files. It inherits from the BiocFile class and allows for easy dispatching import methods.

showClass("TENxFile")

ExperimentHub resources

TENxFile can handle resources from ExperimentHub with careful inputs. For example, one can import a TENxBrainData dataset via the appropriate ExperimentHub identifier (EH1039):

hub <- ExperimentHub::ExperimentHub()
hub["EH1039"]

Currently, ExperimentHub resources do not have an extension and it is best to provide that to the TENxFile constructor function.

fname <- hub[["EH1039"]]
TENxFile(fname, extension = "h5", group = "mm10", version = "2")

Note. EH1039 is a large ~ 4GB file and files without extension as those obtained from ExperimentHub will emit a warning so that the user is aware that the import operation may fail, esp. if the internal structure of the file is modified.

TENxH5

TENxIO mainly supports version 3 and 2 type of H5 files. These are files with specific groups and names as seen in h5.version.map, an internal data.frame map that guides the import operations.

TENxIO:::h5.version.map

In the case that, there is a file without genomic coordinate information, the constructor function can take an NA_character_ input for the ranges argument.

The TENxH5 constructor function can be used on either version of these H5 files. In this example, we use a subset of the PBMC granulocyte H5 file obtained from the 10X website.

h5f <- system.file(
    "extdata", "pbmc_granulocyte_ff_bc_ex.h5",
    package = "TENxIO", mustWork = TRUE
)
library(rhdf5)
h5ls(h5f)

Note. The h5ls function gives an overview of the structure of the file. It matches version 3 in our version map.

The show method gives an overview of the data components in the file:

con <- TENxH5(h5f)
con

import TENxH5 method

We can simply use the import method to convert the file representation to a Bioconductor class representation, typically a SingleCellExperiment.

import(con)

Note. Although the main representation in the package is SingleCellExperiment, there could be a need for alternative data class representations of the data. The projection field in the TENxH5 show method is an initial attempt to allow alternative representations.

TENxMTX

Matrix Market formats are also supported (.mtx extension). These are typically imported as SummarizedExperiment as they usually contain count data.

mtxf <- system.file(
    "extdata", "pbmc_3k_ff_bc_ex.mtx",
    package = "TENxIO", mustWork = TRUE
)
con <- TENxMTX(mtxf)
con

import MTX method

The import method yields a SummarizedExperiment without colnames or rownames.

import(con)

TENxFileList

Generally, the 10X website will provide tarballs (with a .tar.gz extension) which can be imported with the TENxFileList class. The tarball can contain components of a gene expression experiment including the matrix data, row data (aka 'features') expressed as Ensembl identifiers, gene symbols, etc. and barcode information for the columns.

The TENxFileList class allows importing multiple files within a tar.gz archive. The untar function with the list = TRUE argument shows all the file names in the tarball.

fl <- system.file(
    "extdata", "pbmc_granulocyte_sorted_3k_ff_bc_ex_matrix.tar.gz",
    package = "TENxIO", mustWork = TRUE
)
untar(fl, list = TRUE)

We then use the import method across all file types to obtain an integrated Bioconductor representation that is ready for analysis. Files in TENxFileList can be represented as a SingleCellExperiment with row names and column names.

con <- TENxFileList(fl)
import(con)

TENxPeaks

Peak files can be handled with the TENxPeaks class. These files are usually named *peak_annotation files with a .tsv extension. Peak files are represented as GRanges.

pfl <- system.file(
    "extdata", "pbmc_granulocyte_sorted_3k_ex_atac_peak_annotation.tsv",
    package = "TENxIO", mustWork = TRUE
)
tenxp <- TENxPeaks(pfl)
peak_anno <- import(tenxp)
peak_anno

TENxFragments

Fragment files are quite large and we make use of the Rsamtools package to import them with the yieldSize parameter. By default, we use a yieldSize of 200.

fr <- system.file(
    "extdata", "pbmc_3k_atac_ex_fragments.tsv.gz",
    package = "TENxIO", mustWork = TRUE
)

Internally, we use the TabixFile constructor function to work with indexed tsv.gz files.

Note. A warning is emitted whenever a yieldSize parameter is not set.

tfr <- TENxFragments(fr)
tfr

Because there may be a variable number of fragments per barcode, we use a RaggedExperiment representation for this file type.

fra <- import(tfr)
fra

Similar operations to those used with SummarizedExperiment are supported. For example, the genomic ranges can be displayed via rowRanges:

rowRanges(fra)

TENxVisium

The TENxVisium class is used to import 10X Visium data. The TENxVisium constructor function takes the following arguments:

TENxVisium(
    resource = "path/to/10x/visium/file.tar.gz",
    spatialResource = "path/to/10x/visium/spatial/file.spatial.tar.gz",
    sample_id = "sample01",
    images = c("lowres", "hires", "detected", "aligned"),
    jsonFile = "scalefactors_json.json",
    tissuePattern = "tissue_positions.*\\.csv",
    spatialCoordsNames = c("pxl_col_in_fullres", "pxl_row_in_fullres")
)

The resource argument is the path to the 10X Visium file. The spatialResource argument is the path to the 10X Visium spatial file. It usually ends in spatial.tar.gz.

Session Information

sessionInfo()


LiNk-NY/TENxIO documentation built on May 3, 2024, 11:08 p.m.