BPSATACData: E8.25 snATAC-seq data

View source: R/BPSATACData.R

BPSATACDataR Documentation

E8.25 snATAC-seq data

Description

Obtain the processed or raw counts for the Pijuan-Sala et al. (2020) E8.25 single-nucleus ATAC-seq dataset.

Usage

BPSATACData(type = c("processed", "raw"), Csparse.assays = TRUE)

Arguments

type

String specifying the type of data to obtain, see Details. Default behaviour is to return processed data.

Csparse.assays

Logical indicating whether to convert assay matrices into the column major format that is more performant with contemporary software packages. Default behaviour is to perform the conversion.

Details

This function downloads the data for the E8.25 single-nucleus ATAC-seq data from Pijuan-Sala et al. (2020). The dataset is provided as a single sample.

In the processed data, QC-passing libraries have already been identified in each sample. The count matrix contains the number of counts for each identified peak for each cell. Note that you may want to binarise this matrix for downstream analyses. Full details of the methods used in analyses can be found in the paper (see References, below).

The column metadata for cells contains:

sample:

Integer, sample index (for consistency across MGD datasets).

stage:

Character, collection timepoint (for consistency across MGD datasets).

barcode:

Character, unique cell identifier.

nuclei_type:

Character, whether cells were selected using flow gates. Note that these are probably not doublets, but cells in different cell cycle phases.

num_of_reads:

Integer, number of reads.

promoter_coverage:

Numeric, fraction of promoters accessible "in the majority of datasets based on ENCODE DNase Hypersensitive Sites and ATAC-seq data".

read_in_promoter:

Integer, number of reads in promoters.

doublet_scores:

Numeric, doublet scores (calculated with scrublet v0.4).

read_in_peak:

Integer, Number of reads in across-cell-calculated peaks.

ratio_peaks:

Numeric, fraction of reads in across-cell peaks.

final_clusters:

Integer, final cluster indices.

celltype:

Character, celltype label.

al_haem_endo_clusters:

Character, clusters from the focused blood, allantois, endothelium celltypes (or NA, for other celltypes).

Reduced dimension representations of the data are also available in the reducedDims slot of the SingleCellExperiment object. These are topics and umap. Please see the methods of the manuscript (see References, below) for more details on the topic modelling approach.

For both raw and processed data, the row metadata is relatively complex. It contains:

peakID:

Character, unique peak identifier.

peak_chr:

Character, chromosome ID for each peak.

peak_start:

Integer, start position for each peak. As this is from a bed file (I think), this is 0-indexed, and the peak is inclusive of this position.

peak_end:

Integer, end position for each peak. As this is from a bed file (I think), this is 0-indexed, and the peak is exclusive of this position.

Annotation.General:

Character, general peak annotation (TSS (-1kb to +100bp), TTS (-100bp to +1kb), intron, exon, intergenic).

distance_from_TSS:

Integer, distance from the TSS that peaks been annotated to if the region is intergenic. Note: the authors have annotated peaks to multiple genes; distances for different genes are comma-separated in this column.

geneName:

Character, gene name (MGI). Note: the authors have annotated peaks to multiple genes; names for different genes are comma-separated in this column.

geneID:

Character, gene ID (Ensembl gene ID, v92). Note: the authors have annotated peaks to multiple genes; IDs for different genes are comma-separated in this column.

strand:

Character, strand for linked genes. Note: the authors have annotated peaks to multiple genes; strands for different genes are comma-separated in this column.

celltype_specificity:

Character, celltype specificity of the peak. For multiple celltypes, authors have semicolon-separated celltype names.

topic:

Character, topic membership of the peak. For multiple topics, authors have semicolon-separated topic names.

topic_stringent:

Character, topic membership of the peak if it contributes to only a single topic; else "Nonspecific".

accessibility:

Integer, number of nuclei with where peak is accessible.

accessibility_log:

Numeric, log-transformed number of nuclei with where peak is accessible (base e, with an added 1 to the count).

accessibility_ratio:

Numeric, fraction of nuclei where peak is accessible.

umap_X:

Numeric, umap x-coordinate of peak.

umap_Y:

Numeric, umap y-coordinate of peak.

Pattern_endothelium:

Integer, index for dynamic pattern during endothelial establishment (else NA).

Value

If type="processed", a SingleCellExperiment is returned containing the processed data.

If type="raw", a SingleCellExperiment is returned containing the raw data.

Author(s)

Aaron Lun, with modification by Jonathan Griffiths

References

Pijuan-Sala B et al. (2020). Single-cell chromatin accessibility maps reveal regulatory programs driving early mouse organogenesis. Nature Cell Biology 22, 4:487–97.

Examples

## Not run: 
# dataset large enough to cause bioc build issues 
atac.data <- BPSATACData()
atac.data <- BPSATACData(type="processed")

## End(Not run)


MarioniLab/MouseGastrulationData documentation built on Jan. 31, 2024, 11:01 a.m.