library(BiocStyle)
knitr::opts_chunk$set(error=FALSE, message=FALSE, warning=FALSE)

Downloading the data

We obtain a MARS-seq single-cell RNA sequencing dataset of human bone marrow plasma cells and circulating plasma cells from Ledergor et al. (2018). Counts for endogenous genes are available from the Gene Expression Omnibus using the accession number GSE117156. We download and cache them using the r Biocpkg("BiocFileCache") package.

library(BiocFileCache)
bfc <- BiocFileCache("raw_data", ask = FALSE)
mat.path <- bfcrpath(bfc, "https://www.ncbi.nlm.nih.gov/geo/download/?acc=GSE117156&format=file")

We read this into memory as a sparse matrix.

Each matrix corresponds to a 384-well plate from the MARS-seq sequencing. Rows correspond to the features (identical across all matrices), columns correspond to the wells. To obtain the full count matrix, we cbind the matrices together.

tmp <- tempfile()
untar(mat.path, exdir=tmp)
all.files <- list.files(tmp)

library(BiocParallel)
library(scuttle)
all.counts <- bplapply(file.path(tmp, all.files), readSparseCounts)
names(all.counts) <- sub(".txt.gz$", "", all.files)
do.call(rbind, lapply(all.counts, dim))

counts <- do.call(cbind, all.counts)
dim(counts)

Creating a SingleCellExperiment object:

library(SingleCellExperiment)
sce <- SingleCellExperiment(list(counts=counts))

Adding the metadata

We pull the metadata file from the GEO and load it in:

meta.path <- bfcrpath(bfc, "https://www.ncbi.nlm.nih.gov/geo/download/?acc=GSE117156&format=file&file=GSE117156%5Fmetadata%2Etxt%2Egz")
meta <- read.delim(meta.path, row.names=1)
meta <- DataFrame(meta, check.names=FALSE)
meta

Removing this extraneous column:

summary(meta$X)
meta$X <- NULL

We check that the cell names match up with the matrix.

m <- match(colnames(counts), rownames(meta))
stopifnot(all(!is.na(m)))
meta <- meta[m,]

The Experiment_ID column encodes information on the subject, tissue and treatment status. We'll extract this into separate columns for easier access.

splt <- strsplit(meta$Experiment_ID, split = "_")

meta$Subject_ID <- vapply(splt, `[[`, character(1), 2)
meta$Condition <- sub("\\d+$", "", meta$Subject_ID)
meta$Condition[which(meta$Condition == "hip")] <- "Control"

## Treated IDs have 4 components, with `postRx` between subject ID and tissue
meta$Tissue <- vapply(splt, function(x) x[[length(x)]], character(1))
meta$Tissue <- sub("#\\d$", "", meta$Tissue)

meta$Treated <- grepl("postRx", meta$Experiment_ID)

meta
colData(sce) <- meta
sce

Saving to file

We polish the dataset to remove redundant names and convert assay types to more appropriate formats:

library(scRNAseq)
sce <- polishDataset(sce)
sce

We then save all of the relevant components to disk for upload:

meta <- list(
    title="Single cell dissection of plasma cell heterogeneity in symptomatic and asymptomatic myeloma",
    description="Multiple myeloma, a plasma cell malignancy, is the second most common blood cancer. Despite extensive research, disease heterogeneity is poorly characterized, hampering efforts for early diagnosis and improved treatments. Here, we apply single cell RNA sequencing to study the heterogeneity of 40 individuals along the multiple myeloma progression spectrum, including 11 healthy controls, demonstrating high interindividual variability that can be explained by expression of known multiple myeloma drivers and additional putative factors. We identify extensive subclonal structures for 10 of 29 individuals with multiple myeloma. In asymptomatic individuals with early disease and in those with minimal residual disease post-treatment, we detect rare tumor plasma cells with molecular characteristics similar to those of active myeloma, with possible implications for personalized therapies. Single cell analysis of rare circulating tumor cells allows for accurate liquid biopsy and detection of malignant plasma cells, which reflect bone marrow disease. Our work establishes single cell RNA sequencing for dissecting blood malignancies and devising detailed molecular characterization of tumor cells in symptomatic and asymptomatic patients.",
    taxonomy_id="9606",
    genome="GRCh38",
    sources=list(
        list(provider="GEO", id="GSE117156"),
        list(provider="PubMed", id="30523328")
    ),
    maintainer_name="Milan Malfait",
    maintainer_email="milan.malfait94@gmail.com"
)

saveDataset(sce, "2023-12-20_output", meta)

Session information {-}

sessionInfo()


LTLA/scRNAseq documentation built on April 29, 2024, 12:34 p.m.