library(BiocStyle) knitr::opts_chunk$set(error=FALSE, message=FALSE, warning=FALSE)
This RNA-seq dataset was downloaded from GSE107011.
Every sample represents the transcriptome of a specific cell type; this data is therefore well suited to be used as a general training data set for the typical r Biocpkg("SingleR")
analysis.
Expression values were already TPM normalized, so additional processing was only performed to remove 'PBMC' samples, remove genes with no reads across samples, collapse duplicate genes, and log~2~-normalize values.
Main and fine labels were manually assigned to each sample based on cell type as specified in the GEO repository.
First, we'll download the TPM normalized values from GEO.
library(BiocFileCache) url <- "ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSE107nnn/GSE107011/suppl/GSE107011_Processed_data_TPM.txt.gz" bfc <- BiocFileCache(ask=FALSE) ref <- bfcrpath(bfc, url) mat <- as.matrix(read.table(ref, sep = "\t", check.names=FALSE, header = TRUE, row.names = 1)) dim(mat)
We don't want the PBMC
samples, as they aren't purified cell types and don't make for good reference data.
pbmc <- grep("PBMC", colnames(mat)) mat <- mat[, -pbmc] dim(mat)
Genes with no reads in any samples don't provide any value for our purposes, so we will remove those as well.
mat <- mat[rowSums(mat) != 0, ] dim(mat)
This data has Ensembl gene IDs, but we need gene symbols.
library(AnnotationHub) edb <- AnnotationHub()[["AH73881"]] symbols <- mapIds(edb, keytype="GENEID", column="SYMBOL", keys=sub("\\.[0-9]+$", "", rownames(mat)))
Many Ensembl IDs don't have official gene symbols, so we will remove those as well. Then we can assign the gene symbols in place of the Ensembl IDs.
discard <- is.na(symbols) mat <- mat[!discard,] rownames(mat) <- symbols[!discard] dim(mat)
A small number of genes are duplicated, so those will be collapsed by keeping the instances with the highest median of reads across samples.
library(matrixStats) mat <- mat[order(rownames(mat), -rowMedians(mat)),] mat <- mat[!duplicated(rownames(mat)),] dim(mat)
Then we will log~2~-normalize after adding a pseudocount of 1.
logcounts <- log2(mat+1)
We can now apply human-readable labels to each sample. This requires some translation.
library(GEOquery) meta <- getGEO("GSE107011") fine <- pData(meta[[1]])[["cell type:ch1"]] fine <- fine[-pbmc] fine <- sub("cell$", "cells", fine) dictionary <- c( `Naive CD8 T cells`="CD8+ T cells", `Central memory CD8 T cells`="CD8+ T cells", `Effector memory CD8 T cells`="CD8+ T cells", `Terminal effector CD8 T cells`="CD8+ T cells", `MAIT cells`="T cells", `Vd2 gd T cells`="T cells", `Non-Vd2 gd T cells`="T cells", `Follicular helper T cells`="CD4+ T cells", `T regulatory cells`="CD4+ T cells", `Th1 cells`="CD4+ T cells", `Th1/Th17 cells`="CD4+ T cells", `Th17 cells`="CD4+ T cells", `Th2 cells`="CD4+ T cells", `Naive CD4 T cells`="CD4+ T cells", `Progenitor cells`="Progenitors", `Naive B cells`="B cells", `Non-switched memory B cells`="B cells", `Exhausted B cells`="B cells", `Switched memory B cells`="B cells", `Plasmablasts`="B cells", `Classical monocytes`="Monocytes", `Intermediate monocytes`="Monocytes", `Non classical monocytes`="Monocytes", `Natural killer cells`="NK cells", `Plasmacytoid dendritic cells`="Dendritic cells", `Myeloid dendritic cells`="Dendritic cells", `Low-density neutrophils`="Neutrophils", `Low-density basophils`="Basophils", `Terminal effector CD4 T cells`="CD4+ T cells" ) main <- dictionary[fine] stopifnot(all(!is.na(main))) # Cross-checking with the column names of the matrix. collabs <- sub("^[^_]+_", "", colnames(mat)) tab <- table(collabs, fine) stopifnot(all(rowSums(tab > 0)==1)) stopifnot(all(colSums(tab > 0)==1)) library(S4Vectors) coldata <- DataFrame(row.names = colnames(mat), label.main = main, label.fine = fine)
Now the counts and metadata can be saved for upload to r Biocpkg("ExperimentHub")
.
path <- file.path("celldex", "monaco_immune", "1.0.0") dir.create(path, showWarnings = FALSE, recursive = TRUE) saveRDS(logcounts, file = file.path(path, "logcounts.rds")) saveRDS(coldata, file = file.path(path, "coldata.rds"))
sessionInfo()
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.