title: “Semi-Supervised Elastic Net (ssenet)” author: “Amrit Singh” date: “10 March, 2020”
library(ssenet); ## devtools::install_github("singha53/ssenet")
include_graphics("inst/extdata/suppTable2.png")
selectedGenes <- c("fbll1", "itpr2", "vps13c", "tnfrsf1b", "sox2",
"hdx", "wrn", "sumf2", "vmn1r65", "rhob",
"mrgprb1", "calb1", "pld1", "laptm5", "tbr1",
"slc5a7", "abca9", "ankle1", "olr1",
"cecr2", "cpne5", "blzf1", "mertk",
"nell1", "npy2r", "cdc5l", "slco1c1",
"pax6", "cldn5", "cyp2j5", "mfge8",
"col5a1", "bmpr1b", "rrm2", "gja1",
"dcx", "spag6", "csf2rb2", "gda",
"arhgef26", "slc4a8", "gm805", "omg")
plot(coord, col = mixOmics::color.mixo(as.numeric(seqfishLabels$V3)), pch = 21,
xlab = "x-coordinates", ylab = "y-coordinates")
points(coord, col = mixOmics::color.mixo(as.numeric(seqfishLabels$V3)), pch = 19)
# set constants
M = 5;
iter = 5;
ncores = 5;
alpha = 1;
lambda_nfolds = 3;
family = "multinomial";
filter = "none";
max.iter = 50;
perc.full = 1;
thr.conf = 0.5;
## minimum number of samples required per cell-type class (required for hyperparameter tuning and cross-validation)
round(table(scrnaseqLabels$V1)/M/lambda_nfolds, 0) # remove Oligodendrocyte.2
##
## Astrocyte Endothelial Cell GABA-ergic Neuron
## 3 2 51
## Glutamatergic Neuron Microglia Oligodendrocyte.1
## 54 1 1
## Oligodendrocyte.2 Oligodendrocyte.3
## 0 2
keepIndices <- which(scrnaseqLabels$V1 != "Oligodendrocyte.2")
xscrnaseq <- scrnaseq[, keepIndices]
yscrnaseq <- droplevels(scrnaseqLabels$V1[keepIndices])
fitEnet <- enet(xtrain = t(xscrnaseq), ytrain = yscrnaseq, alpha = alpha, lambda = NULL, lambda_nfolds = lambda_nfolds, family = "multinomial", filter = filter)
cvEnet <- predict(object = fitEnet, M = M, iter = iter, ncores = ncores)
cvEnet$perf
## # A tibble: 9 x 3
## ErrName Mean SD
## <chr> <dbl> <dbl>
## 1 Astrocyte 0.0419 0.0104
## 2 BER 0.160 0.00775
## 3 Endothelial Cell 0.179 0.0154
## 4 ER 0.0578 0.00289
## 5 GABA-ergic Neuron 0.0523 0.00468
## 6 Glutamatergic Neuron 0.0399 0.00333
## 7 Microglia 0.0182 0.0407
## 8 Oligodendrocyte.1 0.411 0.0235
## 9 Oligodendrocyte.3 0.381 0.0353
fitSSEnet <- ssenet(xtrain = t(cbind(xscrnaseq, seqfish)),
ytrain=factor(c(as.character(yscrnaseq), rep(NA, ncol(seqfish)))),
alpha = alpha, lambda = fitEnet$lambda, lambda_nfolds = lambda_nfolds, family = "multinomial",
filter = filter,
max.iter = max.iter, perc.full = perc.full, thr.conf = thr.conf)
cvSSEnet <- predict(object = fitSSEnet, M = M, iter = iter, ncores = ncores)
cvSSEnet$perf
## # A tibble: 9 x 3
## ErrName Mean SD
## <chr> <dbl> <dbl>
## 1 Astrocyte 0.0605 0.0208
## 2 BER 0.254 0.00657
## 3 Endothelial Cell 0.283 0.0378
## 4 ER 0.0743 0.00401
## 5 GABA-ergic Neuron 0.0510 0.00365
## 6 Glutamatergic Neuron 0.0569 0.00488
## 7 Microglia 0.0273 0.0249
## 8 Oligodendrocyte.1 0.758 0.0288
## 9 Oligodendrocyte.3 0.542 0.0478
cvErr <- rbind(cvEnet$perf, cvSSEnet$perf) %>%
mutate(method = rep(c("Enet", "SSEnet"), each = nrow(cvEnet$perf))) %>%
mutate(ErrName = factor(ErrName, c(levels(yscrnaseq), "ER", "BER")))
pd <- position_dodge(0.5)
cvErr %>%
ggplot(aes(x = ErrName, y = Mean, color = method)) +
geom_errorbar(aes(ymin=Mean-SD, ymax=Mean+SD), width=.2, position=pd) +
geom_line(position=pd) +
geom_point(position=pd, size = 3) +
theme_bw() +
ylab("Average error rate (5x5 cross-validation)") +
xlab("Cell-type, ER (error rate), BER (balanced error rate)") +
customTheme(sizeStripFont = 15, xAngle = 40, hjust = 1, vjust = 1,
xSize = 10, ySize = 10, xAxisSize = 15, yAxisSize = 15) +
ylab("Error") +
xlab("Cell-type, ER (error rate), BER (balanced error rate)")
| ErrName | Mean | SD | method | | :------------------- | --------: | --------: | :----- | | Astrocyte | 0.0418605 | 0.0104003 | Enet | | BER | 0.1603893 | 0.0077518 | Enet | | Endothelial Cell | 0.1793103 | 0.0154212 | Enet | | ER | 0.0577752 | 0.0028945 | Enet | | GABA-ergic Neuron | 0.0522996 | 0.0046829 | Enet | | Glutamatergic Neuron | 0.0399015 | 0.0033274 | Enet | | Microglia | 0.0181818 | 0.0406558 | Enet | | Oligodendrocyte.1 | 0.4105263 | 0.0235376 | Enet | | Oligodendrocyte.3 | 0.3806452 | 0.0353369 | Enet | | Astrocyte | 0.0604651 | 0.0208006 | SSEnet | | BER | 0.2540298 | 0.0065688 | SSEnet | | Endothelial Cell | 0.2827586 | 0.0377740 | SSEnet | | ER | 0.0743157 | 0.0040055 | SSEnet | | GABA-ergic Neuron | 0.0509855 | 0.0036464 | SSEnet | | Glutamatergic Neuron | 0.0568966 | 0.0048797 | SSEnet | | Microglia | 0.0272727 | 0.0248965 | SSEnet | | Oligodendrocyte.1 | 0.7578947 | 0.0288275 | SSEnet | | Oligodendrocyte.3 | 0.5419355 | 0.0478464 | SSEnet |
panels = list(SVM = selectedGenes, Enet = fitEnet$enet.panel, SSEnet = fitSSEnet$enet.panel)
Input <- fromList(panels)
metadata <- data.frame(Methods=colnames(Input))
upset(Input, sets = colnames(Input))
Amrit Singh
Amrit Singh
trainee (post-doc)
Kim-Anh Le Cao/Bruce McManus
PROOF Centre of Excellence and The University of British Columbia
none
Spatial transcriptomics: seqFISH + scRNA-seq
recommended by supervisor
Can scRNA-seq data be overlaid onto seqFISH for resolution enhancement?
What are the advantages and performance of your approach? > The published approach trained a multiclass SVM on the scRNAseq data and applied it to the seqFISH data to estimate the cell-types labels. My approach uses a penalized regression method (glmnet) with a semi-supervised appraoch in order to build a model using both the scRNAseq+seqFISH data. This strategy uses a recursive approach that invovles multiple rounds of training glmnet models using labeled data (label and imputed) and predicting the cell-type labels of unlabeled data. At each iteration, cell-type labels with high confidence (probability > 0.5) are retained for the next iteration, where a new glmnet model is trained with the scRNAseq data and seqFISH data with imputed cell-type labels with high confidence. This process is repeated until all cell-types in the seqFISH data have been labeled or until 50 iterations have been reached (in order to reduce compute times). The advantage of this approach is that more data in used for model training such that the resulting model may generalize better to new data. The performance of this appraoch was estimated using cross-validation, using only the scRNAseq data as the test set.
Compute times are significantly longer for the semi-supervised approach for model training. Thus, cross-validation takes even longer. The datasets are restricted to 113 genes and therefore the discovery space is very limited for the semi-supervised approach to learn classification rules that are superior to the supervised approach.
Cross-validation was parallelized such that each iteration of cross-validation was run on an independent cpu thread. If additional data is available for this study it may be better than the current results given that the genes are limited to those identified using the scRNAseq data only.
This is my first time looking at single cell data and this opportunity would expose me to knew methods, technologies and research in this field.
1) https://github.com/mabelc/SSC
# scrnaseq <- read.delim("inst/extdata/tasic_training_b2.txt", row.names = 1, header = FALSE)
# scrnaseqLabels <- read.delim("inst/extdata/tasic_labels.tsv", header = FALSE)
# seqfish <- read.delim("inst/extdata/seqfish_cortex_b2_testing.txt", row.names = 1, header = FALSE)
# seqfishLabels <- read.delim("inst/extdata/seqfish_labels.tsv", row.names = 1, header = FALSE)
# dim(scrnaseq); dim(scrnaseqLabels);
# dim(seqfish); dim(seqfishLabels);
#
# coord <- read.delim("inst/extdata/fcortex.coordinates.txt", header = FALSE)
# coord <- lapply(1:nrow(coord), function(i){
# as.numeric(strsplit(as.character(coord[i,]), " ")[[1]][-c(1, 2)])
# }) %>%
# do.call(rbind, .)
# dim(coord)
#
# usethis::use_data(scrnaseq, overwrite = TRUE)
# usethis::use_data(scrnaseqLabels, overwrite = TRUE)
# usethis::use_data(seqfish, overwrite= TRUE)
# usethis::use_data(seqfishLabels, overwrite = TRUE)
# usethis::use_data(coord, overwrite = TRUE)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.