Recent advances in biochemistry and single-cell RNA sequencing (scRNA-seq) have allowed us to monitor the biological systems at the single-cell resolution. However, the low capture of mRNA material within individual cells often leads to inaccurate quantification of genetic material. Consequently, a significant amount of expression values are reported as missing, which are often referred to as dropouts. These missing information can heavily impact the accuracy of downstream analyses.
To addreess this problem, we introduce a novel inputation method, names single-cell Imputation via Subspace Regression (scISR). scISR performs imputation for single-cell sequencing data. scISR identifies the true dropout values in the scRNA-seq dataset using hyper-geomtric testing approach. Based on the result obtained from hyper-geometric testing, the original dataset is segregated into two including training data and imputable data. Next, training data is used for constructing a generalize linear regression model that is used for imputation on the imputable data.
The scISR
package can be installed from GitHub using devtools
package
# Install devtools package utils::install.packages('devtools') # Install scISR from GitHub devtools::install_github('duct317/scISR')
Load the example, Goolam dataset. The data is a list consists of two elements: data matrix with rows as genes and columns as samples, and a vector with the cell types information.
#Load required library library(scISR) # Load example data (Goolam dataset with reduced number of genes), other dataset can be download from our server at http://scisr.tinnguyen-lab.com/ data('Goolam') # Raw data raw <- Goolam$data # Cell types information label <- Goolam$label
We can use the main funtion scISR
to impute the raw data.
# Generating subtyping result set.seed(1) imputed <- scISR(data = raw, ncores = 4)
The quality of the imputation can be evaluated using clustering analysis. We can evaluate the accuracy of clusters obtained from clustering analysis using raw and imputed data with adjusted Rand Index (ARI). Higher ARI means higher agreement between a given clustering and the ground truth. The clusters from imputed data have much higher accuracy.
library(irlba) library(mclust) # Perform PCA and k-means clustering on raw data set.seed(1) # Filter genes that have only zeros from raw data raw_filer <- raw[rowSums(raw != 0) > 0, ] pca_raw <- irlba::prcomp_irlba(t(raw_filer), n = 50)$x cluster_raw <- kmeans(pca_raw, length(unique(label)), nstart = 2000, iter.max = 2000)$cluster print(paste('ARI of clusters using raw data:', round(adjustedRandIndex(cluster_raw, label),3))) # Perform PCA and k-means clustering on imputed data set.seed(1) pca_imputed <- irlba::prcomp_irlba(t(imputed), n = 50)$x cluster_imputed <- kmeans(pca_imputed, length(unique(label)), nstart = 2000, iter.max = 2000)$cluster print(paste('ARI of clusters using imputed data:', round(adjustedRandIndex(cluster_imputed, label),3)))
sessionInfo()
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.