In AskExplain/gcproc: Encodes datasets for prediction and dimension reduction

knitr::opts_chunk$set(echo = TRUE)

Introduction

Generalised Canonical Procrustes (gcproc) is a dimensional reduction technique which encodes samples and features into a smaller subspace that better represents the samples and features in the data. It combines multiple datasets and looks for an improved representation of the data structure shared between datasets. The parameters learned are weights that transforms the original data and can be used to improve projection and embedding techniques such as UMAP (Uniform Manifold Approximation and Projection), and t-SNE (t-distributed Stochastic Neighbourhood Embedding). Encoding the original data improves signal and information quality, where clusters in the projection and embedded space are easier to separate and identify.

Step 0 - Prepare libraries

library(gcproc)
library(mclust)
library(irlba)

Step 1 - Prepare data

Here, three pancreas datasets are integrated with cells aligned. The Baron, Segerstolpe and Xin datasets are analysed. The top 200 genes are taken, and only acinar, alpha, beta, delta, endothelial, epsilon and gamma cells are selected. These datasets are centered and scaled according to the size of the dataset, ready for analysis via gcproc.

load("../../data/pancreas_cell/top_200_genes_pancreas_log_normalised.Rdata")

main_list <- list(baron = baron,
                  xin = xin,
                  segerstolpe = segerstolpe
                  )

Step 2 - Initialise gcproc parameters

Default parameters are retrieved using the function. The verbose print function for explanation of internal running of the code is turned off. Other default parameters are visible by printing the retrieved config variable. Asking the help function via ?extract_config will provide more information about the parameters.

config <- gcproc::extract_config()
config$i_dim <- 10
config$j_dim <- 10
config$verbose <- F

Step 3 - Run gcproc on all datasets

Given gcproc can only align two datasets at a time, Baron is used as a reference for the alignment of Xin and Segerstolpe. They are combined to enrich the encoding of the gene dimension into an informative and shared subspace for the cell axis to be interpreted.

The covariates for the Xin and Segerstolpe datasets are removed by subtracting the difference at the code level. The code is the latent structure each dataset retains that represents the data generating process. By taking the difference between the scRNA-seq and the covariates codes, the relevant structure will be left. With the residual code, it is decoded to recover the covariate adjusted scRNAseq data.

fixed <- gcproc::extract_fixed_framework(F)
fixed$beta <- c(1,1,1)

gcproc.pancreas <- gcproc::gcproc(data_list = main_list,
                                  config = config,
                                  fixed = fixed
                                 )

Step 4 - Plot dimensionality reduction

The parameters that will transform the [cells] by [gene] space into a [cells] by [k] space, where k is the dimension reduction. Parameters are combined using a summation, similar to how vector spaces are combined.

Notice the clusters representing cell types are visible in each of the first three plots, representing the Baron, Segerstolpe and Xin datasets respectively. The final and fourth plot is the integrated plot of all three datasets - notice that cell types indicated by the coloured points still separate, despite the batch effects of each dataset.

plot(umap::umap(gcproc.pancreas$dimension_reduction[[2]]$sample_x.dim_reduce.code)$layout,
     col=as.factor(row.names(xin)), 
     xlab = "UMAP 1", 
     ylab = "UMAP 2", 
     main = "Xin via gcproc and UMAP",
     pch=19,
     cex=0.5,
     xaxt="n",
     yaxt="n")

plot(umap::umap(gcproc.pancreas$dimension_reduction[[3]]$sample_x.dim_reduce.code)$layout,
     col=as.factor(row.names(segerstolpe)), 
     xlab = "UMAP 1", 
     ylab = "UMAP 2", 
     main = "Segerstolpe via gcproc and UMAP",
     pch=19,
     cex=0.5,
     xaxt="n",
     yaxt="n")

plot(umap::umap(gcproc.pancreas$dimension_reduction[[1]]$sample_x.dim_reduce.code)$layout,
     col=as.factor(row.names(baron)), 
     xlab = "UMAP 1", 
     ylab = "UMAP 2", 
     main = "Baron via gcproc and UMAP",
     pch=19,
     cex=0.5,
     xaxt="n",
     yaxt="n")

plot(umap::umap(do.call('rbind',lapply(c(1:3),function(X){gcproc.pancreas$dimension_reduction[[X]]$sample_x.dim_reduce.code})))$layout,
     col=as.factor(row.names(rbind(baron,xin,segerstolpe))), 
     xlab = "UMAP 1", 
     ylab = "UMAP 2", 
     main = "All 3 datasets via gcproc and UMAP",
     pch=19,
     cex=0.5,
     xaxt="n",
     yaxt="n")

Step 5 - Feature importance in dimensionality reduction

Given parameters transform the [cell] by [gene] dataset into a [cell] by [k] subspace, it is possible to use only a few genes in the dimension reduction. This involves a matrix product of two matrices: one as a sub-selected data set of a [cell] by [favorite gene set] and another as a learned parameter of weights of [favorite gene set] by [k].

For example, several pancreatic genes that characterize the cell types are visualized via UMAP below. Correspondingly, a UMAP of all genes that are related to ribosomal function are also included and compared with the previous pancreatic gene set. Notice the distinct difference, with only 15 genes used in the favorite pancreatic gene list, and approximately 70 in the ribosomal gene list.

pancreas_genes_of_interest <- c("INS", 
                                "SST", 
                                "PPY", 
                                "GHRL", 
                                "GCG", 
                                "TTR", 
                                "GAD2", 
                                "IAPP", 
                                "CPE", 
                                "RBP4", 
                                "CD44", 
                                "ANXA2",
                                "SCG2",
                                "PPP1R1A",
                                "KRT8")

id_pancreas_genes <- which(colnames(rbind(baron,xin,segerstolpe)) %in% pancreas_genes_of_interest)

projection_gcproc.pancreas <- gcproc.pancreas$main.parameters[[1]]$beta[id_pancreas_genes,]%*%MASS::ginv(t(gcproc.pancreas$main.parameters[[1]]$beta[id_pancreas_genes,])%*%gcproc.pancreas$main.parameters[[1]]$beta[id_pancreas_genes,])

print(c("Total length of favourite pancreatic gene list: ", length(id_pancreas_genes)))

plot(umap::umap(rbind(baron,xin,segerstolpe)[,id_pancreas_genes]%*%(projection_gcproc.pancreas))$layout,
     col=as.factor(row.names(rbind(baron,xin,segerstolpe))), 
     xlab = "UMAP 1", 
     ylab = "UMAP 2", 
     main = paste("All 3 datasets via gcproc and UMAP \n with ",length(id_pancreas_genes)," pancreatic genes",sep=""),
     pch=19,
     cex=0.5,
     xaxt="n",
     yaxt="n")




ribosomal_genes_of_interest <- c("RPS|RPL")
id_ribosomal_genes <- grep(pattern = ribosomal_genes_of_interest, x = colnames(rbind(baron,xin,segerstolpe)))

projection_gcproc.ribosomal <- gcproc.pancreas$main.parameters[[1]]$beta[id_ribosomal_genes,]%*%MASS::ginv(t(gcproc.pancreas$main.parameters[[1]]$beta[id_ribosomal_genes,])%*%gcproc.pancreas$main.parameters[[1]]$beta[id_ribosomal_genes,])


print(c("Total length of ribosomal gene list: ", length(id_ribosomal_genes)))

plot(umap::umap(rbind(baron,xin,segerstolpe)[,id_ribosomal_genes]%*%(projection_gcproc.ribosomal))$layout,
     col=as.factor(row.names(rbind(baron,xin,segerstolpe))), 
     xlab = "UMAP 1", 
     ylab = "UMAP 2", 
     main = paste("All 3 datasets via gcproc and UMAP \n with ",length(id_ribosomal_genes)," RPS or RPL genes",sep=""),
     pch=19,
     cex=0.5,
     xaxt="n",
     yaxt="n")