knitr::opts_chunk$set(fig.pos = 'ht')
#options(warn=-1) # suppress warnings
options(warn=0) # restore warnings

Section 1: Brief introduction on Sparsely-Connected-Autoencoder (SCA) analysis

The scRNAseq analysis tools, which are part of the rCASC package [Alessandri et al 2019], are implemented in docker containers to simplify the installation procedure of the overall workflow and to guarantee functional and computational reproducibility [Kulkarni et al. 2018]. Figure below shows the overall organization of rCASC.

library(knitr)
include_graphics('./img/rCASC.png')

The SCA workflow is shown in the figure below, which describes the basic steps required to extract hidden functional features, i.e. TFs, miRNAs, Kinases, from a scRNAseq experiment using Sparsely-Connected-Autoencoder. To execute a SCA analysis it is required a scRNAseq count matrix. Subpopulation organization is then discovered using any of the clustering tools implemented in rCASC a), and the quality of the subpopulation organization is evaluated using Cell Stability Score metric b), for more info on CSS please see [Alessandri et al 2019 supplementary data. The clusters'data matrix, which is the count matrix including subpopulation organization, is normalized c). Normalized data are used to train a Sparsely-Connected-Autoencoder d). The hidden layer matrix is saved over multiple runs of the Sparsely-Connected-Autoencoder (hidden layer matrix is made of 0/1 for each hidden node, where 1 indicates that the node was used and 0 that the node was not used). The ability of each hidden layer matrix to reconstruct, at least partially, the expected subpopulation organization is evaluated by clustering e) and the quality of the reconstruction is evaluated using two metrics: Quality Control of Clusters and Quality Control of Models f), more information on these two metrics will be given later in this vignette. The hidden layer frequency matrix provides information of the usage of the hidden nodes (TFs, miRNAs, Kinases) over multiple runs of the Sparsely-Connected-Autoencoder. COMET tool [Delaney et al. 2019] is then used to grab the most important molecular features, i.e. TFs, miRNAs, Kinases, associated to each SCA reconstructed cluster.

library(knitr)
include_graphics('./img/scaworkflow.png')

Section 2: Dataset used in this vignette to demonstrate the use of SCA analysis.

A data set, setA [Alessandri et al. 2019], based on FACS purified cell types [Zheng et al. 2017] was used to investigate the SCA behaviour.

setA: 100 cells randomly selected for each cell type

SetA was previously used to estimate the strength of CSS metric [Alessandri et al. 2019]. We clustered setA using all the clustering tools actually implemented in rCASC: tSne+k-mean [Pezzotti et al. 2017], SIMLR [Wang et al. 2017], griph [Serra et al. 2019], Seurat [Butler et al. 2018], scanpy [Wolf et al. 2018] and SHARP [Wan et al. 2020]. All tools but tSne+k-mean and scanpy provided very good and similar partition of the different cell types (see Figure below).

library(knitr)
include_graphics('./img/5clusters.png')

Section 3: Manually curated cancer-immune-signature

SCA analysis based on a manually curated cancer-immune-signature (IS, Figure below) was also used on setA reference clusters, see main manuscript. This SCA analysis shows, for cluster 4 (Natural Killer cells) and cluster 5 (Naïve T-cells), QCM/QCC values greater than 0.5 for the majority of the cells (Figure below).

library(knitr)
include_graphics('./img/is.png')

Section 4: variational SCA (vSCA)

VAEs have one fundamentally unique property that separates them from other autoencoders: their latent spaces are, by design, continuous, allowing easy random sampling and interpolation. We tested vSCA based on TF-targets using setA. Looking at the results at the level of QCC and QCM, vSCA does not provide any advantage with respect to a normal SCA, see Figure below.

library(knitr)
include_graphics('./img/vsca.png')

Section 5: How to run a SCA analysis.

This section provides an example of a SCA analysis, number of permutations is kept very small to reduce computing time.

The overall run described in this vignette is available in the folder vignettes/setA

IMPORTANT: Please note that the results generated in this section are meaningless. They are only used to demonstrate the how SCA functions work. In a real experiment at least 120 permutations MUST be execute in each chunk of code, where permutations are indicated.

Computing time, in each chunk, refers the time required to execute the task on a MacBook Pro (3.5 GHz Dual-Core Intel Core i7, 16 Gb RAM).

Section 5.1: autoencoder function

The input data for the autoencoder function are generated using the following script. As clustering tool we have used Griph [Serra et al. 2019], Seurat [Butler et al. 2018].

start_time <- Sys.time()

library(rCASC)
home <- getwd()
# cat("\n",home,"\n")
dir.create(paste(home, "setA", sep="/"))
dir.create(paste(home, "scratch", sep="/"))
file.copy(from=paste(path.package("rCASC"),"examples/setA.csv.zip", sep="/"), to=paste(home, "setA", sep="/"), overwrite = T)
unzip(paste(home, "setA/setA.csv.zip", sep="/"), exdir=paste(home, "setA", sep="/"))
file.remove(paste(home, "setA/setA.csv.zip", sep="/"))
system(paste("rm -fR ", home,"/setA/__MACOSX", sep=""))  

griphBootstrap(group="docker",scratch.folder=paste(home, "scratch", sep="/"), file=paste(home, "setA/setA.csv", sep="/"),  nPerm=8, permAtTime=8, percent=10, separator=",",logTen=0, seed=111)

end_time <- Sys.time()
cat("\n Computing time: ",end_time - start_time, " mins\n")

The autoencoder function feeds the SCA using the data derived by the script shown above. The chunk below show a single permutation run, which is used to estimate the nEpochs max value. Figure learning rate shows an example of the learning curve. In this example the maximum level of learning is obtained with 250 epochs. Thus, the analysis with multiple permutations can be done setting epoch slightly greater than 250, e.g. 300. Learning pictures are saved in SCAtutorial/vignettes/setA/Results/setATF/5/

start_time <- Sys.time()

library(rCASC)
home <- getwd()
# cat("\n",home,"\n")
autoencoder(group=c("docker"),
            scratch=paste(home, "scratch", sep="/"),
            file=paste(home, "setA/setA.csv", sep="/"), 
            separator=",", nCluster=5, bias="TF", permutation=1, 
            nEpochs=2000, patiencePercentage=5,
            cl=paste(home, "setA/Results/setA/5/setA_clustering.output.csv", sep="/"), 
            seed=1111, projectName="setATF", bN="NULL")

end_time <- Sys.time()
cat("\n Computing time: ",end_time - start_time, " mins\n")
library(knitr)
include_graphics('./img/learning.png')

Section 5.1.1: autoencoder parameters (for the full list of parameters please refer to the function help):

start_time <- Sys.time()

library(rCASC)
home <- getwd()
# cat("\n",home,"\n")
autoencoder(group=c("docker"),
            scratch=paste(home, "scratch", sep="/"),
            file=paste(home, "setA/setA.csv", sep="/"), 
            separator=",", nCluster=5, bias="TF", permutation=8, 
            nEpochs=300, patiencePercentage=5,
            cl=paste(home, "setA/Results/setA/5/setA_clustering.output.csv", sep="/"), 
            seed=1111, projectName="setATF", bN="NULL")

end_time <- Sys.time()
cat("\n Computing time: ",end_time - start_time, " mins\n")

The autoencoder function produces as output many files with the extension Ndensespace.format, in the folder Results/ProjectName/Permutation. The number of such files corresponds to the number of selected permutations. In this specific examples autoencoder outputs are in /SCAtutorial/vignettes/setA/Results/setATF/5/permutation/ folder.

Section 5.2: autoencoderClustering function

The function autoencoderClustering generates the numerical data required to understand if there is coherence between counts table derived clusters and those derived using autoencoders latent space.

start_time <- Sys.time()

home <- getwd()
# cat("\n",home,"\n")
library(rCASC)
autoencoderClustering(group="docker", scratch.folder=paste(home, "scratch", sep="/"),
                      file=paste(home, "setA/Results/setATF/setA.csv", sep="/"),
                      separator=",", nCluster=5, clusterMethod="GRIPH", seed=1111, 
                      projectName="setATF", permAtTime=8)

end_time <- Sys.time()
cat("\n Computing time: ",end_time - start_time, "\n")

Section 5.2.1: autoencoderClustering parameters (for the full list of parameters please refer to the function help):

The autoencoderClustering output is a folder where the name is given by the combination of the projectName and the clusterMethod, e.g in the above example /Results/setATF_GRIPH. In the example above, in the folder /SCAtutorial/vignettes/setA/Results/setATF_GRIPH/5/, there is the file label.csv, where is located the cluster assignment for each cells over each permutation.

Section 5.3: autoencoderAnalysis function

The output of autoencoderClustering is used by the function autoencoderAnalysis to generate the QCC and QCM statistics. QCC is an extension of CSS, for a complete mathematical description of CSS please see [Section 5.1 Cell Stability Score: mathematical description in Alessandri et al. 2019 supplementary data] and it measures the ability of latent space to keep aggregated cells belonging to predefined clusters generated using the gene count table. The metrics has the range between 0 and 1, where 1 indicates a high coexistence of cells within the same cluster in the analysis, and 0 a total lack of coexistence of cells within the same cluster.

start_time <- Sys.time()

home <- getwd()
cat("\n",home,"\n")
library(rCASC)
autoencoderAnalysis(group="docker", scratch.folder=paste(home, "scratch", sep="/"),
                    file=paste(home, "setA/Results/setATF_GRIPH/setA.csv", sep="/"), 
                    separator=",", nCluster=5, seed=1111, 
                    projectName="setATF_GRIPH", Sp=0.8)

end_time <- Sys.time()
cat("\n Computing time: ",end_time - start_time, "\n")

In the figure below, only cluster 7 can be well explained by latent space, Figure below panel B. QCM metric is also an extension of CSS and it measures the ability of the neural network to generate consistent data over the different training. In the Figure below panel C, SCA provides consistent data only for clusters 5 and 7. Informative clusters are those characterized by high QCM and QCC scores. In Figure below panel D, only cluster 7 is characterized by a robust neural network able to keep the cell aggregated using hidden layer knowledge. Dashed red line (Figure below panel D) indicates the defined threshold to consider the latent space information suitable to support cells’ clusters.

library(knitr)
include_graphics('./img/qcm-qcf.png')

Section 5.3.1: autoencoderAnalysis parameters (for the full list of parameters please refer to the function help):

The outputs of autoencoderAnalysis are the pdfs:

The above files are saved in setA/Results/setATF/5/

Section 5.4: autoFeature function

The autoFeature creates the frequency table for COMET analysis [Delaney et al. 2019].

start_time <- Sys.time()

home <- getwd()
cat("\n",home,"\n")
library(rCASC)
autoFeature(group="docker", scratch.folder=paste(home, "scratch", sep="/"), 
            file=paste(home, "setA/Results/setATF_GRIPH/setA.csv", sep="/"),
            separator=",", nCluster=5, projectName="setATF_GRIPH")

end_time <- Sys.time()
cat("\n Computing time: ",end_time - start_time, "\n")

Section 5.4.1: autoFeature parameters (for the full list of parameters please refer to the function help):

Section 5.5: comet2 function

cometsc2 is a modification of cometsc function, used in rCASC to extract cluster specific genes, suitable to handle autoencoder frequency table, i.e. autoencoder frequency table contains the % of permutations in which each hidden node for each cell was characterized by a weight different from 0 (if a hidden node is characterized by a weight equal to 0 it means that it was not used in that specific permutation).

start_time <- Sys.time()

home <- getwd()
cat("\n",home,"\n")
library(rCASC)
cometsc2(group="docker", file=paste(home, "setA/Results/setATF_GRIPH/5/freqMatrix.csv", sep="/"),
         scratch.folder=paste(home, "scratch", sep="/"), threads=2, X=0.15, K=2, counts="False", 
         skipvis="False", nCluster=5, separator=",",
         clustering.output=paste(home,"setA/Results/setA/5/setA_clustering.output.csv",sep="/"))

end_time <- Sys.time()
cat("\n Computing time: ",end_time - start_time, "\n")

cometsc2 outputs are the same of COMET, for more information please refer to COMET documentation.

IMPORTANT: This function produces an output including all clusters, but only results related to clusters supported by QCM and QCC means greater than 0.5 have to be taken in account.

Section 5.5.1: autoFeature parameters (for the full list of parameters please refer to the function help):

Section 5.6: wrapperAutoencoder function

Since the interconnection between the various functions shown above is quite complex, we assembled the wrapperAutoencoder function, which executes allautoencoder analysis steps.

start_time <- Sys.time()

library(rCASC)
home <- getwd()
dir.create(paste(home, "setA", sep="/"))
dir.create(paste(home, "scratch", sep="/"))
file.copy(from=paste(path.package("rCASC"),"examples/setA.csv.zip", sep="/"), to=paste(home, "setA", sep="/"), overwrite = T)
unzip(paste(home, "setA/setA.csv.zip", sep="/"), exdir=paste(home, "setA", sep="/"))
file.remove(paste(home, "setA/setA.csv.zip", sep="/"))
system(paste("rm -fR ", home,"/setA/__MACOSX", sep=""))  

griphBootstrap(group="docker",scratch.folder=paste(home, "scratch", sep="/"), file=paste(home, "setA/setA.csv", sep="/"),  nPerm=8, permAtTime=8, percent=10, separator=",",logTen=0, seed=111)

end_time <- Sys.time()
cat("\n Computing time: ",end_time - start_time, " mins\n")

wrapperAutoencoder(group="docker", scratch.folder=paste(home, "scratch", sep="/"),
                   file=paste(home, "setA/setA.csv", sep="/"), separator=",", 
                   cl=paste(home,"/setA/Results/setA/5/setA_clustering.output.csv",sep=""),
                   projectName="setATF2",clusterMethod="GRIPH", Sp=0.8, nCluster=5, 
                   permutation=3, nEpochs=10, bias="TF", skipvis="False")

Section 5.6.1: wrapperAutoencoder parameters:



kendomaniac/SCAtutorial documentation built on Sept. 11, 2020, 9:18 a.m.