#options(rmarkdown.html_vignette.check_title = FALSE)
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
library(scRGNet)

Overview

The goal of scRGNet is to explore cell-cell relationship using scRNA-seq data. Different from conventional dimensional reduction methods used by any existing R packages that assume statistical distribution for gene expression data, for scRNA-seq data analysis, scRGNet encodes gene expression data by applying the feature autoencoder of the novel single cell graph neural network (scGNN) framework recently proposed by Wang. et al[@scGNN] that requires no statistical assumptions. It attempts to capture the relationships among cell samples in the scRNA-seq data through topological abstraction based on gene expression and transcriptional regulation information; the package also provides an option for whether to include cell-type specific regulatory signals as a regulariser in trainning the feature autoendoer using a left-truncated mixture Gaussian (LTMG) model.[@LTMG] Once the lower dimensional representations of the encoded expression values for each cell sample are obtained, they will be used to fit a K-nearest neighbours (KNN) model to build the cell-cell network, with each node in the KNN graph representing a call sample, and outlier cell samples in the KNN graph will be pruned using the isolation forest modal.[@isolationForest][@isotree] Although this tool offers a way for statistical hypothesis-free analysis, it comes with a price: the feature autoencoder is a neural network consists of 2 hidden layers, making the task computational-intensive. Hence, this document introduces scRGNet using a small subset of scRNA-seq data from experiment GSE138852[@GSE138852] containing only 48 cells and 1000 most variant genes is included in the package for a quick demo.

Workflow

Step 1. Pre-preocessing scRNA-seq Data from csv File

Before using the scRNA-seq data to train the feature autoencoder, we need to pre-process the raw data matrix first. The reason being is, dropout-events are rather common in scRNA-seq data, leading to a number of zero entries. The purpose of pre-processing is to retain genes and cells that are less "noisy", that is, removing genes and cells with proportion of zero expression values exceeding a certain threshold, which can be defined by the user.

To load the demo data in the package:

inputCountsPath <- system.file("extdata", "GSE138852_small.csv", package = "scRGNet")

Note that a compressed csv.gz file downloaded from the GEO database can also be directly loaded with this function without unzipping.

Then, we can pre-process the raw count matrix by calling

counts <- scRGNet::preprocessCSV(path            = inputCountsPath,
                                 log_transform   = TRUE,
                                 cell_zero_ratio = 0.99,
                                 gene_zero_ratio = 0.99,
                                 geneSelectNum   = 2000,
                                 transpose       = FALSE,
                                 toCSV           = FALSE,
                                 outdir_path     = NULL,
                                 savename        = NULL)

User can define their desirable thresholds on the maximum proportion of genes with zeros in cells using argument cell_zero_ratio, whose default value is 0.99, meaning cells with more than 99% genes that are zeros will be filtered out, and threshold on the maximum proportion of zeros in genes, whose default is also 0.99, implying genes with more than 99% zero values will be filtered out. After the filtering, genes will be ranked based on the variations in their expression values. User can define the number of most variant genes they would like to select. However, it cannot be fewer than 512 genes, which is the minimum requirement for the feature autoencoder to perform dimensional reduction. If the reamining genes after filtering are fewer than the user defined value, all the remaining genes will be selected. The default value for log_transform is TRUE and it is recommended to set to TRUE. If the raw matrix was given with cells as rows and genes as columns, then transpose needs to be set to TRUE to get the correct filtered result. If you would like to save the pre-processed result for other analyses, you can set toCSV to TRUE, and enter the name of the saved file using savename, and the directory where the file will be saved using outdir_path, with no path separator "/" at the end.

The function returns an R6 scDataset object containing the pre-processed scRNA-seq data in a transposed sparse matrix. User can view the number of remaining cell samples in the object using

length(counts)

And index specific cell sample by simply indexing:

counts[5]

Step 2. Infer LTMG Signals (Optional)

Before performing the dimensional reduction using the feature autoencoder, user can choose whether to include the LTMG signals in feature autoencoder training. To infer LTMG tags, use

ltmg <- scRGNet::runLTMG(scDataset   = counts,
                         fromFile    = FALSE,
                         readPath    = NULL,
                         toFile      = FALSE,
                         outdir_path = NULL,
                         savename    = NULL)

runLTMG takes either a scDataset object containing the data, or, if fromFile is set to TRUE, a path to an csv file containing pre-processed scRNA-seq data. If you would like to save the LTMG tags to file, set toFile to TRUE and enter the directory path to save the file using outdir_path and the name of the saved file savename.

Step 3. Encoding Gene Expression Values

The two subsections below are intended for advanced user. If you only want to run a quick demo, you can safely skip to the training section. If you would like to run your own dataset, please make sure that you have enough computational resources; running with a whole scRNA-seq dataset can take up 12G ram.

Hyper-parameters

The feature autoencoder is a model that requires some tuning for its hyperparameters. For user with machine learning background and familiar with the concept of hyperparameters, you can tune the hyperparameter using

hyperParams <- scRGNet::setHyperParams(
    batch_size  = 1L,
    regu_epochs = 5L,
    L1          = 0.5,
    L2          = 0.5,
    regu_alpha  = 0.9,
    reduction   = "sum"
)

Here I'm showing the default hyperparameters to run a fast demo. It does not produce a convergent result.

One can tell whether the result has converged by reading the averaged loss value printed on the console. If a LTMG matrix is applied, then the averaged loss will stop decreasing at certain value (depending on the intensity of regularisation and the batch size) because the LTMG signals are regularising.

Hardware

scRGNet also offers options for setting which device to use for running the model:

hardwareSetup <- scRGNet::setHardware(
    CUDA       = FALSE,
    coresUsage = 1L
)

If you have a powerful Nvidia graphics card and would like to use it to train the model, you might consider setting CUDA to TRUE. However, because feature autoencoder is implemented using torch neural network module, and only CUDA 10 and 11.1 are supported by torch for R[@torch], before proceeding, make sure you have the correct CUDA version installed in your computer. If you would like to run with your CPU, you can set the number of cores to train the model with coresUsage. Unfortunately this option is not available for Mac OS user (sorry).

Training

To run the model using our data, simply run:

z <- scRGNet::runFeatureAE(scDataset = counts)

or, if you would like to run with LTMG regularisation:

z <- scRGNet::runFeatureAE(scDataset = counts,
                           LTMG_mat  = ltmg)

If you would like to with your custimised settings:

z <- scRGNet::runFeatureAE(scDataset     = counts,
                           LTMG_mat      = ltmg,
                           hyperParams   = hyperParams,
                           hardwareSetup = hardwareSetup)

where hyperParams and hardwareSetup are user-defined.

At the end of training, runFeatureAE returns a matrix of the low-dimensional representation of the dataset in scDataset. Again, note that in scDataset, the pre-processed matrix has been transposed, meaning that now rows are cells and columns are genes. Hence, the selected genes are used as our "features" in the feature auto-encoder, and it finds the low-dimensional features from the gene expression values for each cell.

Step 4. Generate a Network

After we have found the encoded feature matrix, scRGNet fits this feature matrix to a KNN model: for a user defined value of $k$ (default uses the best heuristic value from the given feature matrix), for each cell sample, this KNN model selects K cells nearest in Euclidean distance, and then for the $k$ selected nearest cells of the current cell sample, fit them in an isolation forest model[@isolationForest][@isotree] to prune outlier cells from the graph:

net <- scRGNet::generateNetwork(feature_mat = z)

Note that the pruning of outliers by the isolation forest can be done faster with multiple CPU cores (for non MacOS users only, sorry again):

net <- scRGNet::generateNetwork(feature_mat   = z,
                                hardwareSetup = hardwareSetup)

Step 5. Plot and Analysis

scRGNet offers 3 visualisations for analysis

Cell-Cell Network

The most straight-forward visualisation[@visNetwork] is the cell-cell network. It shows the topological structure representing the cell-cell relationship inferred from the encoded gene expression values:

scRGNet::plotCellNet(net = net)

If you would like to view a plot without the highlight, and set your own title for the network, you can set group to FALSE and enter your title:

scRGNet::plotCellNet(net   = net,
                     group = FALSE,
                     title = "My Network")

If you would like to select multiple cell samples at once by the group it belongs to, set show_select_by = "group":

scRGNet::plotCellNet(net            = net,
                     group          = TRUE,
                     show_select_by = "group")

Or, you might be interested in a particular specific cell sample and the other cell samples which are connected to it:

scRGNet::plotCellNet(net            = net,
                     group          = FALSE,
                     show_select_by = "node")

You can also change the node size with the node_size argument, whose default value is 25:

scRGNet::plotCellNet(net            = net,
                     group          = TRUE,
                     node_size      = 50) ## doubling the node size

Note that the quality of your network entirely depends on how well the model was tuned. Running the model with default hyper-parameters will produce a rather crude result that haven't converged like those shown above.

A model trained with the same dataset, running 500 epochs using a full batch, with default k will have the following convergent result:

{width=90%}

Degree Distribution

A network is an intuitive visualisation, but it is difficult to generailise information regarding its topological structure. Another visualisation this package provides is the degree distribution:

scRGNet::plotDegree(net = net)

Again, you can set your own title with argument title. This is available in all plotting functions of this package.

A left-skewed distribution could indicate that most cells have interaction with each other, forming large common communities. Contrarily, a right-skewed distribution implies less interaction among cells.

Log-log Plot: Rank vs. Frequency.

The log-log plot is helpful in further analysing the topological structure of the network. Many biological networks follows a scale-free degree distribution. Data from a model of a scale free network when compared against the power law, that is, whether a linear pattern in the log-log plot is observed[@graph]:

scRGNet::plotLog(net = net)

SessionInfo

utils::sessionInfo()

Citation

To cite this package:

utils::citation("scRGNet")

References



ff98li/scRGNet documentation built on Jan. 14, 2022, 4:58 a.m.