scAlign Tutorial"
In scAlign: An alignment and integration method for single cell genomics

knitr::opts_chunk$set(echo = TRUE)

Introduction

This tutorial provides a guided alignment for two groups of cells from cellbench RNA mixture experiments. In this tutorial we demonstrate the unsupervised alignment strategy of scAlign described in Johansen et al, 2018 along with typical analysis utilizing the aligned dataset, and show how scAlign can identify and match cell types across platforms without using the labels as input.

Alignment goals

The following is a walkthrough of a typical alignment problem for scAlign and has been designed to provide an overview of data preprocessing, alignment and finally analysis in our joint embedding space. Here, our primary goals include:

Learning a low-dimensional cell state space in which cells group by function and type, regardless of condition (platform).
Accurately labeling old cells with cell cycle and cell type information using only the young cell annotations.

Installation

## Install scAlign
install.packages('devtools')
devtools::install_github(repo = 'quon-titative-biology/scAlign')
library(scAlign)

## Install Tensorflow 
library(tensorflow)
install_tensorflow(version = "gpu") ## Removing version will install CPU version of Tensorflow

Guide to installing python and tensorflow on different operating systems.

On Windows:
Download Python 3.6.8. Note, newer versions of Python (e.g. 3.7) cannot use TensorFlow at this time.
Make sure pip is included in the installation.

On Ubuntu:
sudo apt update
sudo apt install python3-dev python3-pip

On MacOS (homebrew):
/usr/bin/ruby -e "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/install)"
export PATH="/usr/local/bin:/usr/local/sbin:$PATH"
brew update
brew install python # Python 3

Further details at: https://www.tensorflow.org/install

Data setup

The gene count matrices used for this tutorial are hosted on the cellbench github: here.

First, we load in the normalized cellbench data. The data was normalized following the procedures defined on the cellbench github.

library(scAlign)
library(SingleCellExperiment)
library(ggplot2)

## Load in cellbench data
data("cellbench", package = "scAlign", envir = environment())

## Extract RNA mixture cell types
mix.types = unlist(lapply(strsplit(colnames(cellbench), "-"), "[[", 2))

## Extract Platform
batch = c(rep("CEL", length(which(!grepl("sortseq", colnames(cellbench)) == TRUE))),
          rep("SORT", length(which(grepl("sortseq", colnames(cellbench)) == TRUE))))

scAlign setup

The general design of scAlign's makes it agnostic to the input RNA-seq data representation. Thus, the input data can either be gene-level counts, transformations of those gene level counts or a preliminary step of dimensionality reduction such as canonical correlates or principal component scores. Here we create the scAlign object from the previously normalized cellbench data and perform CCA on the unaligned data.

## Create SCE objects to pass into scAlignCreateObject
youngMouseSCE <- SingleCellExperiment(
    assays = list(scale.data = cellbench[,batch=='CEL'])
)

oldMouseSCE <- SingleCellExperiment(
    assays = list(scale.data = cellbench[,batch=='SORT'])
)

## Build the scAlign class object and compute PCs
scAlignCB = scAlignCreateObject(sce.objects = list("CEL"=youngMouseSCE,
                                                   "SORT"=oldMouseSCE),
                                 labels = list(mix.types[batch=='CEL'],
                                               mix.types[batch=='SORT']),
                                 data.use="scale.data",
                                 pca.reduce = FALSE,
                                 cca.reduce = TRUE,
                                 ccs.compute = 5,
                                 project.name = "scAlign_cellbench")

Alignment of cellbench RNAmixture

Now we align the cell populations from both protocols. scAlign returns a low-dimensional joint embedding space where the effect of platform is removed allowing us to use the complete dataset for downstream analyses such as clustering or differential expression.

## Run scAlign with all_genes
scAlignCB = scAlign(scAlignCB,
                    options=scAlignOptions(steps=1000,
                                           log.every=1000,
                                           norm=TRUE,
                                           early.stop=TRUE),
                    encoder.data="scale.data",
                    supervised='none',
                    run.encoder=TRUE,
                    run.decoder=FALSE,
                    log.dir=file.path('~/models_temp','gene_input'),
                    device="CPU")

# ## Additional run of scAlign with CCA
# scAlignCB = scAlign(scAlignCB,
#                     options=scAlignOptions(steps=1000,
#                                            log.every=1000,
#                                            norm=TRUE,
#                                            early.stop=TRUE),
#                     encoder.data="CCA",
#                     supervised='none',
#                     run.encoder=TRUE,
#                     run.decoder=FALSE,
#                     log.dir=file.path('~/models','cca_input'),
#                     device="CPU")

## Plot aligned data in tSNE space, when the data was processed in three different ways:
##   1) either using the original gene inputs,
##   2) after CCA dimensionality reduction for preprocessing.
##   Cells here are colored by input labels

set.seed(5678)
gene_plot = PlotTSNE(scAlignCB,
                     "ALIGNED-GENE",
                     title="scAlign-Gene",
                     perplexity=30)
## Show plot
gene_plot
# cca_plot = PlotTSNE(scAlignCB,
#                     "ALIGNED-CCA",
#                     title="scAlign-CCA",
#                     perplexity=30)
#
# multi_plot_labels = grid.arrange(gene_plot, cca_plot, nrow = 1)