In MatteoBlla/PsiNorm: Scaling normalization based on the Pareto distribution

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

Installation

First of all we need to install PsiNorm:

if(!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
BiocManager::install("PsiNorm")

library(SingleCellExperiment)
library(splatter)
library(scater)
library(cluster)
library(PsiNorm)

Introduction

PsiNorm is a new scalable between-sample normalization for single cell RNA-seq count data based on the power-law Pareto type I distribution. It can be demonstrated that the Pareto parameter is inversely proportional to the sequencing depth, it is sample specific and its estimate can be obtained for each cell independently. PsiNorm computes the shape parameter for each cellular sample and then uses it as multiplicative size factor to normalize the data. The final goal of the transformation is to align the gene expression distribution especially for those genes characterised by high expression. Note that, similar to other global scaling methods, our method does not remove batch effects, which can be dealt with downstream tools.

To evaluate the ability of PsiNorm to remove technical bias and reveal the true cell similarity structure, we used both an unsupervised and a supervised approach. We first simulate a scRNA-seq experiment with four known clusters using the splatter Bioconductor package. Then in the unsupervised approach, we i) reduce dimentionality using PCA, ii) identify clusters using the clara partitional method and then we iii) computed the Adjusted Rand Index (ARI) to compare the known and the estimated partition.

In the supervised approach, we i) reduce dimentionality using PCA, and we ii) compute the silhouette index of the known partition in the reduced dimensional space.

Data Simulation

We simulate a matrix of counts with 2000 cellular samples and 10000 genes with splatter.

set.seed(1234)
params <- newSplatParams()
N=2000
sce <- splatSimulateGroups(params, batchCells=N, lib.loc=12,
                           group.prob = rep(0.25,4),
                           de.prob = 0.2, de.facLoc = 0.06,
                           verbose = FALSE)

sce is a SingleCellExperiment object with a single batch and four different cellular groups.

To visualize the data we used the first two Principal Components estimated starting from the raw log-count matrix.

set.seed(1234)
assay(sce, "lograwcounts") <- log1p(counts(sce))
sce <- runPCA(sce, exprs_values="lograwcounts", scale=TRUE, ncomponents = 2)
plotPCA(sce, colour_by="Group")

PsiNorm data normalization

Data Normalization with PsiNorm

To normalize the raw counts we used the PsiNorm normalization and we visualized the data using the first two principal components.

sce<-PsiNorm(sce)
sce<-logNormCounts(sce)
head(sizeFactors(sce))

Note that running the PsiNorm function computes a set of size factors that are added to the SingleCellExperiment object.

The logNormCounts function can be then used to normalize the data by multiplying the raw counts and the size factors.

set.seed(1234)
sce<-runPCA(sce, exprs_values="logcounts", scale=TRUE, name = "PsiNorm_PCA",
            ncomponents = 2)
plotReducedDim(sce, dimred = "PsiNorm_PCA", colour_by = "Group")

We can appreciate from the plot that PsiNorm allows a better separation among known cellular groups.

Unsupervised approach: Adusted Rand Index

We calculate ARI of both raw counts and PsiNorm normalized counts after PCA dimension reduction and $clara$ clustering (with $k$ equal to the simulated number of clusters); higher the ARI, better the normalization.

groups<-cluster::clara(reducedDim(sce, "PCA"), k=nlevels(sce$Group))
a<-paste("ARI from raw counts:", 
         round(mclust::adjustedRandIndex(groups$clustering, sce$Group), 
               digits = 3))

groups<-cluster::clara(reducedDim(sce, "PsiNorm_PCA"), k=nlevels(sce$Group))
b<-paste("ARI from PsiNorm normalized data:",
         round(mclust::adjustedRandIndex(groups$clustering, sce$Group), 
               digits = 3))

kableExtra::kable(rbind(a,b), row.names = FALSE)

Pareto normalization considerably increases the ARI index.

Supervised approach: Silhouette index

We calculate the Silhouette index of both raw counts and PsiNorm normalized counts after tSNE dimension reduction exploiting known simulated clusters; higher the Silhouette, better the normalization.

dist<-daisy(reducedDim(sce, "PCA"))
dist<-as.matrix(dist)
a<-paste("Silhouette from raw counts:", round(summary(
    silhouette(x=as.numeric(as.factor(sce$Group)),
               dmatrix = dist))$avg.width, digits = 3))

dist<-daisy(reducedDim(sce, "PsiNorm_PCA"))
dist<-as.matrix(dist)
b<-paste("Silhouette from PsiNorm normalized data:", round(summary(
    silhouette(x=as.numeric(as.factor(sce$Group)),
               dmatrix = dist))$avg.width, digits = 3))
kableExtra::kable(rbind(a,b), row.names = FALSE)

Pareto normalization considerably increases the Silhouette index.

Correlation of PC1 and PC2 with sequencing depth

To check if PsiNorm is able to capture technical noise and remove unwanted variation within a dataset (due for instance to differences in sequencing depth), we check whether the first two PCs are capturing technical variance. We computed the maximum correlation obtained between PC1 and PC2 and cell sequencing depths; a higher correlation indicates that the normalization was not able to properly remove noise.

set.seed(4444)
PCA<-reducedDim(sce, "PCA") 
PCAp<-reducedDim(sce, "PsiNorm_PCA")
depth<-apply(counts(sce), 2, sum)
a<-paste("The Correlation with the raw data is:",
            round(abs(max(cor(PCA[,1], depth), cor(PCA[,2], depth))), digits=3))
b<-paste("The Correlation with the PsiNorm normalized data is:",
            round(abs(max(cor(PCAp[,1], depth), cor(PCAp[,2], depth))), digits = 3))
kableExtra::kable(rbind(a,b), row.names = FALSE)