In CDecamps/penda: A R package that performs personalized differential analysis of comics data

knitr::opts_chunk$set(collapse=TRUE, comment = "#>", fig.width=9, fig.height=6, eval=TRUE, echo=TRUE, results="verbatim", dpi=75)
layout(1, respect=TRUE)
penda::draw_penda()

Introduction

penda (PErsoNalized Differential Analysis ) is an open-access R package that detects gene deregulation in individual samples compared to a set of reference, control samples. This tutorial aims at providing to non-expert users basic informations and illustrations on how to run the package.

How to cite: Richard, M. et al. PenDA, a rank-based method for personalized differential analysis: Application to lung cancer. PLOS Computational Biology 16, e1007869 (2020).

Dataset and data filtering

Dataset

The dataset used to illustrated the method corresponds to the transcriptomes of 3000 genes (RNAseq counts, normalized with DESeq2) for 40 normal, control samples and 40 tumorous samples taken from the TCGA study of lung adenocarcinoma [PMID:25079552].

data_ctrl is a data matrix containing the normalized counts of each control sample. The rownames of the matrix correspond to the gene_symbol, the colnames indicate the sample ID.

data_ctrl = penda::penda_data_ctrl
head(data_ctrl[,1:3])
dim(data_ctrl)

data_case is a data matrix containing the normalized counts of each tumor sample. The rownames of the matrix correspond to the gene_symbol, the colnames indicate the sample ID.

data_case = penda::penda_data_case
data_case = data_case[rownames(data_ctrl),]
head(data_case[,1:3])
dim(data_case)

Note: this vignette is an example that has been designed for a rapid test of the method. So we limit the number of genes and the number of samples for this purpose. For an optimal utilization of the method, users should however upload all their available data (genes, control and case samples).

Method

penda performs a 3-steps analysis:

Data filtering and creation of the dataset
Relative gene ordering
Differential expression testing

Data filtering

threshold_dataset = 0.99
Penda_dataset = penda::make_dataset(data_ctrl, data_case, detectlowvalue = TRUE, detectNA = TRUE, threshold = threshold_dataset)
data_ctrl = Penda_dataset$data_ctrl
data_case = Penda_dataset$data_case

The function make_dataset contains three steps to prepare the data for the analysis.

detect_na_value removes rows and columns (ie, genes and samples) of the data matrices that contain more than r "threshold" % (default value = r threshold_dataset) of NA (Not Available) value.
detect_zero_value removes genes with very low expression in the majority of samples (controls and cases), ie. genes whose expression is lower than val_min in threshold% of all the samples. By default it uses the function normalmixEM to estimate the value of val_min using all the log2-transformed count data but this parameter can also be tuned manually by the user.
rank_genes sorts the genes based on the median value of gene expression in controls. This step is essential for the proper functioning of penda.

head(data_ctrl[,1:3])
dim(data_ctrl)
head(data_case[,1:3])
dim(data_case)

Relative gene ordering

threshold_LH = 0.99
s_max = 30
L_H_list = penda::compute_lower_and_higher_lists(data_ctrl, threshold = threshold_LH, s_max = s_max)
L = L_H_list$L
H = L_H_list$H

The penda method uses the relative gene ordering in normal tissue.

The function compute_lower_and_higher_lists computes two matrices L and H based on the filtered control dataset (data_ctrl).

Each row of the L matrix contains a list of at most s_max (default value = r s_max) genes (characterized by their ids) whose expressions are lower than that of the gene associated to the corresponding row, in at least threshold_LH (default value = r threshold_LH*100 %) of the control samples.

Each row of the H matrix contains a list of at most s_max (default value = r s_max) genes (characterized by their ids) whose expressions are higher than that of the gene associated to the corresponding row, in at least threshold_LH (default value = r threshold_LH*100 %) of the control samples.

Below, we show the number of genes in the L and H lists.

layout(matrix(1:2, 1), respect=TRUE)
hist(rowSums(L_H_list$L != 0), xlab = "nb of L genes", main = "Size of L list")
hist(rowSums(L_H_list$H != 0), xlab = "nb of H genes", main = "Size of H list")

Differential expression testing

threshold = 0.4
iterations =  20 
quant_test =  0.05
factor_test = 1.2

penda_res=penda::penda_test(samples = data_case, 
                  controls = data_ctrl,
                   threshold = threshold, 
                   iterations =  iterations, 
                   L_H_list =  L_H_list, 
                   quant_test =  quant_test,
                   factor_test = factor_test)

The function penda_test infers for each gene and for each sample of the data_case matrix its deregulation status (up-regulation, down-regulation or no deregulation). This function analyses case samples one by one. It is based on the L_H_list and tracks for changes in relative ordering in the sample of interest. If these changes exceed the given threshold, the gene of interest is considered as deregulated.

By default, the threshold parameter is set to r threshold but we strongly advise users to use the vignette vignette simulation to adjust this parameter to the user-specific data.

Results are in the form of two matrices $down_genes and $up_genes. Each row corresponds to a gene and each column to a case sample. A TRUE entry in these matrices means that the corresponding genes are deregulated (down or up-regulated) in the corresponding samples.

generate_data_bypatient = function(D_list, U_list){
  down = colSums(D_list)
  up = colSums(U_list)
  total = down + up
  patient_names = colnames(D_list)
  patients = rep(factor(patient_names , levels = patient_names [order(total)]), 3)
  variable = c(rep("down", length(patient_names)),
               rep("up", length(patient_names)),
               rep("total", length(patient_names)))
  value = c(down, up, total)
  pc = c(down/nrow(D_list)*100, up/nrow(D_list)*100, total/nrow(D_list)*100)
  return(data.frame(patients = patients,
                    variable = variable,
                    value = value,
                    pc = round(pc,2)))
}

data_bypatient = generate_data_bypatient(D_list = penda_res$down_genes,
                                         U_list = penda_res$up_genes)

plot_figure = function(data_patients){
  library(ggplot2)
  mytheme = theme(panel.background = element_blank(),
  panel.grid.major = element_line(colour="black", size = (0.1)),
                 panel.grid.minor = element_blank())

  p1 = ggplot(data_patients, aes(x = patients, y = pc)) +
    geom_line(aes(group = variable), colour = "grey80") +
    mytheme +
    ylab("% of gene deregulation") + xlab("patients") +
    geom_point(aes(colour = variable), size = 0.5) +
    ylim(0, 80) +
    scale_x_discrete(breaks = NULL) +
    scale_colour_manual(
      name = "Gene deregulation per patient",
      values = c("blue", "black", "red"),
      labels = c("DOWN", "UP & DOWN", "UP")
    ) #+
  #  theme(legend.position = "none", axis.text.x = element_blank())

  return(p1)
}

plot_figure(data_patients = data_bypatient)

plot_heatmap_hclust = function (data) {
  sum(apply(is.na(data), 1, any))
  data = data[!apply(is.na(data), 1, any), ]

  # clustering base on correlation for tissues
  tmp_d = data
  tmp_d = t(tmp_d) - apply(tmp_d, 2, mean)
  tmp_d = t(tmp_d)
  tmp_d = cor(tmp_d, method="pe")
  dim(tmp_d)
  hc_col = hclust(dist(1 - tmp_d), method="complete")

  Colv = as.dendrogram(hc_col)
  dendrogram="col"      

  # clustering base on eucl. dist. for genes
  d = dist(data)
  hc_row = hclust(d, method="complete")
  Rowv = as.dendrogram(hc_row)
  dendrogram="both"      

  # col
  colors=c("blue", "gray", "red")
  cols = colorRampPalette(colors)(20)

  foo = gplots::heatmap.2(data, Rowv=Rowv, Colv=Colv, dendrogram="col", trace="none", col=cols,
                          labRow = FALSE,labCol = FALSE,
                          main=paste0("Penda (", nrow(data), " genes x ", ncol(data), " samples)"), mar=c(10,5), useRaster=TRUE)
}


plot_heatmap_hclust(data = penda_res$down_genes - penda_res$up_genes)

Material and methods

This paragraph is automatically generated by the vignette to specify the method and data filtering parameters. It can be directly cut and paste to the "material and methods" section of the user analysis.

The PenDA vignette of the penda package version 1.0 was executed on r floor(Penda_dataset$info["init_nb_genes"]) genes, using r floor(Penda_dataset$info["init_nb_ctrls"]) control samples and r floor(Penda_dataset$info["init_nb_cases"]) case samples.

The data set was pretreated as following: r floor(Penda_dataset$info["nb_genes_NA"]) genes and r floor(Penda_dataset$info["nb_patients_NA"]) samples were removed during the NA values filtering step, and r floor(Penda_dataset$info["nb_genes_0"]) genes were removed because lowly expressed: under the threshold val_min = r Penda_dataset$info["val_min"] in at least r Penda_dataset$info["threshold"]*100 % of cases.

r ncol(data_ctrl) controls were used to generate L and H lists using the following parameters: threshold LH = r threshold_LH and s_max = r s_max.

The PenDA method was then applied on r ncol(data_case) cases, with the following set of parameters: quantile = r quant_test, factor = r factor_test and threshold = r threshold.