knitr::opts_chunk$set(collapse=TRUE, comment = "#>", fig.width=9, fig.height=6, eval=TRUE, echo=TRUE, results="verbatim", dpi=75) layout(1, respect=TRUE) penda::draw_penda()
penda
(PErsoNalized Differential Analysis ) is an open-access R package that detects gene deregulation in individual samples compared to a set of reference, control samples. This tutorial aims at providing to advanced users tools to optimized the method parameters using simulations.
How to cite: Richard, M. et al. PenDA, a rank-based method for personalized differential analysis: Application to lung cancer. PLOS Computational Biology 16, e1007869 (2020).
The dataset used to illustrated the method corresponds to the transcriptomes of 3000 genes (RNAseq counts, normalized with DESeq2) for 40 normal, control samples and 40 tumorous samples taken from the TCGA study of lung adenocarcinoma [PMID:25079552].
data_ctrl
is a data matrix containing the normalized counts of each control sample.
The rownames of the matrix correspond to the gene_symbol, the colnames indicate the sample ID.
data_ctrl = penda::penda_data_ctrl head(data_ctrl[,1:3]) dim(data_ctrl)
data_case
is a data matrix containing the normalized counts of each tumor sample.
The rownames of the matrix correspond to the gene_symbol, the colnames indicate the sample ID.
data_case = penda::penda_data_case data_case = data_case[rownames(data_ctrl),] head(data_case[,1:3]) dim(data_case)
Note: this vignette is an example that has been designed for a rapid test of the method. So we limit the number of genes and the number of samples for this purpose. For an optimal utilization of the method, users should however upload all their available data (genes, control and case samples).
The optimal choice of parameters (Sec. 4.2) is based on simulations that perturb control samples (Sec. 4.1). Here, we extract three patients (data_simu
) from data_ctrl
that will be used later for this purpose. For consistency, we also discard them from the data_ctrl
matrix that will serve as reference.
data_simu = data_ctrl[,1:3] data_ctrl = data_ctrl[,-(1:3)] head(data_simu[,1:3]) dim(data_simu) dim(data_ctrl)
Note: this vignette is an example that has been designed for a rapid test of the method. For a more complete analysis and a better parameter estimation, we recommend users to simulate more cases (10 for example instead of 3). If you don't have much control, you can remove them one by one and apply the vignette several times.
threshold_dataset = 0.99 Penda_dataset = penda::make_dataset(data_ctrl, data_case, detectlowvalue = TRUE, detectNA = TRUE, threshold = threshold_dataset) data_ctrl = Penda_dataset$data_ctrl data_case = Penda_dataset$data_case data_simu = data_simu[rownames(data_ctrl),]
The function make_dataset
contains three steps to prepare the data for the analysis.
detect_na_value
removes rows and columns (ie, genes and samples) of the data matrices that contain more than r "threshold"
% (default value = r threshold_dataset
) of NA (Not Available) value.detect_zero_value
removes genes with very low expression in the majority of samples (controls and cases), ie. genes whose expression is lower than val_min
in threshold
% of all the samples. By default it uses the function normalmixEM
to estimate the value of val_min
using all the log2-transformed count data but this parameter can also be tuned manually by the user.rank_genes
sorts the genes based on the median value of gene expression in controls. This step is essential for the proper functioning of penda
.head(data_ctrl[,1:3]) dim(data_ctrl) head(data_case[,1:3]) dim(data_case) head(data_simu[,1:3]) dim(data_simu)
threshold_LH = 0.99 s_max = 30 L_H_list = penda::compute_lower_and_higher_lists(data_ctrl, threshold=threshold_LH, s_max=s_max) L = L_H_list$L H = L_H_list$H
The penda
method uses the relative gene ordering in normal tissue.
The function compute_lower_and_higher_lists
computes two matrices L
and H
based on the filtered control dataset (data_ctrl
).
Each row of the L matrix contains a list of at most s_max
(default value = r s_max
) genes (characterized by their ids) whose expressions are lower than that of the gene associated to the corresponding row, in at least threshold_LH
(default value = r threshold_LH*100
%) of the control samples.
Each row of the H matrix contains a list of at most s_max
(default value = r s_max
) genes (characterized by their ids) whose expressions are higher than that of the gene associated to the corresponding row, in at least threshold_LH
(default value = r threshold_LH*100
%) of the control samples.
Below, for some genes (FOXH1, KRTAP2-3, etc.), we show the id of 10 genes of the L and H lists.
L[1000:1005,1:10] H[1000:1005,1:10] dim(L) dim(H)
Estimation of optimal parameters adapted to the user data is based on a ROC analysis on simulated datasets. The function complex_simulation
uses the real distribution of difference between the control and the case samples to simulate the proportion and the value of the dysregulation (see the original paper for details on the method of simulation).
It returns the vector of initial data (data_simu
, in simulation$initial_data
), the vector of data with modifications (simulation$simulated_data
) and the index of modified data (simulation$changes_idx
).
size_grp = 100 quant_simu = 0.05 simulation = penda::complex_simulation(data_ctrl, data_case, data_simu, size_grp, quant = quant_simu) head(simulation$initial_data) head(simulation$simulated_data)
simulation$changes_idx[1000:1005] #Before simulation: simulation$initial_data[simulation$changes_idx[1000:1005]] #After simulation: simulation$simulated_data[simulation$changes_idx[1000:1005]]
set.seed(1) tmp_kc = c(rnorm(100, 50000, 10000), rnorm(50, 70000, 15000)) tmp_ctrl = c(rnorm(100, 50000, 10000)) plot(density(tmp_kc), main="Expression of a gene in tumoral lung", yaxt="n", ylab="") quant_fig = quantile(tmp_ctrl, c(0.05, 0.95)) abline(v = quant_fig, lty=2) lim_fig = c(quant_fig[1] / 1.2, quant_fig[2] * 1.2) abline(v = lim_fig, col=2) legend("topright", lty=2:1, col=1:2, c("quantiles", "quantiles * factor"))
In the rare cases where the lists L or H of a gene are empty, penda
uses a simpler, less efficient, method based on quantile to determine the deregulation status (see the original paper). This quantile method depends on two parameters:
quantile
values of the distribution of expression of a gene in the control samples (dotted line)factor
parameter which modulates the quantile values and defined the thresholds which determine the deregulation status in case samples (red line).The function choose_quantile
applies the quantile method to the simulated dataset for various values of the parameters quantile and factor (grid of values defined by the variables factor_values
and quantile_values
, see the R documentation for default parameters). For each set of parameters, it computes the corresponding False Discovery Rate (FDR), True Positive Rate (TPR) and False Positive Rate (FPR).
The function select_quantile_param
then choose the best set that maximize the TPR for a user-specified maximal value of the FDR (defined by the variable FDR_max
, default value = 0.15).
quantile_values = c(0, 0.01, 0.02, 0.03, 0.05, 0.06, 0.08, 0.1, 0.15, 0.2, 0.3, 0.4) factor_values = c(1, 1.2, 1.4) which_quantile = penda::choose_quantile(data_ctrl, simulation, factor_values = factor_values, quantile_values = quantile_values) best_quantile = penda::select_quantile_param(which_quantile, FDR_max = 0.1)
In this example, optimum quantile method parameters are defined as:
r best_quantile$quantile
r best_quantile$factor
df_quant = data.frame( factor = factor(which_quantile[,1]), quantile = factor(which_quantile[,2]), FDR = which_quantile[,3], TPR = which_quantile[,4] ) library(ggplot2) ggplot(df_quant, aes(x = FDR, y = TPR, color = quantile, group = factor, shape = factor)) + geom_point() + geom_line() + theme_minimal() + geom_vline(xintercept = best_quantile$FDR, linetype = "dotted") + geom_hline(yintercept = best_quantile$TPR, linetype = "dotted")
Note: this vignette is an example that has been designed for a rapid test of the method. For a more complete analysis and a better parameter estimation, we recommend users to simulate more cases (10 for example instead of 3) and test more values for the parameters quantile and factor.
The penda
method infers fo each gene in each case sample its deregulation status (up-regulation, down-regulation or no deregulation) based on the L_H_list
. It tracks for changes in relative ordering in the sample of interest. If these changes exceed the given threshold, the gene of interest is considered as deregulated.
The second step of the parameter choice is therefore to determine the optimal value of threshold
. The function choose_threshold
applies PenDA to the simulated dataset for different threshold values (defined by the variable threshold_values
) and computes the corresponding FDR, TPR and FPR.
The function select_threshold_param
then choose the threshold value that maximize the TPR for a user-specified maximal value of the FDR (defined by the variable FDR_max
, default value = 0.05).
threshold_values = c(0.01, 0.05, 0.1, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9) best_quant = best_quantile$quantile best_fact = best_quantile$factor which_threshold = penda::choose_threshold(data_ctrl, L_H_list, 30, simulation, threshold_values, quant_test = best_quant, factor_test = best_fact) best_threshold = penda::select_threshold_param(which_threshold, FDR_max = 0.05)
In this example, optimum test threshold parameter is defined as:
r best_threshold$threshold
which_threshold = apply(which_threshold, 2, as.numeric) if(length(unique(which_threshold[,1])) > 1){ results_threshold = c() for(value in unique(which_threshold[,2])){ sum_value = colSums(which_threshold[which_threshold[,"threshold"] == value, ]) results_threshold = rbind(results_threshold, c(value, sum_value[c(6, 7, 8, 9)])) } } else { results_threshold = which_threshold[, c(2, 6, 7, 8, 9)] } df = data.frame(threshold = as.factor(results_threshold[,1]), FDR = results_threshold[,"FP"] / (results_threshold[,"TP"] + results_threshold[,"FP"]), TPR = results_threshold[,"TP"] / (results_threshold[,"TP"] + results_threshold[,"FN"]), group = rep(1)) ggplot(df, aes(x = FDR, y = TPR, color = threshold, group = group)) + geom_point() + geom_line() + theme_minimal() + geom_vline(xintercept = best_threshold$FDR, linetype = "dotted") + geom_hline(yintercept = best_threshold$TPR, linetype = "dotted")
Note: this vignette is an example that has been designed for a rapid test of the method. For a more complete analysis and a better parameter estimation, we recommend users to simulate more cases (10 for example instead of 3) and test more values for the parameter threshold.
As a safety check, PenDA is applied to the control samples used for the simulations and estimates the proportion of false positives.
threshold_values = c(0, 0.05, 0.1, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9) best_quant = best_quantile$quantile best_fact = best_quantile$factor results = c() for(thres in threshold_values){ penda_res = penda::penda_test(samples = data_simu, controls = controls, threshold = thres, iterations = 20, L_H_list = L_H_list, quant_test = best_quant, factor_test = best_fact) results = rbind(results, c("U", thres, colSums(penda_res$up_genes))) results = rbind(results, c("D", thres, colSums(penda_res$down_genes))) }
df = as.data.frame(results) colnames(df)[1:2] = c("Deregulation", "Threshold") patients = apply(results[, 3:ncol(results)], 2, as.numeric) df$mean = rowMeans(patients) df$proportion = df$mean/nrow(controls) * 100 ggplot(df, aes(x = Threshold, y = proportion, color = Deregulation, group = Deregulation)) + geom_point() + geom_line() + theme_minimal() + labs(title = "Proportion of genes deregulated detected in controls", subtitle = paste0("Mean between ", ncol(patients), " control samples."))+ ylab("Proportion of genes deregulated (%)")
With these simulations you can now perform analysis on your real data (see vignette_penda
) using the parameters:
r best_quantile$quantile
r best_quantile$factor
r best_threshold$threshold
This paragraph is automatically generated by the vignette to specify the method and data filtering parameters. It can be directly cut and paste to the "material and methods" section of the user analysis.
The simulation vignette of the penda
package version 1.0 was executed on r floor(Penda_dataset$info["init_nb_genes"])
genes, using r floor(Penda_dataset$info["init_nb_ctrls"])
control samples and r floor(Penda_dataset$info["init_nb_cases"])
case samples.
The data set was pretreated as following:
r floor(Penda_dataset$info["nb_genes_NA"])
genes and
r floor(Penda_dataset$info["nb_patients_NA"])
samples were removed during the NA values filtering step,
and r floor(Penda_dataset$info["nb_genes_0"])
genes were removed because lowly expressed:
under the threshold val_min
= r Penda_dataset$info["val_min"]
in at least r Penda_dataset$info["threshold"]*100
% of cases.
r ncol(data_simu)
cases samples were simulated using the complex simulation function with the following parameters: group size = r size_grp
, quantile = r quant_simu
. Theses simulations identified r signif(length(simulation$changes_idx) / length(simulation$simulated_data) * 100,3)
% of genes as typically deregulated in cases samples.
r ncol(controls)
controls were used to generate L and H lists using the following parameters: threshold LH = r signif(threshold_LH,3)
and s_max = r s_max
.
The quantile method was applied on the r ncol(data_simu)
simulated cases. We retained a global FDR value of r signif(best_quantile$FDR,3)
, with the following set of parameters: quantile = r best_quantile$quantile
and factor = r best_quantile$factor
.
The PenDA method was then applied on these r ncol(data_simu)
cases. We retained a global FDR value of r signif(best_threshold$FDR,3)
, with the following set of parameters: quantile = r best_quantile$quantile
, factor = r best_quantile$factor
and threshold = r best_threshold$threshold
.
sessionInfo()
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.