knitr::opts_chunk$set(collapse=TRUE, comment = "#>", fig.width=9, fig.height=6, eval=TRUE, echo=TRUE, results="verbatim", dpi=75)
ECTOR is a R package that detects ectopic gene expression in cancer, according to a threshold determined from normal samples. ECTOR analyses then the impact of abnormal expression of these genes in patient survival, allowing the detection of potential biomarkers of cancer aggressiveness.
The illustration of the method required the import of transcriptomic data (RNAseq) of 25.360 genes for 499 samples and clinical data (experimental grouping) of 547 samples from the TCGA study of colon adenocarcinoma. The list of tissue-restricted genes (testis, placenta, embryonic stem cells and brain) has also been imported from the Epimed database. Data loaded below have been filtered in a previous vignette (00_vignette_data).
e = ector::ector_clinicals head(e) dim(e) idx_n = rownames(e)[e$tissue_status %in% "normal"] head(idx_n) length(idx_n) idx_c = rownames(e)[e$tissue_status %in% "tumoral"] head(idx_c) length(idx_c)
e
is a data matrix containing clinical information for r nrow(e)
samples. The columns represent these clinical characteristics and each row corresponds to one patient, whose ID is given in the first column (id_sample).
idx_n
indicates the ID of normal samples and idx_c
indicates the ID of cancerous samples.
d = ector::ector_transcriptome head(d[,1:4]) dim(d)
d
is a data matrix containing the RNAseq counts of each gene for r length(idx_n)
control samples and r length(idx_c)
tumoral samples, meaning r ncol(d)
samples in total. It has been obtained from the TCGA study of colon adenocarcinoma (s$data
). The rownames and colnames represent respectively the gene symbols and sample ID.
In order to know if genes are abnormally expressed in cancerous samples, it is required to establish a threshold from normal samples for all genes. The formula commonly used is mean(x) + 3*sd(x), x being the expression of a given gene in normal samples.
ector::m3sd
is the function used to determine the threshold for each gene and uses only data from normal samples (idx_n
). The function is then applied for each gene in the data matrix d
.
thresh = apply(d[,idx_n], 1, ector::m3sd) head(thresh) length(thresh)
It is possible to visualize approximately the number of samples where a given gene is abnormally expressed.
plot_gene_expr
is a function to draw a plot in which the samples are represented by a dot. The x axis is the tissue status (normal or tumoral) and the y axis represents the expression.
The plot contains also the lines corresponding to the mean and the threshold (ector::m3sd
) of the given gene and the dots found above the threshold are colored in red.
The last line of the code above illustrates how to obtain this figure for the first gene of d
.
#plot_gene_expr = function(gs, d, e, idx_n, thresh_func=ector::m3sd) { # expr = d[gs,] # plot(jitter(as.numeric(as.factor(e$tissue_status))), expr, # xlab=" ", ylab="Expression", xaxt = "n", # col = (expr>thresh_func(expr[idx_n]))+1, # main=gs # ) # tmp_tab = table(e$tissue_status) # axis(1, at = 1:2, paste0(names(tmp_tab)," (n=", tmp_tab, ")"), las = 1) # abline(h=thresh_func(expr[idx_n]), col="red") # abline(h=mean(expr[idx_n]), col="black") # legend("topleft", c("mean", "threshold"), col=1:2, lty=1) #} layout(1, respect=TRUE) ector::plot_gene_expr(gs=rownames(d)[1], d, e, idx_n, thresh_func=ector::m3sd)
In order to know if a gene is ectopically expressed in cancerous samples, it is required to compare their expression with the corresponding threshold.
prop_ectopic_expr
is a function which aims to calculate the percentage of patients in which the expression of a given gene is aberrant (higher than the threshold).
ectopic_activation
is used to apply the prop_ectopic_expr
function to each gene.
candidates
indicates the gene symbols having an aberrant expression in at least 10% of patients.
#prop_ectopic_expr = function(x, idx_c, idx_n, thresh_func=ector::m3sd) { # if (sum(is.na(x)) > 0) { # warning("NA is expression values.") # } # y = x[idx_c]>thresh_func(x[idx_n]) # res = sum(y)/length(idx_c) # return(res) #} ectopic_activation = apply(d, 1, ector::prop_ectopic_expr, idx_c, idx_n) head(ectopic_activation) length(ectopic_activation) candidates = names(ectopic_activation)[ectopic_activation>0.1] length(candidates)
The influence of ectopic gene expression on patient survival is determined by the calculation of cox p-values by using the function compute_cox_pv
.
#compute_cox_pv = function (x, idx_c, idx_n, e, thresh_func=ector::m3sd) { # ss = survival::Surv(e[idx_c,]$os_months, e[idx_c,]$os_censor) # v = x[idx_c]>thresh_func(x[idx_n]) # f = suppressWarnings(survival::coxph(ss ~ v)) # sf = summary(f) # pvcox = signif(sf$wald[["pvalue"]], digits=3) # hr = signif(sf$coef[2], digits=3) # logrank = signif(sf$logtest[["pvalue"]], digits=3) # res = c(pvcox=pvcox, hr=hr, logrank=logrank) # return(res) #} cox_results = data.frame(t(apply(d[candidates,], 1, ector::compute_cox_pv, e, idx_c, idx_n,))) biomarkers = rownames(cox_results)[cox_results$pvcox<0.05 & cox_results$hr>1]
cox_results
allows to apply this function to every candidate gene.
head(cox_results) dim(cox_results)
biomarkers
gives the gene symbol for each gene whose cox p-value is less than 0.05 and whose hazard ratio is more than 1.
layout(matrix(1:2,1), respect=TRUE) plot(cox_results$hr, -log10(cox_results$pvcox), main="Volcano plot", col="grey") abline(h=-log10(0.05), lty=2, col=2) text(cox_results[biomarkers,]$hr, -log10(cox_results[biomarkers,]$pvcox), biomarkers, cex=.6) plot(-log10(cox_results$pvcox), main="Manhattan plot", col="grey") abline(h=-log10(0.05), lty=2, col=2) text(which(rownames(cox_results) %in% biomarkers), -log10(cox_results[biomarkers,]$pvcox), biomarkers, cex=.6)
As for the calculation of p-values, the function drawing survival curves (plot_survival
) requires time and censor of survival. Patients are divided in two groups, depending on the activation status of a given gene (on or off). plot_survival
is then applied to the previously determined biomarkers
.
#plot_survival = function(gs, d, idx_c, idx_n, e, thresh_func=ector::m3sd) { # expr = d[gs,] # pv = compute_cox_pv(expr, e, idx_c, idx_n, thresh_func=thresh_func) # ss = survival::Surv(e[idx_c,]$os_months, e[idx_c,]$os_censor) # v = expr[idx_c]>thresh_func(expr[idx_n]) # sf = survival::survfit(ss ~ v) # plot( # sf, # xlab="Time in months", # xlim=c(0,80), # ylab="Overall survival", # main=paste0(gs, " pv_cox=", signif(cox_results[gs,]$pvcox, 3), # "\n pv_logrank=", signif(cox_results[gs,]$logrank,3), # "\n HR=", signif(cox_results[gs,]$hr,3)), # col=c(4,2) # ) # legend("topright", # c(paste0("Off (n=", sum(expr[idx_c]<=thresh_func(expr[idx_n])), ")"), # paste0("On (n=", sum(expr[idx_c]>thresh_func(expr[idx_n])), ")")), # lty=1, col=c(4,2)) #} layout(matrix(1:6,2), respect=TRUE) foo = sapply(biomarkers, ector::plot_survival, d, e, idx_c, idx_n)
The next step consists in grouping the patients depending on the number of biomarkers
ectopically expressed.
number_genes
gives for each patient the number of biomarkers that are ectopically expressed. It ranges from 0 to 13, meaning that some patients do not express any of these biomarkers ectopically while some others can express until 13 of them ectopically.
The number of genes ectopically expressed allows to define the P1, P2 and P3 groups, in order to separate the patients. These groups are respectively composed of the patients expressing 0 to 2 genes, 3 to 5 genes and more than 5 genes ectopically.
thresh_biomarkers = apply(d[biomarkers,idx_n], 1, ector::m3sd) number_genes = apply(d[biomarkers,idx_c]>thresh_biomarkers, 2, function(x){sum(x)}) patients_group = data.frame(number_genes=number_genes, group="P1", stringsAsFactors=FALSE) patients_group[patients_group$number_genes>2,]$group = "P2" patients_group[patients_group$number_genes>5,]$group = "P3"
head(patients_group) table(patients_group$number_genes) table(patients_group$group)
#ss = survival::Surv(e[idx_c,]$os_months, e[idx_c,]$os_censor) #v = patients_group$number_genes # v = patients_group$group #f = suppressWarnings(survival::coxph(ss ~ v)) #sf = summary(f) #pvcox = signif(sf$wald[["pvalue"]], digits=3) #hr = signif(sf$coef[2], digits=3) #logrank = signif(sf$logtest[["pvalue"]], digits=3) #res = c(pvcox=pvcox, hr=hr, logrank=logrank) # res cox_results_grp = c(ector::compute_cox_pv_grp(e, idx_c, patients_group$number_genes))
layout(1, respect=TRUE) #ss = survival::Surv(e[idx_c,]$os_months, e[idx_c,]$os_censor) #v = patients_group$group #sf = survival::survfit(ss ~ v) #plot( # sf, # xlab="Time in months", # xlim=c(0,80), # ylab="Overall survival", # main=paste0(" pv_cox=", signif(pvcox, 3), # "\n pv_logrank=", signif(logrank,3), # "\n HR=", signif(hr,3)), # col=c("blue","red","black") #) #legend("topright", # c(paste0("P1 (n=", sum(patients_group[,2]=="P1"), ")"), # paste0("P2 (n=", sum(patients_group[,2]=="P2"), ")"), # paste0("P3 (n=", sum(patients_group[,2]=="P3"), ")")), # lty=1, col=c("blue","red","black")) ector::plot_survival_grp(e, idx_c, patients_group$group)
sessionInfo()
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.