knitr::opts_chunk$set(collapse=TRUE, comment = "#>", fig.width=9, fig.height=6, eval=TRUE, echo=TRUE, results="verbatim", dpi=75)

Introduction

ECTOR is a R package that detects ectopic gene expression in cancer, according to a threshold determined from normal samples. ECTOR analyses then the impact of abnormal expression of these genes in patient survival, allowing the detection of potential biomarkers of cancer aggressiveness.

Dataset

The illustration of the method required the import of transcriptomic data (RNAseq) of 25.360 genes for 499 samples and clinical data (experimental grouping) of 547 samples from the TCGA study of colon adenocarcinoma. The list of tissue-restricted genes (testis, placenta, embryonic stem cells and brain) has also been imported from the Epimed database. Data loaded below have been filtered in a previous vignette (00_vignette_data).

e = ector::ector_clinicals
head(e)
dim(e)

idx_n = rownames(e)[e$tissue_status %in% "normal"]  
head(idx_n)
length(idx_n)
idx_c = rownames(e)[e$tissue_status %in% "tumoral"]
head(idx_c)
length(idx_c)

e is a data matrix containing clinical information for r nrow(e) samples. The columns represent these clinical characteristics and each row corresponds to one patient, whose ID is given in the first column (id_sample).

idx_n indicates the ID of normal samples and idx_c indicates the ID of cancerous samples.

d = ector::ector_transcriptome
head(d[,1:4])
dim(d)

d is a data matrix containing the RNAseq counts of each gene for r length(idx_n) control samples and r length(idx_c) tumoral samples, meaning r ncol(d) samples in total. It has been obtained from the TCGA study of colon adenocarcinoma (s$data). The rownames and colnames represent respectively the gene symbols and sample ID.

Method

Establishment of gene expression threshold

In order to know if genes are abnormally expressed in cancerous samples, it is required to establish a threshold from normal samples for all genes. The formula commonly used is mean(x) + 3*sd(x), x being the expression of a given gene in normal samples.

ector::m3sd is the function used to determine the threshold for each gene and uses only data from normal samples (idx_n). The function is then applied for each gene in the data matrix d.

thresh = apply(d[,idx_n], 1, ector::m3sd)
head(thresh)
length(thresh)

Plotting gene expression

It is possible to visualize approximately the number of samples where a given gene is abnormally expressed.

plot_gene_expr is a function to draw a plot in which the samples are represented by a dot. The x axis is the tissue status (normal or tumoral) and the y axis represents the expression. The plot contains also the lines corresponding to the mean and the threshold (ector::m3sd) of the given gene and the dots found above the threshold are colored in red.

The last line of the code above illustrates how to obtain this figure for the first gene of d.

#plot_gene_expr = function(gs, d, e, idx_n, thresh_func=ector::m3sd) {
#  expr = d[gs,]
#  plot(jitter(as.numeric(as.factor(e$tissue_status))), expr, 
#  xlab=" ", ylab="Expression", xaxt = "n",
#       col = (expr>thresh_func(expr[idx_n]))+1, 
#       main=gs
#  )
#  tmp_tab = table(e$tissue_status)
#  axis(1, at = 1:2, paste0(names(tmp_tab)," (n=", tmp_tab, ")"), las = 1)
#  abline(h=thresh_func(expr[idx_n]), col="red")
#  abline(h=mean(expr[idx_n]), col="black")
#  legend("topleft", c("mean", "threshold"), col=1:2, lty=1)
#}

layout(1, respect=TRUE)
ector::plot_gene_expr(gs=rownames(d)[1], d, e, idx_n, thresh_func=ector::m3sd)

Determination of ectopically expressed genes

In order to know if a gene is ectopically expressed in cancerous samples, it is required to compare their expression with the corresponding threshold.

prop_ectopic_expr is a function which aims to calculate the percentage of patients in which the expression of a given gene is aberrant (higher than the threshold).

ectopic_activation is used to apply the prop_ectopic_expr function to each gene.

candidates indicates the gene symbols having an aberrant expression in at least 10% of patients.

#prop_ectopic_expr = function(x, idx_c, idx_n, thresh_func=ector::m3sd) {
#  if (sum(is.na(x)) > 0) {
#    warning("NA is expression values.")    
#  }
#  y = x[idx_c]>thresh_func(x[idx_n])
#  res = sum(y)/length(idx_c)
#  return(res)
#}

ectopic_activation = apply(d, 1, ector::prop_ectopic_expr, idx_c, idx_n)
head(ectopic_activation)
length(ectopic_activation)
candidates = names(ectopic_activation)[ectopic_activation>0.1]
length(candidates)

Univariate Survival

p-values

The influence of ectopic gene expression on patient survival is determined by the calculation of cox p-values by using the function compute_cox_pv.

#compute_cox_pv = function (x, idx_c, idx_n, e, thresh_func=ector::m3sd) {
#  ss = survival::Surv(e[idx_c,]$os_months, e[idx_c,]$os_censor)
#  v = x[idx_c]>thresh_func(x[idx_n])
#  f = suppressWarnings(survival::coxph(ss ~ v))
#  sf = summary(f)
#  pvcox = signif(sf$wald[["pvalue"]], digits=3)
#  hr = signif(sf$coef[2], digits=3)
#  logrank = signif(sf$logtest[["pvalue"]], digits=3)
#  res = c(pvcox=pvcox, hr=hr, logrank=logrank)
#  return(res)
#}

cox_results = data.frame(t(apply(d[candidates,], 1, ector::compute_cox_pv, e, idx_c, idx_n,)))
biomarkers = rownames(cox_results)[cox_results$pvcox<0.05 & cox_results$hr>1]

cox_results allows to apply this function to every candidate gene.

head(cox_results)
dim(cox_results)

biomarkers gives the gene symbol for each gene whose cox p-value is less than 0.05 and whose hazard ratio is more than 1.

layout(matrix(1:2,1), respect=TRUE)
plot(cox_results$hr, -log10(cox_results$pvcox), main="Volcano plot", col="grey")
abline(h=-log10(0.05), lty=2, col=2)
text(cox_results[biomarkers,]$hr, -log10(cox_results[biomarkers,]$pvcox), biomarkers, cex=.6)
plot(-log10(cox_results$pvcox), main="Manhattan plot", col="grey")
abline(h=-log10(0.05), lty=2, col=2)
text(which(rownames(cox_results) %in% biomarkers), -log10(cox_results[biomarkers,]$pvcox), biomarkers, cex=.6)

Curves

As for the calculation of p-values, the function drawing survival curves (plot_survival) requires time and censor of survival. Patients are divided in two groups, depending on the activation status of a given gene (on or off). plot_survival is then applied to the previously determined biomarkers.

#plot_survival = function(gs, d, idx_c, idx_n, e, thresh_func=ector::m3sd) {
#  expr = d[gs,]
#  pv = compute_cox_pv(expr, e, idx_c, idx_n, thresh_func=thresh_func)
#  ss = survival::Surv(e[idx_c,]$os_months, e[idx_c,]$os_censor)
#  v = expr[idx_c]>thresh_func(expr[idx_n])
#  sf = survival::survfit(ss ~ v)
#  plot(
#    sf, 
#    xlab="Time in months",
#    xlim=c(0,80),
#    ylab="Overall survival", 
#    main=paste0(gs, " pv_cox=", signif(cox_results[gs,]$pvcox, 3),
#                "\n pv_logrank=", signif(cox_results[gs,]$logrank,3),
#                "\n HR=", signif(cox_results[gs,]$hr,3)),
#    col=c(4,2)
#  )
#  legend("topright",
#        c(paste0("Off (n=", sum(expr[idx_c]<=thresh_func(expr[idx_n])), ")"),
#        paste0("On (n=", sum(expr[idx_c]>thresh_func(expr[idx_n])), ")")),
#  lty=1, col=c(4,2))
#}

layout(matrix(1:6,2), respect=TRUE)
foo = sapply(biomarkers, ector::plot_survival, d, e, idx_c, idx_n)

Patients grouping

The next step consists in grouping the patients depending on the number of biomarkers ectopically expressed.

number_genes gives for each patient the number of biomarkers that are ectopically expressed. It ranges from 0 to 13, meaning that some patients do not express any of these biomarkers ectopically while some others can express until 13 of them ectopically.

The number of genes ectopically expressed allows to define the P1, P2 and P3 groups, in order to separate the patients. These groups are respectively composed of the patients expressing 0 to 2 genes, 3 to 5 genes and more than 5 genes ectopically.

thresh_biomarkers = apply(d[biomarkers,idx_n], 1, ector::m3sd)
number_genes = apply(d[biomarkers,idx_c]>thresh_biomarkers, 2, function(x){sum(x)})
patients_group = data.frame(number_genes=number_genes, group="P1", stringsAsFactors=FALSE)
patients_group[patients_group$number_genes>2,]$group = "P2"
patients_group[patients_group$number_genes>5,]$group = "P3"
head(patients_group)
table(patients_group$number_genes)
table(patients_group$group)
#ss = survival::Surv(e[idx_c,]$os_months, e[idx_c,]$os_censor)
#v = patients_group$number_genes
# v =  patients_group$group
#f = suppressWarnings(survival::coxph(ss ~ v))
#sf = summary(f)
#pvcox = signif(sf$wald[["pvalue"]], digits=3)
#hr = signif(sf$coef[2], digits=3)
#logrank = signif(sf$logtest[["pvalue"]], digits=3)
#res = c(pvcox=pvcox, hr=hr, logrank=logrank)
# res

cox_results_grp = c(ector::compute_cox_pv_grp(e, idx_c, patients_group$number_genes))
layout(1, respect=TRUE)
#ss = survival::Surv(e[idx_c,]$os_months, e[idx_c,]$os_censor)
#v =  patients_group$group
#sf = survival::survfit(ss ~ v)
#plot(
#  sf, 
#  xlab="Time in months",
#  xlim=c(0,80),
#  ylab="Overall survival", 
#  main=paste0(" pv_cox=", signif(pvcox, 3),
#              "\n pv_logrank=", signif(logrank,3),
#              "\n HR=", signif(hr,3)),
#  col=c("blue","red","black")
#)
#legend("topright",
#    c(paste0("P1 (n=", sum(patients_group[,2]=="P1"), ")"),
#      paste0("P2 (n=", sum(patients_group[,2]=="P2"), ")"),
#      paste0("P3 (n=", sum(patients_group[,2]=="P3"), ")")),
#    lty=1, col=c("blue","red","black"))

ector::plot_survival_grp(e, idx_c, patients_group$group)

Material and methods

Session Information

sessionInfo()


mpetrier/ector documentation built on April 4, 2020, 3:59 p.m.