suppressPackageStartupMessages(
  suppressWarnings({
  library(impute)
  library(graphite)
  library(clipper)
  library(org.Hs.eg.db)
}))

Introduction

In this tutorial we are going to analyze the dataset of metabolic activities published in Terunuma et al (2017). In this paper the Authors characterized the transcriptomic and metabolomic profile of several types of human breast tumors to uncovered metabolite signatures using an untargeted discovery approach. They found that the oncometabolite 2-hydroxyglutarate (2HG) accumulates at high levels in a subset of tumors and discovered an association between increased 2HG levels and MYC pathway activation in breast cancer. As an example of mixed metabolites and genes topological analyses here we are going to use both the dataset of metabolite abundances and gene expression as reported in the supplementary files selecting only tumour samples and comparing ER+ and ER- breast cancers. We will use pathway information from the KEGG and SMPDB databases to identify a set of features whose activity changes significantly between the two sample classes. This result will hopefully hint at some specific biological activities that are pathologically altered in tumoral samples.

Data Preparation

We start by loading the experimental data into R. The measurements are stored in two simple text files, one for the genes and one for the metabolites. Thanks to the utility functions provided by R, we can combine the download and the actual reading of a table in a single operation. We also specify the details of the file format: the first row of input represents the header of the table; the tabulation character \t separates values; we don't want R to change strings into factors.

gene_url <-
  url("https://romualdi.bio.unipd.it/wp-uploads/2018/04/Terunuma_gene_expr.txt")
gexpr <- read.table(gene_url, header = TRUE, sep = "\t", row.names = 1,
                    stringsAsFactors = FALSE)
metab_url <-
  url("https://romualdi.bio.unipd.it/wp-uploads/2018/04/Terunuma_metabolite_expr.txt")
mexpr <- read.table(metab_url, header = TRUE, sep = "\t", row.names = NULL,
                    stringsAsFactors = FALSE)

The loaded table for the metabolites contains r nrow(mexpr) rows and r ncol(mexpr) columns, but not all of them actually correspond to measurements. We ask R to list the names of the columns:

colnames(mexpr)

As you can see the first four columns actually contain metabolite identifiers. We store their indices in a variable for future reference.

idcols <- 1:4

Missing Values

We display the first few rows of the table to get a feeling for its content. For the moment, we concentrate on the metabolite measurements.

head(mexpr[,5:10])

It looks like there is a significant number of missing values. Let's plot a distribution of the fraction of NAs for each row.

nas_by_row <-
  rowSums(is.na(data.matrix(mexpr[,-idcols]))) /
  (ncol(mexpr) - length(idcols))
hist(nas_by_row, xlab = "Fraction of NA values", ylab = "Number of rows",
     main = NULL)

We cannot make good use of rows with too many missing values. We decide to drop them from our dataset. Specifically, we are going to remove anything with more than 50% of NAs.

dexpr <- mexpr[nas_by_row < 0.5,]

This operation leaves us with r nrow(dexpr) of the original r nrow(dexpr) rows.

We are not yet satisfied, though. The downstream analysis won't be able to cope with any NA. What would happen if we were to apply a more stringent procedure, removing any NA?

sum(rowSums(is.na(dexpr[,-idcols])) == 0)

The above command is a little bit dense, so let's go through it in small steps. First we select all columns, except the identifiers. Then we check each value, testing if it's an NA. We tally how many NAs we have in each row and we consider only those in which the count is zero (meaning, no NA at all). The outer sum tells us how many rows would survive our filter.

The final verdict is quite grim. Only r sum(rowSums(is.na(dexpr[,-idcols])) == 0) out of r nrow(dexpr) measurements would be used. We could do much better using a strategy that has become quite common in cases like this: imputation.

Instead of implementing it ourselves, we are going to use the excellent BioConductor package impute.

library(impute)
iexpr <- cbind(dexpr[,idcols],
               impute.knn(data.matrix(dexpr[,-idcols]))$data,
               stringsAsFactors = FALSE)
head(iexpr[, 1:6])

We can explicitly check there are not NAs left:

sum(is.na(iexpr[,-idcols]))

Missing Identifiers

We now concentrate on the metabolite identifiers. We start again from the first few rows.

head(iexpr[,idcols])

The CAS identifiers have a trailing ; we don't need and there are a number of empty strings, which really represent missing values. We fix both these issues.

iexpr$CAS <- sub("\\s*;.*$", "", iexpr$CAS)
iexpr[,idcols][iexpr[,idcols] == ""] <- NA
head(iexpr[,idcols])

We get a measure of how many identifiers are present or missing:

summary(is.na(iexpr[,idcols]))

We have a Metabolon ID for each metabolite in the matrix, while in the case of CAS identifiers we find only r nrow(iexpr)-sum(is.na(iexpr$CAS)) usable rows.

Unfortunately, at this point graphite does not support the Metabolon IDs. We do rely on CAS even if that means loosing a significant fraction of the rows.

valid_cas <- !is.na(iexpr$CAS)
cas_col <- which(names(iexpr) == "CAS")
cexpr <- iexpr[valid_cas, c(cas_col, seq.int(5, ncol(iexpr)))]
head(cexpr[,1:6])

There is no reason at this point to keep the identifiers as a column inside the dataset. We move such information to the row names and transform the data.frame into a numeric matrix.

mexpr <- data.matrix(cexpr[,-1])
rownames(mexpr) <- paste("CAS", cexpr$CAS, sep = ":")
colnames(mexpr) <- colnames(gexpr)
head(mexpr[,1:6])

Finally, we convert the metabolite levels by taking the logarithm and we merge the two matrices (one for the genes and one for the metabolites) together.

fexpr <- rbind(data.matrix(gexpr), log(mexpr))

Pathway Analysis

To make sense of the changes in metabolic activity recorded in the data we have just loaded, we are going to use pathway information from KEGG (via the graphite package) and a statistical analysis capable of exploiting the topology of such pathways (here we'll rely on clipper). clipper performs a test on means and covariances between two conditions and it is able to deal with variables of heterogeneous nature.

We import the required packages:

library(graphite)
library(clipper)

clipper will need three pieces of information:

  1. a set of pathways
  2. the activity matrix (mexpr in our case)
  3. a vector representing the two sample groups we are going to compare

KEGG Pathways

Getting pathways is quite easy thanks to graphite.

kpaths <- pathways("hsapiens", "kegg")

The above command retrieves all KEGG pathways for Homo sapiens (r length(kpaths) in total). We take a peek at the first entry:

kpaths[[1]]

This summary doesn't tell us which identifiers are used for the nodes in the pathway. We can get that from the list of edges.

head(edges(kpaths[[1]], "mixed"))

As we might expect, KEGG pathways are using KEGG compounds IDs. Since we're relying on CAS in our data, we should convert them. convertIdentifiers to the rescue!

rpaths <- convertIdentifiers(convertIdentifiers(kpaths, "ENSEMBL"), "CAS")

The edges of the first pathway have been converted into:

head(edges(rpaths[[1]], "mixed"))

To make the rest of the analysis more robust, we are going to filter pathways requiring that 1) metabolites represent at least 30% of the nodes and 2) at least 10 edges connect genes or metabolites for which we have experimental measurements.

filter_pathways <- function(pathways, expr, min_edge_num) {
  node_names <- rownames(expr)

  pred <- function(p) {
    ns <- nodes(p, "mixed")
    if (length(ns) == 0) {
      return(FALSE)
    }

    frac <- length(grep("^CAS:", ns)) / length(ns)
    if (frac < 0.3) {
      return(FALSE)
    }

    es <- edges(p, "mixed")
    mask <- (paste(es$src_type, es$src, sep = ":") %in% node_names) &
            (paste(es$dest_type, es$dest, sep = ":") %in% node_names)
    sum(mask) >= min_edge_num
  }

  Filter(pred, pathways)
}

fpaths <- filter_pathways(rpaths, fexpr, 10)

Now about sample classes. We are going to split our dataset into two: POS TUMOR samples in the first group and NEG TUMOUR samples in the other. clipper wants from us a vector with as many entries as there are samples. An entry should be 1 if the corresponding samples belongs to the first class, or 2 in the other case.

pos_indices <- grep("^POS", colnames(fexpr))
neg_indices <- grep("^NEG", colnames(fexpr))

classes <- numeric(length = ncol(fexpr))
classes[neg_indices] <- 1
classes[pos_indices] <- 2
classes

We use the helper function runClipper to start our analysis. Note that we explicitly require metabolites from the pathways.

out <- runClipper(fpaths, fexpr, classes, "mean", "mixed",
                  maxNodes = 150, seed = 42)

We check that there are no errors and the we extract the list of pathways which appears to be significantly altered between the two conditions according to clipper.

stopifnot(length(out$results) == 10)
names(out$results)

We use an helper function to visualize the results.

library(org.Hs.eg.db)

plot_altered_path <- function(result, pathways, node_scale = 2) {
  title <- names(result)
  pathway <- pathways[[title]]

  graph <- pathwayGraph(pathway, which = "mixed")

  labels <- sapply(nodes(graph), function(n) {
    n2 <- sub("^CAS:", "", n)
    if (n != n2) {
      n2
    } else {
      mapIds(org.Hs.eg.db, sub("^ENSEMBL:", "", n), "SYMBOL", "ENSEMBL",
             multiVas = "first")
    }
  })
  names(labels) <- nodes(graph)

  altered <- unlist(strsplit(result[[1]][1, "pathGenes"], "[,;]"))
  selected <- nodes(graph) %in% altered
  node_colors <- ifelse(selected, "red", "black")
  names(node_colors) <- nodes(graph)

  base <- 0.75
  heights <- ifelse(selected, base * node_scale, base)
  names(heights) <- nodes(graph)
  widths <- ifelse(selected, base * node_scale, base)
  names(widths) <- nodes(graph)

  base <- 14
  fontsizes <- ifelse(selected, base * node_scale, base)
  names(fontsizes) <- nodes(graph)

  between_altered <- function(edge_names, altered) {
    sapply(edge_names, function(edge_name) {
      nodes <- unlist(strsplit(edge_name, "~", fixed = TRUE))
      all(nodes %in% altered)
    })
  }

  edge_colors <- ifelse(between_altered(edgeNames(graph), altered),
                        "red", "black")

  plot(graph,
       attrs = list(edge = list(arrowsize = 0.5)),
       nodeAttrs = list(label = labels, color = node_colors, width = widths,
                        height = heights, fontsize = fontsizes),
       edgeAttrs = list(color = edge_colors),
       recipEdges = "combined", main = title)
}
selection <- c(
  "Alanine, aspartate and glutamate metabolism",
  "Arginine biosynthesis",
  "D-Glutamine and D-glutamate metabolism"
)
for (idx in seq_along(out$results)) {
  res <- out$results[idx]
  if (names(res) %in% selection) {
    plot_altered_path(res, fpaths)
  }
}

graphite analyses highlight "Alanine, aspartate and glutamate metabolism" and "Arginine biosynthesis". These pathways enclose most of the genes and metabolites discussed in Terunuma et al. that follow the abundance pattern of 2HG such as N-acetyl amino acids, especially the mitochondrial N-acetyl-aspartate (NAA), the glutaminase (GLS1) protein, and the aspartoacylase (ASPA), which is generally reduced in breast tumors but most significantly within the ER-negative tumor group which suggests a possible cause for increased tumor-associated NAA. Another graphite pathway worthy of note is "D-Glutamine and D-glutamate metabolism" reported also in Terunuma et al. because it contains L-Glutamine (CAS:56-85-9) and L-Glutamic Acid (CAS:56-86-0), two of the most altered metabolites between ER+ and ER- breast tumours.

SMPDB Pathways

spaths <- pathways("hsapiens", "smpdb")
fpaths <- filter_pathways(
  convertIdentifiers(convertIdentifiers(spaths, "ENSEMBL"), "CAS"),
  fexpr,
  10)
out <- runClipper(fpaths, fexpr, classes, "mean", "mixed",
                  maxNodes = 150, seed = 42)
stopifnot(length(out$results) == 190)
names(out$results)

Using the SMPDB database we obtained 190 significant pathways among which we found "The oncogenic action of 2-hydroxyglutarate" that is the pathway describing the role of 2HG in tumour environment.



sales-lab/graphite documentation built on Oct. 15, 2023, 9:23 a.m.