In tinyheero/RHL30: An R package for the RHL30 Predictor

library("knitr") 

opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.path = "man/figures/"
)

RHL30

An R Package for the RHL30 prognostic predictor. The predictor is a gene expression-based prognostic model for predicting post-autologous stem-cell transplantation outcomes. It designed to be used on RHL30 NanoString expression count data on relapsed Hodgkin lymphoma (RHL) samples.

The predictor was published at:

Chan FC*, Mottok A*, et al. Prognostic Model to Predict Post-Autologous Stem-Cell Transplantation Outcomes in Classical Hodgkin Lymphoma. J Clin Oncol JCO2017727925 (2017) doi:10.1200/JCO.2017.72.7925. *Contributed equally to this work.

How to Install

To install this package, you need to first have the package devtools installed, then you run:

devtools::install_github("tinyheero/RHL30")

How to use

We will be using the BCCA RHL30 training cohort from the paper as an example of how to generate RHL30 predictor score. The following steps will reproduce the RHL30 scores from the paper.

First, let's load the RHL30 package and the RHL30 model:

library("RHL30")
library("dplyr")
rhl30_model_df <- get_rhl30_model_coef_df()
rhl30_model_df

The model contains a total of 30 genes:

18 genes that make the model
12 housekeeper genes that are used to normalize the data

The next step is to load the expression data you want to generate RHL30 scores on. The expression data should be a tab-separated values file. The first line should be a header line with gene_name as the first column followed by the sample identifiers. Each row should then be the name of the gene and then the respectively raw expression values for each sample.

The expression data of the BCCA RHL30 training cohort is provided as an example. Let's load that data:

exprs_file <- 
  system.file("extdata", "bcca_rhl_rhl30_gene_exprs_mat.tsv", package = "RHL30")
exprs_mat <- load_exprs_mat(exprs_file)
dim(exprs_mat)

The expression data contains the 30 genes (rows) and 68 samples (columns). Next we calculate the normalizer values (geometric mean of the 12 housekeepers) for each sample:

hk_genes <- 
  filter(rhl30_model_df, gene_type == "housekeeper") %>%
  pull("gene_name")

sample_normalizer_values <- get_sample_normalizer_value(exprs_mat, hk_genes)

In the paper, a threshold of 35 was set to exclude poor quality samples. This was done because very low normalizer values often lead to very high normalized expression values. We can apply this threshold to eliminate poor quality samples:

high_quality_samples <- 
  names(sample_normalizer_values[sample_normalizer_values > 35])
filtered_exprs_mat <- exprs_mat[, high_quality_samples]
dim(filtered_exprs_mat)

This eliminates 2 poor quality samples leaving us with 66 samples. Note that the sample HL1120 did not receive ASCT and thus was not reported in figure 4 of the paper. As such, the final number in figure 4 is 65 samples.

Let's normalize our expression matrix and generate the RHL30 scores for each sample:

filtered_exprs_mat_norm <- 
  normalize_exprs_mat(filtered_exprs_mat, sample_normalizer_values)
rhl30_df <- get_rhl30_scores_df(filtered_exprs_mat_norm, rhl30_model_df)
head(rhl30_df)