devtools::load_all(".")
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

Background introduction

NITUMID is the R implementation of Non-negative Matrix Factorization-based Immune-TUmor MIcroenvironment Deconvolution, a statistical framework for tumor immune microenvironment deconvolution.

During tumor development process, its interaction with the host immune system shapes a complex tumor microenvironment (TME) consisting of tumor, various infiltrating immune cells, and other non-cancerous cell types. Interests in understanding this microenvironment has been growing since the breakthrough of immunotherapy.

One simple in silico method for profiling tumor immune microenvironment is via bulk tumor gene expression data, which can be seen as a mixture of all compoenent cell types (immune cells, tumor cells, etc.) gene expression profiles. Let's consider a gene expresison matrix $Y \in R^{m\times n}$, where $m$ rows are $m$ genes and $n$ columns are $n$ samples. Under the above framework, we can have the following:

\begin{align} Y &= W\cdot H\ W &\in R^{m\times k}\ R &\in R^{k \times n} \end{align}

Here $k$ is the number of component cell types, i.e., we assume each sample's gene expression profile (each column of $Y$) is a linear mixture of the $k$ component cell types ($k$ columns of $W$) with their corresponding weights ($n$ columns of $H$). Further, since the fractions of all component cell types should sum up to 1, we requires each column of $H$ sum up to 1. Notice that since $Y$, $W$ and $H$ here are all non-negative matrices, it becones essentially a non-nagative matrix factorization problem (NMF) under regularizaiton.

Current version of NITUMID only supports tumor microenvironment deconvolution for bulk melanoma data. And current framework includes 11 component cell types: Dendritic cells, CD4+ T cell, CD8+ T cell, Macrophages, B cells, Natural killer cells (NK/NKT), Monocytes, Plasma, MAST cell, Eosinophils and melanoma. NITUMID's framework is generally applicable, the only issue limiting us from extending to more tumor types is training data, and we expect to support more tumor types in near future.

Common microarray or RNA-Seq give measurements for >10000 genes' expression, most of which are not significantly different across cell types, so they would not help deconvolution but instead adding more computational burden. So it is reasonable to only choose a small subset of genes (a.k.a signature genes) whose expression levels help us differentiate among cell types.

The ideal signature genes are those who are exclusively expressing in one specific cell types. Also, if one gene is expression mostly in a few of our component cell types, it should also help.

In our current version of NITUMID, we curated a list of 53 signature genes for our 11 cell types, their information can be seen in signature_marker_melanoma in this package:

head(signature_marker_melanoma)

Cell tells you whcih specific cell type is this gene a signature gene to. If a gene has no-NA Secondary_cells values, that means despite this gene's significant expression level in one cell type, in the Secondary_cells, it also has reasonable expression level in those secondary cell types.

Back to our NMF framework $Y=W\times H$, now we have $m=53$. Recall that the i-th column of $W$ represents the gene expression profile (of these 53 genes) in the i-th cell type, so for the signature genes for i-th cell type, their values in this column should be significantly larger. However, NMF by itself cannot achieve this kind of structure, so we introduce a guide matrix $A$ that helps us impose our desired strucutre for signature matrix. $A$ is a matrix with same dimension as $W$, and it is a trichotomous matrix consists of $1,0.5,0$. An entry 1 for a specific cell type means the corresponding gene is highly expressed in that cell type, an entry 0 means that the gene is not expressed in the cell type, and an entry 0.5 means that the gene has an intermediate expression level in that cell type.

However, to make it compatible with our optimization setup, the $A$ for argument should be $\mathbb{1}-A$, in our manuscript, we defined the A that used in argument as \tilde{A}. Start now on, we will denote the guide matrix still as $A$ but the argument input A as $\tilde{A}$

You can take a look at our current $\tilde{A}$ whose variable name is A_melanoma_v5

dim(A_melanoma_v5)
A_melanoma_v5[1:3,]

Another issue with NMF is its instability, which comes as a price for flexibility. To get over this issue, for each of the NMF run, we use different parameters and choose the most stable setup. So for each run of NITUMID, on top of the original dataset, we will do a permutation and measure their results' consistency. NITUMID will return a consistency table as well as factorization results under different parameters, and we usually choose the factorization results under the most stable parameters.

For full details about how we set up the optimization proceudure, choose parameter and consistenct measurement, please see our manuscript for more details.

Getting start with NITUMID

The raw data you should prepare for NITUMID should be a data matrix with its rows genes and columns samples, its rowname is required to be gene symbol, for example, the following data matrix is gene expression profiles for 20 CD8+ T cells ```{R,cache=T} dim(ematrix_GSE6740_CD8) rownames(ematrix_GSE6740_CD8)[1:10] class(ematrix_GSE6740_CD8)

The first step should be finding the signature genes in the rows of `ematrix_GSE6740_CD8`, which can be done by `Signature_Match` function:

```{R,cache=T}
signature_found <- Signature_Match(gene_expression = ematrix_GSE6740_CD8,signature_genes = as.character(signature_marker_melanoma$gene_symbol))

#return signature genes' row indexes
signature_found$matched_index
#which siganture genes are not found in the dataset
signature_found$missing_row_index

Signature_Match takes two/three inputs:

  1. Your raw gene expression matrix
  2. A vector of all signature gene symbols, default values are the 53 genes from us
  3. Optional, if some of the signature genes have alias, you can specify them with an extra vector.

Signature_Match returns two values: 1. matched_index is a numeric vector contains all siganture genes' corresponding rows in the dataset 2. For those siganture genes that don't present in the dataset, their indexes are also recorded missing_row_index. Notice that those missing siganture genes still have a default value 1221 in matched_index.

Once you obtain output from Signature_Match, you can proceed on to the main function NITUMID.NITUMID takes sevral arguments, for now we will go through those most important ones:

If `if.bulk=F`, `NITUMID` returns two values:

* A consistency table that tells you the results' consistencies under different parameter setups
* result: a list of results under different parameter setups
```{R,cache=T}
GSE6740_CD8_out$consistency_table

Here since all of the consistencies are 1, we just show the results of the second one: ```{R,cache=T} GSE6740_CD8_out$result[[2]]$H

As we can see here, all the samples have 1s as their CD8 proportions, i.e., all the samples are classified to CD8+ T cell.

Let's look at another example of bulk tumor data, the proceudure are the same except `if.bulk=FALSE`.

Here `Nivolumab_rna_sub` is a 53 by 119 matrix whose rows are the 53 signature genes (we omitted the step for finding signature genes)

```{R,cache=T}
dim(Nivolumab_rna_sub)
head(Nivolumab_rna_sub[,1:4])
NIVO_out <- NITUMID(Y = Nivolumab_rna_sub,A = A_melanoma_v5,row_index = seq(1,53),if.bulk = T,row_mean = row_mean_v5)
NIVO_out$consistency_table
NIVO_out$result[[1]]$H

Notice that when if.bulk=TRUE, we only use one specific parameters setup, so you do not need to choose anymore, the NIVO_out$result will be a list length 1, containing $W$ and $H$

Easy Way Out

In a lot of cases, if you just want to use all the default settings and take a quick look, NITUMID_simple wraps everything up for you. You only need to input the raw $Y$ and specify if the data is bulk, and that's it.

One thing you need to keep in mind is: in most cases NITUMID_simple will pick up the results for the best consistency for you. However, if the consistency tabl has multiple "best" results, it would still return consistency table and a list results for you to manually decide. ```{R,cache=T} NITUMID_out_simple <- NITUMID_simple(Y=ematrix_GSE6740_CD8,if.bulk = F) NITUMID_out_simple$consistency_table NITUMID_out_simple$result[[2]]$H

## Test Mode
This part is intended for advanced use of NITUMID. Test Mode in NITUMID is pramarily for benchmarking NITUMID's performance with data that we know the underlying true cell proportions.

Consider now we have one example, if we do know all the 11 cell types' fractions in it, we would have a vector $h_{T}$ length 11, and each of its entry should be the corresponding cell type's proportion. Let's further denote the 11 cell types' proportions we obtained from NITMID is $\hat{h}$. Then we can measure their consistencies in 3 different metrics: 

* Pearson Correlation
* Spearman Correlation
* K-L Divergence

When `test_mode=TRUE`, two additional arguments are required:

* real_H: the underlying true proporions for each sample, it is supposed to have 11 rows and same columns number as sample size.
* correlation: which of the metric you would like to use for the consistency measurement, "p" for Pearson Correlation, "s" for Spearman correlation and "kl" for K-L divergence.

For example, in our previous case for CD8+ T cells, since we know all the 20 samples are CD8+ T cells, we could create the following `real_H_cd8` matrix:
```{R}
real_H_cd8 <- matrix(rep(0,20*11),nrow=11)
real_H_cd8[3,]<-1

Then we could benchmark NITUMID with this result:

```{R,cache=T} GSE6740_CD8_out_benchmark <- NITUMID(Y = ematrix_GSE6740_CD8[signature_found$matched_index,],A = A_melanoma_v5,row_index = setdiff(seq(1,53),signature_found$missing_row_index),if.bulk = F,test_mode = T,row_mean = row_mean_v5,real_H = real_H_cd8)

GSE6740_CD8_out_benchmark$real_corr_table

Notice that here the outcome has one additional field called `real_corr_table`, which is a list of 3 that gives the mean Pearson correlation between estimated fractions and underlying true fractions under 3 different parameters. Unsurprisingly, we get 1 here!

Another common circumstance is that when your known cell types do not match the 11 cell types in NITUMID, in this case, we offer an argument `cell_structure` to adjust for that. `cell_structure` is a data frame with two variables: 
* `origina_index`: a vector from 1-12, indicating the 11 cell types in NITUMID
* `destin_index`: a vector indicating the matching relationship between your cell types and NITUMID's cell types. For example, if you only have T cells instead of CD4+ T cells and CD8+ T cells, you can put 2 on both the second and third entry in `destin_index`. In this case, when calculating consistency, the fractions of CD4+ T cells and CD8+ T cells from NITUMID will be merged to one cell type and compare with the 2nd cell type in the `real_H` you provide.

Let's look at the following exmaple, `real_mix_gene_expr` is a 53 by 4 matrix containing gene expression profiles for 4 samples. And their component cell fractions are known:
```{R}
dim(real_mix_gene_expr)
real_H_mix

Here, we can see that even though we do know their compositions, we still only have CD4, CD8, B, NK and Melanoma cells, all other cell types are categorized as other (thus real_H matrix is then 6 by 4 instead of 11 by 4). To test NITUMID's performance on this data, we created the following cell structure:

cell_structure_real_mix

Here we denote all other cell types as the 6th cell type, while CD4, CD8, B, NK and melanoma are 1-5, respectively. And we can evaluate NITUMID's performance as follow:

{R,cache=T} real_mix_benchamrk<-NITUMID(Y=real_mix_gene_expr,A = A_melanoma_v5,row_index = seq(1,53),if.bulk = T,test_mode = T,real_H = real_H_mix,cell_structure = cell_structure_real_mix,correlation = "p") real_mix_benchamrk$real_corr_table

So here we see that the mean correlation across all 4 samples are 0.55.



tdw1221/NITUMID documentation built on May 31, 2019, 1:54 a.m.