README.md
In sealx017/MIAMI: Mutual Information-based Analysis of Multiplex Imaging data

MIAMI

Souvik Seal and Debashis Ghosh

This is an R package implementing the proposed method from the paper, “MIAMI: Mutual Information-based Analysis of Multiplex Imaging data”. The package provides a thorough pipeline for performing marker co-expression analysis of multiplex imaging data, such as Vectra (Huang et. al. 2013) and MIBI data (Keren et. al. 2019). The package also provides standalone functions for computing several mutual information (MI) theoretic measures (Principe 2010), such as EQMI*, EQMI and CSQMI for quantifying the dependence between random variables in general datasets (definitions are available below in the appendix).

We install and load the developmental version of MIAMI from GitHub.

suppressMessages(devtools::install_github('sealx017/MIAMI'))
require(MIAMI)

Next, we import the example files named, “Marker_Data.csv” and “Clinical_Data.csv”. The first one has expression data of five markers ( $p$ = 5), HLA-DR, CD45RO, H3K27me3, H3K9ac and HLA_Class_1 for 39 subjects. The second one has data of two clinical outcomes, recurrence and survival and one covariate, Age for the same set of subjects. Both of these files are extracted from the triple-negative breast cancer MIBI data first published in Keren et. al. 2018. The files have a common column named “ID” denoting subject IDs.

data("marker_data")
knitr::kable(head(marker_data), format="markdown")

| ID | HLA.DR | CD45RO | HLA_Class_1 | H3K27me3 | H3K9ac | |----:|----------:|----------:|------------:|----------:|----------:| | 1 | 0.0887767 | 0.0029441 | 0.2295286 | 0.1766168 | 0.0704011 | | 1 | 0.0118106 | 0.1407535 | 0.2150085 | 0.5186082 | 0.2832617 | | 1 | 0.0000000 | 0.0000000 | 0.0178665 | 0.2305889 | 0.0206786 | | 1 | 0.0125603 | 0.0000000 | 0.2317846 | 0.4331141 | 0.2314381 | | 1 | 0.1843811 | 0.0339638 | 0.1492079 | 0.7069783 | 0.4647193 | | 1 | 0.0832260 | 0.0000000 | 0.1986478 | 0.7122134 | 0.4879173 |

data("clinical_data")
knitr::kable(head(clinical_data), format="markdown")

| ID | Recurrence | Recurrence_time | Survival | Survival_time | Age | |----:|-----------:|----------------:|---------:|--------------:|----:| | 1 | 0 | 9 | 0 | 2612 | 77 | | 2 | 1 | 745 | 0 | 745 | 67 | | 3 | 1 | 3130 | 1 | 3130 | 42 | | 4 | 0 | 31 | 1 | 2523 | 41 | | 5 | 1 | 1683 | 1 | 1683 | 64 | | 6 | 1 | 2275 | 1 | 2275 | 53 |

We start by showing how to compute the MI-based measures, EQMI*, EQMI, and CSQMI (Principe 2010) between any arbitrary set of random variables (r.v.’s) from a general data-frame i.e., when the data does not necessarily come from multiplex imaging platforms. We create a matrix named Data_matrix with 2500 samples and 5 columns corresponding to five r.v.’s and use it in the function named QMI to estimate the measures.
A novelty of the estimation algorithm is that the quantities are computed in a step-wise fashion. It means that along with the EQMI* between the full set of markers, one can easily extract the EQMI* between several smaller sets of markers. For example, with $p$ = 5 markers, (1, 2, 3, 4, 5), the EQMI* between all the following sets of markers can be extracted in a single estimation procedure,
- (1, 2), denoted by EQMI*_12
- (1, 2, 3), denoted by EQMI*_123
- (1, 2, 3, 4), denoted by EQMI*_1234
- (1, 2, 3, 4, 5), denoted by EQMI*_12345.
The estimation algorithm requires selecting bandwidth parameters for each of the r.v.’s. the default option, bandwidth = “Hpi” uses the multivariate plug-in bandwidth matrix described in Wand, M. P., & Jones, M. C. (1994). However, in larger datasets (especially, for large $p$ ) for faster computation, bandwidth = “Ind”, which chooses bandwidth by Silverman’s rule (Silverman, B.W. (2018)) for every r.v., can be used.

Data_matrix = marker_data[1:2500, -1]
QMIs = QMI(Data_matrix, bandwidth = "HPI", measure = "All", var_names = T)
print(QMIs)
# $EQMI_star
#               Estimate
# EQMI*_12    0.01650183
# EQMI*_123   0.19007064
# EQMI*_1234  0.36213356
# EQMI*_12345 0.56960589
# 
# $EQMI
#               Estimate
# EQMI_12       6.200363
# EQMI_123    283.226354
# EQMI_1234  1574.540741
# EQMI_12345 8456.344853
# 
# $CSQMI
#               Estimate
# CSQMI_12    0.01822507
# CSQMI_123   0.27834795
# CSQMI_1234  0.55462570
# CSQMI_12345 0.96711185
# 
# $Bandwidth_parameters
#             h1           h2           h3           h4           h5
# 1 0.0001129059 3.739611e-05 0.0003045994 0.0005547427 0.0005718942

We return to the analysis of multiplex imaging data. We compute the EQMI* of all the subjects using the function QMI_all and store the values in a matrix whose every row corresponds to a subject.

EQMI_vector = QMI_all(marker_data, bandwidth = "Ind", measure = "EQMI_star", progress_bar = "False")
knitr::kable(head(EQMI_vector), format="markdown")

| ID | EQMI*_12 | EQMI*_123 | EQMI*_1234 | EQMI*_12345 | |----:|-----------:|------------:|-------------:|--------------:| | 1 | 0.0122462 | 0.1705994 | 0.3427808 | 0.5413480 | | 2 | 0.0183228 | 0.1212016 | 0.3560708 | 0.4309078 | | 3 | 0.0648835 | 0.3260319 | 0.3505626 | 0.5423636 | | 4 | 0.0076886 | 0.1274159 | 0.2140916 | 0.4323397 | | 5 | 0.0735815 | 0.3347920 | 0.3927648 | 0.5451000 | | 6 | 0.0327404 | 0.1719063 | 0.2413455 | 0.5027947 |

Using the vector of estimated EQMI* of all the subjects, we perform association analysis with two clinical outcomes, survival and recurrence. We have the outcomes, the time to death and the time to recurrence and the respective censoring indicators (= 0 for an event) as the columns of the matrix named clinical_data. We have one single covariate, Age in the same matrix. Below, we create a matrix named surv_dat with all the subject IDs and their survival outcomes, and another matrix named covariates with the subject IDs and available covariates which, in this case, is just Age. We then use the function Cox_PH to fit the proportional hazard (PH) model outputting a table of p-values. To add higher order terms in the PH model, change degree to > 1.

surv_dat = clinical_data[,c(1,4:5)]
covariates = clinical_data[,c(1,6)]
SurvCox = Cox_PH(surv_dat, covariates, EQMI_vector, degree = 1)
knitr::kable(SurvCox, format="markdown")

| Variable | p-value | |:--------------|----------:| | EQMI*_12 | 0.0685692 | | EQMI*_123 | 0.0373026 | | EQMI*_1234 | 0.0467573 | | EQMI*_12345 | 0.0150054 |

Below, we create a matrix named recur_dat with all the subject IDs and their recurrence outcomes, and use the same covariates matrix as earlier. We again use the Cox_PH function and get the p-values.

recur_dat = clinical_data[,c(1:3)]
RecurCox = Cox_PH(recur_dat, covariates, EQMI_vector, degree = 1)
knitr::kable(RecurCox, format="markdown")

| Variable | p-value | |:--------------|----------:| | EQMI*_12 | 0.0609279 | | EQMI*_123 | 0.0268636 | | EQMI*_1234 | 0.0119667 | | EQMI*_12345 | 0.0128277 |

Univariate marginal density

We provide a few basic functions to estimate and plot the univariate marginal densities of the random variables. We go back to the matrix named Data_matrix which has 2500 samples and 5 columns corresponding to five r.v.’s. We look at first two columns (call them, X_1 and X_2) and estimate their kernel density estimates as f_1 and f_2 using the function dens_univ. Next, using the function univ_dens_plot, we plot f_1 and f_2. The default number of grids is, ngrids = 1024 and the bandwidth parameter is selected using Silverman’s rule.

X_1 = Data_matrix[,1]
X_2 = Data_matrix[,2]  

f_1 = univ_dens(X_1, ngrids = 1024)
f_2 = univ_dens(X_2, ngrids = 1024)

p1 = univ_dens_plot(f_1)
p2 = univ_dens_plot(f_2)

Bivariate joint density

We estimate the bivariate joint density of X_1 and X_2, as f_12 using the function biv_dens. Next, using the function biv_dens_plot, we plot f_12. The default number of grids is, ngrids = 512 and the bandwidth matrix used is the multivariate plug-in bandwidth matrix described in Wand, M. P., & Jones, M. C. (1994). It may be easy to interpret the estimated density focusing only on the smaller values of $X_1$ and $X_2$ since both the r.v.’s usually have most of the values close to 0. Hence, we use two thresholds (q1, q2) to display the estimated density only between 70% quantiles of both the r.v.’s.


f_12 = biv_dens(cbind(X_1, X_2), ngrids = 512)
q1 = quantile(X_1, 0.7); q2 = quantile(X_2, 0.7)
p = biv_dens_plot(f_12, maxs = c(q1, q2))

a) Huang, W., Hennrick, K., & Drew, S. (2013). A colorful future of quantitative pathology: validation of Vectra technology using chromogenic multiplexed immunohistochemistry and prostate tissue microarrays. Human pathology, 44(1), 29-38.

b) Keren, L., Bosse, M., Thompson, S., Risom, T., Vijayaragavan, K., McCaffrey, E., … & Angelo, M. (2019). MIBI-TOF: A multiplexed imaging platform relates cellular phenotypes and tissue structure. Science advances, 5(10), eaax5851.

c) Principe, J. C. (2010). Information theoretic learning: Renyi’s entropy and kernel perspectives. Springer Science & Business Media.

d) Keren, L., Bosse, M., Marquez, D., Angoshtari, R., Jain, S., Varma, S., … & Angelo, M. (2018). A structured tumor-immune microenvironment in triple negative breast cancer revealed by multiplexed ion beam imaging. Cell, 174(6), 1373-1387.

e) Wand, M. P., & Jones, M. C. (1994). Multivariate plug-in bandwidth selection. Computational Statistics, 9(2), 97-116.

f) Silverman, B. W. (2018). Density estimation for statistics and data analysis. Routledge.

Here, we show the mathematical definitions of the three different MI-based measures, EQMI*, EQMI and CSQMI, implemented in this package.

alt text here

$f_{12 \ldots p}(x_{1}, x_{2}, \ldots , x_{p})$ is the joint PDF and $f_{1}(x_{1}), f_{2}(x_{2}), \ldots f_{p}(x_{p})$ are the marginal PDFs of the $p$ variables. The details of the efficient algorithm used for estimating $V_J, V_C$ and $V_M$ can be found in the manuscript, “MIAMI: Mutual Information-based Analysis of Multiplex Imaging data”.