In UCLA-BD2K/CV.Signature.TCP: Cardiovascular Signature Temporal Clustering Platform

knitr::opts_chunk$set(warning=FALSE, message=FALSE, cache=TRUE, fig.height = 4, fig.width = 7, fig.align = "center",tidy.opts=list(width.cutoff=80),tidy=TRUE)

During cardiovascular disease progression, multiple systems in the myocardium (e.g., cardiac proteome) undergo diverse molecular changes. The temporal patterns of individual biomolecules depict their unique responses towards pathological drivers and contribute to underlying pathogenesis. Advances in high-throughput omics technology have enabled cost-effective temporal profiling of targeted systems in animal models of human disease. Therefore, in order to understand temporal dynamics of cardiovascular diseases, we are interested in investigating model organisms and cardiac patients over time. CV.Signature.TCP extracts molecular signatures from time-series proteomics and other -omics data. This vignette introduces how to run this analysis. For its introduction and reference, please see Chung et al. (2020).

Oxidative Post-Translational Modifications (O-PTM)

We consider cysteine O-PTM occupancies in mouse over time, that are undergoing cardiac remodeling (Wang et al. 2018). Briefly, the temporal changes (sites, PTM types, and occupancy) in 3 types of cysteine O-PTMs (reversible cysteine O-PTMs; irreversible cysteine sulphinylation, CysSO2H; Cysteine sulphonylation, CysSO3H) at the proteome level were obtained using a mouse model of isoproterenol (ISO) induced cardiac remodeling. Male C57BL6/J mice (9-12 wks old) were treated with ISO or saline for 14 days (4 replicates/treatment, n=3/group); the left ventricle tissue were harvested at 6 time points (0, 1, 3, 5, 7, 14 days) and subjected to proteomics characterization to determine the cysteine O-PTM temporal features. The ratios of O-PTMs in ISO conditions over these in control conditions are calculated. These ratios represent the cysteine O-PTMs undergoing cardiac remodeling, and some of those data are included in this R package.

library(CV.Signature.TCP)
data(cys_optm)
meta <- cys_optm[,1:4]
optm <- log(cys_optm[meta$Select,5:10])
optm <- t(scale(t(optm), scale=TRUE, center=TRUE))
days <- as.numeric(colnames(optm))
ls()

head(optm)
print(days)
head(meta)

dim(optm)
class(optm)

Quick start

The main function is the eponymous function called tms(). It essentially runs the whole analysis pipeline with minimal user inputs. Three steps of denoising, clustering, and evaluation can be specified by arguments denoise, cluster.method, and evaluate, respectively. Each of these steps are described in details below. For a starter, one can run tms() on the loaded dataset cys_optm that denoise the data with fitting cubic splines, applying k-means clustering, and evaluating with the jackstraw tests.

optm.tms <- CV.Signature.TCP(optm,
                timepoints = days,
                center.dat = TRUE, #should each row centered?
                scale.dat = TRUE, #should each row scaled?
                denoise = c("smooth.spline"),
                denoise.parameter=c("cv"),
                cluster.method = c("kmeans"),
                K = 5, #a number of clusters
                evaluate = TRUE, #run a jackstraw or not?
                verbose = FALSE, #output diagnostic messages
                seed = 1
               )

For more details and fine-tuning of each tep, the tms package provide modular functions.

Denoising

Temporal structure provides a unique opportunity to de-noise time-series data, that may lack technical replicates. We provide 2 denosing techniques, using cubic splines and principal component analysis (PCA).

Cubic Splines

First, use cubic splines for denoising by fitting a cubic spline for each molecular variable and obtaining the fitted/predicted values. Temporal dynamics of O-PTM may be assumed to evolve with continuity such that smooth curves approximate their true signals. The degree of freedom (DoF) is a hyperparameter requied for cubic splines. It can be chosen automatically by cross-validation for each molecular variable.

require(splines)
# use cubic splines for denoising
# a degree of freedom for each variable is estimated through cross-validation
denoised_optm <- denoise_spline(dat = optm,
                                timepoints = days,
                                dof="cv")

On the other hand, the global option (dof="cv.global) pools all the cross-validated DoFs and use the mean value for all variables. Also, one may supply a manually chosen DoF.

# a degree of freedom is set for the whole dataset,
# by taking a mean of CV-estimated degrees of freedom 
denoised_optm_global <- denoise_spline(dat = optm,
                                timepoints = days,
                                dof="cv.global")

Principal Component Analysis (PCA)

Second, PCA is applied on the observed data, and the top $r$ PCs are used to reconstruct the data. This parameter $r$ should be given to an argument. Essentially, the $r$ rank data capture the most important patterns. PCA-based method does not utilize the time points, and instead take the major systematic variation captured by a $r$ ranked matrix. Our general recommendation for time-course experiments is to use cubic splines for denoising.

# use PCA/SVD for denoising with the top r=3 PCs, with the EM or NIPALS algorithm
denoised_optm_pca <- denoise_pca(dat = optm,
                                 timepoints = days,
                                 r=3,
                                 maxiter = 1000,
                                 method="em") #method="nipals" available

The elbow plot for PCA, showing the percent variances explained by individual principal components (PCs), may aid in determining the number of PCs to use in the denoising process. Specifically, using the singular values (which are diagonal elements in $D$) from applying SVD on the data, one can calculate the percent variances by $d^2 / \sum(d^2)$. The function pca.elbow automates this process.

pca.elbow(dat=denoised_optm,
          title="Percent variance explained")

Unsupervised Clustering

Identifying the optimal number of clusters to use is one of the key challenges in unsupervised learning which is beyond the scope of this vignette. For an introductory discussion, see \url{https://en.wikipedia.org/wiki/Determining_the_number_of_clusters_in_a_data_set} and algorithms in factoextra and NbClust. It is always important to consider biological and molecular contexts to appropriate select $K$, as well as considering multiple visual and analytical approaches. We provide a few visual approaches to help deciding the number of clusters. The common advice is to find the elbow (or the inflection point) when plotting the within sums of squares (wss):

library(factoextra)
cluster.elbow(dat=denoised_optm, FUNcluster=kmeans, method="wss", k.max=10)

The same function can be used to also loook at the silhouette plot (Rousseeuw, 1987). For the silhouette technique, refer to Peter J. Rousseeuw (1987). "Silhouettes: a Graphical Aid to the Interpretation and Validation of Cluster Analysis". Computational and Applied Mathematics.

cluster.elbow(dat=denoised_optm, FUNcluster=kmeans, method="silhouette", k.max=10)

Then, clustering is conducted as follow. We provide two ways to estimate similarity between molecular variables (e.g., proteins). First, cluster.method = "kmeans" runs a well-known K-means clustering that uses the least squared Euclidean distance. Second, cluster.method = "hclust" runs a hierarchical clustering from an output from DTW or a correlation-based dissimilarity matrix.

## cluster the denoised data using K-means clustering
clustered_optm <- cluster(denoised_optm,
                          timepoints = days,
                          cluster.method = "kmeans",
                          center.dat = TRUE,
                          scale.dat = TRUE,
                          K=5)

Jackstraw Evalution

We have clustered the denoised data from time-course proteomics or other molecular data. In this way, that m variables are clustered into K clusters. We are now interested in evaluating cluster memberships, e.g., whether these variables are correclty assigned to their given clusters. We utilize the jackstraw tests for association between the variables and the cluster centers.

require(jackstraw)
## cluster the denoised data using K-means clustering
## note that seed is provided for reproducibility
dat.evaluated <- jackstraw_kmeans(dat = denoised_optm,
                                  kmeans.dat = clustered_optm$cluster.obj,
                                  B = 1000,
                                  verbose = FALSE,
                                  seed=1)

hist(dat.evaluated$p.F, main="P-value Histogram")
summary(dat.evaluated$p.F)

From the resulting p-values, we are interested in estimating the posterior probability that a given variable should be included in its assigned cluster. This quantity is called posterior inclusion probabilities (PIPs) which help conduct feature selection (Chung, 2020). Let's estimate PIPs from p-values.

dat.evaluated$pip <- pip(dat.evaluated$p.F, lambda=.5)

# p-values are stored in dat.evaluated$p.F
# PIP are stored in dat.evaluated$pip
plot(dat.evaluated$p.F, dat.evaluated$pip, pch=20)
abline(h=.9,col="red")

This method use the lfdr estimation in the qvalue package. It attemp to automatically estimate the proportion of null variables ($\pi_0$). One may try to provide pi0 or lambda. For a certain set of p-values, this may not be feasible (see a quick primer, (\url{http://varianceexplained.org/statistics/interpreting-pvalue-histogram/})). One may set pi0 = 1 for the most conservative approach.

Alternatively, one could use other p-value adjustments or false discovery rate procedures. For example, we use the R function p.adjust to estimate FDRs using the Benjamini-Hochberg procedure.

# estimate the false discovery rates from a Benjamini-Hochberg procedure
dat.evaluated$p.adjust <- p.adjust(dat.evaluated$p.F, method="BH")

Visualization

We can visualize the clusters obtained from applying tms as follow. Particularly, vis_cluster plots individual variables over time, grouped by cluster memberships. Note that all the visualization is done by ggplot2 \url{https://ggplot2.tidyverse.org/}. The output of tms visualization functions can be stored and further manipulated following ggplot2 conventions.

require(ggplot2)
# display the plot
vis_cluster(dat = denoised_optm,
                  group = clustered_optm$membership)

# store the resulting plot for further modifications
p <- vis_cluster(dat = denoised_optm,
                       group = clustered_optm$membership)
# for example, save the plot into a pdf file
# ggsave(filename="optm_clusters.pdf", plot = p, width = 10, height = 6)

We can also use the jackstraw analysis to select O-PTMs that are shown to be strongly related to each cluster. Some O-PTMs may have been assigned to a given cluster by chance and p-values and posterior inclusion probabilities (PIPs) from the jackstraw tests can help us remove them. When using the threshold of 90% (PIP > 0.9), r sum(dat.evaluated$pip > .9) out of r length(dat.evaluated$pip) O-PTMs are deemed significant.

# use the PIPs calculated from the previous section
# only retain O-PTMs with PIPs > .8 
vis_cluster(dat = denoised_optm[dat.evaluated$pip > .8,],
                  group = clustered_optm$membership[dat.evaluated$pip > .8]) + coord_cartesian(ylim=c(-3.5, 3.5))

References

R package

NC Chung, H Choi, B Mirza, D Wang, D Sigdel, W Wang, P Ping (2020). Identifying Temporal Molecular Signatures Underlying Cardiovascular Diseases: A Data Science Platform. Under revision at Journal of Molecular and Cellular Cardiology

O-PTM data

J Wang, H Choi, NC Chung, Q Cao, DCM Ng, B Mirza, SB Scruggs, D Wang, AO Garlid, P Ping (2018). Integrated dissection of the cysteine oxidative post-translational modification proteome during cardiac hypertrophy. Journal of Proteome Research https://pubs.acs.org/doi/abs/10.1021/acs.jproteome.8b00372

UCLA-BD2K/CV.Signature.TCP documentation built on May 15, 2020, 11:27 p.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

Tweet to @rdrrHQ

GitHub issue tracker

ian@mutexlabs.com