In ZytoHMGU/hetset: Identification of Heterogeneous Subsets in Data

Explore Heterogeneity

This vignette introduces the hetset package, which is meant to support the analysis of high-dimensional real-valued data. The aim is the identification of previously unknown subpopulations in supposedly homogeneous samples. In explorative data analysis, where data sets are screened for unexpected patterns, evaluation of this kind of heterogeneity is essential. Functions for scanning data, visualization of the returned subpopulations, and testing for association with sample information are provided in the package and introduced in this vignette.

Whereever quantitative analyses aim at statistically significant differences between two distinct groups of samples, homogeneity (similarity) within a group and heterogeneity (dissimilarity) between the groups is the implicite fundament of the analysis. In contrast to the ordinarity of the problem, the catalogue of tools is rather short for multivariate samples. For instance, data exploration may be assisted featurewise by boxplots or pairwise in scatterplots or other visualization tools, but still all tools are limited by the missing knowledge about potential subgroup memberships of the samples of each group of interest.

The hetset package addresses this problem by an search algorithm scan_hetset, that extracts a subset of variables (called signature) which has biggest potential of best separating sample subgroups. So the suggested signature either splits the data convincingly und thus generates a bunch of hypotheses to be tested or it supports the conception of an homogeneous data set. In detail, the algorithm at the same time constructs feature by feature the signature as well as it assigns all samples to one of the subgroups based on the signature. In statistical terms, this is realized by transfering the forward subset selection approach from the supervised machine learning setting to the unsupervised setting by replacing the objective function of maximization of likelihoods (or minimization of residuals) by the maximization of a dissimilarity measure between the densities of a fitted mixture model. The model fitting in each iteration is realized in an EM-algorithm-like approach.

Heterogeneity is modelled by a two-component normal mixture model. Note, that violation of the normality assumption by real world data, which trustworthy kills p-values of hypothesis testing scenarios, does not harm as much the hetset concept. This robustness arises from the usage of the model fit, which is restricted to ranking the candidates and not evaluating probabilities or inferring number of subpopulations. Therefore, the model is not used to falsify hypothesis but rather to capture the characteristic spread of the data.

The scan_hetset function maximizes in an unsupervised forward subset selection approach the squared Hellinger's distance of the densities of the model's two components fitted to the data.
Other than model-based clustering approaches, the objective function does not assume feature-wise additivity of the dissimilarity measure, which results in splitting the data alongside highly correlated pairs of features. Furthermore the feature selective nature of this approach allows straight interpretation of the results in contrast to factor analyzers approaches, where a small number of principal component like vectors is calculated, which themselves conserve full dimensionality. The simple inner EM-like nature of the algorithm also allows alternative access to the iterations, for instance to evolve signatures around predefined subsets of samples or features.

Installation

To install hetset, execute

install_github("ZytoHMGU/hetset")

Once installed, run

nextSeed <- rpois(n = 1,lambda = 10^6)
set.seed(1410)
library(hetset)

to load the package.

Data

Expression data was downloaded from the Cencer Genome Atlas (TCGA) via the bioconductor package curatedTCGAData. RNA sequencing data (RNASeq2GeneNorm) of Head and Neck Squamous Cell Carcimoma (HNSC) have been selected. In order to reduce computational costs, the number of genes was reduced to "Signature 1" as presented in Buitrago-Pérez, Current Genomics, 2009.

data("TCGA_HNSCC_expr")
dim(H)
summary(H)
H <- hetset(H)

Preprocessing

In case, outliers may be present in the data, it is recommended, to censor the data before scanning for heterogeneity. This means, that values outside, the $k$-fold standard deviation around the mean, are shifted to this boundary.

H <- censor_data(H = H,k = 3.5)

HetSet Scan

Concept

In a scan of the data, a two-component mixture model [ f(x|\theta) = \pi f(x|\mu_A,\Sigma_A) + (1-\pi) f(x|\mu_B,\Sigma_B) ] is fitted to a (growing) subset $s$ of features $\mathcal{F}$. As an objective function, the distance $H^2(A,B)$ of the two components $A$ and $B$ is maximized.

scan_hetset offers basically three ways to enter the scanning procedure. Beside the raw data, subgroup information for (some) samples or features of interest can be included in initialization of the scan.

no information: scan raw data featurewise or by pairs of features. (based on mclust package)
subgroup membership: some samples are manually assigned to subpopulations. Densities of mixture-model components are estimated based on the assigned samples before the unlabelled samples are assinged to subpopulations.
set of features: guide the scan towards pathways of interest. (based on mclust package)

Depending on the initialization, a full partitioning $(\xi_i)_{i=1:N}, \xi_i \in {A,B}$ of the samples as well as a feature (vector) $s$ are calculated. Then, sequentially the remaining features in $\mathcal{F}\setminus s$ are added to the feature vector, forming candidate signatures. For each candidate signature, a fixed number of EM-steps is made (i.e. in an estimation (E) step, the samples are assigned to subpopulations and in a maximization (M) step, the parametrization is adpted to maximize the likelihood of the observed samples). As the last step, the squared Hellinger's distance is calculated for each candidate signature and the one with the highest distance is chosen to be the updated signature.

Unless the relative importance (explained in ...) of each feature is lower than the threshold or the length of the signature exceeds the given limit, this procedure is repeated.

Thereby the splitting of the samples grows along features, that support the existing separation. Having in mind deregulated pathways, splitting the samples in one sample should cause a consistent trend in other features of the pathway.

Scanning Raw Data

The default scan will run on a data set without preselected features or preassigned samples.

By executing

Hs <- scan_hetset(H = H,min_size = 2,max_size = 3,rel_imp = 0.01,em_steps = 2)
Hs@metadata$slf

the SummarizedExperiment container $H$ is scanned for heterogeneous subsets.

Thereby, the following options can be used to guide the scan.

min_size: lower bound for the length of the signature to be returned. The search will not stop unless at least min_size features are in the signature.
max_size: upper bound for the length of the signature to be returned. The search will stop at latest when max_size features are in the signature.
rel_imp: after each extension of the signature (i.e. adding the new feature that caused largest distance of the subpopulations), all members are evaluated with regard to their relative importance for data separation. When a feature's importance is below this threshold, it is removed from the signature and the scan is stopped
em_steps: the number of estimation-maximization iterations made to adapt the mixture model to the signature with the added candidate. The lower the number, the lazier the adaptation to new patterns (in the added features) and the higher the importance of the earlier features.

The result of the scan is again a SummarizedExperiment container, with additional fields in the metadata slot, that parametrize the fitted mixture model. In addition to that, the partitioning of the samples to the subpopulations is stored in the $slf$-entry of the sample data.

Visualization of Results

The most important tool in explorative data analysis is the human eye. Thus undoubtedly, visualization of the scanning result is the first thing to be done, once the scan is finished. For signatures of length 1, a histogram is plotted. Signatures of length 2 are visulalized in scatterplots and longer signatures result in a matrix of pairwise scatterplots.

plot_hetset(Hs)

Evaluate Features of a Signature

Beside the visualization by plot_hetset, the members of the proposed signature can be evaluated with regard to their contribution to the separating potential of the signature.

For this purpose, the contribution of each member is quantified as the loss of separation of the subpopulation after removing the feature from the signature. This loss of separation is defined as the difference of Hellingers squared distances with and without this feature. Given a set $S = {X, Y, Z}$ of features that cause a Hellinger's distance of $0.8$. Assume, removing $X$ from S would result in a distance of $0.2$, then we say, $X$ had a contribution of $0.6$, which equals an explained distance of $0.6/0.8 = 75\%$. Scaling each contribution to the total sum of contributions (which does not equal $1$ in general) is called the relative importance.

eval_out <- hetset::evaluate_set(H = Hs)

Evaluate Dependencies

Further, the association of the partitioning of the samples with other sample information can be tested for statistical independence by Fisher tests. To test independence with age, sex or hpv.status, show_dependencies can be executed. Splitting of the samples according to sample information can be achieved by setting a reference level, the logical state or a numeric threshold.

show_dependencies(H = Hs,dep_list = c("years_to_birth","gender",
                                      "patient.hpv_status_by_ish_testing"),
                  S_list = list(50,"male","positive"))

A short summary of the tested list (confidence intervals not adjusted for multiple testing) can be added to the current plot by setting the argument type="plot".

Evaluating Samples

For further analysis of the signature and/or sample subpopulations it can be benefitial to focus on those samples that are clearly assigned to one of the subpopulations and exclude those, that are ambiguous or differ from both subpopulations.

Those to criteria are implemented in the function show_partition by thresholding for each sample the maximum distance to the nearest subpopulation-center (inner-quantile) and asking for a minumum ratio of the likelihoods given one or the other group-membership (cutoff). A list of "clearly-assigned" samples is returned and can be visualized by

showPart_out <- show_partition(H = Hs,cutoff = 0.99,inner = 0.8)
showPart_out[1:10,1:6]
Hs$sample_color <- showPart_out$group
Hs@metadata$color_type = "indiv"
plot_hetset(Hs)

Note that this does not adress the uncertainty of the proposed parametrization returned from the purely data-driven scan, but it supports the analysis of the proposed subpopulations.

Further Visualization Options

The samples may also be colored by other information (e.g. gender).

Hs@metadata$color_type = "patient.hpv_status_by_ish_testing"
Hs@metadata$slf <- Hs@metadata$slf[1:2]
Hs <- estimate_densities(Hs)
plot_hetset(Hs)
show_dependencies(H = Hs,dep_list = c("years_to_birth","gender",
                                      "patient.hpv_status_by_ish_testing"),
                  S_list = list(50,"male","positive"),type = 'plot')

These are the basic tools, to scan and visualize heterogeneity in data. Before explaining how to subset data for further analysis, a little more detail on the algorithm is given.

Methodology

A data matrix $X \in \mathbb{R^{n \times p}}$ of $n$ samples is considered, each sample represented by a $p$-dimensional vector. It is assumed, that the data can be modelled sufficiently by a two-component normal mixture model
[ P(x|\theta) = \pi \cdot f(x|\mu_A,\Sigma_A) + (1-\pi) \cdot f(x|\mu_B,\Sigma_B).]

Since we are interested in the heterogeneous patterns of $X$, we are looking for substes of $X$, that can be partitioned in distinguishable subpopulations.

[ P(x_i^s|\theta) = \pi \cdot f(x_i^s|\mu_A,\Sigma_A) + (1-\pi) \cdot f(x_i^s|\mu_B,\Sigma_B).]

This model, fitted to data and the samples are assigned to the subpopulatinos according to these densities ($f(x)$). When no information is given about group memberships, an EM algorithm is applied to fit the model. Note, that assignment of samples to the subpopulations differs from conventional Estimation steps in EM-algorithms. Assigning all samples to the subpopulation with the higher likelilhood, introduces an bias on the estimated distance of the group centers. Therefore, all samples are assigned stochastically to the centers with propabilities proportional to the relative likelihoods (see concept).

For a given partitioning of the data, a parameter vector [\hat{\theta} = (\hat{\pi}, \hat{\mu}_A, \hat{\Sigma}_A, \hat{\mu}_B, \hat{\Sigma}_B) ] is estimated by maximum likelihood.

The squared Hellinger's distance between two multivariate normal distributions $P = \mathcal{N}(\mu_A, \Sigma_A)$ and $Q = \mathcal{N}(\mu_A, \Sigma_A)$ is [\hspace{-1.5cm} H^2(P, Q) = 1 - \frac{\text{det}(\Sigma_A)^{1/4} \text{det}(\Sigma_B)^{1/4}}{\text{det}\left(\frac{\Sigma_A + \Sigma_B}{2}\right)^{1/2}} exp\left[ -\frac{1}{8}(\mu_A-\mu_B)^T \left(\frac{\Sigma_A + \Sigma_B}{2}\right)^T (\mu_A-\mu_B)\right]]

ML-estimates for mean and covariance matrix for the subpopulations are calculated given the current partitioning of the samples. Second, the samples are reassigned to the subpopulations given their likelihoods of the two subpopulation densities (i.e. $P(\xi_j = A) = \frac{f(x_j|\hat{\mu}_A,\hat{\Sigma}_A)}{f(x_j|\hat{\mu}_A,\hat{\Sigma}_A) + f(x_j|\hat{\mu}_B,\hat{\Sigma}_B)}$)

Prepare Further Analysis

Subpopulations

Scanning identified subpopulations for further heterogeneities is an obvious step in order to achieve consistent pictures of the scene. In this section, the preparation of such a scan is shown - the scanning and analysis of results is the same as presented in the scanning section.

Select a subpopulation "A" (or "B") and reset the current signature by executing

HA <- subset_hetset(H = Hs,prt = "A")
HA@metadata$slf <- NULL

In a scan, you will find that subgroup "A" seems to be quite homogeneous, whereas subgroup "B" shows further heterogeneity.

More Signatures

A set of samples can be separable with respect to different characteristics, represented by signatures. Therefore, excluding, the most important feature of the first signature can bring up more sources of heterogeneity within the same population

H2 <- subset_hetset(H = Hs,remove.features = eval_out$Feature[1])
H2@metadata$slf <- NULL

prepares such a scan.

Informed Search

The scan-procedure is also able to construct a signature that best separates the samples around predefined subgroups (e.g. male/femal, hpv positive/negative, older/younger than 50 years, ...)

The search algorithm might evolve away from the predefined partitioning, but it is guided along the biggest difference between the initial subgroups. The odds ratio of the initial grouping-variable can be seen as an indicator of the degree of which the returned group is associated with the intial characteristic.

Hg <- H
Hg <- hetset::make_partition_by_logical(H = Hg,log_vec = H$gender == "male")

will prepare the data for such a scan.

Targetted Search

If a feature of interest should be addressed in a heterogeneity scan, this can be done by prespecifying it in the $slf$-field of the metadata-slot. Setting $prt$ slot and the parametrization of the SE container to NULL, the scan will start with an mclust scan of the feature-data that was specified.

Final

References

...

Session information.

set.seed(nextSeed)
rm(nextSeed)
print(sessionInfo())

ZytoHMGU/hetset documentation built on June 6, 2019, 2:16 p.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

ZytoHMGU/hetset
Identification of Heterogeneous Subsets in Data

In ZytoHMGU/hetset: Identification of Heterogeneous Subsets in Data

Explore Heterogeneity

Installation

Data

Preprocessing

HetSet Scan

Concept

Scanning Raw Data

Visualization of Results

Evaluate Features of a Signature

Evaluate Dependencies

Evaluating Samples

Further Visualization Options

Methodology

Prepare Further Analysis

Subpopulations

More Signatures

Informed Search

Targetted Search

Final

References

Session information.

R Package Documentation

Browse R Packages

We want your feedback!

ZytoHMGU/hetset Identification of Heterogeneous Subsets in Data

In ZytoHMGU/hetset: Identification of Heterogeneous Subsets in Data

Explore Heterogeneity

Installation

Data

Preprocessing

HetSet Scan

Concept

Scanning Raw Data

Visualization of Results

Evaluate Features of a Signature

Evaluate Dependencies

Evaluating Samples

Further Visualization Options

Methodology

Prepare Further Analysis

Subpopulations

More Signatures

Informed Search

Targetted Search

Final

References

Session information.

R Package Documentation

Browse R Packages

We want your feedback!

ZytoHMGU/hetset
Identification of Heterogeneous Subsets in Data