scan_hetset: Scan Heterogeneity in 'SummarizedExperiment' Container
In ZytoHMGU/hetset: Identification of Heterogeneous Subsets in Data

Description Usage Arguments Details Value Author(s) See Also Examples

Scans SummarizedExperiment container (e.g. as returned by hetset function) for feature sets that form phenotypical subpopulations.

1 2	scan_hetset(H, level = "univariate", min_size = 2, max_size = 10, rel_imp = 0, em_steps = 2)

`H`	`SummarizedExperiment` object
`level`	univariate (level = 1 : default) or bivariate (level = 2) scan
`min_size`	minimum number of features to be selected
`max_size`	maximum number of features to be selected
`rel_imp`	minimum relative importance of each selected feature im terms of contribution to the Hellinger distance
`em_steps`	number of EM-steps at each iteration (default = 5); reduce to increase impact of initial state

This function provides an unsupervised version of the forward subset selection (FSS) algorithm. The goal is to extract the subset of features from the first assay in a SummarizedExperiment containter that hosts the sample-mixture with highest degree of dissimilarity in fitted multivariate densities.

The FSS is well known in regression analysis and related fields, where the dimension of the data needs to be reduced. In conventional FSS at each iteration (i.e. extending the best model with k variables by another one), all remaining variables are included sequentially to the current model and the resulting fit is evaluated in terms of the residuals. At the end, when the model has grown to the full model, the one that performed best with regard to some information criterion is chosen.

In contrast to subset selection procedures in a supervised learning setting, the true classification of samples to biological subgroups might be unknown and thus, one faces an unsupervised learning setting. The unsupervised FSS replaces the quantification of the error of a fit by the degree of dissimilarity of the components of the two-component mixture model. The dissimilarity is measured as Hellinger's squared distance for normal distributions. As adaption to an information criterion, each selected feature can be evaluated in terms of contribution to the squared Hellinger distance (see evaluate_set). Once the added value of a feature in the signature is below the threshold rel_imp the scan is stopped.

When neither a feature of interest nor a pre-partitioning of the data is given, all features are fitted separatly (univariate) or pairwise (bivariate) by a two-component normal mixture model. The maximum likelihood fit is performed by the mclust toolbox, which uses a EM algorithm to estimate the parameters of the mixture and the associated partitioning.

In case, a set of features is of special interest, the researcher is free to set it as an initial state and the data of this partitcular set is fitted by mclust. Also a pre-partitioning of (some of) the samples is an option to guide the search for informative heterogeneities. In such a case, the mclust-fit is replaced by direct estimation and maximization steps.

In all non-mclust maximization steps, the maximum-likelihood assigning of samples to the component, with the higher density, the scan_hetset function assignes the samples probabilistically to the components, in order to avoid the resulting location bias for overlapping components.

Returns a SummarizedExperiment object with metadata about the selected features, the parameters for the mixture componentes (subpopulations) and the labelling of the samples.

`metadata$slf`	set of selected features
`metadata$prm.full`	list with mean vector and covariance matrix for the selected features
`metadata$prm.A`	list with mean vector and covariance matrix for subpopulation A
`metadata$prm.B`	list with mean vector and covariance matrix for subpopulation B
`metadata$prp.A`	mixture coefficient indicating the relative number of samples assigned to subpopulation A
`metadata$sqHell`	Hellinger's squared distance of prm.A and prm.B
`prt`	labels indicating the subpopulation a sample was assigned to

Daniel Samaga

hetset

X <- matrix(data = rnorm(10*20),ncol = 50)
H <- hetset::hetset(D = X)
H <- scan_hetset(H,rel_imp = 0.01,em_steps = 5)

X1 <- c(rnorm(50,0,1),rnorm(50,3,1))
X2 <- c(rnorm(50,0,1),rnorm(50,2,1))
X3 <- c(rnorm(50,0,1),rnorm(50,2,2))
X4 <- c(rnorm(50,0,1),rnorm(50,1,1))
X5 <- c(rnorm(50,0,1),rnorm(50,1,2))
A <- matrix(data = rnorm(n = 1000,mean = 0,sd = 1),ncol = 100)
Hds <- hetset(D = rbind(X1,X2,X3,X4,X5,A))
rm(A,X1,X2,X3,X4,X5)
Hds <- scan_hetset(H = Hds,level = "univariate",min_size = 2,
                   max_size = 5,rel_imp = 0.01,em_steps = 5)
plot_hetset(Hds)

data("TCGA_HNSCC_expr")
H <- subset_hetset(H = H,keep.features = sample(x = H@NAMES,size = 5,FALSE))
H <- censor_data(H)
Hds <- scan_hetset(H = H,level = "univariate",min_size = 2,
    max_size = 3,em_steps = 5)
plot_hetset(Hds)

Hds <- scan_hetset(H = H,level = "bivariate",min_size = 2,
    max_size = 4,em_steps = 4)
plot_hetset(Hds)

rm(Hds)
Hds <- subset_hetset(H = H,keep.samples = c(rep(TRUE,400),
                                            rep(FALSE,ncol(H)-400)))
Hds$prt <- as.factor(rep(c("A","B"),times = 200))
Hds <- scan_hetset(H = Hds,level = "univariate",min_size = 2,
    max_size = 4,em_steps = 4)
plot_hetset(Hds)