Description Usage Arguments Details Value Author(s) See Also Examples
Scans SummarizedExperiment
container (e.g. as returned by
hetset
function) for feature sets
that form phenotypical subpopulations.
1 2 | scan_hetset(H, level = "univariate", min_size = 2, max_size = 10,
rel_imp = 0, em_steps = 2)
|
H |
|
level |
univariate (level = 1 : default) or bivariate (level = 2) scan |
min_size |
minimum number of features to be selected |
max_size |
maximum number of features to be selected |
rel_imp |
minimum relative importance of each selected feature im terms of contribution to the Hellinger distance |
em_steps |
number of EM-steps at each iteration (default = 5); reduce to increase impact of initial state |
This function provides an unsupervised version of the forward subset selection (FSS) algorithm.
The goal is to extract the subset of features from the first assay in a SummarizedExperiment
containter
that hosts the sample-mixture with highest degree of dissimilarity in fitted multivariate densities.
The FSS is well known in regression analysis and related fields, where the dimension of the data needs to be reduced. In conventional FSS at each iteration (i.e. extending the best model with k variables by another one), all remaining variables are included sequentially to the current model and the resulting fit is evaluated in terms of the residuals. At the end, when the model has grown to the full model, the one that performed best with regard to some information criterion is chosen.
In contrast to subset selection procedures in a supervised learning setting, the true classification of samples to biological subgroups might be unknown and thus, one faces an unsupervised learning setting.
The unsupervised FSS replaces the quantification of the error of a fit by the degree of dissimilarity of the components of the two-component mixture model.
The dissimilarity is measured as Hellinger's squared distance for normal distributions.
As adaption to an information criterion, each selected feature can be evaluated in terms of contribution to the squared Hellinger distance (see evaluate_set
). Once the added value of a feature in the signature is below the threshold rel_imp
the scan is stopped.
When neither a feature of interest nor a pre-partitioning of the data is given, all features are fitted separatly (univariate) or pairwise (bivariate) by a two-component normal mixture model.
The maximum likelihood fit is performed by the mclust
toolbox, which uses a EM algorithm to estimate the parameters of the mixture and the associated partitioning.
In case, a set of features is of special interest, the researcher is free to set it as an initial state and the data of this partitcular set is fitted by mclust
.
Also a pre-partitioning of (some of) the samples is an option to guide the search for informative heterogeneities. In such a case, the mclust-fit is replaced by direct estimation and maximization steps.
In all non-mclust maximization steps, the maximum-likelihood assigning of samples to the component, with the higher density, the scan_hetset
function assignes the samples probabilistically to the components, in order to avoid the resulting location bias for overlapping components.
Returns a SummarizedExperiment
object with metadata about the selected
features, the parameters for the mixture componentes (subpopulations) and the labelling of the samples.
metadata$slf |
set of selected features |
metadata$prm.full |
list with mean vector and covariance matrix for the selected features |
metadata$prm.A |
list with mean vector and covariance matrix for subpopulation A |
metadata$prm.B |
list with mean vector and covariance matrix for subpopulation B |
metadata$prp.A |
mixture coefficient indicating the relative number of samples assigned to subpopulation A |
metadata$sqHell |
Hellinger's squared distance of prm.A and prm.B |
prt |
labels indicating the subpopulation a sample was assigned to |
Daniel Samaga
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 | X <- matrix(data = rnorm(10*20),ncol = 50)
H <- hetset::hetset(D = X)
H <- scan_hetset(H,rel_imp = 0.01,em_steps = 5)
X1 <- c(rnorm(50,0,1),rnorm(50,3,1))
X2 <- c(rnorm(50,0,1),rnorm(50,2,1))
X3 <- c(rnorm(50,0,1),rnorm(50,2,2))
X4 <- c(rnorm(50,0,1),rnorm(50,1,1))
X5 <- c(rnorm(50,0,1),rnorm(50,1,2))
A <- matrix(data = rnorm(n = 1000,mean = 0,sd = 1),ncol = 100)
Hds <- hetset(D = rbind(X1,X2,X3,X4,X5,A))
rm(A,X1,X2,X3,X4,X5)
Hds <- scan_hetset(H = Hds,level = "univariate",min_size = 2,
max_size = 5,rel_imp = 0.01,em_steps = 5)
plot_hetset(Hds)
data("TCGA_HNSCC_expr")
H <- subset_hetset(H = H,keep.features = sample(x = H@NAMES,size = 5,FALSE))
H <- censor_data(H)
Hds <- scan_hetset(H = H,level = "univariate",min_size = 2,
max_size = 3,em_steps = 5)
plot_hetset(Hds)
Hds <- scan_hetset(H = H,level = "bivariate",min_size = 2,
max_size = 4,em_steps = 4)
plot_hetset(Hds)
rm(Hds)
Hds <- subset_hetset(H = H,keep.samples = c(rep(TRUE,400),
rep(FALSE,ncol(H)-400)))
Hds$prt <- as.factor(rep(c("A","B"),times = 200))
Hds <- scan_hetset(H = Hds,level = "univariate",min_size = 2,
max_size = 4,em_steps = 4)
plot_hetset(Hds)
|
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.