EMVC: Entropy Minimization over Variable Clusters (EMVC) algorithm
In EMVC: Entropy Minimization over Variable Clusters

Description Usage Arguments Value See Also Examples

View source: R/EMVC.R

Implementation of the EMVC algorithm. Takes an n-by-p data matrix and a c-by-p binary annotation matrix and generates an optimized, i.e., filtered, version of the annotation matrix by minimizing the entropy between each variable group and the categorical random variable representing membership of each variable in clusters output by either k-means clustering or horizontal cuts of a dendrogram generated via agglomerative hierarchical clustering with correlation distance. Annotations are never added during optimization, just removed.

1
2
3

    EMVC(data, annotations, bootstrap.iter=20, k.range=NA, clust.method="kmeans", 
        kmeans.nstart=1, kmeans.iter.max=10, hclust.method="average", 
        hclust.cor.method="spearman")

`data`	Input data matrix, observations-by-variables. Must be specified. Cannot contain missing values.
`annotations`	Binary annotation matrix, variable groups-by-variables. Must be specified.
`bootstrap.iter`	Number of bootstrap iterations. Defaults to 20. If set to 1, will return the results from a single optimization run on the input data matrix (i.e., no bootstrapping will be performed).
`clust.method`	Method used to generate variable clusters. Either "kmeans" or "hclust". Defaults to "kmeans".
`k.range`	Range of k-means k values or dendrogram cut sizes. Must be specified.
`kmeans.nstart`	Only relevant if clust.method is "kmeans". K-means nstart value. Defaults to 5.
`kmeans.iter.max`	Only relevant if clust.method is "kmeans".Max number of iterations for k-means. Defaults to 20.
`hclust.method`	Only relevant if clust.method is "hclust". Will be supplied as the "method" argument to the R function `hclust`. Defaults to "average".
`hclust.cor.method`	Only relevant if clust.method is "hclust". Will be supplied as the "method" argument to the R `cor` function. Defaults to "spearman". Represents the correlation method used to compute the dissimilarity matrix for `hclust`. Entries in the dissimilarity matrix will take the form (1-correlation)/2.

Optimized version of the annotation matrix. Contains the average proportion of cluster sizes in which a given annotation was kept during optimization. If bootstrapping is enabled, the optimized matrix will contain the average proportions over all bootstrap resampled datasets.

filterAnnotations.

   ## Create random sparse annotation matrix for 50 variable groups 
   ## and 100 variables
   annotations = matrix(rbinom(5000,1,.1), nrow=50, ncol=100)

   ## Number of initial annotations
   sum(annotations)

   ## Create random gene expression matrix for 50 observations and 100 variables 
   data = matrix(rnorm(5000), nrow=50, ncol=100)
 
   ## Execute EMVC using k-means
   EMVC.results = EMVC(data=data, annotations=annotations, 
                       bootstrap.iter=30, k.range=2:10, clust.method="kmeans", 
                       kmeans.nstart=3, kmeans.iter.max=10)

   ## Filter the results at .9 threshold
   filtered.opt.annotations = filterAnnotations(EMVC.results, .9)
   
   ## Number of optimized annotations at .9 threshold, should be close to 0 since the
   ## variable groups and data are random (i.e., no random annotations avoid 
   ## optimization-based filtering most of the time)
   sum(filtered.opt.annotations)   
   
   ## Filter the results at .1 threshold
   filtered.opt.annotations = filterAnnotations(EMVC.results, .1)
   
   ## Number of optimized annotations at .1 threshold, should be close to 
   ## the initial number of annotations since the variable groups and data are random 
   ## (i.e., no random variables are consistently filtered by the EMVC algorithm)
   sum(filtered.opt.annotations)

[1] 464
Bootstrap iteration 10: Sampling 50 values with replacement. Optimizing 464 true annotations out of 5000
Finished optimization: 180.444444444444 annotations out of 5000
Bootstrap iteration 20: Sampling 50 values with replacement. Optimizing 464 true annotations out of 5000
Finished optimization: 182 annotations out of 5000
Bootstrap iteration 30: Sampling 50 values with replacement. Optimizing 464 true annotations out of 5000
Finished optimization: 175.666666666667 annotations out of 5000
[1] 0
[1] 464