supcluster: Clustering of Features Supervised by an Outcome

View source: R/supcluster.R

supclusterR Documentation

Clustering of Features Supervised by an Outcome

Description

We assume that each individual has set of features and an outcome, further we assume that the features are organized in clusters with a random effect for each cluster, and that the outcome is related to the random effects by a linear regression. The function supcluster performs an MCMC to determine the parameters of this model including the cluster membership of each feature. The program can also perform the estimation without considering the outcome. The outcome can be any data object, as long as it is related to the individual through a frialty.

Usage

supcluster(data,outcome,features,log.transform=TRUE,maxclusters=10,
nstart=100,n=500,shape=1,scale=1,alpha=1,betaP=1,fixj="random",
fbeta=FALSE,starting.value=NULL,nchains=1,linkLikelihood = NULL)

Arguments

data

A data frame of the input data

outcome

Either the variable number or the variable name of the outcome variable. If fbeta=TRUE, no outcome variable is used. If NULL we assume the outcome is a data object and there is a likelihood relating it to a per-patient frialty variable. In that case linkLikelihood cannont be NULL

features

A list of features either as variable names or column numbers this can't be mixed

log.transform

Log transform the feature data. Generally used when the features are gene expressons

maxclusters

The maximum number of clusters used

nstart

The first nstart-1 values of each MCMC chain are not reported, that is used as a “burn in”.

n

The number of MCMC iterations for each chain

shape

The shape parameter for the prior on the variance components

scale

The starting scale parmeter for the prior on the variance components

alpha

The value to use for the Dirichelet prior parameter

betaP

The prior precision of the regression parameters.

fixj

If "random", then the starting value for cluster membership is set at random. If "kmeans" it uses kmeans to set the starting value. Otherwise it is matrix of features verses clusters, where a 1 indicates that feature i is in cluser j and the cluster membership is assumed to be known. fixj should be set to "random" when multiple chains are run.

fbeta

If TRUE then the outcome is not used in the clustering algorithm

starting.value

Starting value for the MCMC. It should be left as NULL when multiple chains are run, in which case the starting cluster membership is determined by fixj. Otherwise it is parameter vector similar to the one described under “value” below.

nchains

Number of chains to run

linkLikelihood

Likelihood function for model linking actual outcome data to the per-patient frialty. The input of the function is a vector of length dim(data)[1]+nparms, where nparms is the number of parameters in the outcome model. The first part of the vector are the frailties and the second part are the parameters of the model. If NULL then outcome is used.

Value

A compound list is returned. At the first level is the chain number. At the second level there are two elements

inp

This has twp values maxclusters giving the maximum number of clusters and ngenes giving the maximum number of features

parms

This is a n by 3+maxclusters+ngenes matrix. Each row is one MCMC iteration. The first three columns are the values of the variance components σ^2,τ^2, and γ^2 the next maxcluster values are the regression coefficients for each cluster and the final ngenes values are the cluster membership of each feature

Note

When the feature space is large this program runs slowely. In the example only 20 iterations where used for the burn in and only 80 iterations are run. In general this would not be adequate to fully explore the feature space.

Author(s)

David A. Schoenfeld, Jessie Hsu

References

Hsu, Jessie J., Dianne M. Finkelstein, and David A. Schoenfeld. "Outcome-driven cluster analysis with application to microarray data." PloS one 10.11 (2015): e0141874.

See Also

concordmap, compare.chains,beta.by.gene

Examples

##---- Should be DIRECTLY executable !! ----
##-- ==>  Define data, use random,
##--	or do  help(data=index)  for the standard data sets
##--Note you need to change nstart and n in example to get enough iterations
#run supcluster on trauma data.  Note: nstart and n must be increased to,say, 2000,3000 
#and maxclusters increased to 20
data("trauma_data")
us=supcluster(trauma_data,outcome="outcome",features=1:87,
              maxclusters=5,nstart=5,n=20,fbeta=FALSE)
#creates plot in paper
usm=concordmap(us,chains=1,sort.genes=TRUE)
image(1:87,1:87,usm$map,xlab='Genes',ylab='Genes',
      main="Trauma Data Example",
      col=gray(16:1 / 16))
#Associate genes with clusters
data("gene_names")
betas=colSums(us[[1]]$parms[,3:22])
outpt=data.frame(cluster.number=usm$clusters,beta=betas[usm$clusters],gene_names[usm$order,])

supcluster documentation built on May 20, 2022, 1:07 a.m.