supcluster: Clustering of Features Supervised by an Outcome
In supcluster: Supervised Cluster Analysis

supcluster

R Documentation

Clustering of Features Supervised by an Outcome

Description

We assume that each individual has set of features and an outcome, further we assume that the features are organized in clusters with a random effect for each cluster, and that the outcome is related to the random effects by a linear regression. The function supcluster performs an MCMC to determine the parameters of this model including the cluster membership of each feature. The program can also perform the estimation without considering the outcome. The outcome can be any data object, as long as it is related to the individual through a frialty.

Usage

supcluster(data,outcome,features,log.transform=TRUE,maxclusters=10,
nstart=100,n=500,shape=1,scale=1,alpha=1,betaP=1,fixj="random",
fbeta=FALSE,starting.value=NULL,nchains=1,linkLikelihood = NULL)

Arguments

`data`	A data frame of the input data
`outcome`	Either the variable number or the variable name of the outcome variable. If `fbeta=TRUE`, no outcome variable is used. If NULL we assume the outcome is a data object and there is a likelihood relating it to a per-patient frialty variable. In that case linkLikelihood cannont be NULL
`features`	A list of features either as variable names or column numbers this can't be mixed
`log.transform`	Log transform the feature data. Generally used when the features are gene expressons
`maxclusters`	The maximum number of clusters used
`nstart`	The first nstart-1 values of each MCMC chain are not reported, that is used as a “burn in”.
`n`	The number of MCMC iterations for each chain

`shape`	The shape parameter for the prior on the variance components
`scale`	The starting scale parmeter for the prior on the variance components
`alpha`	The value to use for the Dirichelet prior parameter

`betaP`	The prior precision of the regression parameters.
`fixj`	If `"random"`, then the starting value for cluster membership is set at random. If `"kmeans"` it uses kmeans to set the starting value. Otherwise it is matrix of features verses clusters, where a 1 indicates that feature i is in cluser j and the cluster membership is assumed to be known. `fixj` should be set to `"random"` when multiple chains are run.
`fbeta`	If TRUE then the outcome is not used in the clustering algorithm
`starting.value`	Starting value for the MCMC. It should be left as NULL when multiple chains are run, in which case the starting cluster membership is determined by `fixj`. Otherwise it is parameter vector similar to the one described under “value” below.
`nchains`	Number of chains to run
`linkLikelihood`	Likelihood function for model linking actual outcome data to the per-patient frialty. The input of the function is a vector of length `dim(data)[1]+nparms`, where `nparms` is the number of parameters in the outcome model. The first part of the vector are the frailties and the second part are the parameters of the model. If NULL then `outcome` is used.

Value

A compound list is returned. At the first level is the chain number. At the second level there are two elements

`inp`	This has twp values `maxclusters` giving the maximum number of clusters and `ngenes` giving the maximum number of features
`parms`	This is a `n` by `3+maxclusters+ngenes` matrix. Each row is one MCMC iteration. The first three columns are the values of the variance components σ^2,τ^2, and γ^2 the next `maxcluster` values are the regression coefficients for each cluster and the final `ngenes` values are the cluster membership of each feature

Note

When the feature space is large this program runs slowely. In the example only 20 iterations where used for the burn in and only 80 iterations are run. In general this would not be adequate to fully explore the feature space.

Author(s)

David A. Schoenfeld, Jessie Hsu

References

Hsu, Jessie J., Dianne M. Finkelstein, and David A. Schoenfeld. "Outcome-driven cluster analysis with application to microarray data." PloS one 10.11 (2015): e0141874.

Examples

##---- Should be DIRECTLY executable !! ----
##-- ==>  Define data, use random,
##--	or do  help(data=index)  for the standard data sets
##--Note you need to change nstart and n in example to get enough iterations
#run supcluster on trauma data.  Note: nstart and n must be increased to,say, 2000,3000 
#and maxclusters increased to 20
data("trauma_data")
us=supcluster(trauma_data,outcome="outcome",features=1:87,
              maxclusters=5,nstart=5,n=20,fbeta=FALSE)
#creates plot in paper
usm=concordmap(us,chains=1,sort.genes=TRUE)
image(1:87,1:87,usm$map,xlab='Genes',ylab='Genes',
      main="Trauma Data Example",
      col=gray(16:1 / 16))
#Associate genes with clusters
data("gene_names")
betas=colSums(us[[1]]$parms[,3:22])
outpt=data.frame(cluster.number=usm$clusters,beta=betas[usm$clusters],gene_names[usm$order,])

supcluster documentation built on May 20, 2022, 1:07 a.m.