r_cluster_data: Cluster data using environmental exposure
In eclust: Environment Based Clustering for Interpretable Predictive Models in High Dimensional Data

Description Usage Arguments Details Value See Also Examples

This is one of the functions for real data analysis, which will cluster the data based on the environment, as well as ignoring the environment

r_cluster_data(data, response, exposure, train_index, test_index,
  cluster_distance = c("corr", "corr0", "corr1", "tom", "tom0", "tom1",
  "diffcorr", "difftom", "fisherScore"), eclust_distance = c("fisherScore",
  "corScor", "diffcorr", "difftom"), measure_distance = c("euclidean",
  "maximum", "manhattan", "canberra", "binary", "minkowski"),
  minimum_cluster_size = 50, ...)

`data`	n x p matrix of data. rows are samples, columns are genes or cpg sites. Should not contain the environment variable
`response`	numeric vector of length n
`exposure`	binary (0,1) numeric vector of length n for the exposure status of the n samples
`train_index`	numeric vector indcating the indices of `response` and the rows of `data` that are in the training set
`test_index`	numeric vector indcating the indices of `response` and the rows of `data` that are in the test set
`cluster_distance`	character representing which matrix from the training set that you want to use to cluster the genes. Must be one of the following corr, corr0, corr1, tom, tom0, tom1, diffcorr, difftom, corScor, tomScor, fisherScore
`eclust_distance`	character representing which matrix from the training set that you want to use to cluster the genes based on the environment. See `cluster_distance` for avaialble options. Should be different from `cluster_distance`. For example, if `cluster_distance=corr` and `EclustDistance=fisherScore`. That is, one should be based on correlations ignoring the environment, and the other should be based on correlations accounting for the environment. This function will always return this add on
`measure_distance`	one of "euclidean","maximum","manhattan", "canberra", "binary","minkowski" to be passed to `dist` function for calculating the distance for the clusters based on the corr,corr1,corr0, tom, tom0, tom1 matrices
`minimum_cluster_size`	The minimum cluster size. Only applicable if `cutMethod='dynamic'`. This argument is passed to the `cutreeDynamic` function through the `u_cluster_similarity` function. Default is 50.
`...`	arguments passed to the `u_cluster_similarity` function

This function clusters the data. The results of this function should then be passed to the r_prepare_data function which output the appropriate X and Y matrices in the right format for regression packages such as mgcv, caret and glmnet

a list of length 8:

clustersAddon: clustering results based on the environment and not the environment. see u_cluster_similarity for details
clustersAll: clustering results ignoring the environment. See u_cluster_similarity for details
etrain: vector of the exposure variable for the training set
cluster_distance_similarity: the similarity matrix based on the argument specified in cluster_distance
eclust_distance_similarity: the similarity matrix based on the argument specified in eclust_distance
clustersAddonMembership: a data.frame and data.table of the clustering membership for clustering results based on the environment and not the environment. As a result, each gene will show up twice in this table
clustersAllMembership: a data.frame and data.table of the clustering membership for clustering results based on all subjects i.e. ignoring the environment. Each gene will only show up once in this table
clustersEclustMembership: a data.frame and data.table of the clustering membership for clustering results accounting for the environment. Each gene will only show up once in this table

u_cluster_similarity

data("tcgaov")
tcgaov[1:5,1:6, with = FALSE]
Y <- log(tcgaov[["OS"]])
E <- tcgaov[["E"]]
genes <- as.matrix(tcgaov[,-c("OS","rn","subtype","E","status"),with = FALSE])
trainIndex <- drop(caret::createDataPartition(Y, p = 0.5, list = FALSE, times = 1))
testIndex <- setdiff(seq_len(length(Y)),trainIndex)

## Not run: 
cluster_res <- r_cluster_data(data = genes,
                              response = Y,
                              exposure = E,
                              train_index = trainIndex,
                              test_index = testIndex,
                              cluster_distance = "tom",
                              eclust_distance = "difftom",
                              measure_distance = "euclidean",
                              clustMethod = "hclust",
                              cutMethod = "dynamic",
                              method = "average",
                              nPC = 1,
                              minimum_cluster_size = 60)

# the number of clusters determined by the similarity matrices specified
# in the cluster_distance and eclust_distance arguments. This will always be larger
# than cluster_res$clustersAll$nclusters which is based on the similarity matrix
# specified in the cluster_distance argument
cluster_res$clustersAddon$nclusters

# the number of clusters determined by the similarity matrices specified
# in the cluster_distance argument only
cluster_res$clustersAll$nclusters

## End(Not run)

eclust documentation built on May 1, 2019, 8:46 p.m.

eclust index

README.md Introduction to eclust

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

eclust
Environment Based Clustering for Interpretable Predictive Models in High Dimensional Data

r_cluster_data: Cluster data using environmental exposure
In eclust: Environment Based Clustering for Interpretable Predictive Models in High Dimensional Data

Description

Usage

Arguments

Details

Value

See Also

Examples

Related to r_cluster_data in eclust...

R Package Documentation

Browse R Packages

We want your feedback!

eclust Environment Based Clustering for Interpretable Predictive Models in High Dimensional Data

r_cluster_data: Cluster data using environmental exposure In eclust: Environment Based Clustering for Interpretable Predictive Models in High Dimensional Data

Description

Usage

Arguments

Details

Value

See Also

Examples

Related to r_cluster_data in eclust...

R Package Documentation

Browse R Packages

We want your feedback!

eclust
Environment Based Clustering for Interpretable Predictive Models in High Dimensional Data

r_cluster_data: Cluster data using environmental exposure
In eclust: Environment Based Clustering for Interpretable Predictive Models in High Dimensional Data