r_cluster_data: Cluster data using environmental exposure

Description Usage Arguments Details Value See Also Examples

Description

This is one of the functions for real data analysis, which will cluster the data based on the environment, as well as ignoring the environment

Usage

1
2
3
4
5
6
r_cluster_data(data, response, exposure, train_index, test_index,
  cluster_distance = c("corr", "corr0", "corr1", "tom", "tom0", "tom1",
  "diffcorr", "difftom", "fisherScore"), eclust_distance = c("fisherScore",
  "corScor", "diffcorr", "difftom"), measure_distance = c("euclidean",
  "maximum", "manhattan", "canberra", "binary", "minkowski"),
  minimum_cluster_size = 50, ...)

Arguments

data

n x p matrix of data. rows are samples, columns are genes or cpg sites. Should not contain the environment variable

response

numeric vector of length n

exposure

binary (0,1) numeric vector of length n for the exposure status of the n samples

train_index

numeric vector indcating the indices of response and the rows of data that are in the training set

test_index

numeric vector indcating the indices of response and the rows of data that are in the test set

cluster_distance

character representing which matrix from the training set that you want to use to cluster the genes. Must be one of the following

  • corr, corr0, corr1, tom, tom0, tom1, diffcorr, difftom, corScor, tomScor, fisherScore

eclust_distance

character representing which matrix from the training set that you want to use to cluster the genes based on the environment. See cluster_distance for avaialble options. Should be different from cluster_distance. For example, if cluster_distance=corr and EclustDistance=fisherScore. That is, one should be based on correlations ignoring the environment, and the other should be based on correlations accounting for the environment. This function will always return this add on

measure_distance

one of "euclidean","maximum","manhattan", "canberra", "binary","minkowski" to be passed to dist function for calculating the distance for the clusters based on the corr,corr1,corr0, tom, tom0, tom1 matrices

minimum_cluster_size

The minimum cluster size. Only applicable if cutMethod='dynamic'. This argument is passed to the cutreeDynamic function through the u_cluster_similarity function. Default is 50.

...

arguments passed to the u_cluster_similarity function

Details

This function clusters the data. The results of this function should then be passed to the r_prepare_data function which output the appropriate X and Y matrices in the right format for regression packages such as mgcv, caret and glmnet

Value

a list of length 8:

clustersAddon

clustering results based on the environment and not the environment. see u_cluster_similarity for details

clustersAll

clustering results ignoring the environment. See u_cluster_similarity for details

etrain

vector of the exposure variable for the training set

cluster_distance_similarity

the similarity matrix based on the argument specified in cluster_distance

eclust_distance_similarity

the similarity matrix based on the argument specified in eclust_distance

clustersAddonMembership

a data.frame and data.table of the clustering membership for clustering results based on the environment and not the environment. As a result, each gene will show up twice in this table

clustersAllMembership

a data.frame and data.table of the clustering membership for clustering results based on all subjects i.e. ignoring the environment. Each gene will only show up once in this table

clustersEclustMembership

a data.frame and data.table of the clustering membership for clustering results accounting for the environment. Each gene will only show up once in this table

See Also

u_cluster_similarity

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
data("tcgaov")
tcgaov[1:5,1:6, with = FALSE]
Y <- log(tcgaov[["OS"]])
E <- tcgaov[["E"]]
genes <- as.matrix(tcgaov[,-c("OS","rn","subtype","E","status"),with = FALSE])
trainIndex <- drop(caret::createDataPartition(Y, p = 0.5, list = FALSE, times = 1))
testIndex <- setdiff(seq_len(length(Y)),trainIndex)

## Not run: 
cluster_res <- r_cluster_data(data = genes,
                              response = Y,
                              exposure = E,
                              train_index = trainIndex,
                              test_index = testIndex,
                              cluster_distance = "tom",
                              eclust_distance = "difftom",
                              measure_distance = "euclidean",
                              clustMethod = "hclust",
                              cutMethod = "dynamic",
                              method = "average",
                              nPC = 1,
                              minimum_cluster_size = 60)

# the number of clusters determined by the similarity matrices specified
# in the cluster_distance and eclust_distance arguments. This will always be larger
# than cluster_res$clustersAll$nclusters which is based on the similarity matrix
# specified in the cluster_distance argument
cluster_res$clustersAddon$nclusters

# the number of clusters determined by the similarity matrices specified
# in the cluster_distance argument only
cluster_res$clustersAll$nclusters

## End(Not run)

eclust documentation built on May 1, 2019, 8:46 p.m.