optimal_kmeans_d: Obtain optimal D solution based on k-means clustering of...

View source: R/optimal_kmeans_d.R

optimal_kmeans_dR Documentation

Obtain optimal D solution based on k-means clustering of disease marker data in a case-control study

Description

optimal_kmeans_d applies k-means clustering using the kmeans function with many random starts. The D value is then calculated for the cluster solution at each random start using the d function, and the cluster solution that maximizes D is returned, along with the corresponding value of D. In this way the optimally etiologically heterogeneous subtype solution can be identified from possibly high-dimensional disease marker data.

Usage

optimal_kmeans_d(markers, M, factors, case, data, nstart = 100, seed = NULL)

Arguments

markers

a vector of the names of the disease markers. These markers should be of a type that is suitable for use with kmeans clustering. All markers will be missing for control subjects. e.g. markers = c("marker1", "marker2")

M

is the number of clusters to identify using kmeans clustering. For M>=2.

factors

a list of the names of the binary or continuous risk factors. For binary risk factors the lowest level will be used as the reference level. e.g. factors = list("age", "sex", "race")

case

denotes the variable that contains each subject's status as a case or control. This value should be 1 for cases and 0 for controls. Argument must be supplied in quotes, e.g. case = "status".

data

the name of the dataframe that contains the relevant variables.

nstart

the number of random starts to use with kmeans clustering. Defaults to 100.

seed

an integer argument passed to set.seed. Default is NULL. Recommended to set in order to obtain reproducible results.

Value

Returns a list

optimal_d The D value for the optimal D solution

optimal_d_data The original data frame supplied through the data argument, with a column called optimal_d_label added for the optimal D subtype label. This has the subtype assignment for cases, and is 0 for all controls.

References

Begg, C. B., Zabor, E. C., Bernstein, J. L., Bernstein, L., Press, M. F., & Seshan, V. E. (2013). A conceptual and methodological framework for investigating etiologic heterogeneity. Stat Med, 32(29), 5039-5052.

Examples


# Cluster 30 disease markers to identify the optimally
# etiologically heterogeneous 3-subtype solution
res <- optimal_kmeans_d(
  markers = c(paste0("y", seq(1:30))),
  M = 3,
  factors = list("x1", "x2", "x3"),
  case = "case",
  data = subtype_data,
  nstart = 100,
  seed = 81110224
)

# Look at the value of D for the optimal D solution
res[["optimal_d"]]

# Look at a table of the optimal D solution
table(res[["optimal_d_data"]]$optimal_d_label)



riskclustr documentation built on March 23, 2022, 1:07 a.m.