optimal_kmeans_d: Obtain optimal D solution based on k-means clustering of...
In riskclustr: Functions to Study Etiologic Heterogeneity

optimal_kmeans_d

R Documentation

Obtain optimal D solution based on k-means clustering of disease marker data in a case-control study

Description

optimal_kmeans_d applies k-means clustering using the kmeans function with many random starts. The D value is then calculated for the cluster solution at each random start using the d function, and the cluster solution that maximizes D is returned, along with the corresponding value of D. In this way the optimally etiologically heterogeneous subtype solution can be identified from possibly high-dimensional disease marker data.

Usage

optimal_kmeans_d(markers, M, factors, case, data, nstart = 100, seed = NULL)

Arguments

`markers`	a vector of the names of the disease markers. These markers should be of a type that is suitable for use with `kmeans` clustering. All markers will be missing for control subjects. e.g. `markers = c("marker1", "marker2")`
`M`	is the number of clusters to identify using `kmeans` clustering. For M>=2.
`factors`	a list of the names of the binary or continuous risk factors. For binary risk factors the lowest level will be used as the reference level. e.g. `factors = list("age", "sex", "race")`
`case`	denotes the variable that contains each subject's status as a case or control. This value should be 1 for cases and 0 for controls. Argument must be supplied in quotes, e.g. `case = "status"`.
`data`	the name of the dataframe that contains the relevant variables.
`nstart`	the number of random starts to use with `kmeans` clustering. Defaults to 100.
`seed`	an integer argument passed to `set.seed`. Default is NULL. Recommended to set in order to obtain reproducible results.

Value

Returns a list

optimal_d The D value for the optimal D solution

optimal_d_data The original data frame supplied through the data argument, with a column called optimal_d_label added for the optimal D subtype label. This has the subtype assignment for cases, and is 0 for all controls.

References

Begg, C. B., Zabor, E. C., Bernstein, J. L., Bernstein, L., Press, M. F., & Seshan, V. E. (2013). A conceptual and methodological framework for investigating etiologic heterogeneity. Stat Med, 32(29), 5039-5052.

Examples


# Cluster 30 disease markers to identify the optimally
# etiologically heterogeneous 3-subtype solution
res <- optimal_kmeans_d(
  markers = c(paste0("y", seq(1:30))),
  M = 3,
  factors = list("x1", "x2", "x3"),
  case = "case",
  data = subtype_data,
  nstart = 100,
  seed = 81110224
)

# Look at the value of D for the optimal D solution
res[["optimal_d"]]

# Look at a table of the optimal D solution
table(res[["optimal_d_data"]]$optimal_d_label)

riskclustr documentation built on May 29, 2024, 6:23 a.m.