predkmeans: Predictive K-means Clustering

Description Usage Arguments Details Value Author(s) References See Also Examples

View source: R/functions_predkmeans.R

Description

Uses a Mixture-of-experts algorithm to find cluster centers that are influenced by prediction covariates.

Usage

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
predkmeans(
  X,
  R,
  K,
  mu = NULL,
  muStart = c("kmeans", "random"),
  sigma2 = 0,
  sigma2fixed = FALSE,
  maxitEM = 100,
  tol = 1e-05,
  convEM = c("both", "mu", "gamma"),
  nStarts = 1,
  maxitMlogit = 500,
  verbose = 0,
  muRestart = 1000,
  returnAll = FALSE,
  ...
)

Arguments

X

An n by p matrix or data frame of data to be clustered.

R

Covariates used for clustering. Required unless doing k-means clustering (i.e. sigma2=0 and sigma2fixed=TRUE).

K

Number of clusters

mu

starting values for cluster centers. If NULL (default), then value is chosen according to muStart.

muStart

Character string indicating how initial value of mu should be selected. Only used if mu=NULL. Possible values are "random" or "kmeans" (default).

sigma2

starting value of sigma2. If set to 0 and sigma2fixed=TRUE, the standard k-means is done instead of predictive k-means.

sigma2fixed

Logical indicating whether sigma2 should be held fixed. If FALSE, then sigma2 is estimated using Maximum Likelihood.

maxitEM

Maximum number of EM iterations for finding the Mixture of Experts solution. If doing regular k-means, this is passed as iter.max.

tol

convergence criterion

convEM

controls the measure of convergence for the EM algorithm. Should be one of "mu", "gamma", or "both". Defaults to "both." The EM algorithm stops when the Frobenius norm of the change in mu, the change in gamma, or the change in mu and the change in gamma is less than 'tol'.

nStarts

number of times to perform EM algorithm

maxitMlogit

Maximum number of iterations in the mlogit optimization (nested within EM algorithm)

verbose

numeric vector indicating how much output to produce

muRestart

Gives max number of attempts at picking starting values. Only used when muStart='random'. If selected starting values for mu are constant within each cluster, then the starting values are re-selected up to muRestart times.

returnAll

A list containing all nStarts solutions is included in the output.

...

Additional arguments passed to mlogit

Details

A thorough description of this method is provided in Keller et al. (2017). The algorithm for sovling the mixture of Experts model is based upon the approach presented by Jordan and Jacobs (1994).

If sigma2 is 0 and sigm2fixed is TRUE, then standard k-means clustering (using kmeans) is done instead.

Value

An object of class predkmeans, containing the following elements:

res.best

A list containing the results from the best-fitting solution to the Mixture of Experts problem:

mu

Maximum-likelihood estimate of intercepts from normal mixture model. These are the cluster centers.

gamma

Maximum-likelihood estimates of the mixture coefficeints.

sigma2

If sigma2fixed=FALSE, the maximum likelihood estimate of sigma2

conv

Indicator of covergence.

objective

Value of the log-likelihood.

iter

Number of iterations.

mfit

A subset of output from mlogit.

center

Matrix of cluster centers

cluster

Vector of cluster labels assigned to observations

K

Number of clusters

sigma2

Final value of sigma^2.

wSS

Mean within-cluster sum-of-squares

sigma2fixed

Logical indicator of whether sigma2 was held fixed

Author(s)

Joshua Keller

References

Keller, J.P., Drton, M., Larson, T., Kaufman, J.D., Sandler, D.P., and Szpiro, A.A. (2017). Covariate-adaptive clustering of exposures for air pollution epidemiology cohorts. Annals of Applied Statistics, 11(1):93–113.

Jordan M. and Jacobs R. (1994). Hierarchical mixtures of experts and the EM algorithm. Neural computation 6(2), 181-214.

See Also

predictML.predkmeans, predkmeansCVest

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
n <- 200
r1 <- rnorm(n)
r2 <- rnorm(n)
u1 <- rbinom(n, size=1,prob=0)
cluster <- ifelse(r1<0, ifelse(u1, "A", "B"), ifelse(r2<0, "C", "D"))
mu1 <- c(A=2, B=2, C=-2, D=-2)
mu2 <- c(A=1, B=-1, C=-1, D=-1)
x1 <- rnorm(n, mu1[cluster], 4)
x2 <- rnorm(n, mu2[cluster], 4)
R <- model.matrix(~r1 + r2)
X <- cbind(x1, x2)
pkm <- predkmeans(X=cbind(x1, x2), R=R, K=4)
summary(pkm)

Example output

sh: 1: cannot create /dev/null: Permission denied
Predictive k-means object with
     4 Clusters
     2 Variables
Convergence status:  9 
Sigma^2 = 13.18 (Fixed = FALSE)
Within-cluster Sum-of-Squares (wSS) =  3155.3 
Cluster centers are:
          x1         x2
1  3.8235365 -0.7720254
2 -4.9685580 -2.1946215
3  0.1237923 -1.5749716
4 -0.9317957  0.5545536

predkmeans documentation built on Jan. 11, 2020, 9:29 a.m.