Clustering longitudinal data from different starting conditions

Description

'kmlCov' re-launch the algorithm implemented in glmClust, for clustering longitudinal data (trajectories), several times with different starting conditions and various number of clusters.

Usage

1
2
3
4
  kmlCov(formula, data, ident, timeVar, nClust = 2:6,
    nRedraw = 20, family = 'gaussian', effectVar = '',
    weights = rep(1,nrow(data)) , timeParametric = TRUE,
    separateSampling = TRUE, max_itr = 100, verbose = TRUE)

Arguments

formula

A symbolic description of the model. In the parametric case we write for example 'y ~ clust(time+time2) + pop(sex)', here 'time' and 'time2' will have a different effect according to the cluster, the 'sex' effect is the same for all the clusters. In the non-parametric case only one covariate is allowed.

data

A [data.frame] in long format (no missing values) which means that each line corresponds to one measure of the observed phenomenon, and one individual may have multiple measures (lines) identified by an identity column. In the non-parametric case the totality of patients must have all the measurements at all fixed times.

nClust

The number of clusters, at leas 2 an at most 26.

nRedraw

The number of time the algorithm is re-run with different starting conditions.

ident

The name of the column identity.

timeVar

Specify the column name of the time variable.

family

A description of the error distribution and link function to be used in the model, by default 'gaussian'. This can be a character string naming a family function, a family function or the result of a call to a family function. (See 'family' for details of family functions).

effectVar

An effect, can be a level cluster effect or not.

weights

Vector of 'prior weights' to be used in the fitting process, by default the weights are equal to one.

timeParametric

By default [TRUE] thus parametric on the time. If [FALSE] then only one covariate is allowed in the formula and the algorithm used is the k-means.

separateSampling

By default [TRUE] it means that the proportions of the clusters are supposed equal in the classification step, the log-likelihood maximised at each step of the algorithm is ∑_{k=1}^{K}∑_{y_i \in P_k} \log(f(y_i, θ_k)), otherwise the proportions of clusters are taken into account and the log-likelihood is ∑_{k=1}^{K}∑_{y_i \in P_k} \log(λ_{k}f(y_i, θ_k)).

max_itr

The maximum number of iterations fixed at 100.

verbose

Print the output in the console.

Details

The purpose of kmlCov is clustering longitudinal data, as well as glmClust, and automate the procedure of re-launching the algorithm from different starting conditions by specifying nRedraw.

The algorithm depends greatly of the starting conditions (initial affectation on the trajectories/individuals), so it is recommanded to run the algorithm multiple times in order to explore the space of the solutions.

'kmlCov' return a list of list of GlmCLuster, the partitions are compared using as criterion the classification log-likelihood, the higher are the best partitions.

Value

A an object of class KmlCovList.

See Also

glmClust
which_best

Examples

1
2
3
data(artifdata)
res <- kmlCov(formula = Y ~ clust(time + time2), data = artifdata, ident = 'id',
timeVar = 'time', effectVar = 'treatment', nClust = 2:3, nRedraw = 2) #run 2 times for each cluster