Clustering longitudinal data

Share:

Description

'glmClust' cluster longitudinal data (trajectories) using the likelihood as a metric of distance, it also deals with multiples covariates with different effects using the generalised linear model 'glm'.

Usage

1
2
3
4
  glmClust(formula, data, ident, timeVar, nClust, family =
    'gaussian', effectVar = '', weights =
    rep(1,nrow(data)), affUser, timeParametric = TRUE,
    separateSampling = TRUE, max_itr = 100, verbose = TRUE)

Arguments

formula

A symbolic description of the model. In the parametric case we write for example 'y ~ clust(time+time2) + pop(sex)', here 'time' and 'time2' will have a different effect according to the cluster, the 'sex' effect is the same for all the clusters. In the non-parametric case only one covariate is allowed.

data

A [data.frame] in long format (no missing values) which means that each line corresponds to one measure of the observed phenomenon, and one individual may have multiple measures (lines) identified by an identity column. In the non-parametric case the totality of patients must have all the measurements at fixed times.

nClust

The number of clusters, between 2 and 26.

ident

Name of the column identity in the data.

timeVar

Name of the 'time' column in the data.

family

A description of the error distribution and link function to be used in the model, by default 'gaussian'. This can be a character string naming a family function, a family function or the result of a call to a family function. (See family for more details of family functions).

effectVar

Name of the effect specified or not in the formula is has level cluster effect or not (optional), note that this parameter is useful for the function plot

weights

Vector of 'prior weights' to be used in the fitting process, by default the weights are equal to one.

affUser

Initial affectation of the individuals in a [data.frame] format, if missing the individuals are randomly assigned to the clusters so it is optional .

timeParametric

By default [TRUE] thus parametric on the time. If [FALSE] then only one covariate is allowed in the formula and the algorithm used is the k-means.

separateSampling

By default [TRUE] it means that the proportions of the clusters are supposed equal in the classification step, the log-likelihood maximised at each step of the algorithm is , otherwise the proportions of clusters are taken into account and the log-likelihood is ∑_{k=1}^{K}∑_{y_i \in P_k} \log(λ_{k}f(y_i, θ_k)).

max_itr

The maximum number of iterations fixed at 100.

verbose

Print the output in the console.

Details

'glmClust' implements an ECM (esperance classification maximisation) type algorithm which assigns the trajectories to the cluster maximising the likelihood. The procedure is repeated until no change in the partitions or no sufficient increase in the likelihood is possible.

'glmClust' also deals with multiple covariates with different level effects, different in each cluster and/or identical for all of them.

The introduction of covariates is possible thanks to 'glm' which fits a generalised linear model and take into account the type of the response (normal, binomial, Poisson ...etc) and the link function.

Several parameters of 'glmClust' are in common with 'glm', like the formula which requires a particular attention by specifying the covariates with a cluster effect, for e.g. clust(T1+T2+..+Tn), the covariates with an identical effect in each cluster are specified with the keyword pop, for e.g. pop(X1+X2+..+Xn), note that these last covariates are optional.
The data are in the long format and no missing values are allowed.

In the parametric case (timeParametric = TRUE) multiples covariates are allowed, in the non-parametric case only one covariate is allowed.

The algorithm depends greatly on the starting condition, which is obtained by randomly affecting the trajectories to the clusters unless the user introduce his own partition. To obtain better results it is desirable to run the algorithm several times from different starting points, therefore it is preferable to use kmlCov which runs the algorithm several times with different number of clusters.

At the end of the algorithm, an object of class GlmCluster is returned and contains information about the affectation of the trajectories, the proportions, the convergence, ...etc. The main trajectories can be simply visualised by plot(my_GlmCluster_Object).

Value

An object of class GlmCluster.

See Also

kmlCov

Examples

1
2
3
4
5
6
7
data(artifdata)
res <- glmClust(formula = Y ~ clust(time + time2 + time3) + pop(treatTime),
data = artifdata, ident = 'id', timeVar = 'time', effectVar = 'treatment', nClust = 4)
# the trajectories with indices 0 indicate the ones with a normal treatment, 1 indicate a high dose
# the color indicates the clusters
# the proportions are in the table above the diagram
plot(res)