regressboot | R Documentation |
The regressboot
function corresponds to the first part of the Robustness Assessment of Regressions using Cluster Analysis Typologies (RARCAT) procedure, which allows for evaluating the impact of sampling uncertainty on a standard Sequence Analysis, and thus assessing the reliability of its findings. See Roth et al. (2024) or the R tutorial as WeightedCluster
vignette for all details on this procedure and its utility. regressboot
should be used together with the unirarcat
function.
regressboot(diss, covar, df, B = 500, count = FALSE,
algo = "pam", method = "ward.D",
fixed = FALSE, ncluster = 10, eval="CH",
parallel = "no", ncpus = 1, cl = NULL)
diss |
The numerical dissimilarity matrix used for clustering. Only a pre-computed matrix (i.e., where pairwise dissimilarities do not depend on the resample) is currently supported. |
covar |
A character vector containing the names of the covariates whose association with the clustering is studied. A formula object is then created inside the function based on this. |
df |
The dataset (data frame) with the covariates of interest. Column names should match the information in |
B |
The integer number of bootstrap. Set to 500 by default to attain a satisfactory precision around the estimates as the procedure involves multiple steps. |
count |
Logical. Whether the bootstrap runs are counted on the screen or not. |
algo |
The clustering algorithm as a character string. Currently only "pam" (calling the function |
method |
A character string with the method argument of |
fixed |
Logical. TRUE implies that the number of clusters is the same in every bootstrap. FALSE (default) implies that an optimal number of clusters is evaluated each time. |
ncluster |
Integer. Either the number of clusters in every bootstrap if |
eval |
A character string with the cluster quality index to be evaluated for each new partition. Any column of |
parallel |
A character string with the type of parallel operation to be used (if any) by the function |
ncpus |
Integer. Number of processes to be used in case of parallel operation. Typically, one would chose this to be the number of available CPUs. |
cl |
A parallel cluster for use if |
The regressboot
function implements the following steps: (1) A random sample with replacement (i.e, bootstrap) is drawn from the data. (2) The bootstrap sample is clustered applying the exact same clustering procedure as the one used in the original analysis, which implies using the same dissimilarity measure, cluster algorithm, and method to determine the number of clusters. (3) A separate logistic regression predicting membership probability in each group is estimated. (4) The Average Marginal Effect (AME) of each covariate on the probability to be assigned to a given type is retrieved for all sequences belonging to this type. (5) These steps are repeated B
times, with B
typically large.
The output of regressboot
is a list with the following components:
B |
The number of bootstrap (input parameter). |
optimal.number |
An integer vector with the numbers of clusters for each bootstrap partition. If input parameter |
cluster.solution |
A numerical matrix with the number of individuals ( |
assoc.char |
A character vector with the different associations evaluated in the logistic regression model (based on input parameter |
original.cluster |
A vector of the same size as the dataset with the original clustering, i.e., the one constructed on the original sample with the given method. |
original.assoc |
A list with the estimated AMEs corresponding to each association between covariates of interest (as in |
coefficients |
A list with the estimated AMEs for all individuals and all bootstraps, corresponding to the associations between covariates of interest (as in |
errors |
A list with the estimated standard errors of the AMEs for all individuals and all bootstraps, corresponding to the associations between covariates of interest (as in |
Uses the following packages: fastcluster, dplyr, margins, boot
Leonard Roth
Roth, L., Studer, M., Zuercher, E., & Peytremann-Bridevaux, I. (2024). Robustness assessment of regressions using cluster analysis typologies: a bootstrap procedure with application in state sequence analysis. BMC medical research methodology, 24(1), 303. https://doi.org/10.1186/s12874-024-02435-8.
Studer, M. (2013). WeightedCluster library manual: A practical guide to creating typologies of trajectories in the social sciences with R. University of Geneva.
Hennig, C. (2007) Cluster-wise assessment of cluster stability. Computational Statistics and Data Analysis, 52, 258-271.
unirarcat
, rarcat
## Set the seed for reproducible results
set.seed(1)
## Load the margins library for marginal effect estimation
library(margins)
## Loading the data (TraMineR package)
data(mvad)
## Creating the state sequence object
mvad.seq <- seqdef(mvad, 17:86)
## Distance computation
diss <- seqdist(mvad.seq, method="LCS")
## Hierarchical clustering
hc <- fastcluster::hclust(as.dist(diss), method="ward.D")
## Computing cluster quality measures
clustqual <- as.clustrange(hc, diss=diss, ncluster=10)
clustqual
# Create cluster membership variable based on cluster quality above
mvad$clustering <- clustqual$clustering$cluster2
mvad$membership <- mvad$clustering == 2
# Formula for the association between the clustering and a covariate of interest
formula <- membership ~ funemp
# Run logistic regression model
mod <- glm(formula, mvad, family = "binomial")
# Model results
summary(margins(mod))
# A character vector with the name of the covariate of interest (to be related to the typology)
covar <- c("funemp")
## As in the original analysis, hierarchical clustering with Ward method is implemented
## An optimal clustering solution with n between 2 and 10 is evaluated each time by
## maximizing the CH index
## For illustration purposes, the number of bootstrap is smaller than what it ought to be
bootout <- regressboot(diss, covar, mvad, B = 50,
algo = "hierarchical", method = "ward.D",
ncluster = 10)
table(bootout$optimal.number)
bootout$assoc.char
# Robustness assessment for the association between father unemployment status
# and membership to the higher education trajectory group
result <- unirarcat(bootout, clustqual$clustering$cluster2, 2, "funempyes")
round(result$pooled.ame, 4)
round(result$standard.error, 4)
round(result$bootstrap.deviation, 4)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.