seqimpute: seqimpute: Imputation of missing data in longitudinal...
In seqimpute: Imputation of Missing Data in Sequence Analysis

seqimpute

R Documentation

seqimpute: Imputation of missing data in longitudinal categorical data

Description

The seqimpute package implements the MICT and MICT-timing methods. These are multiple imputation methods for longitudinal data. The core idea of the algorithms is to fills gaps of missing data, which is the typical form of missing data in a longitudinal setting, recursively from their edges. The prediction is based on either a multinomial or a random forest regression model. Covariates and time-dependent covariates can be included in the model.

The MICT-timing algorithm is an extension of the MICT algorithm designed to address a key limitation of the latter: its assumption that position in the trajectory is irrelevant.

Usage

seqimpute(
  data,
  var = NULL,
  np = 1,
  nf = 1,
  m = 5,
  timing = FALSE,
  frame.radius = 0,
  covariates = NULL,
  time.covariates = NULL,
  regr = "multinom",
  npt = 1,
  nfi = 1,
  ParExec = FALSE,
  ncores = NULL,
  SetRNGSeed = FALSE,
  end.impute = TRUE,
  verbose = TRUE,
  available = TRUE,
  pastDistrib = FALSE,
  futureDistrib = FALSE,
  ...
)

Arguments

`data`	Either a data frame containing sequences of a categorical variable, where missing data are coded as `NA`, or a state sequence object created using the seqdef function. If using a state sequence object, any "void" elements will also be treated as missing. See the `end.impute` argument if you wish to skip imputing values at the end of the sequences.
`var`	A specifying the columns of the dataset that contain the trajectories. Default is `NULL`, meaning all columns are used.
`np`	Number of prior states to include in the imputation model for internal gaps.
`nf`	Number of subsequent states to include in the imputation model for internal gaps.
`m`	Number of multiple imputations to perform (default: `5`).
`timing`	Logical, specifies the imputation algorithm to use. If `FALSE`, the MICT algorithm is applied; if `TRUE`, the MICT-timing algorithm is used.
`frame.radius`	Integer, relevant only for the MICT-timing algorithm, specifying the radius of the timeframe.
`covariates`	List of the columns of the dataset containing covariates to be included in the imputation model.
`time.covariates`	List of the columns of the dataset with time-varying covariates to include in the imputation model.
`regr`	Character specifying the imputation method. Options include `"multinom"` for multinomial models and `"rf"` for random forest models.
`npt`	Number of prior observations in the imputation model for terminal gaps (i.e., gaps at the end of sequences).
`nfi`	Number of future observations in the imputation model for initial gaps (i.e., gaps at the beginning of sequences).
`ParExec`	Logical, indicating whether to run multiple imputations in parallel. Setting to `TRUE` can improve computation time depending on available cores.
`ncores`	Integer, specifying the number of cores to use for parallel computation. If unset, defaults to the maximum number of CPU cores minus one.
`SetRNGSeed`	Integer, to set the random seed for reproducibility in parallel computations. Note that setting `set.seed()` alone does not ensure reproducibility in parallel mode.
`end.impute`	Logical. If `FALSE`, missing data at the end of sequences will not be imputed.
`verbose`	Logical, if `TRUE`, displays progress and warnings in the console. Use `FALSE` for silent computation.
`available`	Logical, specifies whether to consider already imputed data in the predictive model. If `TRUE`, previous imputations are used; if `FALSE`, only original data are considered.
`pastDistrib`	Logical, if `TRUE`, includes the past distribution as a predictor in the imputation model.
`futureDistrib`	Logical, if `TRUE`, includes the future distribution as a predictor in the imputation model.
`...`	Named arguments that are passed down to the imputation functions.

Details

The imputation process is divided into several steps, depending on the type of gaps of missing data. The order of imputation of the gaps are:

Internal gap:: there is at least np observations before an internal gap and nf after the gap
Initial gap:: gaps situated at the very beginning of a trajectory
Terminal gap:: gaps situated at the very end of a trajectory
Left-hand side specifically located gap (SLG):: gaps that have at least nf observations after the gap, but less than np observation before it
Right-hand side SLG:: gaps that have at least np observations before the gap, but less than nf observation after it
Both-hand side SLG:: gaps that have less than np observations before the gap, and less than nf observations after it

The primary difference between the MICT and MICT-timing algorithms lies in their approach to selecting patterns from other sequences for fitting the multinomial model. While the MICT algorithm considers all similar patterns regardless of their temporal placement, MICT-timing restricts pattern selection to those that are temporally closest to the missing value. This refinement ensures that the imputation process adequately accounts for temporal dynamics, imping in more accurate imputed values.

Value

An object of class seqimp, which is a list with the following elements:

data: A data.frame containing the original (incomplete) data.
imp: A list of m data.frame corresponding to the imputed datasets.
m: The number of imputations.
method: A character vector specifying whether MICT or MICT-timing was used.
np: Number of prior states included in the imputation model.
nf: Number of subsequent states included in the imputation model.
regr: A character vector specifying whether multinomial or random forest imputation models were applied.
call: The call that created the object.

Author(s)

Kevin Emery <kevin.emery@unige.ch>, Andre Berchtold, Anthony Guinchard, and Kamyar Taher

References

Halpin, B. (2012). Multiple imputation for life-course sequence data. Working Paper WP2012-01, Department of Sociology, University of Limerick. http://hdl.handle.net/10344/3639.

Halpin, B. (2013). Imputing sequence data: Extensions to initial and terminal gaps, Stata's. Working Paper WP2013-01, Department of Sociology, University of Limerick. http://hdl.handle.net/10344/3620

Emery, K., Studer, M., & Berchtold, A. (2024). Comparison of imputation methods for univariate categorical longitudinal data. Quality & Quantity, 1-25. https://link.springer.com/article/10.1007/s11135-024-02028-z

Examples


# Default multiple imputation of the trajectories of game addiction with the
# MICT algorithm

## Not run: 
set.seed(5)
imp1 <- seqimpute(data = gameadd, var = 1:4)


# Default multiple imputation with the MICT-timing algorithm
set.seed(3)
imp2 <- seqimpute(data = gameadd, var = 1:4, timing = TRUE)


# Inclusion in the MICt-timing imputation process of the three background
# characteristics (Gender, Age and Track), and the time-varying covariate
# about gambling


set.seed(4)
imp3 <- seqimpute(
  data = gameadd, var = 1:4, covariates = 5:7,
  time.covariates = 8:11
)


# Parallel computation


imp4 <- seqimpute(
  data = gameadd, var = 1:4, covariates = 5:7,
  time.covariates = 8:11, ParExec = TRUE, ncores = 5, SetRNGSeed = 2
)

## End(Not run)

seqimpute documentation built on April 12, 2025, 1:54 a.m.