seqaddNA: Generation of missing on longitudinal categorical data.
In seqimpute: Imputation of Missing Data in Sequence Analysis

seqaddNA

R Documentation

Generation of missing on longitudinal categorical data.

Description

Generation of missing data in sequence based on a Markovian approach.

Usage

seqaddNA(
  data,
  var = NULL,
  states.high = NULL,
  propdata = 1,
  pstart.high = 0.1,
  pstart.low = 0.005,
  pcont = 0.66,
  maxgap = 3,
  maxprop = 0.75,
  only.traj = FALSE
)

Arguments

`data`	A data frame containing sequences of a categorical (multinomial) variable, where missing data are coded as `NA`.
`var`	A vector specifying the columns of the dataset that contain the trajectories. Default is `NULL`, meaning all columns are used.
`states.high`	A list of states with a higher probability of initiating a subsequent missing data gap.
`propdata`	Proportion of trajectories for which missing data is simulated, as a decimal between 0 and 1.
`pstart.high`	Probability of starting a missing data gap for the states specified in the `states.high` argument.
`pstart.low`	Probability of starting a missing data gap for all other states.
`pcont`	Probability of a missing data gap to continue.
`maxgap`	Maximum length of a missing data gap.
`maxprop`	Maximum proportion of missing data allowed in a sequence, as a decimal between 0 and 1.
`only.traj`	Logical, if `TRUE`, only the trajectories (specified in `var`) are returned. If `FALSE`, the entire data frame is returned.

Details

The first time point of a trajectory has a pstart.low probability to be missing. For the next time points, the probability to be missing depends on the previous time point. There are four cases:

1. If the previous time point is missing and the maximum length of a missing gap, which is specified by the argument maxgap, is reached, the time point is set as observed.

2. If the previous time point is missing, but the maximum length of a gap is not reached, there is a pcont probability that this time point is missing.

3. If the previous time point is observed and the previous time point belongs to the list of states specified by pstart.high, the probability to be missing is pstart.high.

4. If the previous time point is observed but the previous time point does not belong to the list of states specified by pstart.high, the probability to be missing is pstart.low.

If the proportion of missing data in a given trajectory exceeds the proportion specified by maxprop, the missing data simulation is repeated for the sequence.

Value

A data frame with simulated missing data.

Author(s)

Kevin Emery

Examples

# Generate MCAR missing data on the mvad dataset
# from the TraMineR package

## Not run: 
data(mvad, package = "TraMineR")
mvad.miss <- seqaddNA(mvad, var = 17:86)


# Generate missing data on mvad where joblessness is more likely to trigger
# a missing data gap
mvad.miss2 <- seqaddNA(mvad, var = 17:86, states.high = "joblessness")

## End(Not run)

seqimpute documentation built on April 12, 2025, 1:54 a.m.