seqaddNA: Generation of missing on longitudinal categorical data.

View source: R/seqaddNA.R

seqaddNAR Documentation

Generation of missing on longitudinal categorical data.

Description

Generation of missing data in sequence based on a Markovian approach.

Usage

seqaddNA(
  data,
  var = NULL,
  states.high = NULL,
  propdata = 1,
  pstart.high = 0.1,
  pstart.low = 0.005,
  pcont = 0.66,
  maxgap = 3,
  maxprop = 0.75,
  only.traj = FALSE
)

Arguments

data

A data frame containing sequences of a categorical (multinomial) variable, where missing data are coded as NA.

var

A vector specifying the columns of the dataset that contain the trajectories. Default is NULL, meaning all columns are used.

states.high

A list of states with a higher probability of initiating a subsequent missing data gap.

propdata

Proportion of trajectories for which missing data is simulated, as a decimal between 0 and 1.

pstart.high

Probability of starting a missing data gap for the states specified in the states.high argument.

pstart.low

Probability of starting a missing data gap for all other states.

pcont

Probability of a missing data gap to continue.

maxgap

Maximum length of a missing data gap.

maxprop

Maximum proportion of missing data allowed in a sequence, as a decimal between 0 and 1.

only.traj

Logical, if TRUE, only the trajectories (specified in var) are returned. If FALSE, the entire data frame is returned.

Details

The first time point of a trajectory has a pstart.low probability to be missing. For the next time points, the probability to be missing depends on the previous time point. There are four cases:

1. If the previous time point is missing and the maximum length of a missing gap, which is specified by the argument maxgap, is reached, the time point is set as observed.

2. If the previous time point is missing, but the maximum length of a gap is not reached, there is a pcont probability that this time point is missing.

3. If the previous time point is observed and the previous time point belongs to the list of states specified by pstart.high, the probability to be missing is pstart.high.

4. If the previous time point is observed but the previous time point does not belong to the list of states specified by pstart.high, the probability to be missing is pstart.low.

If the proportion of missing data in a given trajectory exceeds the proportion specified by maxprop, the missing data simulation is repeated for the sequence.

Value

A data frame with simulated missing data.

Author(s)

Kevin Emery

Examples

# Generate MCAR missing data on the mvad dataset
# from the TraMineR package

## Not run: 
data(mvad, package = "TraMineR")
mvad.miss <- seqaddNA(mvad, var = 17:86)


# Generate missing data on mvad where joblessness is more likely to trigger
# a missing data gap
mvad.miss2 <- seqaddNA(mvad, var = 17:86, states.high = "joblessness")

## End(Not run)


seqimpute documentation built on April 12, 2025, 1:54 a.m.