Source: http://stats.stackexchange.com/questions/184741/how-to-simulate-the-different-types-of-missing-data

Rubin defined three types of missing data:

Missing Completely at Random (MCAR)

MCAR occurs when there is a simple probability that data will be missing, and that probability is unrelated to anything else in your study. For example, a patient can miss a follow up visit because there is an accident on the highway and they simply can't get to the visit.

Missing at Random (MAR)

MAR happens when the missingness is related to information in your study, but all the relevant information to predict missingness is in the existing dataset. An example might be a weight loss study in which people drop out if their trajectory is that they are gaining weight. If you can estimate that trajectory for each person before anyone drops out, and see that those whose slope is positive subsequently drop out, you could take that as MAR.

Not Missing at Random (NMAR)

NMAR is like MAR in that the missingness is related to what is happening in your study, but differs in that the data that are related to the missingness is included in the data that are missing. For instance, if you are studying a treatment for vertigo / 'woozy-ness', but anytime a patient is really woozy, they don't show up for the follow-up visit. Thus, all the high values are missing, and they are missing because they are high! In other words, the types of missingness specify the mechanism that generates the missingness itself, so if you understand how the mechanism works, you simply write code to replicate it. For example, if you want 7% of your data missing completely at random, draw a number from a uniform distribution for every value in your dataset, and if it is <.07, replace the value with NA. For missing at random, simulate a logistic regression data generating process that outputs a probability of each value being missing (i.e., being replaced with NA) using information that will continue to be non-missing in your dataset. (For an example of simulating a logistic regression data generating process, see my answer here: Logistic regression simulation in order to show that intercept is biased when Y=1 is rare.) You can generate missingness not at random using a similar logistic regression data generating process, where the probability of missingness is a function of the y-value itself (i.e., the value that will potentially be replaced by NA).

Here is an example:

##### generic data setup:
set.seed(977) # this makes the simulation exactly reproducible
ni     = 100  # 100 people
nj     =  10  # 10 week study
id     = rep(1:ni, each=nj)
cond   = rep(c("control", "diet"), each=nj*(ni/2))
base   = round(rep(rnorm(ni, mean=250, sd=10), each=nj))
week   = rep(1:nj, times=ni)
y      = round(base + rnorm(ni*nj, mean=0, sd=1))
# MCAR
prop.m = .07  # 7% missingness
mcar   = runif(ni*nj, min=0, max=1)
y.mcar = ifelse(mcar<prop.m, NA, y)  # unrelated to anything
mcar.df <- data.frame(cbind(id, week, cond, base, y, y.mcar))
mcar.df
# MAR
y.mar = matrix(y, ncol=nj, nrow=ni, byrow=TRUE)
for(i in 1:ni){
  for(j in 4:nj){
    dif1 = y.mar[i,j-2]-y.mar[i,j-3]
    dif2 = y.mar[i,j-1]-y.mar[i,j-2]
    if(dif1>0 & dif2>0){  # if weight goes up twice, drops out
      y.mar[i,j:nj] = NA;  break
    }
  }
}
y.mar = as.vector(t(y.mar))
mar.df <- cbind(id, week, cond, base, y, y.mar)
mar.df
# NMAR
sort.y = sort(y, decreasing=TRUE)
nmar   = sort.y[ceiling(prop.m*length(y))]
y.nmar = ifelse(y>nmar, NA, y)  # doesn't show up when heavier
nmar.df <- cbind(id, week, cond, base, y, y.nmar)
nmar.df


AlfonsoRReyes/RepDataPeerAssessment1 documentation built on May 5, 2019, 4:53 a.m.