Nothing
#' Imputation of missing data in sequence analysis
#'
#' The seqimpute package implements the MICT and MICT-timing methods. These are multiple imputation methods for
#' longitudinal data. The core idea of the algorithms is to fills gaps of missing data, which is the typcial for of
#' missing data in a longitudinal setting, recursively from their edges. The prediction is based
#' on either a multinomial or a random forest regression model.
#' Covariates and time-dependant covariates can be included in the model.
#' The prediction of the missing values is based on the theory of Prof. Brendan
#' Halpin. It considers a various amount of surrounding available information to
#' perform the prediction process.
#' In fact, we can among others specify \code{np} (the number of past variables
#' taken into account) and \code{nf} (the number of future information taken
#' into account).
#'
#' @details The imputation process is divided into several steps. According to the location of the gaps of NA among the original dataset, we have defined 5 types of gaps:
#'
#' - Internal Gaps (simple usual gaps)
#'
#' - Initial Gaps (gaps situated at the very beginning of a sequence)
#'
#' - Terminal Gaps (gaps situated at the very end of a sequence)
#'
#' - Left-hand side SLG (Specially Located Gaps) (gaps of which the beginning location is included in the interval \code{[0,np]}
#' but the ending location is not included in the interval \code{[ncol(OD)-nf,ncol(OD)]})
#'
#' - Right-hand side SLG (Specially Located Gaps) (gaps of which the ending location is included in the interval \code{[ncol(OD)-nf,ncol(OD)]}
#' but the beginning location is not included in the interval \code{[0,np]})
#'
#' - Both-hand side SLG (Specially Located Gaps) (gaps of which the beginning location is included in the interval \code{[0,np]}
#' and the ending location is included in the interval \code{[ncol(OD)-nf,ncol(OD)]} )
#'
#' Order of imputation of the gaps types:
#' 1. Internal Gaps
#' 2. Initial Gaps
#' 3. Terminal Gaps
#' 4. Left-hand side SLG
#' 5. Right-hand side SLG
#' 6. Both-hand side SLG
#'
#' @param OD either a data frame containing sequences of a multinomial variable with missing data (coded as \code{NA}) or
#' a state sequence object built with the TraMineR package
#' @param np number of previous observations in the imputation model of the internal gaps.
#' @param nf number of future observations in the imputation model of the internal gaps.
#' @param m number of multiple imputations (default: \code{1}).
#' @param timing a logical value that specifies if the standard MICT algorithm (timing=FALSE)
#' or the MICT-timing algorithm (timing=TRUE) should be used.
#' @param timeFrame parameter relative to the MICT-timing algorithm specifying the radius of the timeFrame.
#' @param covariates a data frame containing the covariates intended for use in the imputation process, with each column representing a distinct covariate.
#' @param time.covariates a data frame object containing some time-dependent covariates that help specifying the predictive model more accurately.
#' @param regr a character specifying the imputation method. If \code{regr="multinom"}, multinomial models are used,
#' while if \code{regr="rf"}, random forest models are used.
#' @param nfi number of future observations in the imputation model of the initial gaps.
#' @param npt number of previous observations in the imputation model of the terminal gaps.
#' @param available a logical value allowing the user to choose whether to consider the already imputed data in the predictive model (\code{available = TRUE}) or not (\code{available = FALSE}).
#' @param pastDistrib a logical indicating if the past distribution should be used as predictor in the imputation model.
#' @param futureDistrib a logical indicating if the futur distribution should be used as predictor in the imputation model.
#' @param mice.return a logical indicating whether an object of class \code{mids}, that can be directly used by the \code{mice} package, should be returned
#' by the algorithm. By default, a data frame with the imputed datasets stacked vertically is returned.
#' @param include logical. If a dataframe is returned (\code{mice.return = FALSE}), indicates if the original
#' dataset should be included or not. This parameter does not apply if \code{mice.return=TRUE}.
#' @param noise \code{numeric} object adding a noise on the predicted variable \code{pred} determined by the multinomial model
#' (by introducing a variance \code{noise} for each components of the vector \code{pred}) (the user can choose any value for \code{noise}, but we recommend to choose a rather relatively small value situated in the interval \code{[0.005-0.03]}).
#' @param ParExec logical. If \code{TRUE}, the multiple imputations are run in parallell. This allows faster run time depending of how many core the processor has.
#' @param ncores integer. Number of cores to be used for the parallel computation. If no value is set for this parameter, the number of cores will be set
#' to the maximum number of CPU cores minus 1.
#' @param SetRNGSeed an integer that is used to set the seed in the case of parallel computation. Note that setting \code{set.seed()} alone before the seqimpute function won't work in case
#' of parallel computation.
#' @param verbose logical. If \code{TRUE}, seqimpute will print history and warnings on console. Use \code{verbose=FALSE} for silent computation.
#' @param ... Named arguments that are passed down to the imputation functions.
#'
#' @author Andre Berchtold <andre.berchtold@@unil.ch> Kevin Emery Anthony Guinchard Kamyar Taher
#'
#' @return Returns either an S3 object of class \code{mids} if \code{mice.return = TRUE}
#' or a dataframe, where the imputed dataset are stacked vertically. In the second case,
#' two columns are added: \code{.imp} integer that refers to the imputation number
#' (0 corresponding to the original dataset if \code{include=TRUE}) and \code{.id} character corresponding to
#' the rownames of the dataset to impute.
#'
#' @examples
#'
#' # Default single imputation
#' RESULT <- seqimpute(OD = OD, np = 1, nf = 1, nfi = 1, npt = 1, m = 1)
#'
#' # Seqimpute used with parallelisation
#' \dontrun{
#' RESULT <- seqimpute(OD = OD, np = 1, nf = 1, nfi = 1, npt = 1, m = 2, ParExec = TRUE, SetRNGSeed = 17, ncores = 2)
#' }
#'
#' @references HALPIN, Brendan (2012). Multiple imputation for life-course sequence data. Working Paper WP2012-01, Department of Sociology, University of Limerick. http://hdl.handle.net/10344/3639.
#' @references HALPIN, Brendan (2013). Imputing sequence data: Extensions to initial and terminal gaps, Stata's. Working Paper WP2013-01, Department of Sociology, University of Limerick. http://hdl.handle.net/10344/3620
#'
#'
#' @export
seqimpute <- function(OD, np = 1, nf = 1, m = 1, timing = FALSE, timeFrame = 0, covariates = matrix(NA, nrow = 1, ncol = 1), time.covariates = matrix(NA, nrow = 1, ncol = 1), regr = "multinom", nfi = 1, npt = 1,
available = TRUE, pastDistrib = FALSE,
futureDistrib = FALSE, noise = 0, ParExec = FALSE, ncores = NULL,
SetRNGSeed = FALSE, verbose = TRUE, ...) {
check.deprecated(...)
if (timing == FALSE) {
return(seqimpute_standard(OD,
np = np, nf = nf, m = m, covariates = covariates,
time.covariates = time.covariates, regr = regr, nfi = nfi, npt = npt,
available = available, pastDistrib = pastDistrib,
futureDistrib = futureDistrib, noise = noise, ParExec = ParExec, ncores = ncores,
SetRNGSeed = SetRNGSeed, verbose = verbose, ...
))
} else {
return(seqimpute_timing(OD,
np = np, nf = nf, m = m, covariates = covariates,
time.covariates = time.covariates, regr = regr, nfi = nfi, npt = npt,
available = available, pastDistrib = pastDistrib,
futureDistrib = futureDistrib, noise = noise, ParExec = ParExec, ncores = ncores,
SetRNGSeed = SetRNGSeed, verbose = verbose, ...
))
}
}
seqimpute_standard <- function(OD, covariates = matrix(NA, nrow = 1, ncol = 1), time.covariates = matrix(NA, nrow = 1, ncol = 1), np = 1, nf = 1, m = 1, regr = "multinom", nfi = 1, npt = 1,
available = TRUE, pastDistrib = FALSE,
futureDistrib = FALSE, noise = 0, ParExec = FALSE, ncores = NULL,
SetRNGSeed = FALSE, verbose = TRUE, ...) {
#***************************************************************************
# Sequence object: Recode to NA the missing code
if (inherits(OD, "stslist")) {
valuesNA <- c(attr(OD, "nr"), attr(OD, "void"))
OD <- data.frame(OD)
OD[OD == valuesNA[1] | OD == valuesNA[2]] <- NA
}
if (sum(is.na(OD)) == 0) {
if (verbose == T) {
message("This dataset has no missing values!")
}
return(OD)
}
rownamesDataset <- rownames(OD)
nrowsDataset <- nrow(OD)
# if (mice.return == TRUE) {
# include <- TRUE
# }
# 0. Initial tests and manipulations on parameters ------------------------------------------------------------------------------------------------------------
dataOD <- preliminaryChecks(OD, covariates, time.covariates, np, nf, nfi, npt, pastDistrib, futureDistrib)
dataOD[c("pastDistrib", "futureDistrib", "totV", "totVi", "totVt", "noise")] <- InitCorectControl(regr, dataOD$ODClass, dataOD$OD, dataOD$nr, dataOD$nc, dataOD$k, np, nf, dataOD$nco, dataOD$ncot, nfi, npt, pastDistrib, futureDistrib, dataOD$totV, dataOD$totVi, dataOD$totVt, noise)
# 1. Analysis of OD and creation of matrices ORDER, ORDER2 and ORDER3 -----------------------------------------------------------------------------------------
dataOD[c("MaxInitGapSize", "InitGapSize", "MaxTermGapSize", "TermGapSize", "MaxGap", "ORDER", "ORDER2", "ORDER3")] <- OrderCreation(dataOD$OD, dataOD$nr, dataOD$nc)
# 2. Computation of the order of imputation of each MD (i.e. updating of matrix ORDER) --------------------------------------------------------------------
if (max(dataOD$ORDER) != 0) {
dataOD[c("ORDERSLGLeft", "ORDERSLGRight", "ORDERSLGBoth", "LongGap", "MaxGap", "REFORD_L", "ORDER")] <- ImputeOrderComputation(dataOD$ORDER, dataOD$ORDER3, dataOD$MaxGap, np, nf, dataOD$nr, dataOD$nc)
} else {
dataOD$ORDERSLGLeft <- matrix(nrow = dataOD$nr, ncol = dataOD$nc, 0)
dataOD$ORDERSLGRight <- matrix(nrow = dataOD$nr, ncol = dataOD$nc, 0)
dataOD$ORDERSLGBoth <- matrix(nrow = dataOD$nr, ncol = dataOD$nc, 0)
dataOD$LongGap <- FALSE
}
# Setting parallel or sequential backend and random seed
if (ParExec & (parallel::detectCores() > 2 & m > 1)) {
if (is.null(ncores)) {
Ncpus <- parallel::detectCores() - 1
} else {
Ncpus <- min(ncores, parallel::detectCores() - 1)
}
cl <- parallel::makeCluster(Ncpus)
doSNOW::registerDoSNOW(cl) # registerDoParallel doesn't have compatibility with ProgressBar
if (SetRNGSeed) {
doRNG::registerDoRNG(SetRNGSeed)
}
# set progress bar for parallel processing
pb <- txtProgressBar(max = m, style = 3)
progress <- function(n) setTxtProgressBar(pb, n)
opts <- list(progress = progress)
# condition used to run code part needed for parallel processing
ParParams <- TRUE
} else {
if (ParExec & m == 1) {
if (verbose == T) {
message(paste("/!\\ The number of multiple imputation is 1, parallel processing is only available for m > 1."))
}
} else if (ParExec) {
if (verbose == T) {
message(paste("/!\\ The number of cores of your processor does not allow paralell processing, at least 3 cores are needed."))
}
}
if (SetRNGSeed) {
set.seed(SetRNGSeed)
}
foreach::registerDoSEQ()
opts <- NULL
# condition used to run code part needed for sequential processing
ParParams <- FALSE
}
# Beginning of the multiple imputation (imputing "mi" times)
o <- NULL
RESULT <- foreach(o = 1:m, .inorder = TRUE, .combine = "rbind", .options.snow = opts) %dopar% {
if (!ParParams) {
# Parallel and sequential execution of foreach don't use the same casting mechanism, this one is used for sequential execution.
if (verbose == T) {
cat("iteration :", o, "/", m, "\n")
}
}
# Trying to catch the potential singularity error (part 1/2)
# (comment this part of code (as well as the one at the very end of
# seqimpute.R) to debug more easily and see the full message of any
# occuring error)
#********************************************************************************************************************************
#************************************************************************************************************************
# 3. Imputation using a specific model --------------------------------------------------------------------------------------------------------
if (max(dataOD$ORDER) != 0) {
# Otherwise if there is only 0 in ORDER,
# there is no need to impute internal gaps
# and we directly jump to the imputation of
# external gaps (i.e. points 4. and 5.)
if (verbose == T) {
print("Imputation of the internal gaps...")
}
dataOD[["ODi"]] <- ModelImputation(
OD = dataOD$OD, covariates = dataOD$CO, time.covariates = dataOD$COt, ODi = dataOD$ODi, MaxGap = dataOD$MaxGap,
totV = dataOD$totV, totVi = dataOD$totVi, regr = regr, nc = dataOD$nc, np = np, nf = nf, nr = dataOD$nr, ncot = dataOD$ncot, COtsample = dataOD$COtsample,
pastDistrib = dataOD$pastDistrib, futureDistrib = dataOD$futureDistrib, k = dataOD$k, available = available, REFORD_L = dataOD$REFORD_L, noise = dataOD$noise, ...
)
}
# 4. Imputing initial NAs -------------------------------------------------------------------------------------------------------------------------
if ((nfi != 0) & (dataOD$MaxInitGapSize != 0)) {
if (verbose == T) {
print("Imputation of the initial gaps...")
}
# # we only impute the initial gaps if nfi > 0
dataOD[["ODi"]] <- ImputingInitialNAs(
OD = dataOD$OD, covariates = dataOD$CO, time.covariates = dataOD$COt, ODi = dataOD$ODi, totVi = dataOD$totVi, COtsample = dataOD$COtsample,
futureDistrib = dataOD$futureDistrib, InitGapSize = dataOD$InitGapSize, MaxInitGapSize = dataOD$MaxInitGapSize, nr = dataOD$nr, nc = dataOD$nc,
ud = dataOD$ud, nco = dataOD$nco, ncot = dataOD$ncot, nfi = nfi, regr = regr, k = dataOD$k, available = available, noise = dataOD$noise, ...
)
}
# 5. Imputing terminal NAs ------------------------------------------------------------------------------------------------------------------------
if ((npt != 0) & (dataOD$MaxTermGapSize != 0)) {
# we only impute the terminal
# gaps if npt > 0
if (verbose == T) {
print("Imputation of the terminal gaps...")
}
dataOD[["ODi"]] <- ImputingTerminalNAs(
OD = dataOD$OD, covariates = dataOD$CO, time.covariates = dataOD$COt, ODi = dataOD$ODi, COtsample = dataOD$COtsample, MaxTermGapSize = dataOD$MaxTermGapSize,
TermGapSize = dataOD$TermGapSize, pastDistrib = dataOD$pastDistrib, regr = regr, npt = npt, nco = dataOD$nco, ncot = dataOD$ncot, totVt = dataOD$totVt,
nr = dataOD$nr, nc = dataOD$nc, ud = dataOD$ud, available = available, k = dataOD$k, noise = dataOD$noise, ...
)
}
# if (max(dataOD$ORDER)!=0) {
# 6. Imputing SLG NAs -----------------------------------------------------------------------------------------------------------------------------
# left-hand side SLG
if (max(dataOD$ORDERSLGLeft) != 0) {
# Checking if we have to impute
# left-hand side SLG
if (verbose == T) {
print("Imputation of the left-hand side SLG...")
}
dataOD[["ODi"]] <- LSLGNAsImpute(
OD = dataOD$OD, ODi = dataOD$ODi, covariates = dataOD$CO, time.covariates = dataOD$COt, COtsample = dataOD$COtsample, ORDERSLG = dataOD$ORDERSLGLeft,
pastDistrib = dataOD$pastDistrib, futureDistrib = dataOD$futureDistrib, regr = regr, np = np, nr = dataOD$nr, nf = nf, nc = dataOD$nc, ud = dataOD$ud, ncot = dataOD$ncot,
nco = dataOD$nco, k = dataOD$k, noise = dataOD$noise, available = available, ...
)
}
# right-hand side SLG
if (max(dataOD$ORDERSLGRight) != 0) {
# Checking if we have to impute right-hand
# side SLG
if (verbose == T) {
print("Imputation of the right-hand side SLG...")
}
dataOD[["ODi"]] <- RSLGNAsImpute(
OD = dataOD$OD, ODi = dataOD$ODi, covariates = dataOD$CO, time.covariates = dataOD$COt, COtsample = dataOD$COtsample, ORDERSLGRight = dataOD$ORDERSLGRight,
pastDistrib = dataOD$pastDistrib, futureDistrib = dataOD$futureDistrib, regr = regr, np = np, nr = dataOD$nr, nf = nf, nc = dataOD$nc, ud = dataOD$ud, ncot = dataOD$ncot,
nco = dataOD$nco, k = dataOD$k, noise = dataOD$noise, available = available, ...
)
}
# Checking if we have to impute
# Both-hand side SLG
if (dataOD$LongGap) {
if (verbose == T) {
print("Imputation of the both-hand side SLG...")
}
for (h in 2:np) {
if (sum(dataOD$ORDERSLGBoth[, h - 1] == 0 & dataOD$ORDERSLGBoth[, h] != 0) > 0) {
tt <- which(dataOD$ORDERSLGBoth[, h - 1] == 0 & dataOD$ORDERSLGBoth[, h] != 0)
tmpORDER <- matrix(0, nrow(dataOD$ORDERSLGBoth), ncol(dataOD$ORDERSLGBoth))
tmpORDER[tt, h:ncol(dataOD$ORDERSLGBoth)] <- dataOD$ORDERSLGBoth[tt, h:ncol(dataOD$ORDERSLGBoth)]
dataOD[["ODi"]] <- RSLGNAsImpute(
OD = dataOD$OD, ODi = dataOD$ODi, covariates = dataOD$CO, time.covariates = dataOD$COt, COtsample = dataOD$COtsample, ORDERSLGRight = tmpORDER,
pastDistrib = dataOD$pastDistrib, futureDistrib = dataOD$futureDistrib, regr = regr, np = h - 1, nr = dataOD$nr, nf = nf, nc = dataOD$nc, ud = dataOD$ud,
ncot = dataOD$ncot, nco = dataOD$nco, k = dataOD$k, noise = dataOD$noise, available = available, ...
)
}
}
}
# }
#****************************************************************************************************************************************
# Updating the matrix RESULT used to store the multiple imputations
return(cbind(replicate(dataOD$nr, o), dataOD$ODi))
}
if (ParParams) {
parallel::stopCluster(cl)
}
RESULT <- rbind(cbind(replicate(dataOD$nr, 0), dataOD$OD), RESULT)
# X. Final conversions ----------------------------------------------------------------------------------------------------------------------------------------
RESULT <- FinalResultConvert(RESULT, dataOD$ODClass, dataOD$ODlevels, rownamesDataset, nrowsDataset, dataOD$nr, dataOD$nc, dataOD$rowsNA, m)
# Rearrange dataframe by order of the mi imputation as parallel computation may not return the values in sequential order.
# if (ParParams){
# RESULT <- RESULT[order(RESULT$.imp),]
# }
structure(RESULT,class=c("seqimp","data.frame"))
#return(RESULT)
}
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.