In romainkp/LtAtStructuR: Structuring of Complex Longitudinal Data into Long Format

```{css, echo=FALSE} body .main-container { max-width: 1280px !important; width: 1280px !important; } body { max-width: 1280px !important; }

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

library(data.table)

case_data0 <- data.frame(
  id      = 1:3,
  content = c("intnum=t", 
              "2021-04-21",
              "2021-05-20"),
  start   = c("2021-04-21", 
              "2021-04-21",
              "2021-05-20"),
  end     = c("2021-05-20",
              NA,
              NA),
  group   = c(1,3,4),
  type = c("background","box","box"),
  style = rep("background-color: #C0C0C0; font-size: 8pt",3)
)

1. Introduction

The R package LtAtStructuR automates the process of transforming longitudinal data (e.g. electronic health records data) into a structured analytic data set usable for marginal structural modeling (MSM). This data set, which is generated as output from the package, is suitable for the evaluation of the effects of exposure regimens (e.g. treatment plans) on a survival outcome based on MSM with inverse probability weighting or targeted minimum loss-based estimation. In order to create the output data set, an input cohort data set, an input exposure data set, and an optional input covariate data set(s) must be defined using setCohort(), setExposure(), and setCovariate(), respectively, and gathered into a single LtAtData object. This LtAtData object is then passed into construct() to create the final output data set.

The functionality below is designed to evaluate the effects defined by a single categorical variable at each time point.

2. Input data sets to be created by user

The creation of the following input data sets will often require substantial cleaning and manipulation of 'raw' data in studies based on electronic health records (EHR). For instance, when using drug dispensing data from the EHR, the preparation of an input data set that encodes a particular drug exposure will require the user to map fill dates and quantities dispensed into exposure periods that implement rules regarding, for example, the definition of gaps in medication possession and use of potential medication stockpile. Additionally, the input data sets must be of the class data.table

2.1 Cohort definition

The input cohort data set encodes:

1) baseline measurements 2) dates of cohort entry and end of follow-up (eof), and 3) the reason for eof (i.e., occurrence of the outcome of interest or of a censoring event) for all subjects in the cohort

This data set should contain one row per subject and must include the following columns/variables (all other variables will be ignored):

Unique subject identifier ('ID') encoded as a numeric or character variable (e.g. medical record number).
Index date (i.e., date of cohort entry) encoded as a date variable.
End of follow-up date encoded as a date variable.
Reason for end of follow-up encoded as a numeric or character variable. One value denotes occurrence of the outcome of interest (e.g., acute myocardial infarction) while all other values denote the various types of censoring events (e.g., death, health plan disenrollment, or administrative end of study)
Baseline measurement of time-dependent (e.g., HbA1c lab test results) covariate(s) or measurement of time-independent (e.g., sex, race) covariate(s) encoded as either a numeric or character variable.

Missing values must be coded with NA. Note that data from rows with missing values in any of the first 4 colums above (ID,index/eof dates or reason of eof) will be ignored/discarded by the LtAtStructuR package. The LtAtStructuR package, however, will not ignore data from rows with missing values for any of the covariates measurements.

library(LtAtStructuR)
`%+%` <- function(a, b) paste0(a, b)
input.cohort <- data.table::data.table(ID=c("000"%+%1:5),
                                       Index_date=lubridate::mdy(c("10/06/2008","05/18/2005","03/21/2006","06/17/2007","01/28/2008")),
                                       eof_dt=lubridate::mdy(c("09/30/2009","03/16/2007","11/30/2010","12/31/2010","12/31/2010")),
                                       EOF_reason=c("Lost_followup","Outcome","Lost_followup","Study_end","Study_end"),
                                       Race=c("BA","WH",NA,"AS","BA"),
                                       Hypertension=c(0,0,0,1,1),
                                       eGFR=as.numeric(c(NA,"48.2","61.0","59.7","71.3")),
                                       Stroke=c(1,1,0,1,0),
                                       Hosp_stay=c(0,1,1,0,1))
knitr::kable(input.cohort, caption = "Input cohort data")

Note that the output data set generated by the LtAtStructuR package will contain data from all subjects in the cohort data set, i.e., no systematic exclusion criteria are applied by the package Thus, if the inclusion of some subjects is not warranted to address the research question (e.g., subjects who previously experienced the exposure before the index date), then data from these subjects should be excluded from the cohort data set and all subsequent input data sets.

Creating the cohort LtAtData object

When creating the cohort LtAtData object, the user must specify the following arguments:

data must be populated with the the name of the input cohort data set object
IDvar must be populated with the name of the variable from the input cohort data set that contains the unique subject identifier
index_date must be populated with the name of the variable from the input cohort data set that contains the index date
EOF_date must be populated with the name of the variable from the input cohort data set that contains the end of follow-up dates
EOF_type must be populated with the name from the input cohort data set that contains the reason for end of follow-up
Y_name must be populated with the value of the reason for end of follow-up variable that denotes occurrence of the outcome of interest
L0 must be populated with the names of the variables from the input cohort data that contain the baseline covariate measurements
L0_timeIndep must be populated with the names of the variables subset from L0 that are time-independent with only the following three named elements:
categorical: specifies whether the covariate is continuous ('FALSE') or categorical ('TRUE'). Cannot be missing.
impute: specifies the imputation method for missing measurements: 'default', 'mean', 'mode', 'median'. If missing, imputation with the 'mean' and 'mode' is used for continuous and categorical covariates, respectively. Imputation with 'mean', 'mode', or 'median' is based on measurements from subjects with observed covariate values in data. 'mean' and 'median' can only be used for continuous covariates. 'mode' can only be used for categorical covariates. Imputation with 'default' replaces missing values with 0 if the covariate is numeric and with 'Unknown' otherwise.
impute_default_level: imputation value to be used when the imputation method is 'default'. The value must be a length 1 character (resp. numeric) for a covariate encoded by a character (resp. numeric) vector. If missing, the default values 0 and 'Unknown' are used for continuous and categorical covariates, respectively.

Define cohort object:

cohort <- setCohort(data = input.cohort, 
                    IDvar = "ID", 
                    index_date = "Index_date", 
                    EOF_date = "eof_dt", 
                    EOF_type = "EOF_reason", 
                    Y_name = "Outcome", 
                    L0 = c("Race","Hypertension","eGFR","Stroke","Hosp_stay"), 
                    L0_timeIndep = list("Race"=list("categorical"=TRUE,
                                                    "impute"=NA,
                                                    "impute_default_level"=NA)) )

2.2 Exposure definition

Two types of exposure data can be handled by the package: interval and instantaneous exposures. Interval exposures corresponds to exposures that are experienced under intervals of time with a start and end date. The instantaneous exposures corresponds to exposures that occur on a single day.

2.2.1 Interval exposures

The input exposure data set encodes the exposure regimens for all subjects in the cohort by describing intervals of time during which subjects are exposed to an exposure level other than the reference exposure level chosen by the user (i.e., the output data set created by the package will be based on the assumption that each patients is exposed to the reference level except if encoded otherwise by the exposure data set). Thus, if a subject only experiences the reference exposure level during follow-up, there should be no record for this subject in this exposure data set. Otherwise, this data set can contain multiple rows for the subject and must include the following four (and sometimes only the first three) variables (all others are ignored):

Unique subject identifier: the name of this column should be the same as the 'ID' column in the cohort data set. The values in this column must be a subset of the values in the 'ID' column in the cohort data set (because not all subjects will necessarily have experienced a non-reference exposure level).
Start date of a non-reference exposure episode encoded as a date variable.
End date of a non-reference exposure episode encoded as a date variable.
Value of the non-reference exposure level encoded as a numeric or character variable. This column is not required if the exposure variable is binary. The value for the non-reference exposure level used in the output data set created by the package will then be 1 by default.

The exposure episodes described by rows with the same ID must be non-overlapping. Missing values are not allowed in the exposure data set. All subject identifiers in the exposure data set must also be present in the cohort data set. In addition, while the exposure data set may contain measurements collected strictly before a subject's index date or strictly after a subject's end of follow-up date (both dates are specified in the cohort data set), all exposure measurements collected strictly before the index date or strictly after the eof date will be ignored/discarded by the package, i.e. the output data set created by the package will not incorporate these observations. The value that encodes the reference exposure level used in the output data set will be set to 0 if the fourth column of the input exposure data set described above is missing and, otherwise, it will be set to the specified value of the non-reference exosure level.

input.exposure <- data.table::data.table(ID=c("0001","0001","0001","0003","0003"),
                                         Exposure_start=lubridate::mdy(c("07/21/2006","01/30/2009","08/14/2009","04/06/2006","03/30/2008")),
                                         Exposure_end=lubridate::mdy(c("11/30/2008","05/16/2009","10/13/2009","02/17/2008","06/18/2010")))

input.exposure.cat <- data.table::data.table(ID=c("0001","0001","0001","0003","0003"),
                                             Exposure_start=lubridate::mdy(c("07/21/2006","01/30/2009","08/14/2009","04/06/2006","03/30/2008")),
                                             Exposure_end=lubridate::mdy(c("11/30/2008","05/16/2009","10/13/2009","02/17/2008","06/18/2010")),
                                             Exposure_level=c("metformin","insulin","insulin","sulfonylurea","met+sul"))

knitr::kables(list(knitr::kable(input.exposure, caption = "Input exposure data for a binary exposure"),
                   knitr::kable(input.exposure.cat, caption = "Input exposure data for a categorical exposure")
                   ))

Creating the exposure LtAtData object

When creating the exposure LtAtData object, the user must specify the following arguments:

data must be populated with the the name of the input exposure data set object
IDvar must be populated with the name of the variable from the input exposure data set that contains the unique subject identifier
start_dateand end_date must be populated with the name of the variable from the input exposure data set that contains the exposure start and end dates, respectively
If the exposure is categorical, exp_level must be populated with the name of the variable from the input exposure data set that contains the exposure categorical values; exp_ref must then be populated with the reference exposure level

Define exposure object:

## Binary exposure
exposure <- setExposure(data = input.exposure,
                        IDvar = "ID",
                        start_date = "Exposure_start",
                        end_date = "Exposure_end")

## Categorical exposure
exposure.cat <- setExposure(data = input.exposure.cat,
                        IDvar = "ID",
                        start_date = "Exposure_start",
                        end_date = "Exposure_end",
                        exp_level = "Exposure_level",
                        exp_ref = "None")

2.2.1 Instantaneous exposures

The input exposure data set encodes the exposure regimens for all subjects in the cohort by describing the single day during which the subjects are exposed to an exposure level other than the reference exposure level chosen by the user (i.e., the output data set created by the package will be based on the assumption that each patients is exposed to the reference level except if encoded otherwise by the exposure data set). Thus, if a subject only experiences the reference exposure level during follow-up, there should be no record for this subject in this exposure data set. Otherwise, this data set can contain multiple rows for the subject and must include the following three (and sometimes only the first two) variables (all others are ignored):

Unique subject identifier: the name of this column should be the same as the 'ID' column in the cohort data set. The values in this column must be a subset of the values in the 'ID' column in the cohort data set (because not all subjects will necessarily have experienced a non-reference exposure level).
Exposure date of a non-reference exposure episode encoded as a date variable.
Value of the non-reference exposure level encoded as a numeric or character variable. This column is not required if the exposure variable is binary. The value for the non-reference exposure level used in the output data set created by the package will then be '1' by default.

The exposure episodes described by rows with the same ID must be non-overlapping. Missing values are not allowed in the exposure data set. All subject identifiers in the exposure data set must also be present in the cohort data set. In addition, while the exposure data set may contain measurements collected strictly before a subject's index date or strictly after a subject's end of follow-up date (both dates are specified in the cohort data set), all exposure measurements collected strictly before the index date or strictly after the eof date will be ignored/discarded by the package, i.e. the output data set created by the package will not incorporate these observations. The value that encodes the reference exposure level used in the output data set will be set to '0' if the third column of the input exposure data set described above is missing and, otherwise, it will be set to '0' if that column is specified as a numeric variable and it will be set to 'not exposed' if that column is a character variable.

indexDate_001 <- input.cohort[ID=="0001",lubridate::as_date(Index_date)]
indexDate_003 <- input.cohort[ID=="0003",lubridate::as_date(Index_date)]

expDT <- setInstantExposure(
    rbind(data.table::data.table("ID"="0001","fill.date"=indexDate_001,"D.t"="analog insulin","Q.t"=15),
          data.table::data.table("ID"="0001","fill.date"=indexDate_001+10,"D.t"="analog insulin","Q.t"=90),
          data.table::data.table("ID"="0001","fill.date"=indexDate_001+10+80,"D.t"="analog insulin","Q.t"=90),
          data.table::data.table("ID"="0001","fill.date"=indexDate_001+10+90,"D.t"="human insulin","Q.t"=15),
          data.table::data.table("ID"="0001","fill.date"=indexDate_001+10+90+30,"D.t"="human insulin","Q.t"=180),
          data.table::data.table("ID"="0003","fill.date"=indexDate_003,"D.t"="analog insulin","Q.t"=15),
          data.table::data.table("ID"="0003","fill.date"=indexDate_003+1,"D.t"="analog insulin","Q.t"=90),
          data.table::data.table("ID"="0003","fill.date"=indexDate_003+2,"D.t"="human insulin","Q.t"=90)          
          ),
    "ID", "fill.date", c("D.t","Q.t"))

input_InstExp_bin <- expDT$data[,.(ID,fill.date)]
input_InstExp_cat <- expDT$data

knitr::kables(list(knitr::kable(input_InstExp_bin, caption = "Input exposure data for a binary exposure"),
                   knitr::kable(input_InstExp_cat, caption = "Input exposure data for a categorical exposure")
                   ))

Creating the exposure LtAtData object

When creating the exposure LtAtData object, the user must specify the following arguments:

data must be populated with the the name of the input exposure data set object
IDvar must be populated with the name of the variable from the input exposure data set that contains the unique subject identifier
exp_date must be populated with the name of the variable from the input exposure data set that contains the exposure data
If the exposure is categorical, exp_level must be populated with the name of the variable from the input exposure data set that contains the exposure categorical values; it can be missing if there is oly one non-reference exposure. If missing, the exposure is assumed to be binary and its reference level is encoded by 0

Define exposure object:

## Binary exposure
exposure_instant_binary <- setInstantExposure(data = input_InstExp_bin,
                                              IDvar = "ID",
                                              exp_date = "fill.date")

## Categorical exposure
exposure_instant_categorical <- setInstantExposure(data = input_InstExp_cat,
                                                   IDvar = "ID",
                                                   exp_date = "fill.date",
                                                   exp_level = c("D.t","Q.t"))

2.3. Covariate definition

The input covariate data set(s) encode follow-up measurements strictly after baseline (i.e., index date) for all time-dependent variables (e.g., laboratory measurements, diagnosis, procedures, and drug prescriptions) other than the exposure, outcome and censoring variables. For each time-dependent covariate, a separate data set is used to store all follow-up measurements. This data set must include the following three (and sometimes only the first two) variables (all others are ignored):

Unique subject identifier: the name of this column should be the same as the 'ID' column in the cohort data set. The values in this column must be a subset of the values in the 'ID' column in the cohort data set (because not all subjects will necessarily have follow-up measurements).
Date of measurement encoded as a date variable.
Value of measurement encoded as a numeric or character variable. The name of this column and its variable type should match that of the column in the cohort data set that contains the baseline measurements for the same time-dependent covariate. This column is not required for covariates of behavior 1 (e.g., records of diagnoses or procedures).

Typically, each covariate data set will contain more than one row with the same 'ID', i.e. , multiple measurements per subject although some subjects may only have one follow-up measurement or none. However, each covariate data set must not contain more than one measurement per day for any given subject. In addition, while each covariate data set may contain measurements collected before a subject's index date or after a subject's eof date (both dates are specified in the cohort data set), all covariate measurements collected on or before the index date or after the eof date will be ignored/discarded by the package, i.e. the output data set created by the package will not incorporate these observations. Missing covariate information during follow-up (i.e., after the index date or before or on the eof date) must be encoded by the absence of a record in the covariate data set. In other words, there should not be any missing values for the required three (sometimes two) columns outlined above in the covariate data sets. Finally, all subject identifiers in each covariate data set must also be present in the cohort data set.

The argument type must be populated with the value "binary monotone increasing", "interval", "sporadic", or "indicator" as described below:

2.3.1 Binary monotone increasing

input.covariate.behav.1 <- data.table::data.table(ID=c("0001","0002","0003"),
                                                    Datevar=lubridate::mdy(c("04/01/2009","12/05/2006","05/01/2008")),
                                                    Hypertension=c(1,1,1))
knitr::kable(input.covariate.behav.1, caption = "Input covariate data set of type binary montone increasing")

The value of "binary monotone increasing" indicates that the covariate data set specified with setCovariate() is used by the package as a supplement to the cohort data set to create a single output time-dependent variable that is binary and monotone incresing with no missing values, i.e., its value remains either equal to its baseline value (0 or 1) during follow-up or its value changes once only during follow-up from its baseline valueof 0 to 1 and remains 1 thereafter. Examples of such covariates include binary indicators of a subject’s history of undergoing a given procedure or of receiving a given diagnosis. Note that values for the baseline measurements of a covariate of type "binary monotone increasing" must be provided in the cohort data set and missing values are not allowed. Only numeric baseline values 1 and 0 must be used for such covariates in the cohort data set. For any given subject with a baseline measurement of the covariate in the cohort data set equal to 0, the earliest date of the covariate measurement for that subject in the covariate data set will be interpreted by the package as the time when the output variable changes values from 0 to 1. All other measurements in the covariate data set for that subject, if any, will be ignored. Because there should be no missing values in the output variable, the impute argument can be left blank.

Example output data set of type binary monotone increasing:

beh.1.eg1 <- data.table::data.table(ID=rep("EG.ID1",3),intnum=0:2,censor=c(0,0,1),Hypertension=c(1,1,1))
knitr::kable(beh.1.eg1, caption = "On at baselin")

Example output data set of type binary monotone increasing:

beh.1.eg2 <- data.table::data.table(ID=rep("EG.ID2",3),intnum=0:2,censor=c(0,0,1),Hypertension=c(0,0,0))
knitr::kable(beh.1.eg2, caption = "Never on")

Example output data set of type binary monotone increasing:

beh.1.eg3 <- data.table::data.table(ID=rep("EG.ID3",3),intnum=0:2,censor=c(0,0,1),Hypertension=c(0,1,1))
knitr::kable(beh.1.eg3, caption = "On during follow-up")

2.3.2 Interval

input.covariate.behav.2 <- data.table::data.table(ID=c(sort(rep("000"%+%1:5,2))),
                                                  Datevar=lubridate::mdy(c("12/12/2008","12/17/2008","01/01/2006","01/03/2006","07/15/2008","07/30/2008","05/04/2009","05/10/2009","02/01/2008","02/06/2008")),
                                                  Hosp_stay=c(1,0,1,0,1,0,1,0,1,0))
knitr::kable(input.covariate.behav.2, caption = "Input covariate data set of type interval")

The value of "interval" indicates that the covariate data set specified with setCovariate() is used by the package as a supplement to the cohort data set to create a single output time-dependent variable that is categorical (possibly binary) or continuous with only observed/known values over time (i.e., missing values are not allowed). Examples of such covariates include variables that represent the temporal coverage of prescriptions or of hospitalization stays. Note that values for the baseline measurements of a covariate of type "interval" must be provided in the cohort data set and missing values are not allowed. For any given subject, the date and value of all measurements for that subject in the covariate data set will be processed by the package to determine when, if at all, the output variable changes values and the corresponding updated values for that subject. Because there should be no missing values in the output variable, the impute argument can be left blank.

Example output data set of type interval:

beh.2.eg1 <- data.table::data.table(ID=rep("EG.ID1",3),intnum=0:2,censor=c(0,0,1),Hosp_stay=c(1,1,1))
knitr::kable(beh.2.eg1, caption = "Always on")

Example output data set of type interval:

beh.2.eg2 <- data.table::data.table(ID=rep("EG.ID2",3),intnum=0:2,censor=c(0,0,1),Hosp_stay=c(0,0,0))
knitr::kable(beh.2.eg2, caption = "Never on")

Example output data set of type interval:

beh.2.eg3 <- data.table::data.table(ID=rep("EG.ID3",3),intnum=0:2,censor=c(0,0,1),Hosp_stay=c(1,0,1))
knitr::kable(beh.2.eg3, caption = "On and off")

2.3.3 Sporadic

input.covariate.behav.4 <- data.table::data.table(ID=c("0001","0001","0002","0002","0003","0003","0004","0004","0005","0005"),
                                                  Datevar=lubridate::mdy(c("01/05/2009","05/08/2009","05/25/2005","07/12/2005","05/18/2007","04/08/2007","01/02/2008","05/09/2010","06/12/2008","03/14/2010")),
                                                  eGFR=c(42.8,43.6,64.7,55.4,60.1,52.3,70.2,64.3,45.7,53.8))
knitr::kable(input.covariate.behav.4, caption = "Input covariate data set of type sporadic:")

The value "sporadic" for indicates that the covariate data set specified with setCovariate() is used by the package as a supplement to the cohort data set to create an output time-dependent variable that is categorical (possibly binary) or continuous with observed and possibly unobserved values over time (i.e., missing values are allowed). Examples of such covariates include variables that represent laboratory measurements. Note that values for the baseline measurements of a covariate of type "sporadic" must be provided in the cohort data set and missing values are allowed and must be coded with NA. The impute argument must be populated with the value default, mean, mode, or median to indicate whether missing baseline values hould be impute with, respectively, the default value 0 ("Unknown" if the covariate is a character variable), the mean, the mode or median of the baseline values from subjects with non-missing baseline values. The defualt value 0/"Unknown" can be changed by populating the impute_default_level argument with another value. For any given subject, the date and value of all follow-up measurements in the covariate data set will be processed by the package to determine when, if at all, the output variable is measured during follow-up and the corresponding observed values for that subject.

Example output data set of type sporadic:

beh.4.eg1 <- data.table::data.table(ID=rep("EG.ID1",3),intnum=0:5,censor=c(0,0,0,0,0,1),eGFR=c(30.8,40.2,44.4,30.4,NA,39.1),I.eGFR=c(0,0,0,0,1,0))
knitr::kable(beh.4.eg1, caption = "Frequent monitoring")

Example output data set of type sporadic:

beh.4.eg2 <- data.table::data.table(ID=rep("EG.ID2",3),intnum=0:5,censor=c(0,0,0,0,0,1),eGFR=c(40.6,43.2,NA,NA,NA,40.4),I.eGFR=c(0,0,1,1,1,0))
knitr::kable(beh.4.eg2, caption = "Less frequent monitoring")

Example output data set of type sporadic:

beh.4.eg3 <- data.table::data.table(ID=rep("EG.ID3",3),intnum=0:5,censor=c(0,0,0,0,0,1),eGFR=c(56.7,NA,NA,NA,NA,NA),I.eGFR=c(0,1,1,1,1,1))
knitr::kable(beh.4.eg3, caption = "No monitoring after baseline")

2.3.4 Indicator

input.covariate.behav.5 <- data.table::data.table(ID=c("0001","0002","0003","0003","0003","0004","0004","0004","0005"),
                                                  Datevar=lubridate::mdy(c("12/01/2008","02/15/2006","01/01/2007","02/02/2008","03/03/2009","11/10/2007","3/25/2009","10/31/2010","8/10/2010")),
                                                  Stroke=c(1,1,1,1,1,1,1,1,1))
knitr::kable(input.covariate.behav.5, caption = "Input covariate data set of type indicator")

The value "indicator" indicates that the covariate data set specified with setCovariate() is used by the package as a supplement to the cohort data set to create an output time-dependent variable that is categorical (possibly binary) or continuous with only observed/known values over time (i.e., missing values are not allowed). Examples of such covariates include variables that indicate occurrence of a given event (e.g., a stroke or a clinic visit). Non-occurrence of the event of interest is encoded by the default value “None” and 0 in the output data set created by the package for character and numeric values, respectively. Note that values for the baseline measurements of a covariate of type "indicator" must be provided in the cohort data set and missing values are not allowed. For any given subject, the date and value of all measurements for that subject in the covariate data set will be processed by the package to determine when, if at all, the output variable changes values and the corresponding updated values for that subject. Because there should be no missing values in the output variable, the impute argument can be left blank.

Example output data set of type indicator:

beh.5.eg1 <- data.table::data.table(ID=rep("EG.ID1",3),intnum=0:5,censor=c(0,0,0,0,0,1),Stroke=c(0,0,0,0,0,0))
knitr::kable(beh.5.eg1, caption = "No events during follow up")

Example output data set of type indicator:

beh.5.eg2 <- data.table::data.table(ID=rep("EG.ID2",3),intnum=0:5,censor=c(0,0,0,0,0,1),Stroke=c(1,0,0,0,0,0))
knitr::kable(beh.5.eg2, caption = "One event during follow up")

Example output data set of type indicator:

beh.5.eg3 <- data.table::data.table(ID=rep("EG.ID3",3),intnum=0:5,censor=c(0,0,0,0,0,1),Stroke=c(1,0,1,0,1,1))
knitr::kable(beh.5.eg3, caption = "Multiple events during follow up")

Creating the covariate LtAtData objects

When creating the covariate LtAtData object(s), the user must specify the following arguments:

data must be populated with the the name of the time-dependent input covariate data set object
type must be populated with "binary monotone increasing", "interval", "sporadic", or "indicator"
IDvar must be populated with the name of the variable from the input covariate data set that contains the unique subject identifier
L_date must be populated with the date of the variable from the covariate data set that contains the dates of follow-up measurements
L_name must be populated with the name of the variable from the covariate and/or cohort data set that contains the values of the covariate measurements
categorical indicates if the covariate is continuous categorical = FALSE or categorical categorical = TRUE
impute must be populated with default, mean, mode, or median; a value of NA will default to using the mean and mode for continuous and categorical covariates, respectively
impute_default_level must be populated with a character or numeric string, if and only if impute=default, otherwise, it can be left empty; a value of NA will default to values of 0 and 'Unknown' for continuous and categorical covariates, respectively
acute_change must be set to TRUE/FALSE; a value of TRUE is used to indicate that a measurement of the covariate on a day when the exposure level changes may be the result of the exposure change, and a value of FALSE is used otherwise to indicate that the covariate measurement may be assumed to be unaffected by the change in exposure and thus assumed to have preceded (and possibly triggered) the change in exposure.

Define covariate objects:

hypertension.cov <- setCovariate(data = input.covariate.behav.1,
                                 type = "binary monotone increasing",
                                 IDvar = "ID",
                                 L_date = "Datevar",
                                 L_name = "Hypertension",
                                 categorical = TRUE,
                                 impute = NA,
                                 impute_default_level = NA,
                                 acute_change = FALSE)

hosp_stay.cov <- setCovariate(data = input.covariate.behav.2,
                              type = "interval",
                              IDvar = "ID",
                              L_date = "Datevar",
                              L_name = "Hosp_stay",
                              categorical = TRUE,
                              impute = NA,
                              impute_default_level = NA,
                              acute_change = FALSE)

egfr.cov <- setCovariate(data = input.covariate.behav.4,
                         type = "sporadic",
                         IDvar = "ID",
                         L_date = "Datevar",
                         L_name = "eGFR",
                         categorical = FALSE,
                         impute = NA,
                         impute_default_level = NA,
                         acute_change = FALSE)

stroke.cov <- setCovariate(data = input.covariate.behav.5,
                           type = "indicator",
                           IDvar = "ID",
                           L_date = "Datevar",
                           L_name = "Stroke",
                           categorical = TRUE,
                           impute = NA,
                           impute_default_level = NA,
                           acute_change = FALSE)

3. Construct definition

The final step of the package construct() maps the input cohort, exposure, and covariate data sets into a structured analytic data set that encodes complex, discrete-time, longitudinal data; first, each input data set must be gathered into a single LtAtData object:

Interval exposure

## Final LtAtData object using binary exposure
LtAt.data.binary.At <- cohort + exposure + hypertension.cov + hosp_stay.cov + egfr.cov + stroke.cov

## Final LtAtData object using categorical exposure
LtAt.data.categorical.At <- cohort + exposure.cat + hypertension.cov + hosp_stay.cov + egfr.cov + stroke.cov

Instantaneous exposure

## Final LtAtData object using binary exposure
LtAt.data.binary.InstExp <- cohort + exposure_instant_binary + hypertension.cov + hosp_stay.cov + egfr.cov + stroke.cov

## Final LtAtData object using categorical exposure
LtAt.data.categorical.InstExp <- cohort + exposure_instant_categorical + hypertension.cov + hosp_stay.cov + egfr.cov + stroke.cov

A unit of time time_unit has to be specified before running the construct function, and must be populated with the number of days that will serve as the analytic unit of time in the output data set. This unit of time used to create discrete consecutive time intervals between the index date and end of follow-up.

The format argument must be populated with value standard or MSM SAS macro. A value of MSM SAS macro indicates that the output data set to be created by LtAtStrucutR should be formatted for direct use with the %MSM macro developed by the Harvard Causal Inference group. The %MSM macro automates MSM fitting with Inverse Probability Weighting estimation in studies with survival outcomes. The %MSM macro code and its documentation can be downloaded at https://www.hsph.harvard.edu/causal/software/ A value of standard indicates that the output data set created by LtAtStructuR will not be directly compatible for use with the %MSM macro but instead the output data set will be compatible for use with either the ltmle R package developed at the University of California, Berkeley or the stremr R package developed at the Kaiser Permanente Northern California, Division of Research. The ltmle and stremr packages automate the fitting of MSM and dynamic MSM with both Inverse Probability Weighting estimation and Targeted Minimum Loss based Estimation in studies with survival outcomes. ltmle can be downloaded at http://cran.r-project.org/web/packages/ltmle. stremr can be downloaded at http://cran.r-project.org/web/packages/stremr.

The first_exp_rule argument must be populated with value 0 or 1. With this value, the user indicates to the package whether a subject should be deemed first exposed to a non-reference exposure level in the output data set when the subject experiences a non-reference exposure level for at least 1 day or for exp_threshold of the days of a time interval (we recall that each follow-up interval is defined by a number of days specified by time_unit). The value 1 in first_exp_rule is used to indicate that a subject is deemed first exposed to a non-reference exposure level during a time interval in the output data set if the exposure data set indicates exposure to a non-reference exposure level for at least one day of the interval. The value 0 is used to indicate that a subject is deemed first exposed to a non-reference exposure level during a time interval in the output data set if the exposure data set indicates exposure to a non-reference exposure level for at least exp_threshold of the days of the interval. By default (i.e.., if the exp_threshold argument is left unpopulated), the value for exp_threshold used by the package is set to 50% but an alternate value can be specified by populating the exp_threshold argument with any other value lower than or equal to 1 but strictly greater than 0. The max_exp_var argument sets the limit for the maximum number of exposure variables that is expected when the exposure is defined using setInstantExposure. The max_cov_var argument sets the limit for the maximum number of variables that is expected to be created by the routine to encode the levels of each time-dependent covariate when the exposure is defined using setInstantExposure, The summary_cov_var argument indicates the coarsening method applied in each interval to summarize multiple measurements of a time-dependent covariate into a single summary measure when the exposure is defined using setInstantExposure.

Interval exposures

LtAt.data.bin.At <- construct(LtAtspec = LtAt.data.binary.At,
                              time_unit = 30,
                              first_exp_rule = 1,
                              exp_threshold = 0.5,
                              format = "standard",
                              dates = FALSE)

LtAt.data.cat.At <- construct(LtAtspec = LtAt.data.categorical.At,
                              time_unit = 30,
                              first_exp_rule = 1,
                              exp_threshold = 0.5,
                              format = "standard",
                              dates = FALSE)

Instantaneous exposures

LtAt.data.bin.InstExp <- construct(LtAtspec = LtAt.data.binary.InstExp,
                                   time_unit = 30,
                                   first_exp_rule = 1,
                                   exp_threshold = 0.03,
                                   format = "standard",
                                   dates = FALSE)

LtAt.data.cat.InstExp <- construct(LtAtspec = LtAt.data.categorical.InstExp,
                                   time_unit = 30,
                                   first_exp_rule = 1,
                                   exp_threshold = 0.03,
                                   format = "standard",
                                   dates = FALSE)

LtAt.data.bin.At.harvard <- construct(LtAtspec = LtAt.data.binary.At,
                                      time_unit = 30,
                                      first_exp_rule = 1,
                                      exp_threshold = 0.5,
                                      format = "MSM SAS macro",
                                      dates = FALSE)

LtAt.data.bin.At.harvard.dates <- construct(LtAtspec = LtAt.data.binary.At,
                                            time_unit = 30,
                                            first_exp_rule = 1,
                                            exp_threshold = 0.5,
                                            format = "MSM SAS macro",
                                            dates = TRUE)

4. Output data set

The output data set produced by the LtAtStructuR package organizes the processed longitudinal data for each patient in the cohort into a structured format suitable for analyses by MSM. As described in details in Section 5, each patient's follow-up time is first divided into intervals of constant length (i.e., time_unit). The various measurements in the input data sets are then mapped to these intervals. Each row of the resulting output data set encodes the measurements that characterize a given patient at one such interval. The output data set includes the following columns (the last two are only included when the exposure is categorical with more than two levels):

Unique subject identifier $ID$ encoded as a numeric or character variable (e.g. medical record number).
Interval number $t$ encoded as a numeric variable, beginning at 0 and incrementing by 1 until the last interval.
Interval start and end dates $t_{min}$ and $t_{max}$, respectively, encoded as date variables.
Reason for end-of-follow-up (EOF) $\Gamma$ encoded as a character variable.
Exposure status $A_1(t)$ encoded as a categorical (numeric or character) variable.
Outcome status $Y(t)$, encoded as a binary numeric variable.
Censoring status $A_2(t)$ encoded as a binary numeric variable.
Covariate statuses (one column per covariate: $L_1(t)$, $L_2(t)$, etc.) encoded as numeric or character variables.
Covariate imputation flags (one column per covariate: $I.L_1(t)$, $I.L_2(t)$, etc.) encoded as binary numeric variables. Each imputation variable indicates at each time interval whether or not the value of a given covariate was observed (value of '0') or either imputed or defined as the last observed value carried forward (value of '1').
Indicator of an exposure tie encoded as a binary numeric variable. A value of '1' for this variable indicates that a subject is exposed equally in duration to at least two distinct non-reference exposure levels (see section 5 for details)
Warning indicator of a possibly misleading exposure level assignment encoded as a binary numeric variable. A value of '1' for this variable indicates that a subject is deemed exposed to the reference level despite significant cumulative exposure to at least two non-reference exposure levels (see section 5 for details).

When construct(...,format = "standard") , the following two tables illustrate the encoding in the output data set of the longitudinal data from two patients who each, respectively, experienced and did not experience the event during follow-up (exposure is binary):

cols <- names(LtAt.data.bin.At)
LtAt.data.bin.At[, (cols) := lapply(.SD, factor), .SDcols = cols]
Yt1 <- LtAt.data.bin.At[ID=="0002",][c(1:3)]
Yt1[3,eval(names(Yt1)):="..."]
Yt1 <- rbind(Yt1,LtAt.data.bin.At[ID=="0002",][c(.N-1,.N)])
Yt1.kable <- knitr::kable(Yt1, "html", caption = "EOF reason is failure")
kableExtra::add_header_above(Yt1.kable, c("$ID$", "$t$", "$\\Gamma$", "$Y(t)$", "$A_2(t)$", "$L_1(t)$", "$I.L_1(t)$", "$L_2(t)$", "$I.L_2(t)$", "$L_3(t)$", "$L_4(t)$", "$L_5(t)$", "$A_1(t)$"))

Yt0 <- LtAt.data.bin.At[ID=="0001",][c(1:3)]
Yt0[3,eval(names(Yt0)):="..."]
Yt0 <- rbind(Yt0,LtAt.data.bin.At[ID=="0001",][c(.N-1,.N)])
Yt0.kable <- knitr::kable(Yt0, "html", caption = "EOF reason is censoring")
kableExtra::add_header_above(Yt0.kable, c("$ID$", "$t$", "$\\Gamma$", "$Y(t)$", "$A_2(t)$", "$L_1(t)$", "$I.L_1(t)$", "$L_2(t)$", "$I.L_2(t)$", "$L_3(t)$", "$L_4(t)$", "$L_5(t)$", "$A_1(t)$"))

Note that each row of these tables contains the measurements of covariates $L_j(t)$ for $j=1,2,…$, exposure $A_1(t)$, outcome $Y(t)$ and censoring variable $A_2(t)$ for a given follow-up interval $t$. The columns $t_{min}$ and $t_{max}$ (i.e., intstart and intend, respectively) contain the dates of each follow-up interval defined by the unit of time specified by the user of the package (i.e., time_unit). In particular, the value for instart in the first row of each table contains the index date for the patient. Because measurements of covariates may not be collected at each time point in non-experimental studies, the tables contain a separate column for each covariate $I.L_1(t$ and $I.L_2(t)$ that indicates whether the corresponding covariate is observed (i.e., the value '0' means that the covariate is observed). Tables such as the ones above are constructed for all patients in the cohort and stacked into a single data set that forms the output data set from the LtAtStructuR package.

The output data set just described cannot be used directly with the %MSM macro developed by the Harvard Causal Inference group to fit Marginal Structural Models by Inverse Probability Weighting estimation. For the output data set from the package LtAtStructuR to be directly usable by the %MSM macro, data from all patients who are censored during the first follow-up interval (i.e., at $t=0$) can be removed from the output data set, and, for all other patients, the values of the outcome ($Y(t)$) and censoring ($A_2(t)$) columns of their tables can be shifted up by one row, the resulting last row can be deleted, and the outcome value in the new last row can be set to missing when $A_2(t)$ is 1 in the new last row. These steps are automated by the LtAtStructuR package when construct(...,format = "MSM SAS macro"). The resulting encoding in the output data set of the longitudinal data from the same two patients described above is illustrated in the following two tables:

cols <- names(LtAt.data.bin.At.harvard)
LtAt.data.bin.At.harvard[, (cols) := lapply(.SD, factor), .SDcols = cols]
Yt1 <- LtAt.data.bin.At.harvard[ID=="0002",][c(1:3)]
Yt1[3,eval(names(Yt1)):="..."]
Yt1 <- rbind(Yt1,LtAt.data.bin.At.harvard[ID=="0002",][c(.N-1,.N)])
Yt1.kable <- knitr::kable(Yt1, "html", caption = "EOF reason is failure")
kableExtra::add_header_above(Yt1.kable, c("$ID$", "$t$", "$\\Gamma$", "$Y(t)$", "$A_2(t)$", "$L_1(t)$", "$I.L_1(t)$", "$L_2(t)$", "$I.L_2(t)$", "$L_3(t)$", "$L_4(t)$", "$L_5(t)$", "$A_1(t)$"))

Yt0 <- LtAt.data.bin.At.harvard[ID=="0001",][c(1:3)]
Yt0[3,eval(names(Yt0)):="..."]
Yt0 <- rbind(Yt0,LtAt.data.bin.At.harvard[ID=="0001",][c(.N-1,.N)])
Yt0.kable <- knitr::kable(Yt0, "html", caption = "EOF reason is censoring")
kableExtra::add_header_above(Yt0.kable, c("$ID$", "$t$", "$\\Gamma$", "$Y(t)$", "$A_2(t)$", "$L_1(t)$", "$I.L_1(t)$", "$L_2(t)$", "$I.L_2(t)$", "$L_3(t)$", "$L_4(t)$", "$L_5(t)$", "$A_1(t)$"))

In addition, when construct(...,dates = TRUE) the measurement dates for the covariates will be displayed:

cols <- names(LtAt.data.bin.At.harvard.dates)
LtAt.data.bin.At.harvard.dates[, (cols) := lapply(.SD, factor), .SDcols = cols]
Yt1 <- LtAt.data.bin.At.harvard.dates[ID=="0002",][c(1:3)]
Yt1[3,eval(names(Yt1)):="..."]
Yt1 <- rbind(Yt1,LtAt.data.bin.At.harvard.dates[ID=="0002",][c(.N-1,.N)])
Yt1.kable <- knitr::kable(Yt1[,.(ID,intnum,intstart,intend,eGFR,dteGFR,Hosp_stay,dtHosp_stay,Hypertension,dtHypertension,Stroke,dtStroke)], "html", caption = "Output with dates")
# Yt1.kable <- knitr::kable(LtAt.data.bin.At.harvard.dates[,.(ID,intnum,intstart,intend,eGFR,dteGFR,Hosp_stay,dtHosp_stay,Hypertension,dtHypertension,Stroke,dtStroke)],"html")
kableExtra::add_header_above(Yt1.kable, c("$ID$", "$t$", "$t_{min}$", "$t_{max}$", "$L_2(t)$", "$date.L_2(t)$", "$L_3(t)$", "$date.L_3(t)$", "$L_4(t)$", "$date.L_4(t)$", "$L_5(t)$","$date.L_5(t)$"))

romainkp/LtAtStructuR documentation built on Aug. 24, 2024, 3:38 p.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

romainkp/LtAtStructuR
Structuring of Complex Longitudinal Data into Long Format

In romainkp/LtAtStructuR: Structuring of Complex Longitudinal Data into Long Format

1. Introduction

2. Input data sets to be created by user

2.1 Cohort definition

2.2 Exposure definition

2.2.1 Interval exposures

2.2.1 Instantaneous exposures

2.3. Covariate definition

2.3.1 Binary monotone increasing

2.3.2 Interval

2.3.3 Sporadic

2.3.4 Indicator

3. Construct definition

4. Output data set

R Package Documentation

Browse R Packages

We want your feedback!

romainkp/LtAtStructuR Structuring of Complex Longitudinal Data into Long Format

In romainkp/LtAtStructuR: Structuring of Complex Longitudinal Data into Long Format

1. Introduction

2. Input data sets to be created by user

2.1 Cohort definition

2.2 Exposure definition

2.2.1 Interval exposures

2.2.1 Instantaneous exposures

2.3. Covariate definition

2.3.1 Binary monotone increasing

2.3.2 Interval

2.3.3 Sporadic

2.3.4 Indicator

3. Construct definition

4. Output data set

R Package Documentation

Browse R Packages

We want your feedback!

romainkp/LtAtStructuR
Structuring of Complex Longitudinal Data into Long Format