dataLongSubDist: Data Matrix and Weights for Discrete Subdistribution Hazard...

View source: R/DiscSurvDataTransform.R

dataLongSubDistR Documentation

Data Matrix and Weights for Discrete Subdistribution Hazard Models

Description

Generates the augmented data matrix and the weights required for discrete subdistribution hazard modeling with right censoring.

Usage

dataLongSubDist(
  dataShort,
  timeColumn,
  eventColumns,
  eventColumnsAsFactor = FALSE,
  eventFocus,
  timeAsFactor = FALSE,
  aggTimeFormat = FALSE,
  lastTheoInt = NULL
)

Arguments

dataShort

Original data in short format ("class data.frame").

timeColumn

Character specifying the column name of the observed event times ("logical vector"). It is required that the observed times are discrete ("integer vector").

eventColumns

Character vector specifying the column names of the event indicators (excluding censoring events) ("logical vector"). It is required that a 0-1 coding is used for all events. The algorithm treats row sums of zero of all event columns as censored.

eventColumnsAsFactor

Should the argument eventColumns be interpreted as column name of a factor variable ("logical vector")? Default is FALSE.

eventFocus

Column name of the event of interest (type 1 event) ("character vector").

timeAsFactor

Logical indicating whether time should be coded as a factor in the augmented data matrix("logical vector"). If FALSE, a numeric coding will be used.

aggTimeFormat

Instead of the usual long format, should every observation have all time intervals? ("logical vector") Default is standard long format. In the case of nonlinear risk score models, the time effect has to be integrated out before these can be applied to the C-index.

lastTheoInt

Gives the number of the last theoretic interval ("integer vector"). Only used, if aggTimeFormat==TRUE.

Details

This function sets up the augmented data matrix and the weights that are needed for weighted maximum likelihood (ML) estimation of the discrete subdistribution model proposed by Berger et al. (2018). The model is a discrete-time extension of the original subdistribution model proposed by Fine and Gray (1999).

Value

Data frame with additional column "subDistWeights". The latter column contains the weights that are needed for fitting a weighted binary regression model, as described in Berger et al. (2018). The weights are calculated by a life table estimator for the censoring event.

Author(s)

Thomas Welchowski welchow@imbie.meb.uni-bonn.de

References

\insertRef

bergerSubdistdiscSurv

\insertReffinePropHazdiscSurv

See Also

dataLong

Examples


################################
# Example with unemployment data
library(Ecdat)
data(UnempDur)

# Generate subsample, reduce number of intervals to k = 5
SubUnempDur <- UnempDur [1:500, ]
SubUnempDur$time <- as.numeric(cut(SubUnempDur$spell, c(0,4,8,16,28)))

# Convert competing risks data to long format
# The event of interest is re-employment at full job
SubUnempDurLong <- dataLongSubDist (dataShort=SubUnempDur, timeColumn = "time", 
eventColumns=c("censor1", "censor2", "censor3"), eventFocus="censor1")
head(SubUnempDurLong)

# Fit discrete subdistribution hazard model with logistic link function
logisticSubDistr <- glm(y ~ timeInt + ui + age + logwage,
                    family=binomial(), data = SubUnempDurLong, 
                    weights = SubUnempDurLong$subDistWeights)
summary(logisticSubDistr)

########################################
# Simulation 
# Discrete subdistribution hazards model

# Simulate covariates as multivariate normal distribution
library(mvnfast)
set.seed(1980)
X <- mvnfast::rmvn(n = 1000, mu = rep(0, 4), sigma = diag(4))

# Specification of two discrete cause specific hazards with four intervals
# Event 1
theoInterval <- 4
betaCoef_event1 <- seq(-1, 1, length.out = 5)[-3]
timeInt_event1 <- seq(0.1, -0.1, length.out = theoInterval-1)
linPred_event1 <- c(X %*% betaCoef_event1)
# Event 2
betaCoef_event2 <- seq(-0.5, 0.5, length.out = 5)[-3]
timeInt_event2 <- seq(-0.1, 0.1, length.out = theoInterval-1)
linPred_event2 <- c(X %*% betaCoef_event2)
# Discrete cause specific hazards in last theoretical interval
theoHaz_event1 <- 0.5
theoHaz_event2 <- 0.5

# Derive discrete all cause hazard
haz_event1_X <- cbind(sapply(1:length(timeInt_event1), 
                            function(x) exp(linPred_event1 + timeInt_event1[x]) / 
                              (1 + exp(linPred_event1 + timeInt_event1[x]) + 
                              exp(linPred_event2 + timeInt_event2[x])) ), 
                     theoHaz_event1)

haz_event2_X <- cbind(sapply(1:length(timeInt_event2), 
                            function(x) exp(linPred_event2 + timeInt_event2[x]) / 
                              (1 + exp(linPred_event1 + timeInt_event1[x]) + 
                              exp(linPred_event2 + timeInt_event2[x]) ) ),
                     theoHaz_event2)
allCauseHaz_X <- haz_event1_X + haz_event2_X

# Derive discrete cumulative incidence function of event 1 given covariates
p_T_event1_X <- haz_event1_X * cbind(1, (1-allCauseHaz_X)[, -dim(allCauseHaz_X)[2]])
cumInc_event1_X <-  t(sapply(1:dim(p_T_event1_X)[1], function(x) cumsum(p_T_event1_X[x, ])))

# Calculate all cause probability P(T=t | X)
pT_X <- t(sapply(1:dim(allCauseHaz_X)[1], function(i) estMargProb(allCauseHaz_X[i, ]) ))

# Calculate event probability given time interval P(R=r | T=t, X)
pR_T_X_event1 <- haz_event1_X / (haz_event1_X + haz_event2_X)

# Simulate discrete survival times
survT <- sapply(1:dim(pT_X)[1], function(i) sample(x = 1:(length(timeInt_event1)+1), 
                                                   size = 1, prob = pT_X[i, ]) )
censT <- sample(x = 1:(length(timeInt_event1)+1), size = dim(pT_X)[1], 
               prob = rep(1/(length(timeInt_event1) + 1), (length(timeInt_event1) + 1)), 
               replace = TRUE)

# Calculate observed times
obsT <- ifelse(survT <= censT, survT, censT)
obsEvent <- rep(0, length(obsT))
obsEvent <- sapply(1:length(obsT), 
                  function(i) if(survT[i] <= censT[i]){
                    return(sample(x = c(1, 2), size = 1, 
                    prob = c(pR_T_X_event1[i, obsT[i]  ], 
                    1 - pR_T_X_event1[i, obsT[i]  ]) ))
                  } else{
                    
                    return(0)
                  }
)

# Recode last interval to censored
lastInterval <- obsT == theoInterval
obsT[lastInterval] <- theoInterval-1
obsEvent[lastInterval] <- 0
obsT <- factor(obsT)
obsEvent <- factor(obsEvent)

# Data preparation
datShort <- data.frame(event = factor(obsEvent), time=obsT, X)

# Conversion to long data format
datLongSub <- dataLongSubDist(dataShort = datShort, timeColumn = "time",
                             eventColumns = "event", eventFocus = 1, eventColumnsAsFactor = TRUE)

# Estimate discrete subdistribution hazard model
estSubModel <- glm(formula = y ~ timeInt + X1 + X2 + X3 + X4, data = datLongSub,
                  family = binomial(link = "logit"), weights = datLongSub$subDistWeights)

# Predict cumulative incidence function of first event
predSubHaz1 <- predict(estSubModel, newdata = datLongSub[datLongSub$obj == 2, ], type = "response")
mean(((1 - estSurv(predSubHaz1)) - cumInc_event1_X[2, 1:3])^2)


discSurv documentation built on March 18, 2022, 7:12 p.m.