Heaven includes a series of function that are useful when working with massive sized epidemiological data. The functions are specifically constructed for the Statistics Denmark environment. This particular environment delivers huge datasets in SAS. Some functions are generally useful, others only helpful for datasets as delivered by Statistics Denmark.

Overview

Reading data

Standardizing

Simulation

Matching

Calculation

Splitting

importDREAM

The DREAM dataset is collected from a range of Danish authorities. Since 1991 it contains all residents that have received public support. Since this includes student funding from the state, prolonged sick leave, maternity leave and pension most of the Danish population is included. The dataset is contructed with one record per person and one variable per week since 1991. For each week there is a code if that person has received support. Further since 2008 there is a monthly recording of occupation of each individual in the dataset. For a detailed description of DREAM refer to www.dreammodel.dk

The importDREAM function reads an R data.table imported from SAS and returns a "long" list of support codes or profession codes along with person identification and dates of start and end.

Usage

importDREAM(dreamData,explData=NULL,type="support",pnr="PNR")

Author: Christian Torp-Pedersen

Return

the function returns a data.table with the following variables:

Example

 # The following is a minute version of DREAM
 library(data.table)
 library(heaven)
 microDREAM <-data.table(PNR=c(1,2),branche_2008_01=c("","3"),
   branche_2008_02=c("1","3"), branche_2008_03=c("1",""),
   y_9201=c("","1"),y_9202=c("","2"),y_9203=c("","2"))
 microDREAM[]
 # Explanation of codes for "branche" 
 branche <- data.table(branche=c("1","2","3"),
             text=c("school","gardening","suage"))   
 support <- data.table(support=c("1","2"),text=c("education","sick")) 
 temp <- importDREAM(microDREAM,branche,type="branche",pnr="PNR")   
 temp[]
 temp2 <- importDREAM(microDREAM,support,type="support",pnr="PNR")
 temp2[]    

riskSetMatch

There are multiple good programs to perform simple matching, but few for nested case-control studies where cases and controls are selected from the sample population and an additional requirement of controls to have a particular status (no event yet / alive) at the time where the case has an event/exposure. There are two main uses: Incidence density sampling, where controls are selected for each case at the time of outcome and where the controls should not (yet) have an event. Exposure density sampling is where cases are selected at the time of exposure and controls at that time should not (yet) be exposed.

The riskSetMatch function performs exact matching on fixed variables such as sex and age (time of birth). Variables such a time of birth needs to be rounded appropriately to ensure matches. The function can also include a list of time dependent covariates where the requirement in matching is that the date of the covariate should be prior to matching date for both or later/missing for both.

The function can with input be set to reuse controls or not and to use cases as controls prior to the case date or not. In general it i recommended to allow reuse of controls and to reuse cases prior to being a control. A review of biases is found in Robins: Biometrics, Vol. 42, No. 2 (Jun., 1986), 293-299

The riskSetMatch function is designed for very large data and uses a simplified search for controls for each case. Lists of cases and controls are constructed - and then the controls are sorted by a random variables. Next for each case controls are selected consecutively when they fit the matching conditions.

Usage

riskSetMatch(ptid,event,terms,dat,Ncontrols,oldevent="oldevent"
,caseid="caseid",reuseCases=FALSE,reuseControls=FALSE,caseIndex=NULL 
,controlIndex=NULL,NoIndex=FALSE,cores=1,dateterms=NULL)

Author: Christian Torp-Pedersen

Details

The function does exact matching and keeps 2 dates (indices) apart such that the date for controls is larger than that for cases. Because the matching is exact all matching variables must be integer or character. Make sure that sufficient rounding is done on continuous and semicontinuous variables to ensure a decent number of controls for each case. For example it may be difficult to find controls for cases of very high age and age should therefore often be rounded by 2,3 or 5 years - and extreme ages further aggregated.

For case control studies age may be a relevant matching parameter - for most cohort studies year of birth is more relevant since the age of a control varies with time.

Many datasets have comorbidities as time dependent variables. Matching on these requires that the comorbidity date is not (yet) reached for a corres- ponding variables for cases if the case does not have the comorbidity and similarly that the date has been reached when the case does have that co- morbidity.

For many purposes controls should be reused and cases allowed to be controls prior to being cases. By default, there is no reuse and this can be adjusted with "reuseCases" and "reuseControls"

The function can be used for standard matching without the caseIndex/ controlIndex (with "NoIndex"), but other packages such as MatchIt are more likely to be more optimal for these cases.

It may appear tempting always to use multiple cores, but this comes with a costly overhead because the function machinery has to be distributed to each defined "worker". With very large numbers of cases and controls, multiple cores can save substantial amounts of time. When a single core is used a progress shows progress of matching. There is no progress bar with multiple cores

The function matchReport may afterwards be used to provide simple summaries of use of cases and controls

Return

data.table with cases and controls. After matching, a new variable "caseid" links controls to cases. Further, a variable "oldevent" holds the orginal value of "event" - to be used to identify cases functioning as controls prior to being cases.

Variables in the original dataset are preserved. The final dataset includes all original cases but only the controls that were selected.

If cases without controls should be removed, this is done by setting the variable removeNoControls to TRUE

## Example
 library(data.table)
 library(heaven)
 case <- c(rep(0,40),rep(1,15)) 
 ptid <- paste0("P",1:55)
 sex <- c(rep("fem",20),rep("mal",20),rep("fem",8),rep("mal",7))
 byear <- c(rep(c(2020,2030),20),rep(2020,7),rep(2030,8))
 case.Index <- c(seq(1,40,1),seq(5,47,3))
 control.Index <- case.Index
 diabetes <- seq(2,110,2)
 heartdis <- seq(110,2,-2)
 diabetes <- c(rep(1,55))
 heartdis <- c(rep(100,55))
 dat <- data.table(case,ptid,sex,byear,diabetes,heartdis,case.Index,control.Index)
 # Very simple match without reuse - no dates to control for
 out <- riskSetMatch("ptid","case",c("byear","sex"),dat,2,NoIndex=TRUE)
 head(out,10)
 # Risk set matching without reusing cases/controls - 
 # Some cases have no controls
 out2 <- riskSetMatch("ptid","case",c("byear","sex"),dat,2,caseIndex="case.Index",
   controlIndex="control.Index")
 head(out2,10) 
 # Risk set matching with reuse of cases (control prior to case) and reuse of 
 # controls - more cases get controls
 out3 <- riskSetMatch("ptid","case",c("byear","sex"),dat,2,caseIndex=
   "case.Index",controlIndex="control.Index"
   ,reuseCases=TRUE,reuseControls=TRUE)
 head(out3,10)  
 # Same with 2 cores
 out4 <- riskSetMatch("ptid","case",c("byear","sex"),dat,2,caseIndex=
   "case.Index",controlIndex="control.Index"
   ,reuseCases=TRUE,reuseControls=TRUE,cores=2)  
 head(out4,10)   
 #Time dependent matching. In addtion to fixed matching parameters there are
 #two other sets of dates where it is required that if a case has that condi-
 #tion prior to index, then controls also need to have the condition prior to
 #the case index to be eligible - and if the control does not have the condi-
 #tion prior to index then the same is required for the control.
 out5 <- riskSetMatch("ptid","case",c("byear","sex"),dat,2,caseIndex=
   "case.Index",controlIndex="control.Index"
   ,reuseCases=TRUE,reuseControls=TRUE,cores=1,
   dateterms=c("diabetes","heartdis"))  
 head(out5,10)
 #POSTPROCESSING
 #It may be convinient to add the number of controls found to each case in or-
 #der to remove cases without controls or where very few controls have been
 #found.  This is easily obtained using data.table - with the example above:
 #out5[,numControls:=.N,by=caseid] # adds a column with the number of controls
                                  # for each case-ID   

matchReport

This function is to be applied to the output from riskSetMatch and provides tabulation of uses and reuses of controls and cases

Usage

matchReport(dat, id, case, caseid,oldcase="oldevent") dat - data.table with result from riskSetMatch id - variable with participant id case - 0=control, 1=case caseid - variable defining the groups of matching cases/controls * oldcase - Variable holding case/control=0/1 prior to matching. Distinguishes cases reused as controls

Author: Christian Torp-Pedersen

Details

This function can be helpful to define matching options. If there is excessive reuse of controls or many cases do not find controls it may be desirable to do further rounding of matching variables. Return: Three small tables - Number of controls for cases, use/reuse of controls, use/reuse of cases

Example

 require(data.table)
 case <- c(rep(0,40),rep(1,15)) 
 ptid <- paste0("P",1:55)
 sex <- c(rep("fem",20),rep("mal",20),rep("fem",8),rep("mal",7))
 byear <- c(rep(c(2020,2030),20),rep(2020,7),rep(2030,8))
 caseIndex <- c(seq(1,40,1),seq(5,47,3))
 controlIndex <- caseIndex
 library(data.table)
 dat <- data.table(ptid,case,sex,byear,caseIndex,controlIndex)
 # Very simple match without reuse - no dates to control for
 dataout <- riskSetMatch("ptid","case",c("byear","sex"),dat,2,caseIndex="caseIndex",
   controlIndex="controlIndex",reuseCases=TRUE,reuseControls=TRUE)
 matchReport(dataout,"ptid","case","caseid")   

Lexis functions

The lexis functions are 3 functions to aid splitting of observations in order to perform time varying analyses. To understand the functionality the following graph is a lexis diagram.

The diagram shows time passing for a number of individual by age and calender time. Some have an event (red x) and som not. For time dependent analysis individuals are split by calender time and age. The individual most to the right is split in 5 records, the split occurring each time a line is crossed. This allows regression to be time dependent with individuals contributing to different age/time periods and only be compared with other individuals in the same age/time period.

In practice a range of splits are performed:

For each of the lexis functions two inputs are important:

The use of the 3 lexi function can be shown graphically

Note that the content of the Splitting guide should not be part of Base data

A general warning: Be careful when working with dates, not the least if the dates are provided from SAS. By default SAS has the origin of dates as 1960-01-01 and R as 1970-01-01. Confusion can easily occur. Output of dates from lexis functions is alway "integer". This will result in dates using 1970-01-01 as origin.

lexisTwo

This function can split observations on a number of conditions such as comorbidites and each defined by a date and missing if there is no comorbidity.

The Base date needs to include columns defining participant, start of period, end of period and event (numeric 0/1)

The splitting guide needs to be a matrix like data.table where the first column is patient id and the further columns each define dates where splitting should occur. Missing dates result in no splitting. For comorbidities is is useful to name the columns after the comorbidity and let the content of the columns be the dates of the comorbidity when present and missing if not present.

Usage

lexisTwo(indat,splitdat,invars,splitvars)

author Christian Torp-Pedersen

Base data

A data.table or data.frame whose first 4 columns are in that particular order:

Splitting guide

A data.table which contains person specific information about the onset dates of comorbidities and other events.

Return

The function returns a new data table where records have been split according to the splittingguide dataset. Variables unrelated to the splitting are left unchanged. The names of columns from "splitvars" are also in output data, but now they have the value zero before the dates and 1 after.

Details

The program checks that intervals are not negative. Violation results in an error. Overlap may occur in real data, but the user needs to make decisions regarding this prior to using this function.

It is required that the splittingguide contains at least one record.
Missing data in the person id variables are not allowed and will cause errors.

The output will always have the "next" period starting on the day where the last period ended. This is to ensure that period lengths are calculated pro- perly. The program will also allow periods of zero lengths which is a conse- quence when multiple splits are made on the same day. When there is an event on a period with zero length it is important to keep that period not to loose events for calculations. Whether other zero length records should be kept in calculations depends on the context.

Examples

 library(data.table)

 dat <- data.table(pnr=c("123456","123456","234567","234567","345678","345678"
 ,"456789","456789"),
                 start=as.integer(c(0,100,0,100,0,100,0,100)),
                 end=as.integer(c(100,200,100,200,100,200,100,200)),
                 event=as.integer(c(0,1,0,0,0,1,0,1)))
 split <- data.table (pnr=c("123456","234567","345678","456789"),
 como1.onset=as.integer(c(0,NA,49,50)), como2.onset=as.integer(c(25,75,49,49)),
 como3.onset=as.integer(c(30,NA,49,48)), como4.onset=as.integer(c(50,49,49,47))) 
 #Show the datasets:
 dat[]
 split[]
 lexisTwo(dat # inddato with id/in/out/event
    ,split # Data with id and dates
    ,c("pnr","start","end","event") #names of id/in/out/event - in that order
    ,c("como1.onset","como2.onset","como3.onset","como4.onset")) 
    #Names of date-vars to split by

LexisFromTo

This function handles situations where individuals have conditions that are limited in time - pregnancy, drug doses etc. Each limited condition has a starting date, an end date, the condition as a character string and a name defining the condition. Multiple limited conditions can be in a single data.table.

Base data

A data.table or data.frame whose first 4 columns are in that particular order:

Splitting guide

Usage

lexisFromTo(indat,splitdat,invars,splitvars) author: Christian Torp-Pedersen

Return

The function returns a new data table where records have been split according to the splitting guide dataset. Variables unrelated to the splitting are left unchanged.

Details

The input to this function are two data.tables and two lists of the critical variables. The BASE data it the data to be split. This data must have a variable to identify participants, start/end times and a variable to indicate event after last interval. The other table (SPLITTINGUIDE) contains possibly multiple records for each participants with id/from/to/value/name.

The program checks that intervals are not negative and that intervals for each "name" and each individual do not overlap. Violation stops the program with error. Overlaps may occur in real situations, but the user needs to make decisions regarding this prior to this function.

It is required that the splittingguide contains at least one record. Missing data for key variables are not allowed and will cause errors.

Examples

library(data.table)
dat <- data.table(id=c("A","A","B","B","C","D"),
                 start=as.Date(c(0,100,0,100,200,400),origin='1970-01-01'),
                 end=as.Date(c(100,200,100,200,300,500),origin='1970-01-01'),
                 event=c(0,1,0,0,1,1))
split <- data.table (id=c("A","A","A","A","B","B","B","D","D"),
                    start=as.Date(c(0,50,25,150,110,150,400,300,500),origin='1970-01-01'),
                    end= as.Date(c(25,75,150,151,120,250,500,400,500),origin='1970-01-01'),
                    value=c(1,2,3,4,1,2,3,6,7),
                    name=c("d1","d1","d2","d2","d1","d1","d2","d3","d4"))
#Show the dataset:
dat[]
split[]                   
temp <- lexisFromTo(dat # inddato with id/in/out/event
                   ,split # Data with id and dates
                   ,c("id","start","end","event") #names of id/in/out/event - in that order
                  ,c("id","start","end","value","name")) #Nmes var date-vars to split by
temp[]

LexisSeq

The base data for this function is identical to the other lexis functions, a data.table including c("id","start","end","event") - and other variables to be preserved during splitting. In contrast to the other functions the splitting guide is not a data.table but a vector of dates. It can be specified as a fixed vector with selected dates or in a "seq" format (from,to,by). This simple vector is useful for splitting according to calender time where splitting is identical for all subjects. Further a "varname" may hold a variable from the base data to be added to the vector. This is typically birthdate for splitting on age - and it may be time of exposure when splitting in periods after exposure.

Usage

lexisSeq(indat,invars,varname=NULL,splitvector,format,value="value")

Author: Christian Torp-Pedersen

Return

The function returns a new data table where records have been split according to the provided vector. Variables unrelated to the splitting are left unchanged.

Details

The input must be data.table. This data.table is assumed already to be split by other functions with multiple records having identical participant id. The function extracts those variables necessary for splitting, splits by the provided vector and finally merges other variable onto the final result.

The output will always have the "next" period starting on the day where the last period ended. This is to ensure that period lengths are calculated pro- perly. The program will also allow periods of zero lengths which is a conse- quence when multiple splits are made on the same day. When there is an event on a period with zero length it is important to keep that period not to loose events for calculations. Whether other zero length records should be kept in calculations depend on context.

Examples

library(data.table)
dat <- data.table(ptid=c("A","A","B","B","C","C","D","D"),
                start=c(0,100,0,100,0,100,0,100),
                end=c(100,200,100,200,100,200,100,200),
                dead=c(0,1,0,0,0,1,0,1),
                Bdate=c(-5000,-5000,-2000,-2000,0,0,100,100))
#Example 1 - Splitting on a vector with 3 values to be added to "Bdate"                 
out <- lexisSeq(indat=dat,invars=c("ptid","start","end","dead"),
               varname="Bdate",c(0,150,5000),format="vector")
out[]
#Example 2 - splitting on a from-to-by vector with no adding (calender time?)
out2 <- lexisSeq(indat=dat,invars=c("ptid","start","end","dead"),
                 varname=NULL,c(0,200,50),format="seq",value="myvalue")
out2[]


tagteam/heaven documentation built on April 26, 2024, 6:22 a.m.