In tagteam/heaven: Data Preparation Routines for Medical Registry Data

Heaven includes a series of function that are useful when working with massive sized epidemiological data. The functions are specifically constructed for the Statistics Denmark environment. This particular environment delivers huge datasets in SAS. Some functions are generally useful, others only helpful for datasets as delivered by Statistics Denmark.

Overview

Reading data

readSAS Extract selected data from large SAS datasets. SAS does the main work in background
importDREAM Reads the DREAM dataset regarding state subsidizing and profession
searchEvent Search for an event in specified time period for each individual
getAdmLimits Sets start and end to hospitalisations with multiple records
extractCode Extraction of diseases by diagnoses

Standardizing

standardize.rate Standardizes rates to given age distribution
Standardize.proportion Standardizes proportions to age distribution
standardize.prodlim Compute standardized absolute risks based on a stratified Kaplan-Meier or Aalen-Johansen estimate

Simulation

simpoisson Simulates data for Poisson distribution

Matching

riskSetMatch Risk set matching for incidense density samptling and exposure density sampling
matchReport Provides data on matching success from riskSetMatch

Calculation

poissonRisk Predicting absolute risks from poisson regression
poissonregression Aggregates data and performs poisson regression using GLM
patientProfile Profile of common covariate combinations among patients

Splitting

lexisTwo Splits by a set of covariates each causing at most one split
lexisFromTo Splits by a series of periods - from/to
lexisSeq Splits consecutively by a vector

importDREAM

The DREAM dataset is collected from a range of Danish authorities. Since 1991 it contains all residents that have received public support. Since this includes student funding from the state, prolonged sick leave, maternity leave and pension most of the Danish population is included. The dataset is contructed with one record per person and one variable per week since 1991. For each week there is a code if that person has received support. Further since 2008 there is a monthly recording of occupation of each individual in the dataset. For a detailed description of DREAM refer to www.dreammodel.dk

The importDREAM function reads an R data.table imported from SAS and returns a "long" list of support codes or profession codes along with person identification and dates of start and end.

Usage

importDREAM(dreamData,explData=NULL,type="support",pnr="PNR")

Author: Christian Torp-Pedersen

dreamData Name of dataset holding DREAM data. Typically product of the importSAS function. In most cases it will be useful to add a set.hook to the importSAS function - set.hook="keep PNR branch:" for type of work and set.hook="keep=PNR y_:" for support
explData Name of data holding relation between codes and explanatory text. Typically read from "//srvfsenas1/formater/SAS formater i Danmarks Statistik/txt_filer" using "read.csv2". The data is expected to hold two variables: the code and the explanation - in that order. The variable holding the code MUST be named "branche" or "support" as appropriate.
type - "support" extracts weekly public support and "branche" supplies montly type of work. Other names cause error.
pnr - name of person identifier in DREAM dataset, typically "PNR"

Return

the function returns a data.table with the following variables:

pnr - person identificer from call
start - start of each period
end - end of each period
code - coding of period
explanation - text to eplain coding if a text dataset was specified

Example

 # The following is a minute version of DREAM
 library(data.table)
 library(heaven)
 microDREAM <-data.table(PNR=c(1,2),branche_2008_01=c("","3"),
   branche_2008_02=c("1","3"), branche_2008_03=c("1",""),
   y_9201=c("","1"),y_9202=c("","2"),y_9203=c("","2"))
 microDREAM[]
 # Explanation of codes for "branche" 
 branche <- data.table(branche=c("1","2","3"),
             text=c("school","gardening","suage"))   
 support <- data.table(support=c("1","2"),text=c("education","sick")) 
 temp <- importDREAM(microDREAM,branche,type="branche",pnr="PNR")   
 temp[]
 temp2 <- importDREAM(microDREAM,support,type="support",pnr="PNR")
 temp2[]

riskSetMatch

There are multiple good programs to perform simple matching, but few for nested case-control studies where cases and controls are selected from the sample population and an additional requirement of controls to have a particular status (no event yet / alive) at the time where the case has an event/exposure. There are two main uses: Incidence density sampling, where controls are selected for each case at the time of outcome and where the controls should not (yet) have an event. Exposure density sampling is where cases are selected at the time of exposure and controls at that time should not (yet) be exposed.

The riskSetMatch function performs exact matching on fixed variables such as sex and age (time of birth). Variables such a time of birth needs to be rounded appropriately to ensure matches. The function can also include a list of time dependent covariates where the requirement in matching is that the date of the covariate should be prior to matching date for both or later/missing for both.

The function can with input be set to reuse controls or not and to use cases as controls prior to the case date or not. In general it i recommended to allow reuse of controls and to reuse cases prior to being a control. A review of biases is found in Robins: Biometrics, Vol. 42, No. 2 (Jun., 1986), 293-299

The riskSetMatch function is designed for very large data and uses a simplified search for controls for each case. Lists of cases and controls are constructed - and then the controls are sorted by a random variables. Next for each case controls are selected consecutively when they fit the matching conditions.

Usage

riskSetMatch(ptid,event,terms,dat,Ncontrols,oldevent="oldevent"
,caseid="caseid",reuseCases=FALSE,reuseControls=FALSE,caseIndex=NULL 
,controlIndex=NULL,NoIndex=FALSE,cores=1,dateterms=NULL)

Author: Christian Torp-Pedersen

ptid Personal ID variable defining participant
event Defining cases/controls MUST be integer 0/1 - 0 for controls, 1 for case
terms c(.....) Specifies the variables that should be matched by - enclosed in ".."
dat The single dataset with all information - coerced to data.table if data.frame
Ncontrols Number of controls sought for each case
oldevent Holds original value of event - distinguishes cases used as controls
caseid Character. Variable holding grouping variable for cases/controls (=case-ptid)
reuseCases Logical. If \code{TRUE} a case can be a control prior to being a case
reuseControls Logical. If \code{TRUE} a control can be reused for several cases
caseIndex Integer/Date. Date variable defining the date where a case becomes a case. For a case control study this is the date of event of interest, for a cohort study the date where a case enters an analysis.
controlIndex Integer/Date. date variable defining the date from which a controls can no longer be selected. The controlIndex must be larger than the caseIndex. For a case control study this would be the date where a control has the event of interest or is censored. For a cohort study it would be the date where the control disappears from the analysis, e.g. due to death or censoring.
NoIndex Logical. If \code{TRUE} caseIndex/controlIndex are ignored
cores number of cores to use, default is one
dateterms c(....) A list of variable neames (character) in "dat" specifying dates of conditions. When a list is specified it is not only checked that the caseIndex is not after controlIndex, but also that for all variables in the list either both control/case dates are missing, both prior to case index, both after case index - or missing for case and with control date after case index.

Details

The function does exact matching and keeps 2 dates (indices) apart such that the date for controls is larger than that for cases. Because the matching is exact all matching variables must be integer or character. Make sure that sufficient rounding is done on continuous and semicontinuous variables to ensure a decent number of controls for each case. For example it may be difficult to find controls for cases of very high age and age should therefore often be rounded by 2,3 or 5 years - and extreme ages further aggregated.

For case control studies age may be a relevant matching parameter - for most cohort studies year of birth is more relevant since the age of a control varies with time.

Many datasets have comorbidities as time dependent variables. Matching on these requires that the comorbidity date is not (yet) reached for a corres- ponding variables for cases if the case does not have the comorbidity and similarly that the date has been reached when the case does have that co- morbidity.

For many purposes controls should be reused and cases allowed to be controls prior to being cases. By default, there is no reuse and this can be adjusted with "reuseCases" and "reuseControls"

The function can be used for standard matching without the caseIndex/ controlIndex (with "NoIndex"), but other packages such as MatchIt are more likely to be more optimal for these cases.

It may appear tempting always to use multiple cores, but this comes with a costly overhead because the function machinery has to be distributed to each defined "worker". With very large numbers of cases and controls, multiple cores can save substantial amounts of time. When a single core is used a progress shows progress of matching. There is no progress bar with multiple cores

The function matchReport may afterwards be used to provide simple summaries of use of cases and controls

Return

data.table with cases and controls. After matching, a new variable "caseid" links controls to cases. Further, a variable "oldevent" holds the orginal value of "event" - to be used to identify cases functioning as controls prior to being cases.

Variables in the original dataset are preserved. The final dataset includes all original cases but only the controls that were selected.

If cases without controls should be removed, this is done by setting the variable removeNoControls to TRUE

## Example
 library(data.table)
 library(heaven)
 case <- c(rep(0,40),rep(1,15)) 
 ptid <- paste0("P",1:55)
 sex <- c(rep("fem",20),rep("mal",20),rep("fem",8),rep("mal",7))
 byear <- c(rep(c(2020,2030),20),rep(2020,7),rep(2030,8))
 case.Index <- c(seq(1,40,1),seq(5,47,3))
 control.Index <- case.Index
 diabetes <- seq(2,110,2)
 heartdis <- seq(110,2,-2)
 diabetes <- c(rep(1,55))
 heartdis <- c(rep(100,55))
 dat <- data.table(case,ptid,sex,byear,diabetes,heartdis,case.Index,control.Index)
 # Very simple match without reuse - no dates to control for
 out <- riskSetMatch("ptid","case",c("byear","sex"),dat,2,NoIndex=TRUE)
 head(out,10)
 # Risk set matching without reusing cases/controls - 
 # Some cases have no controls
 out2 <- riskSetMatch("ptid","case",c("byear","sex"),dat,2,caseIndex="case.Index",
   controlIndex="control.Index")
 head(out2,10) 
 # Risk set matching with reuse of cases (control prior to case) and reuse of 
 # controls - more cases get controls
 out3 <- riskSetMatch("ptid","case",c("byear","sex"),dat,2,caseIndex=
   "case.Index",controlIndex="control.Index"
   ,reuseCases=TRUE,reuseControls=TRUE)
 head(out3,10)  
 # Same with 2 cores
 out4 <- riskSetMatch("ptid","case",c("byear","sex"),dat,2,caseIndex=
   "case.Index",controlIndex="control.Index"
   ,reuseCases=TRUE,reuseControls=TRUE,cores=2)  
 head(out4,10)   
 #Time dependent matching. In addtion to fixed matching parameters there are
 #two other sets of dates where it is required that if a case has that condi-
 #tion prior to index, then controls also need to have the condition prior to
 #the case index to be eligible - and if the control does not have the condi-
 #tion prior to index then the same is required for the control.
 out5 <- riskSetMatch("ptid","case",c("byear","sex"),dat,2,caseIndex=
   "case.Index",controlIndex="control.Index"
   ,reuseCases=TRUE,reuseControls=TRUE,cores=1,
   dateterms=c("diabetes","heartdis"))  
 head(out5,10)
 #POSTPROCESSING
 #It may be convinient to add the number of controls found to each case in or-
 #der to remove cases without controls or where very few controls have been
 #found.  This is easily obtained using data.table - with the example above:
 #out5[,numControls:=.N,by=caseid] # adds a column with the number of controls
                                  # for each case-ID

matchReport

This function is to be applied to the output from riskSetMatch and provides tabulation of uses and reuses of controls and cases

Usage

matchReport(dat, id, case, caseid,oldcase="oldevent") dat - data.table with result from riskSetMatch id - variable with participant id case - 0=control, 1=case caseid - variable defining the groups of matching cases/controls * oldcase - Variable holding case/control=0/1 prior to matching. Distinguishes cases reused as controls

Author: Christian Torp-Pedersen

Details

This function can be helpful to define matching options. If there is excessive reuse of controls or many cases do not find controls it may be desirable to do further rounding of matching variables. Return: Three small tables - Number of controls for cases, use/reuse of controls, use/reuse of cases

Example

 require(data.table)
 case <- c(rep(0,40),rep(1,15)) 
 ptid <- paste0("P",1:55)
 sex <- c(rep("fem",20),rep("mal",20),rep("fem",8),rep("mal",7))
 byear <- c(rep(c(2020,2030),20),rep(2020,7),rep(2030,8))
 caseIndex <- c(seq(1,40,1),seq(5,47,3))
 controlIndex <- caseIndex
 library(data.table)
 dat <- data.table(ptid,case,sex,byear,caseIndex,controlIndex)
 # Very simple match without reuse - no dates to control for
 dataout <- riskSetMatch("ptid","case",c("byear","sex"),dat,2,caseIndex="caseIndex",
   controlIndex="controlIndex",reuseCases=TRUE,reuseControls=TRUE)
 matchReport(dataout,"ptid","case","caseid")

Lexis functions

The lexis functions are 3 functions to aid splitting of observations in order to perform time varying analyses. To understand the functionality the following graph is a lexis diagram.

The diagram shows time passing for a number of individual by age and calender time. Some have an event (red x) and som not. For time dependent analysis individuals are split by calender time and age. The individual most to the right is split in 5 records, the split occurring each time a line is crossed. This allows regression to be time dependent with individuals contributing to different age/time periods and only be compared with other individuals in the same age/time period.

In practice a range of splits are performed:

Comorbidities: These are most often used to split a single time, the date of the comorbidity
Period effects: For drug treatment, pregnancies and many other situations there are periods where individuals are in one state and other periods with other states - such a pregnancy, dose of a drug etc.
Sequential splits: As shown in the lexis diagram it is natural to split a number of times on age and calender time

For each of the lexis functions two inputs are important:

Base data - These are the data that should be split and they must contain a person-id, a starting date, and end date and a recording wheter the period ended with an event.
Splitting guide - This is a dataset (or vector for lexisSeq) that defines where splits should occur.

The use of the 3 lexi function can be shown graphically

Note that the content of the Splitting guide should not be part of Base data

A general warning: Be careful when working with dates, not the least if the dates are provided from SAS. By default SAS has the origin of dates as 1960-01-01 and R as 1970-01-01. Confusion can easily occur. Output of dates from lexis functions is alway "integer". This will result in dates using 1970-01-01 as origin.

lexisTwo

This function can split observations on a number of conditions such as comorbidites and each defined by a date and missing if there is no comorbidity.

The Base date needs to include columns defining participant, start of period, end of period and event (numeric 0/1)

The splitting guide needs to be a matrix like data.table where the first column is patient id and the further columns each define dates where splitting should occur. Missing dates result in no splitting. For comorbidities is is useful to name the columns after the comorbidity and let the content of the columns be the dates of the comorbidity when present and missing if not present.

Usage

lexisTwo(indat,splitdat,invars,splitvars)

author Christian Torp-Pedersen

Base data

A data.table or data.frame whose first 4 columns are in that particular order:

ID - Person identification - note that there may be multiple lines per ID if previously split
Start of time interval. Either a date or an integer/numeric.
Stop - End of time interval. Either in date format or given as numeric/integer.
event - Binary 0-1 variable indicating if an event occurred at end of interval

Splitting guide

A data.table which contains person specific information about the onset dates of comorbidities and other events.

ID - Person ID
Characteristic 1 - Column of dates (integer/numeric) defining start of that characteristic
Characteristic 2 -
Further characteristic columns

Return

The function returns a new data table where records have been split according to the splittingguide dataset. Variables unrelated to the splitting are left unchanged. The names of columns from "splitvars" are also in output data, but now they have the value zero before the dates and 1 after.

Details

The program checks that intervals are not negative. Violation results in an error. Overlap may occur in real data, but the user needs to make decisions regarding this prior to using this function.

It is required that the splittingguide contains at least one record.
Missing data in the person id variables are not allowed and will cause errors.

The output will always have the "next" period starting on the day where the last period ended. This is to ensure that period lengths are calculated pro- perly. The program will also allow periods of zero lengths which is a conse- quence when multiple splits are made on the same day. When there is an event on a period with zero length it is important to keep that period not to loose events for calculations. Whether other zero length records should be kept in calculations depends on the context.

Examples

 library(data.table)

 dat <- data.table(pnr=c("123456","123456","234567","234567","345678","345678"
 ,"456789","456789"),
                 start=as.integer(c(0,100,0,100,0,100,0,100)),
                 end=as.integer(c(100,200,100,200,100,200,100,200)),
                 event=as.integer(c(0,1,0,0,0,1,0,1)))
 split <- data.table (pnr=c("123456","234567","345678","456789"),
 como1.onset=as.integer(c(0,NA,49,50)), como2.onset=as.integer(c(25,75,49,49)),
 como3.onset=as.integer(c(30,NA,49,48)), como4.onset=as.integer(c(50,49,49,47))) 
 #Show the datasets:
 dat[]
 split[]
 lexisTwo(dat # inddato with id/in/out/event
    ,split # Data with id and dates
    ,c("pnr","start","end","event") #names of id/in/out/event - in that order
    ,c("como1.onset","como2.onset","como3.onset","como4.onset")) 
    #Names of date-vars to split by

LexisFromTo

This function handles situations where individuals have conditions that are limited in time - pregnancy, drug doses etc. Each limited condition has a starting date, an end date, the condition as a character string and a name defining the condition. Multiple limited conditions can be in a single data.table.

Base data

A data.table or data.frame whose first 4 columns are in that particular order:

ID - Person identification - note that there may be multiple lines per ID if previously split
Start of time interval. Either a date or an integer/numeric.
Stop - End of time interval. Either in date format or given as numeric/integer.
event - Binary 0-1 variable indicating if an event occurred at end of interval

Splitting guide

ID - person id
start - date/integer to set start of period
end - date/ineger to set end of period
value - character variable indicating condition (dose of drug, pregnancy etc)
name - name of condition (pregnancy, occupation, name of drug etc)

Usage

lexisFromTo(indat,splitdat,invars,splitvars) author: Christian Torp-Pedersen

indat - base data with id, start, end, event and other data - possibly already split
splitdat - Data with splitting guide - id/from/to/value/name
invars - vector of colum names for id/entry/exit/event - in that order, example: c("id","start","end","event")
splitvars - vector of column names containing dates to split by. example: c("id","start","end","value","name") - must be in that order!

Return

The function returns a new data table where records have been split according to the splitting guide dataset. Variables unrelated to the splitting are left unchanged.

Details

The input to this function are two data.tables and two lists of the critical variables. The BASE data it the data to be split. This data must have a variable to identify participants, start/end times and a variable to indicate event after last interval. The other table (SPLITTINGUIDE) contains possibly multiple records for each participants with id/from/to/value/name.

The program checks that intervals are not negative and that intervals for each "name" and each individual do not overlap. Violation stops the program with error. Overlaps may occur in real situations, but the user needs to make decisions regarding this prior to this function.

It is required that the splittingguide contains at least one record. Missing data for key variables are not allowed and will cause errors.

Examples

library(data.table)
dat <- data.table(id=c("A","A","B","B","C","D"),
                 start=as.Date(c(0,100,0,100,200,400),origin='1970-01-01'),
                 end=as.Date(c(100,200,100,200,300,500),origin='1970-01-01'),
                 event=c(0,1,0,0,1,1))
split <- data.table (id=c("A","A","A","A","B","B","B","D","D"),
                    start=as.Date(c(0,50,25,150,110,150,400,300,500),origin='1970-01-01'),
                    end= as.Date(c(25,75,150,151,120,250,500,400,500),origin='1970-01-01'),
                    value=c(1,2,3,4,1,2,3,6,7),
                    name=c("d1","d1","d2","d2","d1","d1","d2","d3","d4"))
#Show the dataset:
dat[]
split[]                   
temp <- lexisFromTo(dat # inddato with id/in/out/event
                   ,split # Data with id and dates
                   ,c("id","start","end","event") #names of id/in/out/event - in that order
                  ,c("id","start","end","value","name")) #Nmes var date-vars to split by
temp[]

LexisSeq

The base data for this function is identical to the other lexis functions, a data.table including c("id","start","end","event") - and other variables to be preserved during splitting. In contrast to the other functions the splitting guide is not a data.table but a vector of dates. It can be specified as a fixed vector with selected dates or in a "seq" format (from,to,by). This simple vector is useful for splitting according to calender time where splitting is identical for all subjects. Further a "varname" may hold a variable from the base data to be added to the vector. This is typically birthdate for splitting on age - and it may be time of exposure when splitting in periods after exposure.

Usage

lexisSeq(indat,invars,varname=NULL,splitvector,format,value="value")

Author: Christian Torp-Pedersen

indat - base data with id, start, end, event and other data - possibly already split
invars - vector of colum names for id/entry/exit/event - in that order, example: c("id","start","end","event")
varname - name of variable to be added to vector
splitvector - A vector of calender times (integer). Splitvector can be a sequence of fixed times with format="vector" or generate a sequence of from-to-by if given 3 values and format="seq"
format - either "vector" for fixed times or "seq" to generate a sequence of from-to-by
value - 0 to the left of the vector, increase of 1 as each element of vector is passed

Return

The function returns a new data table where records have been split according to the provided vector. Variables unrelated to the splitting are left unchanged.

Details

The input must be data.table. This data.table is assumed already to be split by other functions with multiple records having identical participant id. The function extracts those variables necessary for splitting, splits by the provided vector and finally merges other variable onto the final result.

Examples

library(data.table)
dat <- data.table(ptid=c("A","A","B","B","C","C","D","D"),
                start=c(0,100,0,100,0,100,0,100),
                end=c(100,200,100,200,100,200,100,200),
                dead=c(0,1,0,0,0,1,0,1),
                Bdate=c(-5000,-5000,-2000,-2000,0,0,100,100))
#Example 1 - Splitting on a vector with 3 values to be added to "Bdate"                 
out <- lexisSeq(indat=dat,invars=c("ptid","start","end","dead"),
               varname="Bdate",c(0,150,5000),format="vector")
out[]
#Example 2 - splitting on a from-to-by vector with no adding (calender time?)
out2 <- lexisSeq(indat=dat,invars=c("ptid","start","end","dead"),
                 varname=NULL,c(0,200,50),format="seq",value="myvalue")
out2[]

tagteam/heaven documentation built on April 13, 2025, 6:24 a.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

tagteam/heaven Data Preparation Routines for Medical Registry Data

In tagteam/heaven: Data Preparation Routines for Medical Registry Data

Overview

Reading data

Standardizing

Simulation

Matching

Calculation

Splitting

importDREAM

Usage

Return

Example

riskSetMatch

Usage

Details

Return

matchReport

Usage

Details

Example

Lexis functions

lexisTwo

Usage

Base data

Splitting guide

Return

Details

Examples

LexisFromTo

Base data

Splitting guide

Usage

Return

Details

Examples

LexisSeq

Usage

Return

Details

Examples

R Package Documentation

Browse R Packages

We want your feedback!

tagteam/heaven
Data Preparation Routines for Medical Registry Data