splitSeq: lexisSeq
In tagteam/heaven: Data Preparation Routines for Medical Registry Data

splitSeq

R Documentation

lexisSeq

Description

splitSeq is a function which can split records according to a vector of selected times. At the outset each record has two variables representing start and end on a time scale. A vector of time points is supplied and each record is replaced by as many records as the number of times points from the vector that occurs in the interval. After splitting the variable representing end of time is replaced by the splitting-time and the next record has this splitting-time as the start of time variable.

This function is particularly useful to split variables according to variables that change continuously. Typical situations are age(e.g. 5 year periods), calender time (e.g. 2 year periods) and selected times after a situation of interest (e.g. fixed sized time periods after a starting date). The input is a data.table and splitting guide. The "base" data are the data to be split. They may contain much information, but the key is "id", "start" and "end". These describe the participant's id, start of time period and end of time period.

The other input is data to define splitvector and name. The splitvector may be a fixed vector (format="vector", e.g. a series of fixed calender dates) or a list of 3 integers defining start, end and interval to split by (format="seq", for a split on age between 20 and 80 by 5 years a splitvector could be defined as: splitvector <- c(20,80,5)*365.25 and provided to the function as a variable). "varname" is a name of a variable in the data.table the defines a value to be added to the splitvector. For the age split just used as an example it would be a variable containing the birth date. For a split after onset of a conditi on it should be the date of the condition and NA when the condition does not occur. When no value should be added to the vector (e.g. split by calender time) "varname" should keep its default value of NULL.

On output a new variable with default name "value" defines the result of splitting. The variable can be renamed to a user defined name (e.g. value="myvalue"). This variable will contain zero when time is before the first value of the splitting vector (added the "varname") and then increased by one as each value of the splitting vector is reached.

Usage

splitSeq(indat,invars,varname=NULL,splitvector,format,value="value",
datacheck=TRUE)

Arguments

`indat`	base data with id, start, end and other data - possibly already split
`invars`	column names for id,entry,exit - in that order, example: c("id","start","end")
`varname`	name of variable to be added to vector
`splitvector`	A vector of calender times (integer). Splitvector is a sequence of fixed dates (or other time scala).
`format`	String with two possible values: `"vector"` a series of fixed calender dates `"seq"` see description
`value`	0 to the left of the vector, increase of 1 as each element of vector is passed
`datacheck`	- Checks that data are in appropriate format and that intervals are neihter negative or overlapping. Can be set to FALSE if checked elsewhere.

Details

The input must be data.table. This data.table is assumed already to be split by other functions with multiple records having identical participant id. The function extracts those variables necessary for splitting, splits by the provided vector and finally merges other variable onto the final result.

A note of caution: This function works with dates as integers. R has a de- fault origina of dates as 1 January 1970, but other programs have different default origins - and this includes SAS and Excell. It is therefor important for decent results that care is taken that all dates are defined similarly.

The output will always have the "next" period starting on the day where the last period ended. This is to ensure that period lengths are calculated pro- perly. The program will also allow periods of zero lengths which is a conse- quence when multiple splits are made on the same day. When there is an event on a period with zero length it is important to keep that period not to loose events for calculations. Whether other zero length records should be kept in calculations depend on context.

This function is identical to the lexisSeq function with the change that "event" is not considered.

Value

The function returns a new data table where records have been split according to the values in splitvector. Variables unrelated to the splitting are left unchanged.

Author(s)

Christian Torp-Pedersen

Examples

library(data.table)

dat <- data.table(ptid=c("A","A","B","B","C","C","D","D"),
                start=as.Date(c(0,100,0,100,0,100,0,100),origin="1970-01-01"),
                end=as.Date(c(100,200,100,200,100,200,100,200),origin="1970-01-01"),
                Bdate=as.Date(c(-5000,-5000,-2000,-2000,0,0,100,100),origin="1970-01-01"))
#Example 1 - Splitting on a vector with 3 values to be added to "Bdate"                 
out <- splitSeq(indat=dat,invars=c("ptid","start","end"),
               varname="Bdate",
               splitvector=as.Date(c(0,150,5000),origin="1970-01-01"),
               format="vector")
out[]
#Example 2 - splitting on a from-to-by sequence with no adding (calender time?)
out2 <- splitSeq(indat=dat,invars=c("ptid","start","end"),
                 varname=NULL,splitvector=c(0,200,50),
                 format="seq",value="myvalue")
out2[]

tagteam/heaven documentation built on April 13, 2025, 6:24 a.m.