splitTwo: splitTwo

View source: R/splitTwo.R

splitTwoR Documentation



This function can split records from a dataset in two records using information from a "splitting-guide". At start each record has variables representing stand and end of time - after the split time of a split record end at the time of a split one one record and represent the start of time on the next record.

The dataset to be split needs to have variables that define start and end of a time interval (start/end). Also necessary is a variable (id) that defines the relation between individuals and records in both the dataset to be split and the splitting guide.

The dataset to be split can have any other variable than those necessary for the split - these appear on each of the split records after the split.

The splitting-guide provides the information on which times to split records. This guide can be provide in a wide or a long format:

Wide format: This requires one record pr. individual that have dates to be split on. One column defines the same id as in the "base" table. The other columns contain dates for each condition where the split should occur. These column names will also appear in the output data, but on output the values are zero before the dates and 1 after. When dates are NA output has zero.

Long format: This requires one record per data where a possible split should occur. The columns should contain id, name of condition and the data to split on. Output is identical to that provide with the wide format.





A data.table or data.frame whose first 3 columns are:

  • id Person identification variable such as PNR. The data may contain multiple lines per subject.

  • start Start of time interval. Either a date or an integer/numeric.

  • end End of time interval. Either in date format or given as numeric/integer.


The splitting guide. A data.table which contains person specific information about the onset dates of comorbidities and other events. Wide format:

  • id Person identification variable such as PNR. The data may contain multiple lines per subject.

  • Date 1 Either a date or an integer/numeric. Format must match that of the start and stop of arguments indat The onset date of comorbidity 1 or other event. If integer/numeric it can be time since a baseline date on project specific scale (e.g., days or months).

  • Date 2 Either a date or an integer/numeric. The onset date of comorbidity 2 or other event. If integer/numeric it can be time since a baseline date on project specific scale (e.g., days or months).

  • Dat3 ....

Long format:

  • id Person identification variable such as PNR. The data may contain multiple lines per subject.

  • Condition nameCharacter providing the variable name of condition

  • Date Either a date or an integer/numeric. The onset date of comorbidity or other event. If integer/numeric it can be time since a baseline date on project specific scale (e.g., days or months).


vector of column names for id/entry/exit - in that order, example: c("id","start","end")


For wide format: - vector of column names of columns containing dates to split by. example: c("date1","date2","date3","date4") For long format: - vector of the 3 columns in the data.table: id/name/date, example: c("id","name","date") The name of the id column must be the same in both datasets


- format of splitting guide - "wide" or "long"


- This program may crash if intervals are overlapping or negative. Data checking produces an error in such cases. This can be omitted if the data have been checked by other means. For the splitting guide this options checks that there is only one entry for each variable to split by for each person identifier.


The program checks that intervals are not negative and that intervals from one individual do not overlap.

It is required that the splitting guide contains at least one record. Missing data in the person id variables are not allowed and will cause errors.

A note of caution: This function works with dates as numeric. R has a default original of dates as 1 January 1970, but other programs have different default origins - and this includes SAS and Excel. It is therefor important for decent results that care is taken that all dates are defined similarly.

The output will always have the "next" period starting on the day where the last period ended. This is to ensure that period lengths are calculated pro- perly. The program will also allow periods of zero lengths which is a conse- quence when multiple splits are made on the same day.

This function is identical to the lexisTwo function with the change that "events" are not considered.


The function returns a new data table where records have been split according to the splitting guide dataset. Variables unrelated to the splitting are left unchanged. The names of columns from "splitvars" are also in output data, but now they have the value zero before the dates and 1 after.


Christian Torp-Pedersen

See Also




dat <- data.table(pnr=c("123456","123456","234567","234567","345678","345678"
split <- data.table (pnr=c("123456","234567","345678","456789"),
como1.onset=as.integer(c(0,NA,49,50)), como2.onset=as.integer(c(25,75,49,49)),
como3.onset=as.integer(c(30,NA,49,48)), como4.onset=as.integer(c(50,49,49,47))) 
#Show the datasets:
splitTwo(dat # inddato with id/in/out
   ,split # Data with id and dates
   ,c("pnr","start","end") #names of id/in/out - in that order
   #Names of date-vars to split by
# And with splittingguide in long format
splitvars <- c("como1.onset","como2.onset","como3.onset","como4.onset")
split <- data.table::melt(data=split,id.vars="pnr",measure.vars=splitvars,
split <- split[!is.na(value)] # remove missing values
splitTwo(dat # in-data with id/in/out
   ,split # Data with id/name/date
   ,c("pnr","start","end") #names of id/in/out - in that order

tagteam/heaven documentation built on Oct. 24, 2024, 7:40 p.m.