aggre: Aggregation of split 'Lexis' data

View source: R/aggregating.R

aggreR Documentation

Aggregation of split Lexis data

Description

Aggregates a split Lexis object by given variables and / or expressions into a long-format table of person-years and transitions / end-points. Automatic aggregation over time scales by which data has been split if the respective time scales are mentioned in the aggregation argument to e.g. intervals of calendar time, follow-up time and/or age.

Usage

aggre(
  lex,
  by = NULL,
  type = c("unique", "full"),
  sum.values = NULL,
  subset = NULL,
  verbose = FALSE
)

Arguments

lex

a Lexis object split with e.g. splitLexis or splitMulti

by

variables to tabulate (aggregate) by. Flexible input, typically e.g. by = c("V1", "V2"). See Details and Examples.

type

determines output levels to which data is aggregated varying from returning only rows with pyrs > 0 ("unique") to returning all possible combinations of variables given in aggre even if those combinations are not represented in data ("full"); see Details

sum.values

optional: additional variables to sum by argument by. Flexible input, typically e.g. sum.values = c("V1", "V2")

subset

a logical condition to subset by before computations; e.g. subset = area %in% c("A", "B")

verbose

logical; if TRUE, the function returns timings and some information useful for debugging along the aggregation process

Details

Basics

aggre is intended for aggregation of split Lexis data only. See Lexis for forming Lexis objects by hand and e.g. splitLexis, splitLexisDT, and splitMulti for splitting the data. lexpand may be used for simple data sets to do both steps as well as aggregation in the same function call.

Here aggregation refers to computing person-years and the appropriate events (state transitions and end points in status) for the subjects in the data. Hence, it computes e.g. deaths (end-point and state transition) and censorings (end-point) as well as events in a multi-state setting (state transitions).

The result is a long-format data.frame or data.table (depending on options("popEpi.datatable"); see ?popEpi) with the columns pyrs and the appropriate transitions named as fromXtoY, e.g. from0to0 and from0to1 depending on the values of lex.Cst and lex.Xst.

The by argument

The by argument determines the length of the table, i.e. the combinations of variables to which data is aggregated. by is relatively flexible, as it can be supplied as

  • a character string vector, e.g. c("sex", "area"), naming variables existing in lex

  • an expression, e.g. factor(sex, 0:1, c("m", "f")) using any variable found in lex

  • a list (fully or partially named) of expressions, e.g. list(gender = factor(sex, 0:1, c("m", "f"), area)

Note that expressions effectively allow a variable to be supplied simply as e.g. by = sex (as a symbol/name in R lingo).

The data is then aggregated to the levels of the given variables or expression(s). Variables defined to be time scales in the supplied Lexis are processed in a special way: If any are mentioned in the by argument, intervals of them are formed based on the breaks used to split the data: e.g. if age was split using the breaks c(0, 50, Inf), mentioning age in by leads to creating the age intervals [0, 50) and [50, Inf) and aggregating to them. The intervals are identified in the output as the lower bounds of the appropriate intervals.

The order of multiple time scales mentioned in by matters, as the last mentioned time scale is assumed to be a survival time scale for when computing event counts. E.g. when the data is split by the breaks list(FUT = 0:5, CAL = c(2008,2010)), time lines cut short at CAL = 2010 are considered to be censored, but time lines cut short at FUT = 5 are not. See Return.

Aggregation types (styles)

It is almost always enough to aggregate the data to variable levels that are actually represented in the data (default aggre = "unique"; alias "non-empty"). For certain uses it may be useful to have also "empty" levels represented (resulting in some rows in output with zero person-years and events); in these cases supplying aggre = "full" (alias "cartesian") causes aggre to determine the Cartesian product of all the levels of the supplied by variables or expressions and aggregate to them. As an example of a Cartesian product, try

merge(1:2, 1:5).

Value

A long data.frame or data.table of aggregated person-years (pyrs), numbers of subjects at risk (at.risk), and events formatted fromXtoY, where X and X are states transitioning from and to or states at the end of each lex.id's follow-up (implying X = Y). Subjects at risk are computed in the beginning of an interval defined by any Lexis time scales and mentioned in by, but events occur at any point within an interval.

When the data has been split along multiple time scales, the last time scale mentioned in by is considered to be the survival time scale with regard to computing events. Time lines cut short by the extrema of non-survival-time-scales are considered to be censored ("transitions" from the current state to the current state).

Author(s)

Joonas Miettinen

See Also

aggregate for a similar base R solution, and ltable for a data.table based aggregator. Neither are directly applicable to split Lexis data.

Other aggregation functions: as.aggre(), lexpand(), setaggre(), summary.aggre()

Examples


## form a Lexis object
library(Epi)
data(sibr)
x <- sibr[1:10,]
x[1:5,]$sex <- 0 ## pretend some are male
x <- Lexis(data = x,
           entry = list(AGE = dg_age, CAL = get.yrs(dg_date)),
           exit = list(CAL = get.yrs(ex_date)),
           entry.status=0, exit.status = status)
x <- splitMulti(x, breaks = list(CAL = seq(1993, 2013, 5), 
                                 AGE = seq(0, 100, 50)))

## these produce the same results (with differing ways of determining aggre)
a1 <- aggre(x, by = list(gender = factor(sex, 0:1, c("m", "f")), 
             agegroup = AGE, period = CAL))

a2 <- aggre(x, by = c("sex", "AGE", "CAL"))

a3 <- aggre(x, by = list(sex, agegroup = AGE, CAL))

## returning also empty levels
a4 <- aggre(x, by = c("sex", "AGE", "CAL"), type = "full")

## computing also expected numbers of cases
x <- lexpand(sibr[1:10,], birth = bi_date, entry = dg_date,
             exit = ex_date, status = status %in% 1:2, 
             pophaz = popmort, fot = 0:5, age = c(0, 50, 100))
x$d.exp <- with(x, lex.dur*pop.haz)
## these produce the same result
a5 <- aggre(x, by = c("sex", "age", "fot"), sum.values = list(d.exp))
a5 <- aggre(x, by = c("sex", "age", "fot"), sum.values = "d.exp")
a5 <- aggre(x, by = c("sex", "age", "fot"), sum.values = d.exp)
## same result here with custom name
a5 <- aggre(x, by = c("sex", "age", "fot"), 
             sum.values = list(expCases = d.exp))
             
## computing pohar-perme weighted figures
x$d.exp.pp <- with(x, lex.dur*pop.haz*pp)
a6 <- aggre(x, by = c("sex", "age", "fot"), 
             sum.values = c("d.exp", "d.exp.pp"))
## or equivalently e.g. sum.values = list(expCases = d.exp, expCases.p = d.exp.pp).

popEpi documentation built on Aug. 23, 2023, 5:08 p.m.