An introduction to linelist
In linelist: Tagging and Validating Epidemiological Data

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

library(linelist)

Motivations

Outbreak analytics pipelines often start with case line lists, which are data tables in which every line is a different case/patient, and columns record different variables of potential epidemiological interest such as date of events (e.g. onset of symptom, case notification), disease outcome, or patient data (e.g. age, sex, occupation). Such data is typically held in a data.frame (or a tibble) and used in various downstream analysis. While this approach is functional, it often means that each analysis step will:

need to check the required inputs are present in the data, and for the user to specify where (e.g. 'This is the column where dates of onset are stored.')
need to validate the required data (e.g. 'Check that the field storing dates of onset are indeed dates, and not a character.')

The aim of linelist is to take care of these pre-requisites once and for all before downstream analyses, thus helping to make data pipelines more robust and straightforward.

linelist in a nutshell

Outline

linelist is an R package which implements basic data representation for case line lists, alongside accessors and basic methods. It essentially provides three types of functionalities:

tagging: a tags system permits to pre-identify key epidemiological variables needed in downstream analyses (e.g. dates of case notification, symptom onset, age, gender, disease outcome)
validation: functions checking that tagged variables are indeed present in the data.frame/tibble, and that they have the expected type (e.g. checking that dates are Date, integer or numeric)
secured methods: generic functions which could lead to the loss of tagged variables have dedicated methods for linelist objects with adapted behaviours, either updating tags as needed (e.g. rename(), names() <- ...) or issuing warnings/errors when tagged variables are lost (e.g. select(), x[], x[[]])

Should I use linelist?

linelist is designed to add a robust, foundational layer to your data pipelines, but it might add unnecessary complexity to your analysis scripts. Here are a few hints to gauge if you should consider using the package.

You may have use for linelist if ...:

your data changes/updates over time (e.g. new entries, new variables, renamed variables)
you build data pipelines entailing multiple layers of data processing and analysis
you are looking to build re-useable analysis scripts, i.e. which will work on other datasets with minimal added changes

Conversely, you probably do not need it if ...:

you work on historical data, which has likely already been curated/validated and will no longer change
you perform some quick, simple analysis of your data, which you will not need to expand on later
your analysis scripts are very specific and will not be re-used elsewhere

Getting started

Installation

Our stable versions are released periodically on CRAN, and can be installed using:

install.packages("linelist", build_vignettes = TRUE)

If you prefer using the latest features and bug fixes, you can alternatively install the development version of linelist from GitHub using the following commands:

if (!require(remotes)) {
  install.packages("remotes")
}
remotes::install_github("epiverse-trace/linelist", build_vignettes = TRUE)

Once installed, you can load the package in your R session using:

library(linelist)

Key functionalities

A linelist object is an instance of a data.frame or a tibble in which key epidemiological variables have been tagged. The main features of the packages are broken down into the 3 categories outlined above.

Tagging system

Tags are paired keys pointing a reference epidemiological variables to the name of a column in a data.frame or tibble. The tagging system permits to construct linelist objects, modify tags in existing objects, check and access existing tags and the corresponding variables.

make_linelist(): to create a linelist object by tagging key epi variables in a data.frame or a tibble
set_tags(): to add, remove, or modify tags in a linelist
tags(): to list variables which have been tagged in a linelist
tags_names(): to list all recognized tag names; details on what the tags represent can be found at ?make_linelist
tags_df(): to obtain a data.frame of all the tagged variables in a linelist

Validation

Basic routines are provided to validate linelist objects. More advanced validation e.g. looking at compatibility of dated events will be implemented in a separate package.

validate_tags(): check that tagged variables are present in the dataset, that tags match the pre-defined list of tagged variables
validate_types(): check that tagged variables have an acceptable class, as defined in tags_types()
validate_linelist(): general validation of linelist objects, equivalent to running both validate_tags() and validate_types(), and checking the class of the object

Secured methods

These are dedicated S3 methods for existing generics which can be used to prevent the loss of tagged variables.

lost_tags_action(): to set the behaviour to adopt when tagged variables would be lost by an operation: issue a warning (default), an error, or ignore
get_lost_tags_action(): to check the current behaviour for lost tagged variables
names<-(): the 'base R' approach to renaming columns of a linelist; will rename tags as needed to match the new column names
x[] and x[[]]: for subsetting columns using 'base R' syntax; will behave according to get_lost_tags_actions() if tagged variables are lost

Worked example

Example dataset

In this example, we use the case line list of the Hagelloch 1861 measles outbreak, distributed by the outbreaks package as measles_hagelloch_1861 .

data(measles_hagelloch_1861, package = "outbreaks")

# overview of the data
head(measles_hagelloch_1861)

Creating a linelist object

Let us assume we want to tag the following variables to facilitate downstream analyses, after having checked their tag name in ?make_linelist:

the date of symptom onset, here called prodrome (tag: date_onset)
the date of death (tag: date_death)
the age of the patient (tag: age)
the gender of the patient (tag: gender)

We first load a few useful packages, and create a linelist with the above information:

library(tibble) # data.frame but with nice printing
library(dplyr) # for data handling
library(linelist) # this package!

x <- measles_hagelloch_1861 |>
  tibble() |>
  make_linelist(date_onset = "date_of_prodrome",
                date_death = "date_of_death",
                age = "age",
                gender = "gender")
head(x)

The printing of the object confirms that the tags have been added. If we want to double-check which variables have been tagged:

tags(x)

Changing tags

Tags can be added / removed / changed using set_tags().

Let us assume we also want to record the outcome: it is currently missing, but can be built from dates of deaths (missing date = survived). This can be done by using mutate() on x to create the new variable (remember x is not only a linelist, but also a regular tibble and this is compatible with dplyr verbs), and setting up a new tag using set_tags():

x <- x |>
  mutate(
    inferred_outcome = if_else(is.na(date_of_death), "survived", "died")
  ) |>
  set_tags(outcome = "inferred_outcome")
x

If we wanted to undo the above, i.e. remove the outcome tag, we only need to set it to NULL:

x <- x |>
  set_tags(outcome = NULL)
tags(x)

Accessing tagged variables

Now that key variables have been tagged in x, we can used these pre-defined fields in downstream analyses, without having to worry about variable names and types. We could access tagged variables using any of the following means:

# select tagged variables only
x |>
  select(has_tag(c("date_onset", "date_death")))

# select tagged variables only with renaming on the fly
x |>
  select(onset = has_tag("date_onset"))

# get all tagged variables in a data.frame
x |>
  tags_df()

Using safeguards

Because x remains a valid tibble, we can use any data handling operations implemented in dplyr. However, some of these operations may cause accidental removal of key tagged variables. linelist provides a safeguard mechanism against this. For instance, let's assume we want to select only some columns of x:

x |>
  select(1:2)

Here, the above command gave a meaningful warning, in which select() removes some of the variables that were tagged.

We can also use the has_tag() select helper to select columns via their tag. For example, to retain the first 2 variables, and the gender tag:

# hybrid selection
x |>
  select(1:2, has_tag("gender"))

Again, we observe a warning as before due to the loss of tagged variables in the operation. This behaviour can be silenced if needed, or could be changed to issue an error (for stronger pipelines for instance):

# hybrid selection
x |>
  select(1:2, has_tag("gender"))

# hybrid selection - no warning
lost_tags_action("none")

x |>
  select(1:2, has_tag("gender"))

# hybrid selection - error due to lost tags
lost_tags_action("error")

x |>
  select(1:2, has_tag("gender"))

# note that `lost_tags_action` sets the behavior for any later operation, so we
# need to reset the default
get_lost_tags_action() # check current behaviour
lost_tags_action() # reset default

Changing tag loss action permanently

If you wish to change the lost_tags_action in a way that persists across R sessions, you can do so by setting the LINELIST_LOST_ACTION environment variable. For example, your .Renviron file could contain the following line:

LINELIST_LOST_ACTION="error"

Any scripts or data that you put into this service are public.

linelist documentation built on June 8, 2025, 10:06 a.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

linelist
Tagging and Validating Epidemiological Data

An introduction to linelist
In linelist: Tagging and Validating Epidemiological Data

Motivations

linelist in a nutshell

Outline

Should I use linelist?

Getting started

Installation

Key functionalities

Tagging system

Validation

Secured methods

Worked example

Example dataset

Creating a linelist object

Changing tags

Accessing tagged variables

Using safeguards

Changing tag loss action permanently

Try the linelist package in your browser

R Package Documentation

Browse R Packages

We want your feedback!

linelist Tagging and Validating Epidemiological Data

An introduction to linelist In linelist: Tagging and Validating Epidemiological Data

Motivations

linelist in a nutshell

Outline

Should I use linelist?

Getting started

Installation

Key functionalities

Tagging system

Validation

Secured methods

Worked example

Example dataset

Creating a linelist object

Changing tags

Accessing tagged variables

Using safeguards

Changing tag loss action permanently

Try the linelist package in your browser

R Package Documentation

Browse R Packages

We want your feedback!

linelist
Tagging and Validating Epidemiological Data

An introduction to linelist
In linelist: Tagging and Validating Epidemiological Data