In reconhub/linelist: Tools to Import and Tidy Case Linelist Data

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.width = 8,
  fig.height = 6,
  fig.path = "figs/",
  echo = TRUE
)

Welcome to the linelist package!

This package is dedicated to simplifying the cleaning and standardisation of linelist data. Considering a case linelist data.frame, it aims to:

standardise the variables names, replacing all non-ascii characters with their closest latin equivalent, removing blank spaces and other separators, enforcing lower case capitalisation, and using a single separator between words
standardise the labels used in all variables of type character and factor, as above
set POSIXct and POSIXlt to Date objects
extract dates from a messy variable, automatically detecting formats, allowing inconsistent formats, and dates flanked by other text

Installing the package

To install the current stable, CRAN version of the package, type:

install.packages("linelist")

To benefit from the latest features and bug fixes, install the development, github version of the package using:

devtools::install_github("reconhub/linelist")

Note that this requires the package devtools installed.

What does it do?

Data cleaning

Procedures to clean data, first and foremost aimed at data.frame formats, include:

clean_data(): the main function, taking a data.frame as input, and doing all the variable names, internal labels, and date processing described above
clean_variable_names(): like clean_data, but only the variable names
clean_variable_labels(): like clean_data, but only the variable labels
clean_variable_spelling(): provided with a dictionary, will correct the spelling of values in a variable and can globally correct commonly mis-spelled words.
clean_dates(): like clean_data, but only the dates
guess_dates(): find dates in various, unspecified formats in a messy character vector

Worked example

Let us consider some messy data.frame as a toy example:

## make toy data
onsets <- as.Date("2018-01-01") + sample(1:10, 20, replace = TRUE)
discharge <- format(as.Date(onsets) + 10, "%d/%m/%Y")
genders <- c("male", "female", "FEMALE", "Male", "Female", "MALE")
gender <- sample(genders, 20, replace = TRUE)
case_types <- c("confirmed", "probable", "suspected", "not a case",
                "Confirmed", "PROBABLE", "suspected  ", "Not.a.Case")
messy_dates <- sample(
                 c("01-12-2001", "male", "female", "2018-10-18", "2018_10_17",
                   "2018 10 19", "// 24//12//1989", NA, "that's 24/12/1989!"),
                 20, replace = TRUE)
case <- factor(sample(case_types, 20, replace = TRUE))
toy_data <- data.frame("Date of Onset." = onsets,
                       "DisCharge.." = discharge,
                       "SeX_ " = gender,
                       "Épi.Case_définition" = case,
                       "messy/dates" = messy_dates)
## show data
toy_data

We start by cleaning these data:

## load library
library(linelist)

## clean data with defaults
x <- clean_data(toy_data)
x

Getting help online

Bug reports and feature requests should be posted on github using the issue system. All other questions should be posted on the RECON forum:
http://www.repidemicsconsortium.org/forum/

Contributions are welcome via pull requests.

Please note that this project is released with a Contributor Code of Conduct. By participating in this project you agree to abide by its terms.

reconhub/linelist documentation built on Jan. 1, 2023, 9:39 p.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

Tweet to @rdrrHQ

GitHub issue tracker

ian@mutexlabs.com