knitr::opts_chunk$set( collapse = TRUE, comment = "#>", fig.width = 8, fig.height = 6, fig.path = "figs/", echo = TRUE )
This package is dedicated to simplifying the cleaning and standardisation of
linelist data. Considering a case linelist data.frame
, it aims to:
standardise the variables names, replacing all non-ascii characters with their closest latin equivalent, removing blank spaces and other separators, enforcing lower case capitalisation, and using a single separator between words
standardise the labels used in all variables of type character
and factor
,
as above
set POSIXct
and POSIXlt
to Date
objects
extract dates from a messy variable, automatically detecting formats, allowing inconsistent formats, and dates flanked by other text
To install the current stable, CRAN version of the package, type:
install.packages("linelist")
To benefit from the latest features and bug fixes, install the development, github version of the package using:
devtools::install_github("reconhub/linelist")
Note that this requires the package devtools installed.
Procedures to clean data, first and foremost aimed at data.frame
formats,
include:
clean_data()
: the main function, taking a data.frame
as input, and doing all
the variable names, internal labels, and date processing described above
clean_variable_names()
: like clean_data
, but only the variable names
clean_variable_labels()
: like clean_data
, but only the variable labels
clean_variable_spelling()
: provided with a dictionary, will correct the
spelling of values in a variable and can globally correct commonly mis-spelled
words.
clean_dates()
: like clean_data
, but only the dates
guess_dates()
: find dates in various, unspecified formats in a messy
character
vector
Let us consider some messy data.frame
as a toy example:
## make toy data onsets <- as.Date("2018-01-01") + sample(1:10, 20, replace = TRUE) discharge <- format(as.Date(onsets) + 10, "%d/%m/%Y") genders <- c("male", "female", "FEMALE", "Male", "Female", "MALE") gender <- sample(genders, 20, replace = TRUE) case_types <- c("confirmed", "probable", "suspected", "not a case", "Confirmed", "PROBABLE", "suspected ", "Not.a.Case") messy_dates <- sample( c("01-12-2001", "male", "female", "2018-10-18", "2018_10_17", "2018 10 19", "// 24//12//1989", NA, "that's 24/12/1989!"), 20, replace = TRUE) case <- factor(sample(case_types, 20, replace = TRUE)) toy_data <- data.frame("Date of Onset." = onsets, "DisCharge.." = discharge, "SeX_ " = gender, "Épi.Case_définition" = case, "messy/dates" = messy_dates) ## show data toy_data
We start by cleaning these data:
## load library library(linelist) ## clean data with defaults x <- clean_data(toy_data) x
Bug reports and feature requests should be posted on github using the issue system. All other questions should be posted on the RECON forum:
http://www.repidemicsconsortium.org/forum/
Contributions are welcome via pull requests.
Please note that this project is released with a Contributor Code of Conduct. By participating in this project you agree to abide by its terms.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.