This package offers a tidy solution for epidemiological data. It houses a range of functions for epidemiologists and public health data wizards for data management and cleaning. These include:
import()
imports files of any formats.export()
exports files of any formats.get_admin_names()
downloads admin names using various country codes and naming conventions.clean_admin_names()
cleans admin names using both user-provided and downloaded admin data.create_test()
creates a unit-testing functions to perform various data validation.consistency_check()
plot to see if certain variables exceed others (i.e., tests vs cases).handle_outliers()
detects outliers using various approaches, and offers functionality to manage them.missing_plot()
plots missing data or reporting rate for given variable(s) by different factors.The package is yet to be available on Cran, but can be installed using devtools in R.
# The way to install it via CRAN once available # install.packages("epiCleanr")
You can install the latest development version from GitHub by using:
# If you haven't installed the 'devtools' package, run: # install.packages("devtools") devtools::install_github("truenomad/epiCleanr")
Inspired by rio
, the import function allows you to read data from a wide
range of file formats. Additional reading options specific to each format can
be passed through the ellipsis (...) argument. Similarly, the export function
provides a simple way to export data into various formats.
# Load the epiCleanr package library(epiCleanr) # Reading a CSV file with a specific seperator data_csv <- import("path/to/your/file.csv", sep = "\n") # Import the first sheet from an Excel file data_excel <- import("path/to/your/file.xlsx", sheet = 1) # Export a Stata DTA file export(my_data, "path/to/your/file.dta") # Export an RDS file export(my_data, "path/to/your/file.rds") # Export an Excel file with sheets export( list(my_data = my_data1, my_data2 = my_data2), "path/to/your/file.xlsx")
The clean_names_strings()
function offers additional flexibility by allowing
you to clean not just strings within columns but also the column names
themselves.
# For data frame with snake_case (default) data("iris") colnames(iris) # > "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width" "Species" cleaned_iris <- clean_names_strings(iris) colnames(cleaned_iris) # > "sepal_length" "sepal_width" "petal_length" "petal_width" "species"
You can download administrative names (such as districts, provinces, etc.,) for
a given country via the GeoNames
website using country name or codes to pull the data.
# Get admin names for Togo admin_names <- get_admin_names(country_name_or_code = "Togo") # Different admin levels are saved as a list str(admin_names$adm2) #>'data.frame': 30 obs. of 7 variables: #> $ country_code : chr "TG" "TG" "TG" "TG" ... #> $ asciiname : chr "Vo Prefecture" "Zio Prefecture" "Tchaoudjo" "Tchamba" ... #> $ alternatenames: chr "Circonscription de Vogan, Prefecture de Vo, Préfecture de Vo, ... #> $ adm2 : chr "Vo Prefecture" "Zio Prefecture" "Tchaoudjo" "Tchamba" ... #> $ latitude : num 6.42 6.58 9 8.83 6.67 ... #> $ longitude : num 1.5 1.17 1.17 1.42 1.5 ... #> $ last_updated : IDate, format: "2019-01-08" "2019-01-08" "2022-02-17" "2022-02-17" ...
You can clean administrative names using them clean_admin_names
function,
which uses various matching and string distance algorithms to match your admin
data, using your own list of admin names as well admin names from GeoNames
website.
# Get simulated epi data for Togo # get path path <- system.file( "extdata", "fake_epi_df_togo.rds", package = "epiCleanr") # get example data fake_epi_df_togo <- import(path) # get path path <- system.file( "extdata", "fake_epi_df_togo.rds", package = "epiCleanr") # get referecne dataset with clean admin names fake_epi_df_togo <- import(path) # Lets check how the matching looks clean_admin_names( country_code = "Togo", admin_names_to_clean = fake_epi_df_togo$district, user_base_admin_names = togo_admin_df$district, report_mode = T) #> There are 15 out of 15 (100%) admins that have been perfectly matched! #> # A tibble: 15 × 5 #> names_to_clean final_names ource_of_cleaned_name prop_matched matching_algorithm #> <chr> <chr> <chr> <dbl> <chr> #> 1 Bas-Mono Bas-Mono User base admin names 100 Levenshtein Distance #> 2 Bliita Blitta Main admin name from geonames 100 Soundex #> 3 Centrale Centrale User base admin names 100 Levenshtein Distance #> 4 Cinkaasi Cinkassé User base admin names 100 Soundex #> 5 Dankben Dankpen Main admin name from geonames 100 Soundex #> 6 East-Mono Est-Mono Main admin name from geonames 100 Soundex #> 7 Kaloto Kloto Main admin name from geonames 100 Soundex #> 8 Keéran Keran Alternative name from geonames 100 Soundex #> 9 Lomé Lomé User base admin names 100 Levenshtein Distance #> 10 Ogou Ogou Main admin name from geonames 100 Levenshtein Distance #> 11 Sotouboua Sotouboua Main admin name from geonames 100 Levenshtein Distance #> 12 Tchamaba Tchamba Main admin name from geonames 100 Soundex #> 13 Vo Vo Prefecture Main admin name from geonames 100 Levenshtein Distance #> 14 Yotto Yoto Main admin name from geonames 100 Soundex #> 15 Zioo Zio Prefecture Main admin name from geonames 100 Soundex # If we are happy with the names, we can update the old names with the new ones # otherwise we can further clean the names manually fake_epi_df_togo$district <- clean_admin_names( country_code = "Togo", admin_names_to_clean = fake_epi_df_togo$district, user_base_admin_names = togo_admin_df$district, report_mode = F)
The create_test
is there so users can create their own functions in which they can
use for unit-testing when working with datasets that require lots of manipulation and
wrangling. This function (plus the tidylog
package) will save users from the headache
of troubleshooting issues related to data joins and transformations.
# Set up a unit-testing fucntion my_tests <- create_test( # For checking the dimension of the data dimension_test = c(900, 9), # For expected number of combinations in data combinations_test = list( variables = c("month", "year", "district"), expectation = 12 * 5 * 15), # Check repeated cols, rows and max and min thresholds row_duplicates = TRUE, col_duplicates = TRUE, max_threshold_test = list(malaria_tests = 1000, cholera_tests = 1000), min_threshold_test = list(cholera_cases = 0, cholera_cases = 0) ) # Apply your unit-test on your data my_tests(fake_epi_df_togo) #> Test passed! You have the correct number of dimensions! #> Test passed! No duplicate rows found! #> Test passed! No repeated columns found! #> Test passed! You have the correct number of combinations for month, year, district! #> Test passed! Values in column cholera_cases are above the threshold. #> Test passed! Values in column cholera_cases are above the threshold. #> Test passed! Values in column malaria_tests are below the threshold. #> Test passed! Values in column cholera_tests are below the threshold. #> Congratulations! All tests passed: 8/8 (100%) 😀
The consistency_check
function serves as a function for validating the logical
relationships between variables. For instance, if you know that the number of
disease cases cannot exceed the number of tests for that disease, this function helps inspect
such expected behaviours in your data.
# Run checks using Togo data consistency_check(fake_epi_df_togo, tests = c("malaria_tests", "cholera_tests"), cases = c("malaria_cases", "cholera_cases")) #> Consistency test passed for malaria_tests vs malaria_cases: There are more tests than there are cases! #> Consistency test failed for cholera_tests vs cholera_cases: There are 3 (0.33%) rows where cases are greater than tests. # You can even save the plot if you want ggplot2::ggsave("../man/figures/consistency_plot.png", width = 10, height = 5.75, scale = 0.95, dpi = 400, bg = "white")
The handle_outliers
function is designed for both identifying and addressing
outliers in your dataset. It supports multiple statistical methods for outlier
detection, such as Z-score, modified Z-score, and Interquartile Range. Beyond
detection, the function offers various options for handling outliers, including
removal or replacement with mean, median, grouped means, or quantiles.
# Select the variables variables <- c("malaria_tests", "malaria_cases", "cholera_tests", "cholera_cases") # Get outliers report outliers <- handle_outliers( fake_epi_df_togo, vars = variables, method = "zscore", report_mode = TRUE) # Get the report outliers$report #> variable test outliers prop_outliers #> <chr> <chr> <glue> <chr> #> 1 malaria_tests Z-Score 5/900 <1% #> 2 malaria_cases Z-Score 5/900 <1% #> 3 cholera_tests Z-Score 2/900 <1% #> 4 cholera_cases Z-Score 1/900 <1% # Get the plot outliers$plot # You can even save the plot if you want # ggplot2::ggsave("../man/figures/outliers_plot.png", width = 10, # height = 7, scale = 0.95, dpi = 400) # If happy, create a dataframe with the outliers handled fake_epi_df_togo_no_outliers <- handle_outliers( fake_epi_df_togo, vars = variables, method = "zscore", report_mode = FALSE, treat_method = "mean")
The missing_plot
function allows you to visualize missing data across
different variables, either by a single factor like time or district, or by
two factors simultaneously like date and district. This is particularly useful
when you're dealing with epidemiological data that might have varying degrees
of completeness across different times or locations.
# Make date columns fake_epi_df_togo2 <- fake_epi_df_togo |> dplyr::mutate(date = lubridate::as_date(paste0(year, month, "/01"))) # Missing rate for variables across dates missing_plot(fake_epi_df_togo2, miss_vars = variables, x_var = "date", use_rep_rate = F)
# Reporting rate for malaria_cases across dates and districts missing_plot(fake_epi_df_togo2, miss_vars = "malaria_cases", x_var = "date", y_var = "district", use_rep_rate = T)
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.