README.md

epimatch: find matching patient records across tabular datasets

Travis-CI Build Status Project Status: Wip - Initial development is in progress, but there has not yet been a stable, usable release suitable for the public. CRAN_Status_Badge Coverage Status

This package was produced at the Hackout3 in conjunction with rOpenSci. It is a package for displaying and recording suggested patient row matches across datasets for epidemiology workers in the field. It was specifically designed for field workers who will be attempting to find duplicated patient records within a single or multiple tabular datasets, such as csv files. Several fields, such as the location, name and age, can cause ambiguity due to mispellings or different data formats in different datasets. This package finds the closest matches, but rather than directly altering the datasets to reflect the new matches, returns the suggested matches to the field worker so that he/she can decide if indeed the suggested data rows all pertain to the same patient. That field worker can then manually update the dataset rows as he/she sees fit, depending on the context.

Another group at Hackout3 focused on higher-level data cleaning for the modeler/data scientist who receives all of the datasets from all field workers in different locations, as these steps concern aggregate analyses as opposed to data verification. The field worker is the ideal candidate to determine if a patient is represented multiple times in datasets due to the on-the-ground nature of their job.

Contributors

In alphabetical order by first name:

Try it out

You can either install the package on your own computer and run it yourself (instructions below), or you can use the app hosted online.

Installation

If you want to install this package, you may use devtools. Open your R session and copy + paste the following into your R console:

if (!require("devtools")) install.packages("devtools", repo = "https://cran.r-project.org")
devtools::install_github("Hackout3/epimatch")

This should successfully install the epimatch package.

Running

Once you have epimatch installed, load the package and launch the interface in your R console with:

library('epimatch')
launch()

Datasets

Original fake datasets, i.e. before errors were induced, contain exact patient matches in terms of name, age, date of onset, etc. There are three such datasets that you could feed in pairwise into epimatch to find suggested matches:

The global record id will be different for the same person in the case, laboratory and contact forms, because the id is created for each form type. Datasets with these prefixes but an additional "_messy" postfix contain induced errors (like mispellings, slightly different records ages, etc.) for the same patient across different records, to explore how the application would find patient matches in a more realistic context.

Future work

Suggestions? Write it as a github issue to this repo.



Hackout3/epimatch documentation built on May 6, 2019, 9:48 p.m.