rrefine

library(rrefine)

Introduction

OpenRefine (formerly Google Refine) is a popular, open source data cleaning software^1. rrefine enables users to programmatically trigger data transfer between R and OpenRefine. Using the functions available in this package, you can import, export, apply data cleaning operations, or delete a project in OpenRefine directly from R. There are several client libraries for automating OpenRefine tasks via Python, nodeJS and Ruby^2. rrefine extends this functionality to R users.

Installation

rrefine is available on CRAN:

install.packages("rrefine")

The latest version of the package is also available on GitHub and can be installed via devtools by using the following:

# install.packages("devtools")
devtools::install_github("vpnagraj/rrefine")
library(rrefine)

lateformeeting

rrefine includes a sample "dirty" data set to illustrate its features. This object (lateformeeting) is a simulated data frame that holds r nrow(lateformeeting) observations of dates, days of the week, numbers of hours slept and indicators of whether or not the subject was on time for work. The data are recorded in inconsistent formats and will require cleaning in order to be parsed correctly by R. You can take a look at how messy things are below:

knitr::kable(lateformeeting)

refine_upload()

While the data cleaning could be performed using R, the operations here describe a typical scenario for OpenRefine users. The first step to creating a new project is to make sure OpenRefine is installed and running^3. By default, the application will run locally at http://127.0.0.1:3333/. All of the functions in rrefine will assume the default local host name and port, however these can both be overridden[^4]. Additionally, as of v1.1.0 the package will internally connect to the OpenRefine instance using a CSRF token in API requests^5. The refine_upload() function allows you to pass the contents of a delimited text file (csv or tsv) along with a project name (optional) and an argument to automatically open the browser in which OpenRefine is running. The example below demonstrates this workflow using the lateformeeting sample data:

write.csv(lateformeeting, file = "lateformeeting.csv", row.names = FALSE)
refine_upload(file = "lateformeeting.csv", project.name = "lfm_cleanup", open.browser = TRUE)

With the project uploaded, you can perform any of the desired clean-up procedures in OpenRefine.

refine_operations()

Whether the data in OpenRefine has been uploaded via refine_upload() or another method, users can programmatically apply operations to projects using refine_operations(). This function will pass an arbitrary list of data cleaning operations to the specified project. Operations must be defined in valid JSON format^6. In addition to the generic refine_operations() that can flexibly accept any valid JSON operation, the rrefine package includes a series of wrapper functions to perform common data cleaning procedures:

The example below demonstrates several operations using the lateformeeting sample data:

refine_add_column(new_column = "dotw_allcaps", 
                  base_column = "what.day.whas.it", 
                  value = "grel:value",
                  project.name = "lfm_cleanup")
refine_to_upper(column_name = "dotw_allcaps", project.name = "lfm_cleanup")
refine_export(project.name = "lfm_cleanup")$dotw_allcaps
toupper(lfm_clean$dotw)
refine_remove_column(column = "dotw_allcaps", project.name = "lfm_cleanup")

refine_export()

Once you've cleaned up the data in OpenRefine you can pull it back into R for plotting, modeling, etc. by using refine_export(). This function will accept either the project name or the numerical unique identifier. It is only necessary to use both if there are multiple projects with the same name in your OpenRefine application. Note that the data is exported directly into R as a data frame and you can assign it to a new object.

lfm_clean <- refine_export(project.name = "lfm_cleanup")
lfm_clean
knitr::kable(lfm_clean)

From there the clean data is available for analyses that couldn't have been performed in its original format.

refine_delete()

To clean up your OpenRefine workspace you can delete projects using refine_delete(). Just like refine_export() it's possible to pass either a project name or unique identifier to this function. And it is only necessary to use both if there are multiple projects with the same name.

refine_delete(project.name = "lfm_cleanup")

References

[^4]: For documentation on how to specify a different host or port number see ?refine_path().



Try the rrefine package in your browser

Any scripts or data that you put into this service are public.

rrefine documentation built on Nov. 16, 2022, 1:09 a.m.