knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
devtools::load_all(".")

Introduction

Checking data is part of every anlysis, but checking data at the end of an anlaysis isn't always required.

Checking at the end becomes more important as we build more data products that rely on a "stream", or constant flow of new data. These streams represent new data being generated and that must be merged with existing data. Depending on the complexity of the down-stream application, maybe certain column names are required or only certain values are allowed for gene-ids.

To do this data wrangling, we rely concept called a white-list.

white_list <- list(
    "media" = c("LGLI+", "NASH+TNF"),
    "treatment" = c("Advil", "Ibuprofen")
)

A white list is just a named list in R and this example of one, tells us we have two required fields ('media' and 'treatment') and two permitted values for each of those. The idea of permitted values allows projects to restrict data for consistancy, but allow permitted values to evolve along with project goals.

Let's assume we just finished an analysis and we have a data frame data.

data_to_check <- data.frame(
    device_number = c(81, 82),
    media = c("LGLI", "NASH+TNF"),
    tx = c("Adv", "Ibu")
)

Since this example is small, we can see the problem (non-permitted) values with a quick glance. Abbreviations and transciption errors are common in the experimental data we analyze and the goal of post-analysis cleaning is to make this rouge values conform.

Note: Sometimes a problem value is perfectly valid, it's just never been encountered before. In these cases we want to be alerted of the rouge value and will see how to handle these later.

Checking names

The first step is checking that required fields are present. The function check_names() does this.

check_names(data_to_check, white_list)

We see that 'treatment' is missing and it looks like this assay layout renamed it as 'tx'.

names(data_to_check) <- gsub("tx", "treatment", names(data_to_check))

check_names(data_to_check, white_list)

This time it was incorrect naming, but check_names() is a valueable for detecting when data is missing.

Checking values

The second step is checking that individual values are permitted. This is where restricting the data for consistancy comes into play.

The function check_values() is the tool for this.

check_values(data_to_check, white_list)

We are met with more of the red print, alerting us that values in out data are outside of the allowed white-list.

Since we are dealing with only a few substitions here, it would be easy enough to replace them by hand like we did for the names(data_to_check) earlier. But in most cases beyond this vignette, you will be dealing with larger datasets and larger white-lists, where subbing by eye-hand checks becomes painful.

To handle this, there is the function check_and_match() that is a wrapper around check_values() that will send the user into a series of interactive console prompts and return a subsititued version of data_to_check. This function relies on Levenshtein distance to calculate the similarity between strings.

check_and_match(data_to_check, white_list)

The interactivity is difficult to show with Knitr, but there are two additional arguments to check_values() that are worth mentioning. append=TRUE will cause the version of white_list in the global environment to be overwritten, which is useful if you are seeing new, but valid, value for the first time. margin is a numeric margin to increase the similarity score threshold. By default only the most similar values (including ties) in white-list are returned, increasing margin will include any values with similarity scores less than or equal too min(similarity_score) + margin.



hemoshear/assayr2 documentation built on Nov. 8, 2019, 6:13 p.m.