In randomchars42/bioset: Convert a Matrix of Raw Values into Nice and Tidy Data

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

bioset is intended to help you working with sets of raw data.

Working in a lab it is not uncommon to have a data set of raw values (because your measuring device spat it out) and you now need to somehow transform and organise the data so that you can work with it.

Installation

A stable version of bioset is available on CRAN: https://cran.r-project.org/package=bioset

So all you need to do is:

install.packages("bioset")

You can find the latest additions and changes on GitHub. To spare CRAN administrators' time it is requested of all package authors not to submit changes too frequently.

Consequently, I will make new features available on GitHub first. Packages I have not yet submitted to CRAN will be labelled vX.Y.Z-pre.N and appear under: https://github.com/randomchars42/bioset/releases.

To install those packages you can use githubinstall

# install.packages("githubinstall")
gh_install_packages("bioset", ref = "vX.Y.Z-pre.N")

You can install the very latest changes in bioset-master from github with:

# install.packages("devtools")
devtools::install_github("randomchars42/bioset")

Why? What bioset can do for you

bioset lets you:

import raw data organised in matrices, e.g. measured values of a 8 x 12 (96-well) bio-assay plate
calculate concentrations using samples with known concentrations (calibrators) in your dataset
calculate means and variability for duplicates / triplicates / ...
convert your concentrations to (more or less) arbitrary units of concentration

Data import

Suppose you have an ods / xls(x) file with raw values obtained from a measurement like this:

data <-
  utils::read.csv(
    system.file("extdata", "values.csv", package = "bioset"),
    header = FALSE)
rownames(data) <- LETTERS[1:4]

knitr::kable(
  data,
  row.names = TRUE,
  col.names = as.character(1:6))

Save them as set_1.csv- thats like an ods / xls(x) file but its basically a text file with the values separated by commas. In the current versions of LibreOffice / OpenOffice / Microsoft office theres an option "Save as" > "csv".

Load the package.

library("bioset")

Then you can use set_read() to get all values with their position as name in a nice tibble:

set_read()

data <- bioset::set_read(
  file_name = "values.csv",
  path = system.file("extdata", package = "bioset")
)
knitr::kable(data)

set_read() automagically reads set_1.csv in your current directory. If you have more than one set use set_read(num = 2) to read set 2, etc.

If your files are called plate_1.csv, plate_2.csv, ..., (run_1.csv, run_1.csv) you can set file_name = "plate_#NUM#.csv" (run_#NUM#.csv, ...).

If your files are stored in ./files/ tell set_read() where to look via path = "./files/".

Naming the values

Before feeding your samples into your measuring device you most likely drafted some sort of plan which position corresponds to which sample (didn't you?).

data <-
  utils::read.csv(
    system.file("extdata", "names.csv", package = "bioset"),
    header = FALSE)
rownames(data) <- LETTERS[1:4]

knitr::kable(
  data,
  row.names = TRUE,
  col.names = as.character(1:6))

So you had some calibrators (1-4) and samples A, B, C, D, E, F, G, H, each in duplicates.

To easily set the names for your samples just copy the names into your set_1.csv:

data <-
  utils::read.csv(
    system.file("extdata", "values_names.csv", package = "bioset"),
    header = FALSE)
rownames(data) <- LETTERS[1:8]

knitr::kable(
  data,
  row.names = TRUE,
  col.names = as.character(1:6))

Tell set_read() your data contains the names and which column should hold those names by setting additional_vars = c("name").

set_read(
  additional_vars = c("name")
)

This will get you:

data <- bioset::set_read(
  file_name = "values_names.csv",
  path = system.file("extdata", package = "bioset"),
  additional_vars = c("name")
)
knitr::kable(data)

Encoding additional properties

Suppose samples A, B, C, D were taken at day 1 and E, F, G, H were taken from the same rats / individuals / patients on day 2.

It would be more elegant to encode that into the data:

data <-
  utils::read.csv(
    system.file("extdata", "values_names_properties.csv", package = "bioset"),
    header = FALSE)
rownames(data) <- LETTERS[1:8]

knitr::kable(
  data,
  row.names = TRUE,
  col.names = as.character(1:6))

Now, tell set_read() your data contains the names and day by setting additional_vars = c("name", "day"). This will get you:

set_read(
  additional_vars = c("name", "day")
)

data <- bioset::set_read(
  file_name = "values_names_properties.csv",
  path = system.file("extdata", package = "bioset"),
  additional_vars = c("name", "day")
)

knitr::kable(data)

Calculating concentrations

Propably, your measuring device only gave you raw values (extinction rates / relative light units / ...). You know the concentrations of CAL1, CAL2, CAL3 and CAL4. Conveniently, the concentrations follow a linear relationship. To get the concentrations for the rest of the samples you need to interpolate between those calibrators.

set_calc_concentrations() does exactly this for you:

set_calc_concentrations(
  data,
  cal_names = c("CAL1", "CAL2", "CAL3", "CAL4"),
  cal_values = c(1, 2, 3, 4) # ng / ml
)

data <- bioset::set_calc_concentrations(
  data,
  cal_names = c("CAL1", "CAL2", "CAL3", "CAL4"),
  cal_values = c(1, 2, 3, 4) # ng / ml
)

knitr::kable(data)

Your calibrators are not so linear? Perhaps after a ln-ln transformation? You can use: model_func = fit_lnln and interpolate_func = interpolate_lnln. Basicallly, you can use any function as model_function that returns a model which is understood by your interpolate-func.

Duplicates / Triplicates / ...

So samples were measured in duplicates. For our further research you might want to use the mean and perhaps exclude samples with too much spread in their values.

set_calc_variability() to the rescue.

data <- set_calc_variability(
  data = data,
  ids = sample_id,
  value,
  conc
)

This will give you the mean and coefficient of variation (as well as n of the samples and the standard deviation) for the columns value and conc. It will use sample_id to determine which rows belong to the same sample.

data <- bioset::set_calc_variability(
  data = data,
  ids = sample_id,
  value,
  conc
)

knitr::kable(data)

The short way

If you need to read and transform multiple sets sets_read can do that for you.

It takes basically the same arguments as set_read, set_calc_concentrations and set_calc_variability combined and combines their functionality. The principal difference is, that sets_read takes sets - the number of sets to process.

It returns a list and may (write_data = TRUE) create two files in your current directory: data_all.csv and data_samples.csv with the processed data.

sets_read()'s list holds the following items:

$all: here you will find all the data , including calibrators, duplicates, ... (saved in data_all.csv if write_data = TRUE)
$samples: only one row per distinct sample here - no calibrators, no duplicates -> most often you will work with this data (saved in data_samples.csv if write_data = TRUE)
$set1: a list
- $plot: a plot showing you the function used to calculate the concentrations for this set. The points represent the calibrators.
- $model: the model as returned by model_func
($set2 - $setN): the same information for every set you have

Take a look at the data

# now you may run it :)
result_list <- sets_read(
  sets = 1,
  sep = ",",
  additional_vars = c("name", "day"),
  cal_names = c("CAL1", "CAL2", "CAL3", "CAL4"),
  cal_values = c(1, 2, 3, 4) # ng / ml
)

result_list <- bioset::sets_read(
  sets = 1,
  sep = ",",
  path = system.file("extdata", package = "bioset"),
  additional_vars = c("name", "day"),
  cal_names = c("CAL1", "CAL2", "CAL3", "CAL4"),
  cal_values = c(1, 2, 3, 4), # ng / ml
  write_data = FALSE
)