In lrutter/bigPint: BIG multivariate data Plotted INTeractively

knitr::opts_chunk$set(echo=TRUE)

About data object

All functions in the bigPint package require an input parameter called data, which should be a data frame that contains the full dataset of interest. If a researcher is using the package to visualize RNA-seq data, then this data object should be a count table that contains the read counts for all genes of interest.

Example: two groups

The data object requires the same particular data frame format for all bigPint functions. There should be $n$ rows in the data frame, where $n$ is the number of genes. There should be $p + 1$ columns in the data frame, where $p$ is the number of samples. The first column contains the genes names and the rest of the columns should contain the read counts for all samples of interest. An example of this format is shown below:

library(bigPint)
data("soybean_ir_sub")
head(soybean_ir_sub)

We can also examine the structure of an example data object as follows:

str(soybean_ir_sub, strict.width = "wrap")

This example dataset contains 5,604 genes and six samples [@soybeanIR]. There are two treatment groups, N and P. Each treatment group contains three replicates.

Data object rules

As demonstrated above, the data object must meet the following conditions:

Be of type data.frame
Contain at least two treatment groups and at least two replicates per treatment group
Its first column must
- Be called "ID"
- Be of class character
- Contain the names of the genes (or a unique set of names in general)
Each of the rest of its columns must
- Contain the read counts for a given sample (or quantitative values in general)
- Be of class integer or numeric
- Be called in a three-part format (such as "A.3" or "S4.1") that matches the Perl expression ^[a-zA-Z0-9]+\\.[0-9]+, where
  - The first part indicates the treatment group name and must contain alphanumeric characters. Examples include "A", "AR", and "A9"
  - The second part consists of a dot "." to serve as a delimeter
  - The third part indicates the replicate number and must consist of numbers

It is important that the names of all columns except the first follow the three-part format delineated above. All functions in the bigPint package require this format to successfully produce plots. If your data object does not fit this format, bigPint will likely throw an informative error about why your format was not recognized.

Example: three groups

Note that the data object can contain more than two treatment groups. In this case, the bigPint software will automatically create plots for all pairs of treatment groups. An example of this type of dataset is provided in the bigPint package and can accessed as follows:

data(soybean_cn_sub)

data(soybean_cn_sub)
str(soybean_cn_sub, strict.width = "wrap")

This example dataset contains 7,332 genes and nine samples [@brown2015developmental]. There are three treatment groups, S1, S2, and S3. Each treatment group contains three replicates. In such cases where the data object contains more than two treatment groups, all functions in the bigPint package (except plotSMApp()) will automatically produce a plot for each pairwise combination of treatment groups.

For example, bigPint functions will produce plots for S1 versus S2, S1 versus S3, and S2 versus S3 in this case. The same could be accomplished (although less efficiently) by separating the dataset into three separate datasets and running a bigPint function of interest on each of them individually.

library(dplyr)
soybean_cn_sub_S1S2 <- soybean_cn_sub %>% select("ID", contains("S1"), contains("S2"))
soybean_cn_sub_S1S3 <- soybean_cn_sub %>% select("ID", contains("S1"), contains("S3"))
soybean_cn_sub_S2S3 <- soybean_cn_sub %>% select("ID", contains("S2"), contains("S3"))

head(soybean_cn_sub_S1S2, 3)

head(soybean_cn_sub_S1S3, 3)

head(soybean_cn_sub_S2S3, 3)

Preprocessing of data object

Some popular RNA-seq analysis packages (such as edgeR [@robinson2010edger], DESeq2 [@love2014moderated], and limma [@ritchie2015limma]) advise researchers to perform certain preprocessing steps to their data, such as filtering the genes, normalizing their read counts, and standardizing their read counts before visualization. Researchers can use datasets whether or not they have been filtered, normalized, and standardized for setting the data object in the bigPint package. If they wish, they can use bigPint plots to investigate how their dataset changes after filters, normalizations, and standardizations.