knitr::opts_chunk$set(echo = FALSE)
library(fakeDataWithError)
library(visualizationQualityControl)
library(cowplot)
library(dplyr)
library(errorTransformation)

poster_figure_loc <- "/home/rmflight/Documents/posters/dataTransformationsError/"

Outstanding Issues

Abstract

Background

In many omics analyses, a primary step in the data analysis is the application of a transformation to the data. Transformations are generally employed to convert proportional error (variance) to additive error, which most statistical methods appropriately handle. However, omics data frequently contain error sources that result in both additive and proportional errors. To our knowledge, there has not been a systematic study on detecting the presence of proportional error in omics data, or the effect of transformations on the error structure.

Results

In this work, we demonstrate a set of three simple graphs which facilitate the detection of proportional and mixed error in omics data when multiple replicates are available. The three graphs illustrate proportional and mixed error in a visually compelling manner that is both straight-forward to recognize and to communicate. The graphs plot the 1) absolute difference in the range, 2) standard deviation and 3) relative standard deviation against the mean signal across replicates. In addition to showing the presence of different types of error, these graphs readily demonstrate the effect of various transformations on the error structure as well.

Conclusions

We have developed a straight-forward set of graphs that effectively summarize major types of error present in high replicate datasets. Using these graphical summaries we find that the logand hyperbolic inverse sin transforms are the most effective method of the common methods employed for removing proportional error.

Keywords

error analysis, error visualization, omics error

Introduction

Various scalings and transformations are often applied to -omics data to modify specific properties of the data, making the data more amenable to various statistical procedures. One of the most common applications is to modify the error structure, going from multiplicative error that is commonly present in any data generated by counts at a detector (essentially all high-throughput -omics data) to additive error, which most statistical methods expect.

Previous work has examined the influence of scalings and transformations on both the structure of the data and the errors through the use of replicated nuclear magnetic resonance (NMR) [REF] and coupled gas chromatography mass spectrometry (GC-MS) experiments [REF]. Although insightful, these works did not provide direct control over the exact type and amount of error present in the data. Most of the discussion in previous work also concentrated on the shape of the spectra after scaling and transformation. However, in many other -omics data types outside of metabolomics where NMR and GC-MS are applied, there is no real spectra, but simply an aggregation of values measured on independent features, and therefore changes in shape of the overall spectra are largely irrelevant.

In the following, we demonstrate that the use of a simple summary of replicate experiments consisting of the mean, maximum - minimum, standard deviation (sd), and relative standard deviation (rsd) is useful as a visual indicator of the type and presence of errors in various -omics datasets. A simple plot of both the mean vs sd and mean vs rsd is able to determine whether the errors in an experiment are primarily additive (random, constant, baseline error), proportional (dependent on the signal value), or a mixture. From this visual summary, it is also possible to evaluate the impact of different scalings and transformations on the error structure in the data.

We evaluate the impact of the scalings and transformations through the use of synthetic data where specific amounts of additive and proportional error have been added.

Methods

All of the code necessary to regenerate this document can be found on Github at https://github.com/rmflight/error_transformation.

Data

set.seed(1234)
n_point <- 1000
n_rep <- 100
offset <- 4
simulated_data <- c(rlnorm(n_point / 2, meanlog = 1, sdlog = 1),
                    runif(n_point / 2, 5, 100))
simulated_data <- simulated_data + offset

Simulated data was generated by drawing points from two different distributions: 1) a log-normal distribution with a mean in log-space of 1 and a standard deviation of 1; and 2) a uniform distribution over the range of 5 - 100, with r n_point/2 points in each distribution. A value of r offset was added to the data purely to avoid having values less than 1 following the addition of error, making visualization easier. These data points are the pure initial data to which different types of error are added. From this set, r n_rep replicates are generated with either additive (add, 0.5), proportional (prop, 0.1) or mixed (both additive and proportional, mixed) error (equations 1-3).

$$ x = \mu + \sigma_{add} $$

$$ x = \mu + \mu \cdot \rho_{prop} $$

$$ x = \mu + \sigma_{add} + \mu \cdot \rho_{prop} $$

Additive Error

add_sd <- 0.5
add_error <- add_uniform_noise(n_rep, simulated_data, add_sd)

Additive error was added where the standard deviation was r add_sd. A plot of two replicates is shown in Figure X.

Proportional Error

prop_sd <- 0.1
prop_error <- add_prop_noise(n_rep, simulated_data, prop_sd)

Proportional error was added with a relative standard deviation of r prop_sd. A plot of two replicates is shown in Figure X.

Mixed Error

mix_error <- add_prop_uniform(n_rep, simulated_data, add_sd, prop_sd)

A mixture of additive and proportional errors was added using standard deviations and relative standard deviations of r add_sd and r prop_sd respectively.

Transformations

The transformations applied include autoscaling, pareto scaling, root (2nd and 5th), log, and hyperbolic inverse sin (arcsinh).

Results

Very often, high-throughput -omics replicates are visualized in a pairwise manner, by plotting the replicates directly against each other, or rotating them by 45 degrees as a Bland-Altman or MA plot (see Figure Xa and b). Overviews of many replicate / samples can be provided by summarizing each pair by the correlation or root mean squared error (RMSE), however neither of these provide a way to view how or if errors are related to the value of a feature across all replicates. The pairwise MA plot allows this for pairs, but examining a large number of pairwise plots is tedious.

One simple extension is for each feature, to plot the difference between the minimum and maximum value across replicates, vs the mean value, as shown in Figure Xc. A more robust metric with essentially the same information is to plot the standard deviation against the mean. This actually shows if there is any relationship between the error (standard deviation) and the mean, which can provide an indication of whether there is additive, or proportional errors present. Adding a plot of the relative standard deviation (RSD, SD / mean) vs the mean as well as sd vs mean makes it possible to discern the presence of additive, proportional, or mixed error ( see Figures X-Y for examples).

Question for myself and Hunter: What is the best way to organize these plots? In the poster we could easily divide the original and the error summary plots while keeping them on the same line to make the link explicit with the other plots shown later, but we don't really have that option in the paper. So, should the first two figures be figure of original spaces (for all three error models, so 3 rows and 2 columns), and then a figure of the error summaries (3 rows and 3 columns), or ONE ginormous figure with 3 rows and 5 columns of plots?

pair_frame <- data.frame(rep1 = add_error[,1], rep2 = add_error[,2])
add_pair_plot <- ggplot(pair_frame, aes(x = rep1, y = rep2)) + geom_point()

ma_frame <- data.frame(M = add_error[,1] - add_error[,2],
                      A = rowMeans(add_error[, 1:2]))
add_ma_plot <- ggplot(ma_frame, aes(x = A, y = M)) + geom_point()

diff_error <- apply(add_error, 1, function(x){
  max(x) - min(x)
})
range_frame <- data.frame(difference = diff_error,
                          mean = rowMeans(add_error))
add_diff_plot <- ggplot(range_frame, aes(x = mean, y = difference)) + geom_point()

add_error_summ <- summarize_data(t(add_error))
add_sd_plot <- filter(add_error_summ, type == "sd") %>% 
  ggplot(., aes(x = mean, y = var)) + geom_point() + ylab("sd") + geom_hline(yintercept = add_sd, color = "red")
add_rsd_plot <- filter(add_error_summ, type == "rsd") %>%
  ggplot(., aes(x = mean, y = var)) + geom_point() + ylab("rsd")

Additive

add_rep_ma <- plot_grid(add_pair_plot, add_ma_plot, nrow = 1, labels = LETTERS[1:2])
add_summary_plot <- plot_grid(add_diff_plot, add_sd_plot, add_rsd_plot, nrow = 1,
          labels = LETTERS[3:5])
add_rep_ma
add_summary_plot

Proportional Error

pair_frame <- data.frame(rep1 = prop_error[,1], rep2 = prop_error[,2])
prop_pair_plot <- ggplot(pair_frame, aes(x = rep1, y = rep2)) + geom_point()

ma_frame <- data.frame(M = prop_error[,1] - prop_error[,2],
                      A = rowMeans(prop_error[, 1:2]))
prop_ma_plot <- ggplot(ma_frame, aes(x = A, y = M)) + geom_point()

diff_error <- apply(prop_error, 1, function(x){
  max(x) - min(x)
})
range_frame <- data.frame(difference = diff_error,
                          mean = rowMeans(prop_error))
prop_diff_plot <- ggplot(range_frame, aes(x = mean, y = difference)) + geom_point()

prop_error_summ <- summarize_data(t(prop_error))
prop_sd_plot <- filter(prop_error_summ, type == "sd") %>% 
  ggplot(., aes(x = mean, y = var)) + geom_point() + ylab("sd")
prop_rsd_plot <- filter(prop_error_summ, type == "rsd") %>%
  ggplot(., aes(x = mean, y = var)) + geom_point() + ylab("rsd") + geom_hline(yintercept = prop_sd, color = "red")
prop_ma_plot <- plot_grid(prop_pair_plot, prop_ma_plot, nrow = 1,
          labels = LETTERS[1:2])
prop_summary_plot <- plot_grid(prop_diff_plot, prop_sd_plot, prop_rsd_plot, nrow = 1,
                               labels = LETTERS[3:5])
prop_ma_plot
prop_summary_plot

Mixed Error

pair_frame <- data.frame(rep1 = mix_error[,1], rep2 = mix_error[,2])
mix_pair_plot <- ggplot(pair_frame, aes(x = rep1, y = rep2)) + geom_point()

ma_frame <- data.frame(M = mix_error[,1] - mix_error[,2],
                      A = rowMeans(mix_error[, 1:2]))
mix_ma_plot <- ggplot(ma_frame, aes(x = A, y = M)) + geom_point()

diff_error <- apply(mix_error, 1, function(x){
  abs(max(x) - min(x))
})
range_frame <- data.frame(difference = diff_error,
                          mean = rowMeans(mix_error))
mix_diff_plot <- ggplot(range_frame, aes(x = mean, y = difference)) + geom_point()

mix_error_summ <- summarize_data(t(mix_error))
mix_sd_plot <- filter(mix_error_summ, type == "sd") %>% 
  ggplot(., aes(x = mean, y = var)) + geom_point() + ylab("sd")
mix_rsd_plot <- filter(mix_error_summ, type == "rsd") %>%
  ggplot(., aes(x = mean, y = var)) + geom_point() + ylab("rsd")
mix_ma_plot <- plot_grid(mix_pair_plot, mix_ma_plot, nrow = 1,
          labels = LETTERS[1:2])
mix_summary_plot <- plot_grid(mix_diff_plot, mix_sd_plot, mix_rsd_plot, nrow = 1,
                              labels = LETTERS[3:5])
mix_ma_plot
mix_summary_plot

Auto-Scaling

Mixed

me_as <- auto_scale(mix_error)
me_as_diff <- plot_diff(me_as, log_mean = NULL, use_title = NULL)
me_as_sd <- plot_sd(me_as, log_mean = NULL, use_title = NULL)
me_as_rsd <- plot_sd(me_as, sd_type = "rsd", log_mean = NULL, use_title = NULL)
me_nt_summ <- summarize_data(t(mix_error))
max_me_nt <- filter(me_nt_summ, type == "rsd") %>% select(., var) %>% max()

me_as_summ <- summarize_data(t(me_as))
max_me_as <- filter(me_as_summ, type == "rsd") %>% select(., var) %>% max()
identical(max_me_nt, max_me_as)

auto_compare <- summarize_two(mix_error, me_as, sd_type = "diff", name1 = "mix", name2 = "autoscale(mix)")
#auto_compare$mean <- log(auto_compare$mean)

# set up the factors properly
auto_compare$data <- factor(auto_compare$data, levels = c("mix", "autoscale(mix)"), ordered = TRUE)
me_as_plot_compare <- ggplot(auto_compare, aes(x = mean, y = var, color = data)) + 
  geom_point(alpha = 0.5) + xlab("mean") + ylab("diff") + theme(legend.position = c(0.5, 0.8), legend.title = element_text(size = 0))

mix_auto_summary <- plot_grid(me_as_diff, me_as_sd, me_as_rsd, nrow = 1,
                              labels = LETTERS[1:3])
mix_auto_compare <- plot_grid(me_as_plot_compare, nrow = 1, labels = LETTERS[4])
mix_auto_summary
mix_auto_compare

Applying a variance or autoscaling to the data results in the removal of the additive error to a completely identical standard deviation of 1 across all samples at all abundances (Figure X - Y). This applies across all of the different error types. Although the standard deviation becomes identical and constant, the maximum relative variance is the same, and the distribution of the rsd values is identical.

Pareto Scaling

Mixed

me_ps <- pareto_scale(mix_error)
me_ps_diff <- plot_diff(me_ps, log_mean = NULL, use_title = NULL)
me_ps_sd <- plot_sd(me_ps, log_mean = NULL, use_title = NULL)
me_ps_rsd <- plot_sd(me_ps, sd_type = "rsd", log_mean = NULL, use_title = NULL)
me_ps_summ <- summarize_data(t(me_ps))
max_me_ps <- filter(me_ps_summ, type == "rsd") %>% select(., var) %>% max()
all.equal(max_me_nt, max_me_ps)

pareto_compare <- summarize_two(mix_error, me_ps, sd_type = "rsd", name1 = "mix", name2 = "pareto(mix)")

pareto_compare$data <- factor(pareto_compare$data, levels = c("mix", "pareto(mix)"), ordered = TRUE)
#pareto_compare$mean <- log(pareto_compare$mean)

In contrast to autoscaling, pareto scaling modifies the values by the square root of the variance. As shown in Figure 5, this changes the absolute values of the standard deviations, but does not change the distribution, or the relationship of the variance to the mean. In fact, the maximum of the RSD is virtually unchanged, it is only the range of the mean that changes.

me_ps_rsd_oomp <- ggplot(pareto_compare, aes(x = mean, y = var, color = data)) + geom_point(alpha = 0.5) + 
  xlab("mean") + ylab("rsd") + theme(legend.position = c(0.5, 0.8), legend.title = element_text(size = 0))
pareto_summary_graph <- plot_grid(me_ps_diff, me_ps_sd, me_ps_rsd, 
                                  nrow = 1, labels = LETTERS[1:3])
pareto_compare_graph <- plot_grid(me_ps_rsd_oomp, nrow = 1, labels = LETTERS[4])

pareto_summary_graph
pareto_compare_graph

Root Transform

Mixed Error

root_mixed <- root_transform(mix_error)
root_mixed2 <- root_transform(mix_error, 5)

root_class <- rep(c("2", "5"), each = n_rep)

me_rt_comb <- cbind(root_mixed, root_mixed2)

me_rt_summary <- summarize_data(t(me_rt_comb), root_class)
me_rt_diff <- filter(me_rt_summary, type == "diff") %>%
  ggplot(., aes(x = mean, y = var)) + geom_point()
me_rt_sd <- filter(me_rt_summary, type == "sd") %>%
  ggplot(., aes(x = mean, y = var)) + geom_point()
me_rt_rsd <- filter(me_rt_summary, type == "rsd") %>%
  ggplot(., aes(x = mean, y = var)) + geom_point()
compare_rt_me_rs <- rbind(
  summarize_data(t(mix_error)) %>% filter(., type == "rsd") %>%
    mutate(., data = "mix") %>% transform(., mean = log(mean)),
  me_rt_summary %>% filter(., type == "sd") %>%
    mutate(., data = "root(mix)")
)

hist_comp_rt_me_rs <- ggplot(compare_rt_me_rs, aes(x = var, fill = data)) + 
  geom_density(alpha = 0.5) + theme(legend.position = "none")
point_comp_rt_me_rs <- ggplot(compare_rt_me_rs, aes(x = mean, y = var)) +
  geom_point(alpha = 0.5) + theme(legend.position = c(0.5, 0.8), legend.title = element_text(size = 0))

compare_rt_me_rr <- rbind(
  summarize_data(t(mix_error)) %>% filter(., type == "rsd") %>%
    mutate(., data = "rsd(mix)") %>% transform(., mean = log(mean)),
  me_rt_summary %>% filter(., type == "rsd") %>%
    mutate(., data = "rsd(root(mix))")
)

hist_comp_rt_me_rr <- ggplot(compare_rt_me_rr, aes(x = var, fill = data)) + 
  geom_density(alpha = 0.5)
point_comp_rt_me_rr <- ggplot(compare_rt_me_rr, aes(x = mean, y = var, color = data)) +
  geom_point(alpha = 0.5) + theme(legend.position = c(0.5, 0.8), legend.title = element_text(size = 0)) +
  ylab("rsd")
mix_root_summary <- plot_grid(me_rt_diff, me_rt_sd, me_rt_rsd, nrow = 1,
                              labels = LETTERS[1:3])
mix_root_compare <- plot_grid(point_comp_rt_me_rr, 
                         nrow = 1, labels = LETTERS[4])
mix_root_summary
mix_root_compare

Although the root transform changes the overall amount of variance compared to the mean, as with pareto scaling, the overall relationship of variance to the mean is maintained, as all the plots show. The variance is still dependent on the mean value, and the proportional error is still present in the data.

Log Transformation

Mixed Error

log_mix <- log(mix_error)

me_lt_diff <- plot_diff(log_mix, log_mean = FALSE, use_title = NULL)
me_lt_sd <- plot_sd(log_mix, log_mean = FALSE, use_title = NULL)
me_lt_rsd <- plot_sd(log_mix, log_mean = FALSE, sd_type = "rsd", use_title = NULL)
compare_lt_me_rs <- rbind(
  summarize_data(t(mix_error)) %>% filter(., type == "rsd") %>%
    mutate(., data = "rsd(mix)") %>% transform(., mean = log(mean)),
  summarize_data(t(log_mix)) %>% filter(., type == "sd") %>%
    mutate(., data = "sd(log(mix))")
)

compare_lt_me_rs$data <- factor(compare_lt_me_rs$data, levels = c("rsd(mix)", "sd(log(mix))"),
                                ordered = TRUE)

hist_comp_lt_me_rs <- ggplot(compare_lt_me_rs, aes(x = var, fill = data)) + 
  geom_density(alpha = 0.5) + theme(legend.position = "none")
point_comp_lt_me_rs <- ggplot(compare_lt_me_rs, aes(x = mean, y = var, color = data)) +
  geom_point(alpha = 0.5) + theme(legend.position = c(0.5, 0.8), legend.title = element_text(size = 0))

compare_lt_me_rr <- rbind(
  summarize_data(t(mix_error)) %>% filter(., type == "rsd") %>%
    mutate(., data = "mix") %>% transform(., mean = log(mean)),
  summarize_data(t(log_mix)) %>% filter(., type == "rsd") %>%
    mutate(., data = "log(mix)")
)

compare_lt_me_rr$data <- factor(compare_lt_me_rr$data, levels = c("mix", "log(mix)"),
                                ordered = TRUE)

hist_comp_lt_me_rr <- ggplot(compare_lt_me_rr, aes(x = var, fill = data)) + 
  geom_density(alpha = 0.5)
point_comp_lt_me_rr <- ggplot(compare_lt_me_rr, aes(x = mean, y = var, color = data)) +
  geom_point(alpha = 0.5) + theme(legend.position = c(0.5, 0.8), legend.title = element_text(size = 0)) +
  ylab("rsd")
log_mix_summary <- plot_grid(me_lt_diff, me_lt_sd, me_lt_rsd, nrow = 1,
                             labels = LETTERS[1:3])
log_mix_compare <- plot_grid(point_comp_lt_me_rs, point_comp_lt_me_rr, 
                             nrow = 1, labels = LETTERS[4:5])
log_mix_summary
log_mix_compare

The log transform completely changes how the standard deviation depends on the mean. Most notably, the variance becomes constant at higher mean values, much like the standard deviation in the additive error case. As shown in Figure X, the standard deviation of the log transform actually looks almost exactly like the relative standard deviation of the un-transformed data. Although it also appears that in the log-transformed data the low values have more error than the high values, Figure X shows that overall the relative error has decreased for the data.

Arcsinh Transformation

asinh_mix <- arcsinh_transform(mix_error)

me_ah_diff <- plot_diff(asinh_mix, log_mean = FALSE, use_title = NULL)
me_ah_sd <- plot_sd(asinh_mix, log_mean = FALSE, use_title = NULL)
me_ah_rsd <- plot_sd(asinh_mix, log_mean = FALSE, sd_type = "rsd", use_title = NULL)
compare_ah_me_rs <- rbind(
  summarize_data(t(mix_error)) %>% filter(., type == "rsd") %>%
    mutate(., data = "rsd(mix)") %>% transform(., mean = log(mean)),
  summarize_data(t(asinh_mix)) %>% filter(., type == "sd") %>%
    mutate(., data = "sd(asinh(mix))")
)

compare_ah_me_rs$data <- factor(compare_ah_me_rs$data, levels = c("rsd(mix)", "sd(asinh(mix))"),
                                ordered = TRUE)

hist_comp_ah_me_rs <- ggplot(compare_ah_me_rs, aes(x = var, fill = data)) + 
  geom_density(alpha = 0.5) + theme(legend.position = "none")
point_comp_ah_me_rs <- ggplot(compare_ah_me_rs, aes(x = mean, y = var, color = data)) +
  geom_point(alpha = 0.5) + theme(legend.position = c(0.5, 0.8), legend.title = element_text(size = 0))

compare_ah_me_rr <- rbind(
  summarize_data(t(mix_error)) %>% filter(., type == "rsd") %>%
    mutate(., data = "mix") %>% transform(., mean = log(mean)),
  summarize_data(t(asinh_mix)) %>% filter(., type == "rsd") %>%
    mutate(., data = "asinh(mix)")
)

compare_ah_me_rr$data <- factor(compare_ah_me_rr$data, levels = c("mix", "asinh(mix)"),
                                ordered = TRUE)

hist_comp_ah_me_rr <- ggplot(compare_ah_me_rr, aes(x = var, fill = data)) + 
  geom_density(alpha = 0.5)
point_comp_ah_me_rr <- ggplot(compare_ah_me_rr, aes(x = mean, y = var, color = data)) +
  geom_point(alpha = 0.5) + theme(legend.position = c(0.5, 0.8), legend.title = element_text(size = 0)) +
  ylab("rsd")
mix_asinh_summary <- plot_grid(me_ah_diff, me_ah_sd, me_ah_rsd, nrow = 1, labels = LETTERS[1:3])
mix_asinh_compare <- plot_grid(point_comp_ah_me_rs, point_comp_ah_me_rr, nrow = 1, labels = LETTERS[4:5])

mix_asinh_summary
mix_asinh_compare

Like the log transform, the arcsinh transform completely changes how the standard deviation depends on the mean. Most notably, the variance becomes constant at higher mean values, much like the standard deviation in the additive error case. As shown in Figure 8D, the standard deviation of the arcsinh transform actually looks very much like the relative standard deviation of the un-transformed data. In contrast to the relative variance in the mixed error data, it is much decreased. In addition, the arcsinh transform is able to handle a wider range of values than the log transform.

Discussion

Conclusion

Session Information

ses_info <- devtools::session_info()
ses_info$platform
knitr::kable(ses_info$packages, format = "markdown")

Create Markdown

rmarkdown::render("vignettes/main_manuscript.Rmd", clean = FALSE,
                  output_dir = "outputs")
file.copy("vignettes/main_manuscript.utf8.md", "outputs/main_manuscript.utf8.md")


rmflight/error_transformation documentation built on May 27, 2019, 9:31 a.m.