knitr::opts_chunk$set(echo = FALSE) library(fakeDataWithError) library(visualizationQualityControl) library(cowplot) library(dplyr) library(errorTransformation) poster_figure_loc <- "/home/rmflight/Documents/posters/dataTransformationsError/"
In many omics analyses, a primary step in the data analysis is the application of a transformation to the data. Transformations are generally employed to convert proportional error (variance) to additive error, which most statistical methods appropriately handle. However, omics data frequently contain error sources that result in both additive and proportional errors. To our knowledge, there has not been a systematic study on detecting the presence of proportional error in omics data, or the effect of transformations on the error structure.
In this work, we demonstrate a set of three simple graphs which facilitate the detection of proportional and mixed error in omics data when multiple replicates are available. The three graphs illustrate proportional and mixed error in a visually compelling manner that is both straight-forward to recognize and to communicate. The graphs plot the 1) absolute difference in the range, 2) standard deviation and 3) relative standard deviation against the mean signal across replicates. In addition to showing the presence of different types of error, these graphs readily demonstrate the effect of various transformations on the error structure as well.
We have developed a straight-forward set of graphs that effectively summarize major types of error present in high replicate datasets. Using these graphical summaries we find that the logand hyperbolic inverse sin transforms are the most effective method of the common methods employed for removing proportional error.
Keywords
error analysis, error visualization, omics error
Various scalings and transformations are often applied to -omics data to modify specific properties of the data, making the data more amenable to various statistical procedures. One of the most common applications is to modify the error structure, going from multiplicative error that is commonly present in any data generated by counts at a detector (essentially all high-throughput -omics data) to additive error, which most statistical methods expect.
Previous work has examined the influence of scalings and transformations on both the structure of the data and the errors through the use of replicated nuclear magnetic resonance (NMR) [REF] and coupled gas chromatography mass spectrometry (GC-MS) experiments [REF]. Although insightful, these works did not provide direct control over the exact type and amount of error present in the data. Most of the discussion in previous work also concentrated on the shape of the spectra after scaling and transformation. However, in many other -omics data types outside of metabolomics where NMR and GC-MS are applied, there is no real spectra, but simply an aggregation of values measured on independent features, and therefore changes in shape of the overall spectra are largely irrelevant.
In the following, we demonstrate that the use of a simple summary of replicate experiments consisting of the mean, maximum - minimum, standard deviation (sd), and relative standard deviation (rsd) is useful as a visual indicator of the type and presence of errors in various -omics datasets. A simple plot of both the mean vs sd and mean vs rsd is able to determine whether the errors in an experiment are primarily additive (random, constant, baseline error), proportional (dependent on the signal value), or a mixture. From this visual summary, it is also possible to evaluate the impact of different scalings and transformations on the error structure in the data.
We evaluate the impact of the scalings and transformations through the use of synthetic data where specific amounts of additive and proportional error have been added.
All of the code necessary to regenerate this document can be found on Github at https://github.com/rmflight/error_transformation.
set.seed(1234) n_point <- 1000 n_rep <- 100 offset <- 4 simulated_data <- c(rlnorm(n_point / 2, meanlog = 1, sdlog = 1), runif(n_point / 2, 5, 100)) simulated_data <- simulated_data + offset
Simulated data was generated by drawing points from two different distributions:
1) a log-normal distribution with a mean in log-space of 1 and a standard
deviation of 1; and 2) a uniform distribution over the
range of 5 - 100, with r n_point/2
points in each distribution. A value of
r offset
was added to the data purely to avoid having values less than 1 following the addition
of error, making visualization easier. These data points are the pure
initial data to which different types of error are added. From this set,
r n_rep
replicates are generated with either additive (add
, 0.5), proportional
(prop
, 0.1) or mixed (both additive and proportional, mixed
) error (equations
1-3).
$$ x = \mu + \sigma_{add} $$
$$ x = \mu + \mu \cdot \rho_{prop} $$
$$ x = \mu + \sigma_{add} + \mu \cdot \rho_{prop} $$
add_sd <- 0.5 add_error <- add_uniform_noise(n_rep, simulated_data, add_sd)
Additive error was added where the standard deviation was r add_sd
. A plot
of two replicates is shown in Figure X.
prop_sd <- 0.1 prop_error <- add_prop_noise(n_rep, simulated_data, prop_sd)
Proportional error was added with a relative standard deviation of r prop_sd
. A
plot of two replicates is shown in Figure X.
mix_error <- add_prop_uniform(n_rep, simulated_data, add_sd, prop_sd)
A mixture of additive and proportional errors was added using standard deviations
and relative standard deviations of r add_sd
and r prop_sd
respectively.
The transformations applied include autoscaling, pareto scaling, root (2nd and 5th), log, and hyperbolic inverse sin (arcsinh).
Very often, high-throughput -omics replicates are visualized in a pairwise manner, by plotting the replicates directly against each other, or rotating them by 45 degrees as a Bland-Altman or MA plot (see Figure Xa and b). Overviews of many replicate / samples can be provided by summarizing each pair by the correlation or root mean squared error (RMSE), however neither of these provide a way to view how or if errors are related to the value of a feature across all replicates. The pairwise MA plot allows this for pairs, but examining a large number of pairwise plots is tedious.
One simple extension is for each feature, to plot the difference between the minimum and maximum value across replicates, vs the mean value, as shown in Figure Xc. A more robust metric with essentially the same information is to plot the standard deviation against the mean. This actually shows if there is any relationship between the error (standard deviation) and the mean, which can provide an indication of whether there is additive, or proportional errors present. Adding a plot of the relative standard deviation (RSD, SD / mean) vs the mean as well as sd vs mean makes it possible to discern the presence of additive, proportional, or mixed error ( see Figures X-Y for examples).
Question for myself and Hunter: What is the best way to organize these plots? In the poster we could easily divide the original and the error summary plots while keeping them on the same line to make the link explicit with the other plots shown later, but we don't really have that option in the paper. So, should the first two figures be figure of original spaces (for all three error models, so 3 rows and 2 columns), and then a figure of the error summaries (3 rows and 3 columns), or ONE ginormous figure with 3 rows and 5 columns of plots?
pair_frame <- data.frame(rep1 = add_error[,1], rep2 = add_error[,2]) add_pair_plot <- ggplot(pair_frame, aes(x = rep1, y = rep2)) + geom_point() ma_frame <- data.frame(M = add_error[,1] - add_error[,2], A = rowMeans(add_error[, 1:2])) add_ma_plot <- ggplot(ma_frame, aes(x = A, y = M)) + geom_point() diff_error <- apply(add_error, 1, function(x){ max(x) - min(x) }) range_frame <- data.frame(difference = diff_error, mean = rowMeans(add_error)) add_diff_plot <- ggplot(range_frame, aes(x = mean, y = difference)) + geom_point() add_error_summ <- summarize_data(t(add_error)) add_sd_plot <- filter(add_error_summ, type == "sd") %>% ggplot(., aes(x = mean, y = var)) + geom_point() + ylab("sd") + geom_hline(yintercept = add_sd, color = "red") add_rsd_plot <- filter(add_error_summ, type == "rsd") %>% ggplot(., aes(x = mean, y = var)) + geom_point() + ylab("rsd")
add_rep_ma <- plot_grid(add_pair_plot, add_ma_plot, nrow = 1, labels = LETTERS[1:2]) add_summary_plot <- plot_grid(add_diff_plot, add_sd_plot, add_rsd_plot, nrow = 1, labels = LETTERS[3:5]) add_rep_ma add_summary_plot
pair_frame <- data.frame(rep1 = prop_error[,1], rep2 = prop_error[,2]) prop_pair_plot <- ggplot(pair_frame, aes(x = rep1, y = rep2)) + geom_point() ma_frame <- data.frame(M = prop_error[,1] - prop_error[,2], A = rowMeans(prop_error[, 1:2])) prop_ma_plot <- ggplot(ma_frame, aes(x = A, y = M)) + geom_point() diff_error <- apply(prop_error, 1, function(x){ max(x) - min(x) }) range_frame <- data.frame(difference = diff_error, mean = rowMeans(prop_error)) prop_diff_plot <- ggplot(range_frame, aes(x = mean, y = difference)) + geom_point() prop_error_summ <- summarize_data(t(prop_error)) prop_sd_plot <- filter(prop_error_summ, type == "sd") %>% ggplot(., aes(x = mean, y = var)) + geom_point() + ylab("sd") prop_rsd_plot <- filter(prop_error_summ, type == "rsd") %>% ggplot(., aes(x = mean, y = var)) + geom_point() + ylab("rsd") + geom_hline(yintercept = prop_sd, color = "red")
prop_ma_plot <- plot_grid(prop_pair_plot, prop_ma_plot, nrow = 1, labels = LETTERS[1:2]) prop_summary_plot <- plot_grid(prop_diff_plot, prop_sd_plot, prop_rsd_plot, nrow = 1, labels = LETTERS[3:5]) prop_ma_plot prop_summary_plot
pair_frame <- data.frame(rep1 = mix_error[,1], rep2 = mix_error[,2]) mix_pair_plot <- ggplot(pair_frame, aes(x = rep1, y = rep2)) + geom_point() ma_frame <- data.frame(M = mix_error[,1] - mix_error[,2], A = rowMeans(mix_error[, 1:2])) mix_ma_plot <- ggplot(ma_frame, aes(x = A, y = M)) + geom_point() diff_error <- apply(mix_error, 1, function(x){ abs(max(x) - min(x)) }) range_frame <- data.frame(difference = diff_error, mean = rowMeans(mix_error)) mix_diff_plot <- ggplot(range_frame, aes(x = mean, y = difference)) + geom_point() mix_error_summ <- summarize_data(t(mix_error)) mix_sd_plot <- filter(mix_error_summ, type == "sd") %>% ggplot(., aes(x = mean, y = var)) + geom_point() + ylab("sd") mix_rsd_plot <- filter(mix_error_summ, type == "rsd") %>% ggplot(., aes(x = mean, y = var)) + geom_point() + ylab("rsd")
mix_ma_plot <- plot_grid(mix_pair_plot, mix_ma_plot, nrow = 1, labels = LETTERS[1:2]) mix_summary_plot <- plot_grid(mix_diff_plot, mix_sd_plot, mix_rsd_plot, nrow = 1, labels = LETTERS[3:5]) mix_ma_plot mix_summary_plot
me_as <- auto_scale(mix_error) me_as_diff <- plot_diff(me_as, log_mean = NULL, use_title = NULL) me_as_sd <- plot_sd(me_as, log_mean = NULL, use_title = NULL) me_as_rsd <- plot_sd(me_as, sd_type = "rsd", log_mean = NULL, use_title = NULL)
me_nt_summ <- summarize_data(t(mix_error)) max_me_nt <- filter(me_nt_summ, type == "rsd") %>% select(., var) %>% max() me_as_summ <- summarize_data(t(me_as)) max_me_as <- filter(me_as_summ, type == "rsd") %>% select(., var) %>% max() identical(max_me_nt, max_me_as) auto_compare <- summarize_two(mix_error, me_as, sd_type = "diff", name1 = "mix", name2 = "autoscale(mix)") #auto_compare$mean <- log(auto_compare$mean) # set up the factors properly auto_compare$data <- factor(auto_compare$data, levels = c("mix", "autoscale(mix)"), ordered = TRUE)
me_as_plot_compare <- ggplot(auto_compare, aes(x = mean, y = var, color = data)) + geom_point(alpha = 0.5) + xlab("mean") + ylab("diff") + theme(legend.position = c(0.5, 0.8), legend.title = element_text(size = 0)) mix_auto_summary <- plot_grid(me_as_diff, me_as_sd, me_as_rsd, nrow = 1, labels = LETTERS[1:3]) mix_auto_compare <- plot_grid(me_as_plot_compare, nrow = 1, labels = LETTERS[4]) mix_auto_summary mix_auto_compare
Applying a variance or autoscaling to the data results in the removal of the additive error to a completely identical standard deviation of 1 across all samples at all abundances (Figure X - Y). This applies across all of the different error types. Although the standard deviation becomes identical and constant, the maximum relative variance is the same, and the distribution of the rsd values is identical.
me_ps <- pareto_scale(mix_error) me_ps_diff <- plot_diff(me_ps, log_mean = NULL, use_title = NULL) me_ps_sd <- plot_sd(me_ps, log_mean = NULL, use_title = NULL) me_ps_rsd <- plot_sd(me_ps, sd_type = "rsd", log_mean = NULL, use_title = NULL)
me_ps_summ <- summarize_data(t(me_ps)) max_me_ps <- filter(me_ps_summ, type == "rsd") %>% select(., var) %>% max() all.equal(max_me_nt, max_me_ps) pareto_compare <- summarize_two(mix_error, me_ps, sd_type = "rsd", name1 = "mix", name2 = "pareto(mix)") pareto_compare$data <- factor(pareto_compare$data, levels = c("mix", "pareto(mix)"), ordered = TRUE) #pareto_compare$mean <- log(pareto_compare$mean)
In contrast to autoscaling, pareto scaling modifies the values by the square root of the variance. As shown in Figure 5, this changes the absolute values of the standard deviations, but does not change the distribution, or the relationship of the variance to the mean. In fact, the maximum of the RSD is virtually unchanged, it is only the range of the mean that changes.
me_ps_rsd_oomp <- ggplot(pareto_compare, aes(x = mean, y = var, color = data)) + geom_point(alpha = 0.5) + xlab("mean") + ylab("rsd") + theme(legend.position = c(0.5, 0.8), legend.title = element_text(size = 0)) pareto_summary_graph <- plot_grid(me_ps_diff, me_ps_sd, me_ps_rsd, nrow = 1, labels = LETTERS[1:3]) pareto_compare_graph <- plot_grid(me_ps_rsd_oomp, nrow = 1, labels = LETTERS[4]) pareto_summary_graph pareto_compare_graph
root_mixed <- root_transform(mix_error) root_mixed2 <- root_transform(mix_error, 5) root_class <- rep(c("2", "5"), each = n_rep) me_rt_comb <- cbind(root_mixed, root_mixed2) me_rt_summary <- summarize_data(t(me_rt_comb), root_class) me_rt_diff <- filter(me_rt_summary, type == "diff") %>% ggplot(., aes(x = mean, y = var)) + geom_point() me_rt_sd <- filter(me_rt_summary, type == "sd") %>% ggplot(., aes(x = mean, y = var)) + geom_point() me_rt_rsd <- filter(me_rt_summary, type == "rsd") %>% ggplot(., aes(x = mean, y = var)) + geom_point()
compare_rt_me_rs <- rbind( summarize_data(t(mix_error)) %>% filter(., type == "rsd") %>% mutate(., data = "mix") %>% transform(., mean = log(mean)), me_rt_summary %>% filter(., type == "sd") %>% mutate(., data = "root(mix)") ) hist_comp_rt_me_rs <- ggplot(compare_rt_me_rs, aes(x = var, fill = data)) + geom_density(alpha = 0.5) + theme(legend.position = "none") point_comp_rt_me_rs <- ggplot(compare_rt_me_rs, aes(x = mean, y = var)) + geom_point(alpha = 0.5) + theme(legend.position = c(0.5, 0.8), legend.title = element_text(size = 0)) compare_rt_me_rr <- rbind( summarize_data(t(mix_error)) %>% filter(., type == "rsd") %>% mutate(., data = "rsd(mix)") %>% transform(., mean = log(mean)), me_rt_summary %>% filter(., type == "rsd") %>% mutate(., data = "rsd(root(mix))") ) hist_comp_rt_me_rr <- ggplot(compare_rt_me_rr, aes(x = var, fill = data)) + geom_density(alpha = 0.5) point_comp_rt_me_rr <- ggplot(compare_rt_me_rr, aes(x = mean, y = var, color = data)) + geom_point(alpha = 0.5) + theme(legend.position = c(0.5, 0.8), legend.title = element_text(size = 0)) + ylab("rsd")
mix_root_summary <- plot_grid(me_rt_diff, me_rt_sd, me_rt_rsd, nrow = 1, labels = LETTERS[1:3]) mix_root_compare <- plot_grid(point_comp_rt_me_rr, nrow = 1, labels = LETTERS[4]) mix_root_summary mix_root_compare
Although the root transform changes the overall amount of variance compared to the mean, as with pareto scaling, the overall relationship of variance to the mean is maintained, as all the plots show. The variance is still dependent on the mean value, and the proportional error is still present in the data.
log_mix <- log(mix_error) me_lt_diff <- plot_diff(log_mix, log_mean = FALSE, use_title = NULL) me_lt_sd <- plot_sd(log_mix, log_mean = FALSE, use_title = NULL) me_lt_rsd <- plot_sd(log_mix, log_mean = FALSE, sd_type = "rsd", use_title = NULL)
compare_lt_me_rs <- rbind( summarize_data(t(mix_error)) %>% filter(., type == "rsd") %>% mutate(., data = "rsd(mix)") %>% transform(., mean = log(mean)), summarize_data(t(log_mix)) %>% filter(., type == "sd") %>% mutate(., data = "sd(log(mix))") ) compare_lt_me_rs$data <- factor(compare_lt_me_rs$data, levels = c("rsd(mix)", "sd(log(mix))"), ordered = TRUE) hist_comp_lt_me_rs <- ggplot(compare_lt_me_rs, aes(x = var, fill = data)) + geom_density(alpha = 0.5) + theme(legend.position = "none") point_comp_lt_me_rs <- ggplot(compare_lt_me_rs, aes(x = mean, y = var, color = data)) + geom_point(alpha = 0.5) + theme(legend.position = c(0.5, 0.8), legend.title = element_text(size = 0)) compare_lt_me_rr <- rbind( summarize_data(t(mix_error)) %>% filter(., type == "rsd") %>% mutate(., data = "mix") %>% transform(., mean = log(mean)), summarize_data(t(log_mix)) %>% filter(., type == "rsd") %>% mutate(., data = "log(mix)") ) compare_lt_me_rr$data <- factor(compare_lt_me_rr$data, levels = c("mix", "log(mix)"), ordered = TRUE) hist_comp_lt_me_rr <- ggplot(compare_lt_me_rr, aes(x = var, fill = data)) + geom_density(alpha = 0.5) point_comp_lt_me_rr <- ggplot(compare_lt_me_rr, aes(x = mean, y = var, color = data)) + geom_point(alpha = 0.5) + theme(legend.position = c(0.5, 0.8), legend.title = element_text(size = 0)) + ylab("rsd")
log_mix_summary <- plot_grid(me_lt_diff, me_lt_sd, me_lt_rsd, nrow = 1, labels = LETTERS[1:3]) log_mix_compare <- plot_grid(point_comp_lt_me_rs, point_comp_lt_me_rr, nrow = 1, labels = LETTERS[4:5]) log_mix_summary log_mix_compare
The log transform completely changes how the standard deviation depends on the mean. Most notably, the variance becomes constant at higher mean values, much like the standard deviation in the additive error case. As shown in Figure X, the standard deviation of the log transform actually looks almost exactly like the relative standard deviation of the un-transformed data. Although it also appears that in the log-transformed data the low values have more error than the high values, Figure X shows that overall the relative error has decreased for the data.
asinh_mix <- arcsinh_transform(mix_error) me_ah_diff <- plot_diff(asinh_mix, log_mean = FALSE, use_title = NULL) me_ah_sd <- plot_sd(asinh_mix, log_mean = FALSE, use_title = NULL) me_ah_rsd <- plot_sd(asinh_mix, log_mean = FALSE, sd_type = "rsd", use_title = NULL)
compare_ah_me_rs <- rbind( summarize_data(t(mix_error)) %>% filter(., type == "rsd") %>% mutate(., data = "rsd(mix)") %>% transform(., mean = log(mean)), summarize_data(t(asinh_mix)) %>% filter(., type == "sd") %>% mutate(., data = "sd(asinh(mix))") ) compare_ah_me_rs$data <- factor(compare_ah_me_rs$data, levels = c("rsd(mix)", "sd(asinh(mix))"), ordered = TRUE) hist_comp_ah_me_rs <- ggplot(compare_ah_me_rs, aes(x = var, fill = data)) + geom_density(alpha = 0.5) + theme(legend.position = "none") point_comp_ah_me_rs <- ggplot(compare_ah_me_rs, aes(x = mean, y = var, color = data)) + geom_point(alpha = 0.5) + theme(legend.position = c(0.5, 0.8), legend.title = element_text(size = 0)) compare_ah_me_rr <- rbind( summarize_data(t(mix_error)) %>% filter(., type == "rsd") %>% mutate(., data = "mix") %>% transform(., mean = log(mean)), summarize_data(t(asinh_mix)) %>% filter(., type == "rsd") %>% mutate(., data = "asinh(mix)") ) compare_ah_me_rr$data <- factor(compare_ah_me_rr$data, levels = c("mix", "asinh(mix)"), ordered = TRUE) hist_comp_ah_me_rr <- ggplot(compare_ah_me_rr, aes(x = var, fill = data)) + geom_density(alpha = 0.5) point_comp_ah_me_rr <- ggplot(compare_ah_me_rr, aes(x = mean, y = var, color = data)) + geom_point(alpha = 0.5) + theme(legend.position = c(0.5, 0.8), legend.title = element_text(size = 0)) + ylab("rsd")
mix_asinh_summary <- plot_grid(me_ah_diff, me_ah_sd, me_ah_rsd, nrow = 1, labels = LETTERS[1:3]) mix_asinh_compare <- plot_grid(point_comp_ah_me_rs, point_comp_ah_me_rr, nrow = 1, labels = LETTERS[4:5]) mix_asinh_summary mix_asinh_compare
Like the log transform, the arcsinh transform completely changes how the standard deviation depends on the mean. Most notably, the variance becomes constant at higher mean values, much like the standard deviation in the additive error case. As shown in Figure 8D, the standard deviation of the arcsinh transform actually looks very much like the relative standard deviation of the un-transformed data. In contrast to the relative variance in the mixed error data, it is much decreased. In addition, the arcsinh transform is able to handle a wider range of values than the log transform.
ses_info <- devtools::session_info() ses_info$platform knitr::kable(ses_info$packages, format = "markdown")
rmarkdown::render("vignettes/main_manuscript.Rmd", clean = FALSE, output_dir = "outputs") file.copy("vignettes/main_manuscript.utf8.md", "outputs/main_manuscript.utf8.md")
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.