knitr::opts_chunk$set( # collapse = FALSE, # comment = "#>" results = "hold" )
The mosaic.find package provides a function (mosaic_find()) designed to find rhythmic and non-rhythmic trends in multi-omics time course data using model selection and joint modeling, a method called MOSAIC (Multi-Omics Selection with Amplitude Independent Criteria). To read more about our work on this project and cite us, see MOSAIC: A Joint Modeling Methodology for Combined Circadian and Non-Circadian Analysis of Multi-Omics Data by H. De los Santos et al. (2020)
Further, for users who prefer an interface more than coding, as well as built-in visualizations, our GitHub repository can be found here. There, you can find a Shiny application for finding trends and automatically visualizing results, with features such as summary visualizations, heat maps, gene expression plots (with or without replicates visualized), and parameter density graphs. It will also have the most up-to-date features. A FAQ for possible user errors can also be found there.
In this vignette, we'll walk through an example of how to use mosaic.find, and how to choose from the several different built-in methods of preprocessing.
We'll start by loading our library, which contains the mosaic_find() function. It also has two example datsets: expressions_rna, which contains synthetic RNA expression data; and expressions_pro, which contains synthetic protein expression data. We'll be using both throughout this vignette. Here we'll look at the first few rows and columns of our dataset.
library(mosaic.find) head(expressions_rna[,1:5]) head(expressions_pro[,1:5])
Note the data format: its first column first column has gene labels/names, and all other columns have numerical expression data for RNA and protein, respectively. This expression data is ordered by time point then by replicate, and has evenly spaced time points. Any missing data has cells left blank (NA). Also note that the sample names/labels are the same between our RNA and protein datasets, to indicate corresponding expression data. Any expressions with a name not found in the other dataset will not be reflected in the results. In order to use the mosaic_find() function, data must be in this format.
Now, let's look at one the expressions, Sample 2. Here we plot RNA in blue and protein in red. each of the replicates in a different color, then plot the difference between them in gray.
library(ggplot2) # making a function to plot expression, since we're going to use it again # samp: sample name that we want to visualize # expressions_rna, expressions_pro: our original RNA and protein expression data frames # results: data frame of mosaic results plot_exp <- function(samp, expressions_rna, expressions_pro, results = data.frame()){ # color information rna_blue_high <- "#268beb" pro_red_high <- "#d61e1e" rna_blue_low <- "#9ebfde" pro_red_low <- "#d49d9d" # making a function to create automatic palettes for rna and protein vizualizations rna_col_func <- colorRampPalette(c(rna_blue_low, rna_blue_high)) pro_col_func <- colorRampPalette(c(pro_red_low, pro_red_high)) tp <- seq(2,48,by=2) # our time points num_reps <- 3 # number of replicates # set to plot either the mosaic results or the original data if (nrow(results) == 0){ # we're plotting the original data dat_rna <- expressions_rna[,-1] dat_pro <- expressions_pro[,-1] # logicals for the position of our desired sample in each dataset gene_log_rna <- expressions_rna$Sample_Name == samp gene_log_pro <- expressions_pro$Sample_Name == samp } else { # last number that is a parameter in the results dataset end_num <- 33 # we're plotting the processed original data and the fitted data dat_rna <- results[,(end_num+1):((end_num+1)+((length(tp)*num_reps)-1))] fit_rna <- results[,(end_num+(length(tp)*num_reps)+1):((end_num+(length(tp)*num_reps)+1)+(length(tp)-1))] dat_pro <- results[,(end_num+(length(tp)*num_reps)+1+length(tp)):(end_num+(length(tp)*num_reps*2)+1+length(tp)-1)] fit_pro <- results[,(end_num+(length(tp)*num_reps*2)+1+length(tp)):ncol(results)] # logicals for the position of our desired sample gene_log_rna <- gene_log_pro <- results$Gene_Name == samp } # our visualization data frame gg.df <- data.frame( # maximum over time points "rna_max" = sapply(seq(1,ncol(dat_rna), by = num_reps), function(x) max(dat_rna[gene_log_rna,x:(x+num_reps-1)], na.rm = T)), "pro_max" = sapply(seq(1,ncol(dat_pro), by = num_reps), function(x) max(dat_pro[gene_log_pro,x:(x+num_reps-1)], na.rm = T)), # minimum over time points "rna_min" = sapply(seq(1,ncol(dat_rna), by = num_reps), function(x) min(dat_rna[gene_log_rna,x:(x+num_reps-1)], na.rm = T)), "pro_min" = sapply(seq(1,ncol(dat_pro), by = num_reps), function(x) min(dat_pro[gene_log_pro,x:(x+num_reps-1)], na.rm = T)), "timen" = tp # time points ) # colors for lines and shading col_vect <- c( "Original RNA" = rna_blue_low, "Original Protein" = pro_red_low ) # only plot individual replicates if we're only plotting the original data if (nrow(results) == 0){ # add replicate colors # add two to provide more variability rna_rep_col <- rna_col_func(num_reps+2) pro_rep_col <- pro_col_func(num_reps+2) # add replicates to df and color scale for (i in 1:num_reps){ gg.df[,paste0("rna_rep",i)] <- as.numeric(dat_rna[gene_log_rna,seq(i, ncol(dat_rna), by=num_reps)]) gg.df[,paste0("pro_rep",i)] <- as.numeric(dat_pro[gene_log_pro,seq(i, ncol(dat_pro), by=num_reps)]) col_vect[paste0("RNA, Replicate ",i)] <- rna_rep_col[i+1] col_vect[paste0("Protein, Replicate ",i)] <- pro_rep_col[i+1] } } else { # otherwise, we plot the model fits gg.df$rna_fit <- as.numeric(fit_rna[gene_log_rna,]) gg.df$pro_fit <- as.numeric(fit_pro[gene_log_pro,]) # add the colors col_vect["Fit RNA"] <- rna_blue_high col_vect["Fit Protein"] <- pro_red_high } # name breaks for the colors in the legend col_name_breaks <- c(names(col_vect)[grepl("RNA", names(col_vect), fixed = T)], names(col_vect)[grepl("Protein", names(col_vect), fixed = T)]) # visualize, with shading for each omics type p <- ggplot(gg.df, aes(x = timen))+ # setting up scales and labels scale_fill_manual("", values = col_vect, breaks = col_name_breaks)+ scale_color_manual("", values = col_vect, breaks = col_name_breaks)+ ylab("Expression")+ xlab("Time (Hours)")+ theme_bw()+ theme( plot.title = element_text(hjust = .5), legend.position = "bottom", legend.direction = "horizontal" )+ scale_x_continuous(expand = c(0, 0))+ # add shading for original rna and protein data geom_ribbon(aes(ymax = rna_max, ymin = rna_min, fill = "Original RNA"), alpha = .5)+ geom_ribbon(aes(ymax = pro_max, ymin = pro_min, fill = "Original Protein"), alpha = .5)+ ggtitle(samp)+ # title guides( # colors color = guide_legend(nrow = 2, byrow = T), fill = guide_legend(nrow = 2) )+ NULL # only plot individual replicates if we're only plotting the original data if (nrow(results) == 0){ # add each of the replicates for (i in 1:num_reps){ p <- p + geom_line(aes_string(y = paste0("rna_rep",i), color = shQuote(paste0("RNA, Replicate ",i))))+ geom_line(aes_string(y = paste0("pro_rep",i), color = shQuote(paste0("Protein, Replicate ",i))))+ NULL } } else { # otherwise, plot the fits p <- p + geom_line(aes(y = rna_fit, color = "Fit RNA"), size = 1)+ geom_line(aes(y = pro_fit, color = "Fit Protein"), size = 1)+ NULL } return(p) } suppressWarnings(plot_exp("Sample_2", expressions_rna, expressions_pro)) # to ignore warnings for missing values
We can see that this sample manifests very differently depending on the omics type! In RNA, Sample 2 seems to be a rhythmic gene with damping, while in protein, Sample 2 is clearly linear. This difference in expression makes mosaic_find() the perfect function for our dataset.
So we begin by assigning our parameters and running the mosaic_find() function. In this first run, in addition to looking for trends, we want to look for rhythms between 20 and 28 hours, with no preprocessing, assigning these results to a new dataframe.
# run mosaic_find() results <- mosaic_find( rna = expressions_rna, # rna data pro = expressions_pro, # protein data begin = 2, # first time point (hours) end = 48, # last time point (hours) resol = 2, # resolution (hours) num_reps = 3, # number of replicates paired = F, # unpaired, or unrelated, replicates low = 20, # lower limit when looking for rhythms (hours) high = 28 # lower limit when looking for rhythms (hours) ) # look at the results! head(results[,1:16], 5)
Looks like we've sucessfully run MOSAIC! Now we can see that the results data frame has information about which models fit each of the expressions in our omics types, the parameters for those models, and p-values. You'll also notice that it reordered our sample names. For more information on the models and parameters available, please see MOSAIC: A Joint Modeling Methodology for Combined Circadian and Non-Circadian Analysis of Multi-Omics Data.
So, first, let's take a look at the distributions of our models in each of our omics types!
# making a function, since we'll be using this again later plot_dist <- function(results){ # first, we only want to consider significant expressions sig_rna <- results$BH_Adj_P_Value_RNA < .05 sig_pro <- results$BH_Adj_P_Value_Protein < .05 # then get the significant models for each omics type mod_rna <- results$Best_Model_RNA[sig_rna] mod_pro <- results$Best_Model_Protein[sig_pro] # form into the data frame for plotting gg.df <- data.frame( "Model" = c(mod_rna, mod_pro), "Omics_Type" = c(rep("RNA", length(mod_rna)), rep("Protein",length(mod_pro))) ) # colors col_proper <- c("Linear" = "#b32515", # red "Exponential" = "#e39f22", # orange "ECHO" = "#3461eb", # blue "ECHO Joint" = "#5c7de0", # light blue "ECHO Linear" = "#7b2bcc", # purple "ECHO Linear Joint" = "#945fc9" # light purple ) # make a bar chart of distributions for each model type ggplot(gg.df, aes(Omics_Type, fill = Model))+ geom_bar()+ theme_bw()+ ylab("Total Significant Expressions")+ xlab("Omics Type")+ theme_bw()+ theme( plot.title = element_text(hjust = .5), legend.position = "bottom", legend.direction = "horizontal" )+ ggtitle("MOSAIC Model Distributions")+ # title guides( # colors fill = guide_legend(nrow = 2) )+ scale_y_continuous(expand = expand_scale(mult = c(0, .1)))+ scale_fill_manual("Model", values = col_proper)+ NULL } # plot our model distributions! plot_dist(results)
So we can see that each of our omics types has roughly similar distributions, with mostly non-oscillatory models, Linear and Protein, making up most of the distribution, but with roughly 200 of the genes remaining oscillatory.
Let's next look at how the fit turned out for our initial sample. Here, we add the fitted values to our plot and print the parameters to the console.
# plot our fitted data! suppressWarnings(plot_exp("Sample_2", expressions_rna, expressions_pro, results)) # to ignore warnings for missing values # print our parameters for our sample results[results$Gene_Name == "Sample_2",]
We can see that MOSAIC did a great job with these fits, with the fits matching closely to the trend. Looking at the subset of results corresponding to Sample 2, we can see that MOSAIC correctly designated our RNA data as an ECHO model with damping, and our protein data as a linear model with a positive slope. These fits match pretty closely to the trend, which are emphasized by their very low BH adjusted p-values.
Now let's see how preprocessing affects the results. Here, we're going to search for all possible periods, as well as allowing for all our preprocessing options: removing unexpressed genes, normalizing, and smoothing. These preprocessing methods are described in our paper.
# run mosaic_find() with preprocessing results <- mosaic_find( rna = expressions_rna, # rna data pro = expressions_pro, # protein data begin = 2, # first time point (hours) end = 48, # last time point (hours) resol = 2, # resolution (hours) num_reps = 3, # number of replicates paired = F, # unpaired, or unrelated, replicates run_all_per = T, # now we're looking for any period within our time course rem_unexpr = T, # remove any unexpressed genes (we don't have any, but some time courses might) rem_unexpr_amt = 70, # if we have less than 70% expression, we want to remove the gene rem_unexpr_amt_below = 0, # unexpressed genes are those with expression equal to 0 is_normal = T, # normalize data is_smooth = T # smooth the data ) # look at the results! head(results[,1:16], 5)
Let's look at how this preprocessing affected our distribution and original sample!
# plot our model distributions! plot_dist(results) # plot our fitted data! suppressWarnings(plot_exp("Sample_2", expressions_rna, expressions_pro, results)) # to ignore warnings for missing values # print our parameters for our sample results[results$Gene_Name == "Sample_2",]
We can see that the distributions are a little more even now between our omics types, with much of the same effects between them. We can also see the effects of our preprocessing -- normalization put our expressions on the same scale, and smoothing has reduced the space between our replicates. Though the fit hasn't changed very much, it's important to note that smoothing has also reduced the amount of damping in the system: the amplitude change coefficient has decreased by about .02. More discussions on the effects of preprocessing can be found in the supplemental information of ECHO: an application for detection and analysis of oscillators identifies metabolic regulation on genome-wide circadian output.
Other selections are also available, including adjustments for the harmonic and overexpressed cutoffs for ECHO and ECHO linear modelscan be adjusted through the harm_cut and over_cut parameters, respectively, though the defaults are recommended.
Now that you understand the basics of using the mosaic_find() function, feel free to experiment and play around with this vignette and the example data. Good luck with your multi-omics trend searches!
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.