rDolphin is an R package that performs the automatic profiling of 1D 1H NMR spectra datasets and outputs several indicators of quality of quantification and identification for each signal area quantification. To perform a reliable automatic profiling, resilient to the multiple factors that can incorporate variability to a dataset, rDolphin splits the spectrum into Regions of Interest (ROIs). Each ROI has a specific method of quantification (integration or deconvolution; with or without baseline fitting), a list of signals to quantify and a list of signal parameters to monitor (e.g. the chemical $chemical_shift variability).

In addition, rDolphin comes with tools to load individual quantifications to be evaluated and updated or to perform different kinds of exploratory analyses. For researchers with non-expert programming skills, the package incorporates a Shiny GUI.

Package Installation

Please run these commands in the R console (preferably through RStudio) to install the package:

  1. source("https://bioconductor.org/biocLite.R"); biocLite("impute"); biocLite("MassSpecWavelet") #installation of packages that cannot be installed from Github
  2. install.packages("devtools") #installation of package necessary to install rDolphin from Github
  3. devtools::install_github("danielcanueto/rDolphin") #installs rDolphin
  4. library(rDolphin) #loads rDolphin

At some moment, you may be asked to install Rtools:

The installation may stop at some point because of the multiple dependencies from different repositories required to install rDolphin:

These issues tend to be solved when running the previous commands again. If not, please find the package with failed installation (in this example, MassSpecWavelet) and install it before running again the commands.

If you have problems during installation, please contact me at daniel.canueto@urv.cat

Tutorial Structure

The introduction will show how to perform in the R console the profiling of a subset of 30 spectra of MTBLS242, a Metabolights dataset of blood spectra. The 30 spectra belong to 15 patients with blood samples taken at two different times.

A link to an online document showing how to perform similar steps through the Shiny GUI is available in the Readme file of the Github website: https://github.com/danielcanueto/rDolphin.

  1. Download and unzip the contents of this Dropbox link.
  2. Take a look to the three folders inside the unzipped folder. They correspond to three different Metabolights experiments:
  3. MTBLS1: human urine.
  4. MTBLS237: faecal extract.
  5. MTBLS374: blood.

Each one has an input folder with the next components

file.path(system.file(package = "rDolphin"),"extdata")

You will find in the folder the next files:

A thorough description of the structure of every file necessary to input to the tool is available here: https://docs.google.com/document/d/1t--gF5mCBNhbGvn53vKth2nlTzucLx55EfUWIgrMh_o/edit?usp=sharing

Run these commands to set as directory the extdata folder and to import the data contained there:

setwd(file.path(system.file(package = "rDolphin"),"extdata"))
imported_data=import_data("Parameters_MTBLS242_15spectra_5groups.csv")

An imported_data list is generated with several variables. Feel free to investigate the information present in each one of them. Some important ones are dataset, ppm or ROI_data.

Exploratory analysis of spectra dataset

rDolphin eases the analysis of the dataset complexity through two kinds of interactive Plotly figures:

exemplars_plot(imported_data)

median_plot(imported_data)

For each kind of figure, a red trace appears below the spectra. This red trace gives the results of a univariate analysis for every bin and helps the user detect interesting regions to profile. Feel free to play with the interactivity of the Plotly figure.

If you do not know how to annotate a signal in the dataset, you can evaluate possible options in the biofluid studied (blood) ranked by probability thanks to signal repository included with the package. For example, these commands:

ind=intersect(which(imported_data$repository$Shift..ppm.<3.37),which(imported_data$repository$Shift..ppm.>3.35))
View(imported_data$repository[ind,])

help see which signals in blood matrix are seen in the 3.37-3.35 ppm region. A methanol singlet stands out in the results. Accordingly, a high-intensity singlet appears in our dataset. The filtering by biofluid avoids wrong annotations typical when using general databases.

In addition, the user can also visualize the results of STOCSY or RANSY in a region of the dataset. These are the results achieved for a glutamine signal at 2.14-2.12 ppm:

identification_tool(imported_data$dataset,imported_data$ppm,c(2.14,2.12),method='spearman')

The other important glutamine signal at 2.45-2.4 ppm stands out.

Evaluation of ROI information

Looking to the performance of profiling in a model spectrum can help to improve the parameters in some ROIs and to check additional regions of the spectrum with the potential to give interesting insights into a study.

Run the next command:

profiling_model=profile_model_spectrum(imported_data,imported_data$ROI_data)

profiling_model has three elements:

Feel free to modify cells of the ROI information contained in imported_data$ROI_data or to add or remove ROIs and to repeat the model spectrum profiling process.

Automatic profiling of spectra

When you are satisfied with the ROI data contained in imported_data$ROI_data, you can perform an automatic profiling of the spectra through this command:

profiling_data=automatic_profiling(imported_data,imported_data$ROI_data)

If you do not want to wait for the completion of the profiling of the dataset, you can stop the profiling and load this .RData file, where the data of an already performed profiling session is saved:

load(file.path(system.file(package = "rDolphin"),"extdata","MTBLS242_subset_profiling_data.RData"))

profiling_data has two variables:

To output in your computer CSV files with the final_output data, run this command:

write_info(file.path(system.file(package = "rDolphin"),"extdata"),profiling_data$final_output,imported_data$ROI_data)

Go to the extdata folder mentioned in Import of necessary data and you will find the outputted information.

To output in your computer PDFs with plots of every quantification, run this command:

write_plots(file.path(system.file(package = "rDolphin"),"extdata"),profiling_data$final_output,profiling_data$reproducibility_data)

A plots folder is stored in the extdata folder mentioned in Import of necessary data. You can see a PDF file for every signal with all quantifications.

Validation of quantifications

The validation function allows to analyse possible wrong identifications or quantifications. For example, this command:

View(validation(profiling_data$final_output,1)$shown_matrix)

shows the fitting error in every signal quantification. And this command:

View(validation(profiling_data$final_output,3)$shown_matrix)

shows the difference between the expected chemical shift and the calculated chemical shift in every signal quantification.

If the user wants to load the information and the plot of these suspicious quantifications, he can do it through the load_quantification function. Run these commands to load the lineshape fitting of the TSP signal in the first spectrum:

loaded_quantification=load_quantification(profiling_data$reproducibility_data,imported_data,profiling_data$final_output,list(row=1,col=1),imported_data$ROI_data)
loaded_quantification$plot

One can see that the signal should have even more half bandwidth than the large one already assigned. This effect is caused by the binding of TSP to protein. If one is still interested in quantifying the TSP signal area, it seems better to perform an integration of thi signal for example from 0.1 to -0.015 ppm. Change the TSP ROI profile in 'ROI_data' to adapt it to this new fitting type.

imported_data$ROI_data[1,1:3]=c(0.1,-0.15,"Clean Sum")

And now perform the individual TSP quantification in this first spectrum through the individual_profiling function:

updated_profiling_data=individual_profiling(imported_data,imported_data$final_output,1,imported_data$ROI_data[1,,drop=F],imported_data$reproducibility_data)

And compare the results:

loaded_updated_quantification=load_quantification(updated_profiling_data$reproducibility_data,imported_data,updated_profiling_data$final_output,list(row=1,col=1),imported_data$ROI_data)
loaded_updated_quantification$plot

If satisfied with the repeated lineshape fitting, the updated_profiling_data already contains the updated quantification.

Uni and multivariate analysis of fingerprinting and profiling data

Univariate analyses of every bin could be already seen with the profile_model_spectrum function. The p values calculated can be outputted with the p_values function:

pval=p_values(imported_data$dataset,imported_data$Metadata)

However, the big advantage of profiling data to fingerprinting data is its resilience to overlapping and to chemical shift variability (typical in urine) or baseline (typical in blood). Basic univariate analyses of profiling data can be evaluated changing the data called in p_values:

pval=p_values(profiling_data$final_output$quantification,imported_data$Metadata)

Finally, dendrogram heatmaps can show us interesting subsets of samples or signals with the information provided. For example, a not identified signal can be highly correlated in quantification with another signal because they are signals from the same metabolite:

type_analysis_plot(profiling_data$final_output$quantification,profiling_data$final_output,imported_data,'dendrogram_heatmap')

For example, the heat map shows that the 'U3_61' signal correspond probably to another signal of L-valine.

And now, what do I have to do?

After this introduction to the wide range of options to perform the automatic profiling of spectra datasets and to maximize the quality of this profiling, feel free to investigate the inputs and outputs of every function.

You should investigate how to adapt the ROI profiles to the matrix studied. We share on the rDolphin website the ROI information we have found optimal in several matrixes. This ROI information will have to be adapted to the changes caused by the lab protocol during sample preparation or during spectrum acquisition and pre-processing.



danielcanueto/rDolphin documentation built on May 14, 2019, 4:03 p.m.