rDolphin is an R package that performs the automatic profiling of 1D 1H NMR spectra datasets and outputs several indicators of quality of quantification and identification for each signal area quantification. To perform a reliable automatic profiling, resilient to the multiple factors that can incorporate variability to a dataset, rDolphin splits the spectrum into Regions of Interest (ROIs). Each ROI has a specific method of quantification (integration or deconvolution; with or without baseline fitting), a list of signals to quantify and a list of signal parameters to monitor (e.g. the chemical $chemical_shift variability).
In addition, rDolphin comes with tools to load individual quantifications to be evaluated and updated or to perform different kinds of exploratory analyses. For researchers with non-expert programming skills, the package incorporates a Shiny GUI.
Please run these commands in the R console (preferably through RStudio) to install the package:
source("https://bioconductor.org/biocLite.R"); biocLite("impute"); biocLite("MassSpecWavelet")
#installation of packages that cannot be installed from Githubinstall.packages("devtools")
#installation of package necessary to install rDolphin from Githubdevtools::install_github("danielcanueto/rDolphin")
#installs rDolphinlibrary(rDolphin)
#loads rDolphinAt some moment, you may be asked to install Rtools:
The installation may stop at some point because of the multiple dependencies from different repositories required to install rDolphin:
These issues tend to be solved when running the previous commands again. If not, please find the package with failed installation (in this example, MassSpecWavelet) and install it before running again the commands.
If you have problems during installation, please contact me at daniel.canueto@urv.cat
The introduction will show how to perform in the R console the profiling of a subset of 30 spectra of MTBLS242, a Metabolights dataset of blood spectra. The 30 spectra belong to 15 patients with blood samples taken at two different times.
A link to an online document showing how to perform similar steps through the Shiny GUI is available in the Readme file of the Github website: https://github.com/danielcanueto/rDolphin.
Each one has an input
folder with the next components
file.path(system.file(package = "rDolphin"),"extdata")
You will find in the folder the next files:
A Parameters
CSV file with the necessary information to import with the desired preprocessing parameters. A brief description is given inside the CSV file for each parameter.
A Dataset
CSV file with a matrix with each row a spectrum and each observation a bin. The header has the information to which ppm belongs every bin. Spectra can also be inputted as Bruker experiments if specified in the Parameters
file.
A Metadata
CSV file with three columns:
Sample
is the name of the sample. If reading Bruker processed spectra, it needs to be the same than the one in the Bruker folder.Individual
is a number specifying each individual. For example, in the case of this MTBLS242 dataset, the 1-15 sequence is repeated 2 times because there are two samples for each individual.Type
is a number specifying the kind of sample.Accurate information about Individual
and Type
is not necessary for the program to be run, but it will be useful to maximize the quality of exploratory analyses.
An ROI_profiles
CSV file with information of every signal to be quantified. Every signal is located in an ROI that can contain one or more signals to quantify. Several parameters of the signal and the kind of quantification are also shown.
A thorough description of the structure of every file necessary to input to the tool is available here: https://docs.google.com/document/d/1t--gF5mCBNhbGvn53vKth2nlTzucLx55EfUWIgrMh_o/edit?usp=sharing
Run these commands to set as directory the extdata
folder and to import the data contained there:
setwd(file.path(system.file(package = "rDolphin"),"extdata")) imported_data=import_data("Parameters_MTBLS242_15spectra_5groups.csv")
An imported_data
list is generated with several variables. Feel free to investigate the information present in each one of them. Some important ones are dataset
, ppm
or ROI_data
.
rDolphin eases the analysis of the dataset complexity through two kinds of interactive Plotly figures:
exemplars_plot(imported_data)
median_plot(imported_data)
For each kind of figure, a red trace appears below the spectra. This red trace gives the results of a univariate analysis for every bin and helps the user detect interesting regions to profile. Feel free to play with the interactivity of the Plotly figure.
If you do not know how to annotate a signal in the dataset, you can evaluate possible options in the biofluid studied (blood) ranked by probability thanks to signal repository included with the package. For example, these commands:
ind=intersect(which(imported_data$repository$Shift..ppm.<3.37),which(imported_data$repository$Shift..ppm.>3.35)) View(imported_data$repository[ind,])
help see which signals in blood matrix are seen in the 3.37-3.35 ppm region. A methanol singlet stands out in the results. Accordingly, a high-intensity singlet appears in our dataset. The filtering by biofluid avoids wrong annotations typical when using general databases.
In addition, the user can also visualize the results of STOCSY or RANSY in a region of the dataset. These are the results achieved for a glutamine signal at 2.14-2.12 ppm:
identification_tool(imported_data$dataset,imported_data$ppm,c(2.14,2.12),method='spearman')
The other important glutamine signal at 2.45-2.4 ppm stands out.
Looking to the performance of profiling in a model spectrum can help to improve the parameters in some ROIs and to check additional regions of the spectrum with the potential to give interesting insights into a study.
Run the next command:
profiling_model=profile_model_spectrum(imported_data,imported_data$ROI_data)
profiling_model
has three elements:
p
, a Plotly figure showing the fitting applied to every ROI in the model spectrum as well as a red trace below showing the univariate analysis on each bin.
ROI_data
, a data frame of the ROI data used during the process.
total_signals_parameters
, a data frame with parameters of fitting and of quality of fitting (e.g. the fitting error) for every quantified signal.
Feel free to modify cells of the ROI information contained in imported_data$ROI_data
or to add or remove ROIs and to repeat the model spectrum profiling process.
When you are satisfied with the ROI data contained in imported_data$ROI_data
, you can perform an automatic profiling of the spectra through this command:
profiling_data=automatic_profiling(imported_data,imported_data$ROI_data)
If you do not want to wait for the completion of the profiling of the dataset, you can stop the profiling and load this .RData file, where the data of an already performed profiling session is saved:
load(file.path(system.file(package = "rDolphin"),"extdata","MTBLS242_subset_profiling_data.RData"))
profiling_data
has two variables:
final_output
, a list of lists that contains the performed quantifications (in quantification
) as well as several indicators of quality and signal parameter information.
reproducibility_data
, a list of lists where useful information of each quantification is stored, for reproducibility and profiling update purposes.
To output in your computer CSV files with the final_output
data, run this command:
write_info(file.path(system.file(package = "rDolphin"),"extdata"),profiling_data$final_output,imported_data$ROI_data)
Go to the extdata
folder mentioned in Import of necessary data
and you will find the outputted information.
To output in your computer PDFs with plots of every quantification, run this command:
write_plots(file.path(system.file(package = "rDolphin"),"extdata"),profiling_data$final_output,profiling_data$reproducibility_data)
A plots
folder is stored in the extdata
folder mentioned in Import of necessary data
. You can see a PDF file for every signal with all quantifications.
The validation
function allows to analyse possible wrong identifications or quantifications. For example, this command:
View(validation(profiling_data$final_output,1)$shown_matrix)
shows the fitting error in every signal quantification. And this command:
View(validation(profiling_data$final_output,3)$shown_matrix)
shows the difference between the expected chemical shift and the calculated chemical shift in every signal quantification.
If the user wants to load the information and the plot of these suspicious quantifications, he can do it through the load_quantification
function.
Run these commands to load the lineshape fitting of the TSP signal in the first spectrum:
loaded_quantification=load_quantification(profiling_data$reproducibility_data,imported_data,profiling_data$final_output,list(row=1,col=1),imported_data$ROI_data) loaded_quantification$plot
One can see that the signal should have even more half bandwidth than the large one already assigned. This effect is caused by the binding of TSP to protein. If one is still interested in quantifying the TSP signal area, it seems better to perform an integration of thi signal for example from 0.1 to -0.015 ppm. Change the TSP ROI profile in 'ROI_data' to adapt it to this new fitting type.
imported_data$ROI_data[1,1:3]=c(0.1,-0.15,"Clean Sum")
And now perform the individual TSP quantification in this first spectrum through the individual_profiling
function:
updated_profiling_data=individual_profiling(imported_data,imported_data$final_output,1,imported_data$ROI_data[1,,drop=F],imported_data$reproducibility_data)
And compare the results:
loaded_updated_quantification=load_quantification(updated_profiling_data$reproducibility_data,imported_data,updated_profiling_data$final_output,list(row=1,col=1),imported_data$ROI_data) loaded_updated_quantification$plot
If satisfied with the repeated lineshape fitting, the updated_profiling_data
already contains the updated quantification.
Univariate analyses of every bin could be already seen with the profile_model_spectrum
function. The p values calculated can be outputted with the p_values
function:
pval=p_values(imported_data$dataset,imported_data$Metadata)
However, the big advantage of profiling data to fingerprinting data is its resilience to overlapping and to chemical shift variability (typical in urine) or baseline (typical in blood). Basic univariate analyses of profiling data can be evaluated changing the data called in p_values
:
pval=p_values(profiling_data$final_output$quantification,imported_data$Metadata)
Finally, dendrogram heatmaps can show us interesting subsets of samples or signals with the information provided. For example, a not identified signal can be highly correlated in quantification with another signal because they are signals from the same metabolite:
type_analysis_plot(profiling_data$final_output$quantification,profiling_data$final_output,imported_data,'dendrogram_heatmap')
For example, the heat map shows that the 'U3_61' signal correspond probably to another signal of L-valine.
After this introduction to the wide range of options to perform the automatic profiling of spectra datasets and to maximize the quality of this profiling, feel free to investigate the inputs and outputs of every function.
You should investigate how to adapt the ROI profiles to the matrix studied. We share on the rDolphin website the ROI information we have found optimal in several matrixes. This ROI information will have to be adapted to the changes caused by the lab protocol during sample preparation or during spectrum acquisition and pre-processing.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.