Getting started

The AlpsNMR package has most of its functions prefixed with nmr_. The main reason for this is to avoid conflicts with other packages. Besides, it helps for autocompletion: Most coding environments such as RStudio will let you see most of the function names by typing nmr_ followed by pressing the tab key.

This vignette assumes some basic knowledge of NMR and data analysis, and some basic R programming.

We will start by loading AlpsNMR along some convenience packages:

library(dplyr)
library(ggplot2)
library(readxl)
library(BiocParallel)
library(AlpsNMR)

Enable parallellization

This package is able to parallellize several functions through the use of the BiocParallel package. Whether to parallelize or not is left to the user that can control the parallellization registering backends. Please check vignette("Introduction_To_BiocParallel", package = "BiocParallel").

#register(SerialParam(), default = TRUE)  # disable parallellization
register(SnowParam(workers = 2, exportglobals = FALSE), default = TRUE)  # enable parallellization with 2 workers

Data: The MeOH_plasma_extraction dataset

To explore the basics of the AlpsNMR package, we have included three NMR samples acquired in a 600 MHz Bruker instrument bundled with the package. The samples are pooled quality control plasma samples, that were extracted with methanol. They only contain small molecules.

If you have installed this package, you can obtain the directory where the samples are with the command:

MeOH_plasma_extraction_dir <- system.file("dataset-demo", package = "AlpsNMR")
MeOH_plasma_extraction_dir

The demo directory includes three zipped Bruker samples and a dummy Excel metadata file:

list.files(MeOH_plasma_extraction_dir)

Since these are quality control samples, the metadata is a dummy table:

MeOH_plasma_extraction_xlsx <- file.path(MeOH_plasma_extraction_dir, "dummy_metadata.xlsx")
annotations <- readxl::read_excel(MeOH_plasma_extraction_xlsx)
annotations

Loading samples

The function to read samples is called nmr_read_samples(). It expects a character vector with the samples to load that can be paths to directories of Bruker format samples or paths to JDX files.

Additionally, this function can filter by pulse sequences (e.g. load only NOESY samples) or loading only metadata.

zip_files <- fs::dir_ls(MeOH_plasma_extraction_dir, glob = "*.zip")
zip_files
dataset <- nmr_read_samples(sample_names = zip_files)
dataset

If your samples happen to be in different folders per class, AlpsNMR provides convenience functions to read them as well. With this example:

- your_dataset/
  + control/
     * 10/
     * 20/
     * 30/
  + mutated/
     * 10/
     * 20/
     * 30/

You could use:

dataset <- nmr_read_samples_dir(c("your_dataset/control", "your_dataset/mutated"))
dataset

If after reading the ?nmr_read_samples page you still have issues, feel free to open an issue at https://github.com/sipss/AlpsNMR/issues and ask for clarification.

Adding annotations

We can embed the external annotations we loaded above into the dataset:

dataset <- nmr_meta_add(dataset, metadata = annotations, by = "NMRExperiment")

And retrieve them from the dataset:

nmr_meta_get(dataset, groups = "external")

If you want to learn more about sample metadata (including acquisition and FID processing parameters), as well as more complex ways of adding annotations, check out the vignette("Vig02-handling-metadata-and-annotations", package = "AlpsNMR").

Phasing

It might be the case that automatically reconstructed metabolite NMR spectra have a first-order phase error. AlpsNMR provides a convenient wrapper function the NMRphasing package, offering a variety of algorithms to estimate and correct for phase errors, which arise on physical grounds.

#dataset <- nmr_autophase(dataset, method="MPC_DANM")

Interpolation

1D NMR samples can be interpolated together, in order to arrange all the spectra into a matrix, with one row per sample. Here we choose the range of ppm values that we want to include in further analyses.

dataset <- nmr_interpolate_1D(dataset, axis = c(min = -0.5, max = 10))

If the axis = NULL then the ppm axis is autodetected from the samples.

See nmr_interpolate_1D() for further reference on the axis options.

Plotting samples

Plotting many spectra with so many points is quite expensive so it is possible to include only some regions of the spectra or plot only some samples.

plot(dataset, NMRExperiment = c("10", "30"), chemshift_range = c(2.2, 2.8))

Exclude regions

Some regions can easily be excluded from the spectra with nmr_exclude_region():

regions_to_exclude <- list(water = c(4.6, 5), methanol = c(3.33, 3.39))
dataset <- nmr_exclude_region(dataset, exclude = regions_to_exclude)
plot(dataset, chemshift_range = c(4.2, 5.5))

Filter samples

Maybe we just want to analyze a subset of the data, e.g., only a class group or a particular gender. We can filter some samples according to their metadata as follows:

samples_10_20 <- filter(dataset, SubjectID == "Ana")
nmr_meta_get(samples_10_20, groups = "external")

Robust PCA for outlier detection

The AlpsNMR package includes robust PCA analysis for outlier detection.

pca_outliers_rob <- nmr_pca_outliers_robust(dataset, ncomp = 3)
nmr_pca_outliers_plot(dataset, pca_outliers_rob)

Samples with greater QResiduals and Tscores than the threshold defined by the red line are candidates for further exploration and exclusion. With this small dataset, there is not much to see.

Baseline estimation

Spectra may display an unstable baseline, specially when processing blood/fecal samples.

The peak detection and integration algorithms benefit from having an estimation of the baseline, so it is advisable to compute it first and check it fits as expected.

See before:

plot(dataset, chemshift_range = c(1.37, 2.5))
plot(dataset, chemshift_range = c(3.5,3.8))

Estimate the baseline:

dataset <- nmr_baseline_estimation(dataset, lambda = 9, p = 0.01)

And after:

# TODO: Simplify this plot
spectra_to_plot <- tidy(dataset, chemshift_range = c(1.37, 2.5))
baseline_to_plot <- tidy(dataset, chemshift_range = c(1.37, 2.5), matrix_name = "data_1r_baseline")

ggplot(mapping = aes(x = chemshift, y = intensity, color = NMRExperiment)) +
    geom_line(data = spectra_to_plot) +
    geom_line(data = baseline_to_plot, linetype = "dashed") + 
    facet_wrap(~NMRExperiment, ncol = 1)
# TODO: Simplify this plot
spectra_to_plot <- tidy(dataset, chemshift_range = c(3.5, 3.8))
baseline_to_plot <- tidy(dataset, chemshift_range = c(3.5, 3.8), matrix_name = "data_1r_baseline")

ggplot(mapping = aes(x = chemshift, y = intensity, color = NMRExperiment)) +
    geom_line(data = spectra_to_plot) +
    geom_line(data = baseline_to_plot, linetype = "dashed") + 
    facet_wrap(~NMRExperiment, ncol = 1)

Peak detection

The peak detection is performed on short spectra segments using a continuous wavelet transform. Peaks below a threshold intensity are automatically discarded.

Our current approach relies on the use of the baseline threshold (baselineThresh) automatically calculated (see ?nmr_baseline_threshold) and the Signal to Noise Threshold (SNR.Th) to discriminate valid peaks from noise.

See ?nmr_detect_peaks for more information.

baselineThresh <- nmr_baseline_threshold(dataset, range_without_peaks = c(9.5, 10), method = "median3mad")
nmr_baseline_threshold_plot(dataset, baselineThresh)
peak_list_initial <- nmr_detect_peaks(
    dataset,
    nDivRange_ppm = 0.1,
    scales = seq(1, 16, 2),
    baselineThresh = baselineThresh,
    SNR.Th = 3,
    fit_lorentzians = TRUE
)

We can get an overview of the number of peaks we detect on each sample and each chemical shift region:

nmr_detect_peaks_plot_overview(peak_list_initial)

We can explore in a more detailed way the detected peaks:

nmr_detect_peaks_plot(dataset, peak_list_initial, NMRExperiment = "10", chemshift_range = c(3, 3.3))

Let's the detected peaks in a smaller region across samples:

peak_list_in_range <- filter(peak_list_initial, ppm > 3.22, ppm < 3.24)
peak_list_in_range
plot(dataset, chemshift_range = c(3.22, 3.25))
nmr_detect_peaks_plot_peaks(
    dataset,
    peak_list_initial,
    peak_ids = peak_list_in_range$peak_id,
    caption = paste("{peak_id}",
                    "(NMRExp.\u00A0{NMRExperiment},",
                    "gamma(ppb)\u00a0=\u00a0{gamma_ppb},",
                    "\narea\u00a0=\u00a0{area},",
                    "nrmse\u00a0=\u00a0{norm_rmse})")
)
peak_list_initial_accepted <- peaklist_accept_peaks(
    peak_list_initial,
    dataset,
    area_min = 50, 
    keep_rejected = FALSE,
    verbose = TRUE
)

Spectra alignment

Once we have a preliminary peak list, we can align the spectra using the nmr_align() function. We expect shifts between the spectra, this becomes necessary so we can cluster the peaks correctly afterwards and build a peak table.

The alignment process takes several parameters, including:

NMRExp_ref <- nmr_align_find_ref(dataset, peak_list_initial_accepted)
message("Your reference is NMRExperiment ", NMRExp_ref)
dataset_align <- nmr_align(
    nmr_dataset = dataset, 
    peak_data = peak_list_initial_accepted, 
    NMRExp_ref = NMRExp_ref, 
    maxShift_ppm = 0.0015, 
    acceptLostPeak = TRUE
)

Compare the dataset before and after alignment, to verify the quality of the alignment:

plot(dataset, chemshift_range = c(3.025, 3.063))
plot(dataset_align, chemshift_range = c(3.025, 3.063))
cowplot::plot_grid(
    plot(dataset, chemshift_range = c(3.22, 3.25)) + theme(legend.position = "none"),
    plot(dataset_align, chemshift_range = c(3.22, 3.25)) + theme(legend.position = "none")
)

Normalization

With the spectra correctly aligned, you can use spectra normalization techniques. We normalize after alignment because some of the normalization techniques are sensitive to misalignments.

There are multiple normalization techniques available. The most strongly recommended is the Probabilistic Quantile Normalization (pqn), but it requires more samples for its internal estimations to be reliable, as it needs a computation of the median spectra. Nevertheless, it is possible to compute it:

dataset_norm <- nmr_normalize(dataset_align, method = "pqn")

The normalization essentially computes a normalization factor for each sample.

The plot shows the dispersion with respect to the median of the normalization factors, and can highlight samples with abnormally large or small normalization factors.

normalization_info <- nmr_normalize_extra_info(dataset_norm)
normalization_info$norm_factor
normalization_info$plot

We can confirm sample 20 is now slightly more diluted:

to_plot <- dplyr::bind_rows(
    tidy(dataset_align, NMRExperiment = "20", chemshift_range = c(2,2.5)) %>%
        mutate(Normalized = "No"),
    tidy(dataset_norm, NMRExperiment = "20", chemshift_range = c(2,2.5)) %>%
        mutate(Normalized = "Yes"),
)
ggplot(data = to_plot, mapping = aes(x = chemshift, y = intensity, color = Normalized)) + 
    geom_line() +
    scale_x_reverse() +
    labs(y = "Intensity", x = "Chemical shift (ppm)",
         caption = "The normalization slightly diluted experiment 20")

And all samples are more homogeneous now:

cowplot::plot_grid(
    plot(dataset_align, chemshift_range = c(2, 2.5)) + labs(title="Before Normalization"),
    plot(dataset_norm, chemshift_range = c(2, 2.5)) + labs(title="After Normalization"),
    ncol = 1
)

Peak grouping

If you align or normalize your samples, you should rerun the peak detection to ensure the peak positions and estimations are well calculated:

baselineThresh <- nmr_baseline_threshold(dataset_norm, range_without_peaks = c(9.5, 10), method = "median3mad")
nmr_baseline_threshold_plot(dataset_norm, baselineThresh)
peak_list_for_clustering_unfiltered <- nmr_detect_peaks(
    dataset_norm,
    nDivRange_ppm = 0.1,
    scales = seq(1, 16, 2),
    baselineThresh = baselineThresh,
    SNR.Th = 3,
    fit_lorentzians = TRUE,
    verbose = TRUE
)

peak_list_for_clustering <- peaklist_accept_peaks(
    peak_list_for_clustering_unfiltered,
    dataset_norm,
    area_min = 50, 
    keep_rejected = FALSE,
    verbose = TRUE
)

Feel free to plot, explore and further curate your peak list. Or proceed with the current one:

Once we have a peak list for each sample peak_list, we need to turn it into a table, merging peaks from different samples together.

clustering <- nmr_peak_clustering(peak_list_for_clustering, verbose = TRUE)
cowplot::plot_grid(
    clustering$num_cluster_estimation$plot + labs(title = "Full"),
    clustering$num_cluster_estimation$plot +
        xlim(clustering$num_cluster_estimation$num_clusters-50, clustering$num_cluster_estimation$num_clusters+50) +
        ylim(0, 10*clustering$num_cluster_estimation$max_dist_thresh_ppb) +
        labs(title = "Fine region")
)
peak_list_clustered <- clustering$peak_data

We can plot the samples, with the detected peaks and how they have been connected. This allows us to compare the peak detection across samples, and check how good the peak matching is.

If peaks are matched they are connected with a black segment. If peaks are detected but not matched, they appear as a dot. If you see a peak, without a point on top then it means the peak was not detected or it was filtered out.

nmr_peak_clustering_plot(
    dataset = dataset_norm,
    peak_list_clustered = peak_list_clustered,
    NMRExperiments = c("10", "20"),
    chemshift_range = c(2.4, 3.0)
)

Sometimes we see peaks and we wonder why are they not detected. We can include the baselineThresh to plot it in the sample as well. This can help to diagnose if the baselineThresh argument is the cause of a peak not being detected.

nmr_peak_clustering_plot(
    dataset_norm,
    peak_list_clustered, 
    NMRExperiments = c("10", "20"),
    chemshift_range = c(4.2, 4.6),
    baselineThresh = baselineThresh
)
peak_table <- nmr_build_peak_table(peak_list_clustered, dataset_norm)
peak_table
peak_matrix <- nmr_data(peak_table)
peak_matrix[1:3, 1:8]

Or you can get a data frame with the corresponding annotations:

peak_table_df <- as.data.frame(peak_table)
peak_table_df
saveRDS(peak_table, "demo_peak_table.rds")

From this peak table you can proceed to use statistical testing, machine learning, and any downstream analysis you may be interested in.

Session Info:

sessionInfo()


sipss/AlpsNMR documentation built on Aug. 13, 2024, 5:11 p.m.