knitr::opts_chunk$set( collapse = TRUE, comment = "#>", fig.width=10, fig.height=5 )
In this example we will process the example data fa_nmr
to build a simple
PC-LDA classificator. As this dataset includes only three spectra of each category don't
expect wonders, the aim of this vignette is simply to guide you through a workflow.
This package comes with an example dataset, already loaded as a collection. This dataset contains H1-NMR spectra for fatty acids of egg yolks. The spectra were pre-processed using TopSpin: * Phasing * Baseline removal * Spectral alignement * Export as .txt files
Each exported file looked like this :
# File created = <DATE> # Data set = <NAME> 10 1 <"Path/to/folder"> # Spectral Region: # LEFT = 14.931500434875488 ppm. RIGHT = -5.284929038283497 ppm. # # SIZE = 262144 ( = number of points) # # In the following ordering is from the 'left' to the 'right' limits! # Lines beginning with '#' must be considered as comment lines. # 83549.0 83274.0 83057.0 ...
Here is how this data was loaded as a collection:
require(purrr) # First we get a list of the files in the spectra folder files <- file.path(INPUT_FOLDER, list.files(INPUT_FOLDER)) # Then we create a simple parser to read the data form the files parse_file <- function(path){ meta_raw <- readLines(path, n = 10, ok = FALSE) dataset <- unlist(strsplit(meta_raw[2], " "))[5] left <- as.numeric(unlist(strsplit(meta_raw[4], " "))[4]) right <- as.numeric(unlist(strsplit(meta_raw[4], " "))[8]) vals <- read_tsv(path, col_names = c("values"), comment = "#", col_types= "d") return(list(id = dataset, left = left, right = right, values = vals$values)) } # Finally we load each spectra and add it to a collection fa_nmr <- collection() walk(files, ~ {parsed <- parse_file(.x) fa_nmr <<- fa_nmr %>% add_spectrum(parsed$values, parsed$left, parsed$right, parsed$id)}) # We can also load labels for a file to add the to the collection labs <- read_tsv(LABELS) fa_nmr <- add_labels(fa_nmr, labs, ids_from = "sample", labels_from = "label")
Let's set up things here. We will need the tidyverse
and tidymodels
packages for this example
library(tidyverse) library(tidymodels) library(discrim) library(tidySpectR) data(fa_nmr)
The fa_nmr
toy dataset is provided ywith the package. Let's have a look:
fa_nmr autoplot(fa_nmr, offset_y = 1e+09, offset_x = 1) + theme(legend.position = "none") + scale_x_reverse()
First remove the water and solvent (MeOD in this case) peaks and trim the edges of the spectra to the interessant region.
spectra <- fa_nmr tidySpectR::extract(spectra, 4.6, 4.66) %>% autoplot(type="average") + ggtitle("Water peak") tidySpectR::extract(spectra, 3.33, 3.39) %>% autoplot(type="average") + ggtitle("MeOD peak")
This looks good so we can proceed to masking the peaks, these will be effectivally removed from the dataset (not set to zero).
spectra <- spectra %>% extract( 0.5, 7.2) %>% mask(4.6, 4.66) %>% mask(3.33, 3.39)
We can now proceed to normalizing the spectra using normalization to a given peak.
You could also normalize with external values (Osmolarity, concentrations, ...),
check the normalize_factor
method for that.
First we will check the normalization peak to be sure that we include all of it.
spectra %>% tidySpectR::extract(3.57, 3.65) %>% autoplot(type="average") + ggtitle("Normalization peak (raw)")
Looking good, let's proceed and check the result.
spectra <- normalize_internalStandard(spectra, 3.57, 3.65) spectra %>% tidySpectR::extract(3.57, 3.65) %>% autoplot(type="average") + ggtitle("Normalization peak (after normalization)")
We can now bucket (or "bin") the spectra. In this example we will use a very simple uniform binning but more sophisticated methods are implemented.
spectra <- bucket_uniform(spectra, width = 0.04) spectra %>% autoplot(type = "label_average", offset_x = 0.5, offset_y = 5 ) + scale_x_reverse() + ggtitle("Processed spectra")
And here are the processed spectra!
Now that we have processed our spectra we can go on to the modelling part. For this we will rely on the tidymodels workflow. First we will scale the spectra using Pareto scaling and perform PCA on the predictors.
Note that this package implements only a couple of chemotrics-oriented steps to use with recipes. Check out the tidymodels website to learn more about recipes.
# Export the spectra in a tidy format df <- tidy(spectra) print(df) rec <- recipe(df, label ~.) %>% update_role(id, new_role = "ID") %>% step_pareto(all_predictors()) %>% step_nzv(all_predictors()) %>% step_pca(all_predictors(), threshold = 0.95) mod <- discrim_linear(mode = "classification") %>% set_engine("mda") wflow <- workflow() %>% add_recipe(rec) %>% add_model(mod) fitted <- fit(wflow, df) print(fitted)
It looks like we managed to missclassify one sample, this would probably improve by increasing the sample number and optimizing the preprocessing.
When you go to production, you'll want to use the same pre-processing steps on new samples. While this is straigthforward for all the read masking operations it can become difficult when using more sophisticated bucketting or normalization algorithms.
This package provides a processing_template
class to deal with reproducible processing.
Each collection
object contains a processing_template
object that automatically saves
processing steps and their arguments.
# See what happened to spectra already print(spectra) # Exporting the processing template from a processed collection processor <- export_processor(spectra) print(processor)
processing_template
objects also have a tidy method with relevant information and
allow access to each single step:
tidy(processor) processor$steps[[1]] %>% tidy() processor$steps[[4]] %>% tidy() # One can access function arguments: processor$steps[[4]] %>% tidy() %>% pull(factors)
We can now apply the processing to a new collection
object (or reuse the fa_nmr dataset) and compare
to spectra
:
new_data <- process(processor, fa_nmr) print(new_data)
And done! Let's check that we got the same thing out of both processes:
autoplot(spectra, type = "average")+ scale_x_reverse()+ ggtitle("Original data") autoplot(new_data, type = "average")+ scale_x_reverse() + ggtitle("New data")
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.