#'
#' Import a File Written in the JCAMP-DX Format
#'
#' This function supervises the entire import process.
#' Not all official formats are supported, see the vignettes.
#' Prior to release, this package is checked against a very large number of files in the
#' author's collection. However, the JCAMP-DX standard allows many variations and it is
#' difficult to anticipate all permutations. Error messages will
#' generally let you know what's going on. If you have a file that you feel should be
#' supported but gives an error, please file an issue at Github and share the file.
#'
#' @param file Character. The file name to import.
#'
#' @param SOFC Logical. "Stop on Failed Check"
#' The default is \code{TRUE} i.e. stop when something is not right.
#' This ensures that correct data is returned. Change to \code{FALSE} at your own risk.
#' NOTE: Only certain checks can be skipped via this option, as there are some
#' parameters that must be available and correct in order to return \emph{any} answer.
#' For instance, one must end up with the same number of X and Y values.
#' This option is provided for those \pkg{advanced
#' users} who have carefully checked their original files and want to skip the
#' required checks. It may also be useful for troubleshooting.
#' The JCAMP-DX standard typically requires
#' several checks of the data as it is decompressed. These checks are essential
#' to obtaining the correct results. However, some JCAMP-DX writing programs
#' do not follow the standard to the letter. For instance we have observed that
#' not all JCAMP-DX writers put FIRSTY into the metadata, even though it is required by
#' the standard. In other cases values in the file have low precision (see section on precision).
#' Another example we have observed is NMR files where the X values are the count/index of data points,
#' and FIRSTY is given in Hz. Since the field strength and center of the sweep frequency are needed
#' to convert to ppm, and these are parameters not required in the standard, one cannot return an
#' answer in either ppm or Hz automatically.
#' In cases like this, one can set \code{SOFC = FALSE} and then manually convert the X axis.
#'
#' @param debug Integer. The level of debug reporting desired. For those options giving
#' a lot of output, you may wish to consider directing the output via \code{\link{sinkall}}
#' and then search the results for the problematic lines.
#' \itemize{
#' \item 1 or higher = import progress is reported.
#' \item 2 = details about the variable lists, compression formats and
#' parameters that were found.
#' \item 3 = print the extracted X values (huge output!).
#' \item 4 = detailed info on the Y value processing (huge output!).
#' \item 5 = detailed info about processing the Y values when DUP is in use (huge output!).
#' \item 6 = detailed info about processing the Y values when DIF is in use (huge output!).
#' }
#' In cases when an error is about to stop execution, you get additional information regardless of
#' the \code{debug} value.
#'
#' @return A list, as follows:
#'
#' \itemize{
#' \item The first element is a data frame summarizing the pieces of the imported file.
#' \item The second element is the file metadata.
#' \item The third element is a integer vector giving the comment lines found
#' (exclusive of the metdata, which typically contains many comments).
#' }
#'
#' Additional elements contain the extracted data as follows:
#'
#' \itemize{
#' \item If the file contains one non-NMR spectrum, or a processed NMR spectrum (i.e. only
#' the final real data), a single data frame.
#' \item If the file contains the real and imaginary
#' parts of a 1D NMR spectrum, there will be two data frames, one containing the real portion
#' and the other the imaginary portion.
#' \item In all cases above, the data frame has elements \code{x} and \code{y}.
#' \item In the case of 2D NMR data, additional named list elements are returned including
#' the F2 frequency values, the F1 frequency values, and a matrix containing the 2D data.
#' \item In the case of LC-MS or GC-MS data, a data frame is returned for each time point.
#' The data frame has elements \code{mz} and \code{int} (intensity). Each time point
#' is named with the time from the file.
#' }
#'
#' @seealso Do \code{browseVignettes("readJDX")} for background information,
#' references, supported formats, and details about the roles of each function.
#' If you have a multiblock file (which contains multiple spectra, but not 2D NMR,
#' LC-MS or GC-MS data sets), please see
#' \code{\link{splitMultiblockDX}} for a function to break such files into
#' individual files which can then be processed in the normal way.
#'
#' @section Included Data Files:
#' The examples make use of data files included with the package:
#' \itemize{
#' \item File \code{SBO.jdx} is an IR spectrum of Smart Balance Original spread (a butter
#' substitute). The spectrum is presented in transmission format, and was recorded on a
#' ThermoFisher instrument. The file uses AFFN compression, and was written
#' with the JCAMP-DX 5.01 standard. Note that even though the Y-axis was recorded in
#' percent transmission, in the JDX file it is stored on [0\ldots1].
#' \item File \code{PCRF.jdx} is a 1H NMR spectrum of a hexane extract of a reduced fat potato chip.
#' The spectrum was recorded on a JEOL instrument. The file uses SQZ DIF DUP compression,
#' and was written with the JCAMP-DX 6.00 standard.
#' \item File \code{PCRF_line265.jdx} has a deliberate error in it.
#' \item File \code{isasspc1.jdx} is a 2D NMR file recorded on a JEOL GX 400 instrument.
#' The file is freely available at \url{http://www.jcamp-dx.org/}.
#' \item File \code{MiniDIFDUP.JDX} is a small demonstration file, used in the vignettes to
#' illustrate the decompression process. It is derived from a freely available file.
#' }
#'
#' @section Precision:
#' Internally, this package uses a tolerance factor when comparing values during certain checks.
#' This is desirable because the original values in the files
#' are text strings of varying lengths which get converted to numerical values by \code{R}. Occasionally
#' values in the file, such as FIRSTY, are stored with low precision, and the computation of the
#' value to be compared occurs with much greater precision. In these cases the check can fail
#' even when the tolerance is pretty loose. In these cases one might consider setting
#' \code{SOFC = FALSE} to allow the calculation to proceed. If you do this, be certain to check
#' the results carefully as described under \code{SOFC}.
#'
#' @section Y Value Check:
#' The standard requires a "Y Value Check" when in DIF mode. Extra Y values have been appended to each
#' line to use in the check, and the last Y value on a line must equal the first Y value on the next line
#' \emph{IFF} one is in DIF mode. After a successful check, the extra Y value must be removed. In actual practice,
#' some vendors, at least some of the time, seem to differ as to the meaning of
#' "being in DIF mode". In turn, this determines how the Y value check should proceed.
#' \itemize{
#' \item The standard says "When, and only when, the last ordinate of a line is in DIF form ...
#' The first ordinate of the next line ... is always an actual value, equal to the last
#' calculated ordinate of the previous line". See section 5.8.3 of the 1988 publication.
#' \item Taking this definition literally, the Y value check (and removal of the extra value),
#' should occur when one sees e.g. ... DIF DIF DIF (end of line). Let's call
#' this "literal DIF". A literal DIF is easy to detect and act on.
#' \item In other cases, something like ... DIF DUP DUP (end of line) is considered to be in DIF mode
#' for Y value check purposes. In these cases we have look backwards to see if we are in DIF mode.
#' Let's call this "relayed DIF".
#' \item However, some vendors may treat ... DIF DUP DUP (end of line) as not in DIF mode, and hence
#' one should not do the Y value check and not remove any values, as this vendor would not have
#' added an extra Y value.
#' \item In addition to these three possibilities, \code{readJDX} through versions 0.3.xx used a different
#' definition, namely if there were any DIF entries anywhere on the line, then DIF mode was
#' assumed and the Y value check carried out. This worked for many files, but not all.
#' \item In the 0.4.xx series, \code{readJDX} detects both the literal and relayed definitions and
#' tries to keep moving forward as much as possible.
#' }
#'
#' @section Performance:
#' \code{readJDX} is not particularly fast. Priority has been given to assuring correct answers,
#' helpful debugging messages and understandable code.
#'
#' @export
#'
#' @importFrom stringr str_trim
#'
#' @examples
#' # IR spectrum
#' sbo <- system.file("extdata", "SBO.jdx", package = "readJDX")
#' chk <- readJDX(sbo)
#' plot(chk[[4]]$x, chk[[4]]$y / 100,
#' type = "l", main = "Original Smart Balance Spread",
#' xlab = "wavenumber", ylab = "Percent Transmission"
#' )
#'
#' # 1H NMR spectrum
#' pcrf <- system.file("extdata", "PCRF.jdx", package = "readJDX")
#' chk <- readJDX(pcrf)
#' plot(chk[[4]]$x, chk[[4]]$y,
#' type = "l", main = "Reduced Fat Potato Chip Extract",
#' xlab = "ppm", ylab = "Intensity"
#' )
#'
#' # Capturing processing for troubleshooting
#' mdd <- system.file("extdata", "MiniDIFDUP.JDX", package = "readJDX")
#' tf <- tempfile(pattern = "Troubleshooting", fileext = "txt")
#' sinkall(tf)
#' chk <- readJDX(mdd, debug = 6)
#' sinkall() # close the file connection
#' file.show(tf)
#'
#' # 2D HETCORR spectrum
#' \dontrun{
#' nmr2d <- system.file("extdata", "isasspc1.dx", package = "readJDX")
#' chk <- readJDX(nmr2d)
#' contour(chk$Matrix, drawlabels = FALSE) # default contours not optimal
#' }
#'
#' \dontrun{
#' # Line 265 has an N -> G error. Try with various levels of debug.
#' # Even with debug = 0 you get useful diagnostic info.
#' problem <- system.file("extdata", "PCRF_line265.jdx", package = "readJDX")
#' chk <- readJDX(problem)
#' }
#'
readJDX <- function(file = "", SOFC = TRUE, debug = 0) {
if (!requireNamespace("stringr", quietly = TRUE)) {
stop("You need to install package stringr to use this function")
}
if (file == "") stop("No file specified")
jdx <- readLines(file)
##### Step 1. Check the overall file structure.
# A data block consists of ##TITLE= up to ##END=
# However, link blocks can be used to contain data blocks, in which
# case one has a compound file. Link blocks and compound files are not supported,
# but a function is available to split them into individual files.
# NMR data sets, including 2D NMR data sets, and LC-MS/GC-MS data sets use a different
# scheme (NTUPLES) to hold multiple data sets.
blocks <- grep("^\\s*##TITLE\\s*=.*", jdx)
nb <- length(blocks)
if (nb == 0) stop("This does not appear to be a JCAMP-DX file")
if (nb > 1) stop("Compound (multi-block / multi-spectra) data sets can not be parsed.\nSee splitMultiblockDX which is a function to split such files into separate files which can be parsed.")
##### Step 2. Locate the parameters and the variable list(s)
VL <- findVariableLists(jdx, debug)
# "fmt" is a character vector extracted by findVariableLists, and reflects how the data is formatted
# in the variable list. "mode" is a length one string derived from fmt and reflects the processing
# needed, in particular which parameters need to be extracted in order to check the data
fmt <- VL[["DataGuide"]][, "Format"][-1]
mode <- NA_character_
if ("XYY" %in% fmt) mode <- "XYY"
if ("XRR" %in% fmt) mode <- "NMR_1D" # these files also contain XII
if ("NMR_2D" %in% fmt) mode <- "NMR_2D"
if ("LC_MS" %in% fmt) mode <- "LC_MS"
if ("PEAK_TABLE" %in% fmt) mode <- "XYXY" # handled the same as the next one
if ("XYXY" %in% fmt) mode <- "XYXY"
if (is.na(mode)) stop("Could not determine the type of data in the file")
if (debug >= 1) cat("\n\nProcessing file", file, "which appears to contain", mode, "data\n")
##### Step 3. Extract the needed parameters
params <- extractParams(VL[[2]], mode, SOFC, debug)
##### Step 4. Process the variable list(s) into the final list that is returned
if ((mode == "XYY") | (mode == "NMR_1D")) {
# Return value is a list: dataGuide, metadata, comment lines + data frames of x, y
# dataGuide, metadata & comments already in place; process each variable list
for (i in 4:length(VL)) {
VL[[i]] <- processVariableList(VL[[i]], params, mode, SOFC, debug)
}
# Fix up names
if (mode == "XYY") {
specnames <- jdx[blocks] # each line with ##TITLE= (there is only one however)
specnames <- str_trim(substring(specnames, 9, nchar(specnames)))
}
if (mode == "NMR_1D") specnames <- c("real", "imaginary")
names(VL) <- c("dataGuide", "metadata", "commentLines", specnames)
}
if (mode == "NMR_2D") {
# Return value is a list: dataGuide, metadata, comment lines, F2, F1, + a matrix w/2D data
# dataGuide, metadata & comments already in place; add F2, F1, M and drop extra stuff
M <- matrix(NA_real_, ncol = params[2], nrow = params[1]) # matrix to store result
for (i in 4:length(VL)) {
tmp <- processVariableList(VL[[i]], params, mode, SOFC, debug)
M[i - 3, ] <- tmp$y
}
# TODO: check for na in M, if any present we did not find enough pages to fill it and something is wrong
# Update VL
VL[[4]] <- sort(seq(params[4], params[6], length.out = params[2])) # replace element 4 with F2
VL[[5]] <- sort(seq(params[3], params[5], length.out = params[1])) # replace element 5 with F1
M <- M[nrow(M):1, ] # reverse order of rows, works for Bruker files (all vendors?)
VL[[6]] <- M # replace element 6 with M
VL <- VL[1:6] # toss the remaining pieces of raw VL
names(VL) <- c("dataGuide", "metadata", "commentLines", "F2", "F1", "Matrix")
}
if (mode == "LC_MS") {
# Return value is a list: dataGuide, metadata, comment lines, a data frame for each time point
# dataGuide, metadata & comments already in place; add data frames for each time point
for (i in 4:length(VL)) {
VL[[i]] <- processVariableList(VL[[i]], params, mode, SOFC, debug)
}
# Get the retention times & use to label list elements
rti <- grep("##PAGE= T=", jdx)
rt <- jdx[rti]
rt <- sub("##PAGE= ", "", rt)
names(VL) <- c("dataGuide", "metadata", "commentLines", rt)
}
if (mode == "XYXY") {
for (i in 4:length(VL)) {
VL[[i]] <- processVariableList(VL[[i]], params, mode, SOFC, debug)
}
specnames <- jdx[blocks] # each line with ##TITLE= (there is only one however)
specnames <- str_trim(substring(specnames, 9, nchar(specnames)))
names(VL) <- c("dataGuide", "metadata", "commentLines", specnames)
}
##### And we're done!
if (debug >= 1) cat("\nDone processing ", file, "\n")
return(VL)
} # end of readJDX
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.