Nothing
#'
#' Import a File Written in the JCAMP-DX Format
#'
#' This function supervises the entire import process.
#' Not all official formats are supported, see the vignettes.
#' Prior to release, this package is checked against a very large number of files in the
#' author's collection. However, the JCAMP-DX standard allows many variations and it is
#' difficult to anticipate all permutations. Error messages will
#' generally let you know what's going on. If you have a file that you feel should be
#' supported but gives an error, please file an issue at Github and share the file.
#'
#' @param file Character. The file name to import.
#'
#' @param SOFC Logical. "Stop on Failed Check"
#' The default is \code{TRUE} i.e. stop when something is not right.
#' This ensures that correct data is returned. Change to \code{FALSE} at your own risk.
#' NOTE: Only certain checks can be skipped via this option, as there are some
#' parameters that must be available and correct in order to return \emph{any} answer.
#' For instance, one must end up with the same number of X and Y values.
#' This option is provided for those \pkg{advanced
#' users} who have carefully checked their original files and want to skip the
#' required checks. It may also be useful for troubleshooting.
#' The JCAMP-DX standard typically requires
#' several checks of the data as it is decompressed. These checks are essential
#' to obtaining the correct results. However, some JCAMP-DX writing programs
#' do not follow the standard to the letter. For instance we have observed that
#' not all JCAMP-DX writers put FIRSTY into the metadata, even though it is required by
#' the standard. In other cases values in the file have low precision (see section on precision).
#' Another example we have observed is NMR files where the X values are the count/index of data points,
#' and FIRSTY is given in Hz. Since the field strength and center of the sweep frequency are needed
#' to convert to ppm, and these are parameters not required in the standard, one cannot return an
#' answer in either ppm or Hz automatically.
#' In cases like this, one can set \code{SOFC = FALSE} and then manually convert the X axis.
#'
#' @param debug Integer. The level of debug reporting desired. For those options giving
#' a lot of output, you may wish to consider directing the output via \code{\link{sinkall}}
#' and then search the results for the problematic lines.
#' \itemize{
#' \item 1 or higher = import progress is reported.
#' \item 2 = details about the variable lists, compression formats and
#' parameters that were found.
#' \item 3 = print the extracted X values (huge output!).
#' \item 4 = detailed info on the Y value processing (huge output!).
#' \item 5 = detailed info about processing the Y values when DUP is in use (huge output!).
#' \item 6 = detailed info about processing the Y values when DIF is in use (huge output!).
#' }
#' In cases when an error is about to stop execution, you get additional information regardless of
#' the \code{debug} value.
#'
#' @return A list, as follows:
#'
#' \itemize{
#' \item The first element is a data frame summarizing the pieces of the imported file.
#' \item The second element is the file metadata.
#' \item The third element is a integer vector giving the comment lines found
#' (exclusive of the metdata, which typically contains many comments).
#' }
#'
#' Additional elements contain the extracted data as follows:
#'
#' \itemize{
#' \item If the file contains one non-NMR spectrum, or a processed NMR spectrum (i.e. only
#' the final real data), a single data frame.
#' \item If the file contains the real and imaginary
#' parts of a 1D NMR spectrum, there will be two data frames, one containing the real portion
#' and the other the imaginary portion.
#' \item In all cases above, the data frame has elements \code{x} and \code{y}.
#' \item In the case of 2D NMR data, additional named list elements are returned including
#' the F2 frequency values, the F1 frequency values, and a matrix containing the 2D data.
#' \item In the case of LC-MS or GC-MS data, a data frame is returned for each time point.
#' The data frame has elements \code{mz} and \code{int} (intensity). Each time point
#' is named with the time from the file.
#' }
#'
#' @seealso Do \code{browseVignettes("readJDX")} for background information,
#' references, supported formats, and details about the roles of each function.
#' If you have a multiblock file (which contains multiple spectra, but not 2D NMR,
#' LC-MS or GC-MS data sets), please see
#' \code{\link{splitMultiblockDX}} for a function to break such files into
#' individual files which can then be processed in the normal way.
#'
#' @section Included Data Files:
#' The examples make use of data files included with the package:
#' \itemize{
#' \item File \code{SBO.jdx} is an IR spectrum of Smart Balance Original spread (a butter
#' substitute). The spectrum is presented in transmission format, and was recorded on a
#' ThermoFisher instrument. The file uses AFFN compression, and was written
#' with the JCAMP-DX 5.01 standard. Note that even though the Y-axis was recorded in
#' percent transmission, in the JDX file it is stored on [0\ldots1].
#' \item File \code{PCRF.jdx} is a 1H NMR spectrum of a hexane extract of a reduced fat potato chip.
#' The spectrum was recorded on a JEOL instrument. The file uses SQZ DIF DUP compression,
#' and was written with the JCAMP-DX 6.00 standard.
#' \item File \code{PCRF_line265.jdx} has a deliberate error in it.
#' \item File \code{isasspc1.jdx} is a 2D NMR file recorded on a JEOL GX 400 instrument.
#' The file is freely available at \url{http://www.jcamp-dx.org/}.
#' \item File \code{MiniDIFDUP.JDX} is a small demonstration file, used in the vignettes to
#' illustrate the decompression process. It is derived from a freely available file.
#' }
#'
#' @section Precision:
#' Internally, this package uses a tolerance factor when comparing values during certain checks.
#' This is desirable because the original values in the files
#' are text strings of varying lengths which get converted to numerical values by \code{R}. Occasionally
#' values in the file, such as FIRSTY, are stored with low precision, and the computation of the
#' value to be compared occurs with much greater precision. In these cases the check can fail
#' even when the tolerance is pretty loose. In these cases one might consider setting
#' \code{SOFC = FALSE} to allow the calculation to proceed. If you do this, be certain to check
#' the results carefully as described under \code{SOFC}.
#'
#' @section Y Value Check:
#' The standard requires a "Y Value Check" when in DIF mode. Extra Y values have been appended to each
#' line to use in the check, and the last Y value on a line must equal the first Y value on the next line
#' \emph{IFF} one is in DIF mode. After a successful check, the extra Y value must be removed. In actual practice,
#' some vendors, at least some of the time, seem to differ as to the meaning of
#' "being in DIF mode". In turn, this determines how the Y value check should proceed.
#' \itemize{
#' \item The standard says "When, and only when, the last ordinate of a line is in DIF form ...
#' The first ordinate of the next line ... is always an actual value, equal to the last
#' calculated ordinate of the previous line". See section 5.8.3 of the 1988 publication.
#' \item Taking this definition literally, the Y value check (and removal of the extra value),
#' should occur when one sees e.g. ... DIF DIF DIF (end of line). Let's call
#' this "literal DIF". A literal DIF is easy to detect and act on.
#' \item In other cases, something like ... DIF DUP DUP (end of line) is considered to be in DIF mode
#' for Y value check purposes. In these cases we have look backwards to see if we are in DIF mode.
#' Let's call this "relayed DIF".
#' \item However, some vendors may treat ... DIF DUP DUP (end of line) as not in DIF mode, and hence
#' one should not do the Y value check and not remove any values, as this vendor would not have
#' added an extra Y value.
#' \item In addition to these three possibilities, \code{readJDX} through versions 0.3.xx used a different
#' definition, namely if there were any DIF entries anywhere on the line, then DIF mode was
#' assumed and the Y value check carried out. This worked for many files, but not all.
#' \item In the 0.4.xx series, \code{readJDX} detects both the literal and relayed definitions and
#' tries to keep moving forward as much as possible.
#' }
#'
#' @section Performance:
#' \code{readJDX} is not particularly fast. Priority has been given to assuring correct answers,
#' helpful debugging messages and understandable code.
#'
#' @export
#'
#' @importFrom stringr str_trim
#'
#' @examples
#' # IR spectrum
#' sbo <- system.file("extdata", "SBO.jdx", package = "readJDX")
#' chk <- readJDX(sbo)
#' plot(chk[[4]]$x, chk[[4]]$y / 100,
#' type = "l", main = "Original Smart Balance Spread",
#' xlab = "wavenumber", ylab = "Percent Transmission"
#' )
#'
#' # 1H NMR spectrum
#' pcrf <- system.file("extdata", "PCRF.jdx", package = "readJDX")
#' chk <- readJDX(pcrf)
#' plot(chk[[4]]$x, chk[[4]]$y,
#' type = "l", main = "Reduced Fat Potato Chip Extract",
#' xlab = "ppm", ylab = "Intensity"
#' )
#'
#' # Capturing processing for troubleshooting
#' mdd <- system.file("extdata", "MiniDIFDUP.JDX", package = "readJDX")
#' tf <- tempfile(pattern = "Troubleshooting", fileext = "txt")
#' sinkall(tf)
#' chk <- readJDX(mdd, debug = 6)
#' sinkall() # close the file connection
#' file.show(tf)
#'
#' # 2D HETCORR spectrum
#' \dontrun{
#' nmr2d <- system.file("extdata", "isasspc1.dx", package = "readJDX")
#' chk <- readJDX(nmr2d)
#' contour(chk$Matrix, drawlabels = FALSE) # default contours not optimal
#' }
#'
#' \dontrun{
#' # Line 265 has an N -> G error. Try with various levels of debug.
#' # Even with debug = 0 you get useful diagnostic info.
#' problem <- system.file("extdata", "PCRF_line265.jdx", package = "readJDX")
#' chk <- readJDX(problem)
#' }
#'
readJDX <- function(file = "", SOFC = TRUE, debug = 0) {
if (!requireNamespace("stringr", quietly = TRUE)) {
stop("You need to install package stringr to use this function")
}
if (file == "") stop("No file specified")
jdx <- readLines(file)
##### Step 1. Check the overall file structure.
# A data block consists of ##TITLE= up to ##END=
# However, link blocks can be used to contain data blocks, in which
# case one has a compound file. Link blocks and compound files are not supported,
# but a function is available to split them into individual files.
# NMR data sets, including 2D NMR data sets, and LC-MS/GC-MS data sets use a different
# scheme (NTUPLES) to hold multiple data sets.
blocks <- grep("^\\s*##TITLE\\s*=.*", jdx)
nb <- length(blocks)
if (nb == 0) stop("This does not appear to be a JCAMP-DX file")
if (nb > 1) stop("Compound (multi-block / multi-spectra) data sets can not be parsed.\nSee splitMultiblockDX which is a function to split such files into separate files which can be parsed.")
##### Step 2. Locate the parameters and the variable list(s)
VL <- findVariableLists(jdx, debug)
# "fmt" is a character vector extracted by findVariableLists, and reflects how the data is formatted
# in the variable list. "mode" is a length one string derived from fmt and reflects the processing
# needed, in particular which parameters need to be extracted in order to check the data
fmt <- VL[["DataGuide"]][, "Format"][-1]
mode <- NA_character_
if ("XYY" %in% fmt) mode <- "XYY"
if ("XRR" %in% fmt) mode <- "NMR_1D" # these files also contain XII
if ("NMR_2D" %in% fmt) mode <- "NMR_2D"
if ("LC_MS" %in% fmt) mode <- "LC_MS"
if ("PEAK_TABLE" %in% fmt) mode <- "XYXY" # handled the same as the next one
if ("XYXY" %in% fmt) mode <- "XYXY"
if (is.na(mode)) stop("Could not determine the type of data in the file")
if (debug >= 1) cat("\n\nProcessing file", file, "which appears to contain", mode, "data\n")
##### Step 3. Extract the needed parameters
params <- extractParams(VL[[2]], mode, SOFC, debug)
##### Step 4. Process the variable list(s) into the final list that is returned
if ((mode == "XYY") | (mode == "NMR_1D")) {
# Return value is a list: dataGuide, metadata, comment lines + data frames of x, y
# dataGuide, metadata & comments already in place; process each variable list
for (i in 4:length(VL)) {
VL[[i]] <- processVariableList(VL[[i]], params, mode, SOFC, debug)
}
# Fix up names
if (mode == "XYY") {
specnames <- jdx[blocks] # each line with ##TITLE= (there is only one however)
specnames <- str_trim(substring(specnames, 9, nchar(specnames)))
}
if (mode == "NMR_1D") specnames <- c("real", "imaginary")
names(VL) <- c("dataGuide", "metadata", "commentLines", specnames)
}
if (mode == "NMR_2D") {
# Return value is a list: dataGuide, metadata, comment lines, F2, F1, + a matrix w/2D data
# dataGuide, metadata & comments already in place; add F2, F1, M and drop extra stuff
M <- matrix(NA_real_, ncol = params[2], nrow = params[1]) # matrix to store result
for (i in 4:length(VL)) {
tmp <- processVariableList(VL[[i]], params, mode, SOFC, debug)
M[i - 3, ] <- tmp$y
}
# TODO: check for na in M, if any present we did not find enough pages to fill it and something is wrong
# Update VL
VL[[4]] <- sort(seq(params[4], params[6], length.out = params[2])) # replace element 4 with F2
VL[[5]] <- sort(seq(params[3], params[5], length.out = params[1])) # replace element 5 with F1
M <- M[nrow(M):1, ] # reverse order of rows, works for Bruker files (all vendors?)
VL[[6]] <- M # replace element 6 with M
VL <- VL[1:6] # toss the remaining pieces of raw VL
names(VL) <- c("dataGuide", "metadata", "commentLines", "F2", "F1", "Matrix")
}
if (mode == "LC_MS") {
# Return value is a list: dataGuide, metadata, comment lines, a data frame for each time point
# dataGuide, metadata & comments already in place; add data frames for each time point
for (i in 4:length(VL)) {
VL[[i]] <- processVariableList(VL[[i]], params, mode, SOFC, debug)
}
# Get the retention times & use to label list elements
rti <- grep("##PAGE= T=", jdx)
rt <- jdx[rti]
rt <- sub("##PAGE= ", "", rt)
names(VL) <- c("dataGuide", "metadata", "commentLines", rt)
}
if (mode == "XYXY") {
for (i in 4:length(VL)) {
VL[[i]] <- processVariableList(VL[[i]], params, mode, SOFC, debug)
}
specnames <- jdx[blocks] # each line with ##TITLE= (there is only one however)
specnames <- str_trim(substring(specnames, 9, nchar(specnames)))
names(VL) <- c("dataGuide", "metadata", "commentLines", specnames)
}
##### And we're done!
if (debug >= 1) cat("\nDone processing ", file, "\n")
return(VL)
} # end of readJDX
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.