msPurity: Automated Evaluation of Precursor Ion Purity for Mass Spectrometry Based Fragmentation in Metabolomics

Documented in spectralMatching

# msPurity R package for processing MS/MS data - Copyright (C)
#
# This file is part of msPurity.
#
# msPurity is a free software: you can redistribute it and/or modify
# it under the terms of the GNU General Public License as published by
# the Free Software Foundation, either version 3 of the License, or
# (at your option) any later version.
#
# msPurity is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
# GNU General Public License for more details.
#
# You should have received a copy of the GNU General Public License
# along with msPurity.  If not, see <https://www.gnu.org/licenses/>.

# Note that jsonlite and uuid could potentially be removed from
# imports now - but keeping just incase and to not change dependencies 
# of previous msPurity versions

#' @title Spectral matching for LC-MS/MS datasets
#' @aliases spectralMatching
#' @description
#' **General**
#'
#' Perform spectral matching to spectral libraries for an LC-MS/MS dataset.
#'
#' The spectral matching is performed from a **Query** SQLite spectral-database against a **Library** SQLite spectral-database.
#'
#' The SQLite schema of the spectral database can be detailed Schema details can be found
#' [here](https://bioconductor.org/packages/release/bioc/vignettes/msPurity/inst/doc/msPurity-spectral-datatabase-schema.html).
#'
#' The query spectral-database in most cases should contain be the "unknown" spectra database generated the msPurity
#' function createDatabase as part of a msPurity-XCMS data processing workflow.
#'
#' The library spectral-database in most cases should contain the "known" spectra from either public or user generated resources.
#' The library SQLite database by default contains data from MoNA including Massbank, HMDB, LipidBlast and GNPS.
#' A larger database can be downloaded from [here](https://github.com/computational-metabolomics/msp2db/releases).
#' To create a user generated library SQLite database the following tool can be used to generate a SQLite database
#' from a collection of MSP files: [msp2db](https://github.com/computational-metabolomics/msp2db/releases).
#' It should be noted though, that as long as the schema of the spectral-database is as described [here ](https://bioconductor.org/packages/release/bioc/vignettes/msPurity/inst/doc/msPurity-spectral-database-vignette.html), then any database can be used
#' for either the library or query -  even allowing for the same database to be used.
#'
#' The spectral matching functionality has four main components, spectral filtering, spectral alignment, spectral matching,
#' and summarising the results.
#'
#' Spectral filtering is simply filtering both the library and query spectra to be search against (e.g. choosing
#' the library source, instrument, retention time, precursor PPM tolerance etc).
#'
#' The spectral alignment stage involves aligning the query peaks to the library peaks. The approach used is similar
#' to modified pMatch algorithm described in Zhou et al 2015.
#'
#' The spectral matching of the aligned spectra is performed against a combined intensity and m/z weighted vector - created for both
#' the query and library spectra (wq and wl). See below:
#'
#' \deqn{w=intensity^x * mz^y}
#'
#' Where x and y represent weight factors, defaults to *x*=0.5 and *y*=2 as per MassBank. These can be adjusted by
#' the user though.
#'
#' The aligned weighted vectors are then matched using dot product cosine, reverse dot product cosine and the composite dot product.
#' See below for dot product cosine equation.
#'
#' \deqn{dpc =  wq * wl / \sqrt{\sum wq^2} * \sqrt{\sum wl^2}}
#'
#' See the vigenttes for more details regarding matching algorithms used.
#'
#' **Example LC-MS/MS processing workflow**
#'
#'  * Purity assessments
#'    +  (mzML files) -> purityA -> (pa)
#'  * XCMS processing
#'    +  (mzML files) -> xcms.findChromPeaks -> (optionally) xcms.adjustRtime -> xcms.groupChromPeaks -> (xcmsObj)
#'    +  --- *Older versions of XCMS* --- (mzML files) -> xcms.xcmsSet -> xcms.group -> xcms.retcor -> xcms.group -> (xcmsObj)
#'  * Fragmentation processing
#'    + (xcmsObj, pa) -> frag4feature -> filterFragSpectra -> averageAllFragSpectra -> createDatabase -> **spectralMatching** -> (sqlite spectral database)
#'
#'
#'
#'
#' @param q_dbPth character; Path of the database of queries that will be searched against the library spectra. Generated from createDatabase
#' @param l_dbPth character; path to library spectral SQLite database. Defaults to msPurityData package data.
#'
#' @param q_purity character; Precursor ion purity threshold for the query spectra
#' @param q_ppmProd numeric; ppm tolerance for query product
#' @param q_ppmPrec numeric; ppm tolerance for query precursor
#' @param q_raThres numeric; Relative abundance threshold for query spectra
#' @param q_pol character; Polarity of query spectra ('positive', 'negative', NA).
#' @param q_instrumentTypes vector; Instrument types for query spectra.
#' @param q_instruments vector; Instruments for query spectra (note that this is used in combination with q_instrumentTypes - any
#'                              spectra matching either q_instrumentTypes or q_instruments will be used).
#' @param q_sources vector; Sources of query spectra (e.g. massbank, hmdb).
#' @param q_spectraTypes character; Spectra types of query spectra to perfrom spectral matching e.g. ('scan', 'av_all', 'intra', 'inter')
#' @param q_pids vector; pids for query spectra (correspond to column 'pid; in s_peak_meta)
#' @param q_rtrange vector; retention time range (in secs) of query spectra, first value mininum time and second value max e.g. c(0, 10) is between 0 and 10 seconds
#' @param q_spectraFilter boolean; For query spectra, if prior filtering performed with msPurity, flag peaks will be removed from spectral matching
#' @param q_xcmsGroups vector; XCMS group ids for query spectra
#' @param q_accessions vector; accession ids to filter query spectra
#'
#' @param l_purity character; Precursor ion purity threshold for the library spectra (uses interpolated purity - inPurity)
#' @param l_ppmProd numeric; ppm tolerance for library product
#' @param l_ppmPrec numeric; ppm tolerance for library precursor
#' @param l_raThres numeric; Relative abundance threshold for library spectra
#' @param l_pol character; Polarity of library spectra ('positive', 'negative', NA)
#' @param l_instrumentTypes vector; Instrument types for library spectra.
#' @param l_instruments vector; Instruments for library spectra (note that this is used in combination with q_instrumentTypes - any
#'                              spectra matching either q_instrumentTypes or q_instruments will be used).
#' @param l_sources vector; Sources of library spectra (e.g. massbank, hmdb).
#' @param l_spectraTypes vector; Spectra type of library spectra to perfrom spectral matching with e.g. ('scan', 'av_all', 'intra', 'inter')
#' @param l_pids vector; pids for library spectra (correspond to column 'pid; in s_peak_meta)
#' @param l_rtrange vector; retention time range (in secs) of library spectra, first value mininum time and second value max e.g. c(0, 10) is between 0 and 10 seconds
#' @param l_spectraFilter boolean; For library spectra, if prior filtering performed with msPurity, flag peaks will be removed from spectral matching
#' @param l_xcmsGroups vector; XCMS group ids for library spectra
#' @param l_accessions vector; accession ids to filter library spectra
#'
#' @param usePrecursors boolean; If TRUE spectra will be filtered by similarity of precursors based on ppm range defined by l_ppmPrec and q_ppmPrec
#' @param raW numeric; Relative abundance weight for spectra (default to 0.5 as determined by massbank for ESI data)
#' @param mzW numeric; mz weight for spectra (default to 2 as determined by massbank for ESI data)
#' @param rttol numeric ; Tolerance in time range between the library and query spectra retention time
#'
#' @param cores numeric; Number of cores to use
#' @param updateDb boolean; Update the Query SQLite database with the results
#' @param copyDb boolean; If updating the database - perform on a copy rather thatn the original query database
#' @param outPth character; If copying the database - the path of the new database file
#'
#' @param q_dbType character; Query database type for compound database can be either (sqlite, postgres or mysql)
#' @param q_dbName character; Query database name (only applicable for postgres and mysql)
#' @param q_dbHost character; Query database host (only applicable for postgres and mysql)
#' @param q_dbPort character; Query database port (only applicable for postgres and mysql)
#' @param q_dbUser character; Query database user (only applicable for postgres and mysql)
#' @param q_dbPass character; Query database pass - Note this is not secure! use with caution (only applicable for postgres and mysql)
#'
#' @param l_dbType character; Library database type for compound database can be either (sqlite, postgres or mysql)
#' @param l_dbName character; Library database name (only applicable for postgres and mysql)
#' @param l_dbHost character; Library database host (only applicable for postgres and mysql)
#' @param l_dbPort character; Library database port (only applicable for postgres and mysql)
#' @param l_dbUser character; Library database user (only applicable for postgres and mysql)
#' @param l_dbPass character; Library database pass - Note this is not secure! use with caution (only applicable for postgres and mysql)
#'
#' @return Returns a list containing the following elements
#'
#' **q_dbPth**
#'
#' Path of the query database (this will have been updated with the annotation results if updateDb argument used)
#'
#'**xcmsMatchedResults**
#'
#'  If the qeury spectra had XCMS based chromotographic peaks tables (e.g c_peak_groups, c_peaks) in the sqlite database - it will
#' be possible to summarise the matches for each XCMS grouped feature. The dataframe contains the following columns
#'
#' * lpid - id in database of library spectra
#' * qpid - id in database of query spectra
#' * dpc - dot product cosine of the match
#' * rdpc - reverse dot product cosine of the match
#' * cdpc - composite dot product cosine of the match
#' * mcount - number of matching peaks
#' * allcount - total number of peaks across both query and library spectra
#' * mpercent - percentage of matching peaks across both query and library spectra
#' * library_rt - retention time of library spectra
#' * query_rt - retention time of query spectra
#' * rtdiff - difference between library and query retention time
#' * library_precursor_mz - library precursor mz
#' * query_precursor_mz - query precursor mz
#' * library_precursor_ion_purity - library precursor ion purity
#' * query_precursor_ion_purity - query precursor ion purity
#' * library_accession -  library accession value (unique string or number given to eith MoNA or Massbank data entires)
#' * library_precursor_type - library precursor type (i.e. adduct)
#' * library_entry_name - Name given to the library spectra
#' * inchikey - inchikey of the matched library spectra
#' * library_source_name - source of the spectra (e.g. massbank, gnps)
#' * library_compound_name - name of compound spectra was obtained from
#'
#' **matchedResults**
#'
#' All matched results from the query spectra to the library spectra. Contains the same columns as above
#' but without the XCMS details. This table is useful to observe spectral matching results
#' for all MS/MS spectra irrespective of if they are linked to XCMS MS1 features.
#'
#' @return list of database details and dataframe summarising the results for the xcms features
#'
#' @examples
#' #====== XCMS =================================
#' ## Read in MS data
#' #msmsPths <- list.files(system.file("extdata", "lcms", "mzML",
#' #           package="msPurityData"), full.names = TRUE, pattern = "MSMS")
#' #ms_data = readMSData(msmsPths, mode = 'onDisk', msLevel. = 1)
#'
#' ## Find peaks in each file
#' #cwp <- CentWaveParam(snthresh = 5, noise = 100, ppm = 10, peakwidth = c(3, 30))
#' #xcmsObj  <- xcms::findChromPeaks(ms_data, param = cwp)
#'
#' ## Optionally adjust retention time
#' #xcmsObj  <- adjustRtime(xcmsObj , param = ObiwarpParam(binSize = 0.6))
#'
#' ## Group features across samples
#' #pdp <- PeakDensityParam(sampleGroups = c(1, 1), minFraction = 0, bw = 30)
#' #xcmsObj <- groupChromPeaks(xcmsObj , param = pdp)
#'
#' #====== msPurity ============================
#' #pa  <- purityA(msmsPths)
#' #pa <- frag4feature(pa = pa, xcmsObj = xcmsObj)
#' #pa <- filterFragSpectra(pa, allfrag=TRUE)
#' #pa <- averageAllFragSpectra(pa)
#' #q_dbPth <- createDatabase(pa, xcmsObj, metadata=list('polarity'='positive','instrument'='Q-Exactive'))
#' #sm_result <- spectralMatching(q_dbPth, cores=4, l_pol='positive')
#'
#' td <- tempdir()
#' q_dbPth <- system.file("extdata", "tests", "db", "createDatabase_example.sqlite", package="msPurity")
#'
#' rid <- paste0(paste0(sample(LETTERS, 5, TRUE), collapse=""),  paste0(sample(9999, 1, TRUE), collapse=""), ".sqlite")
#' sm_out_pth <- file.path(td, rid)
#'
#' result <- spectralMatching(q_dbPth, q_xcmsGroups = c(53, 89, 410), cores=1, l_accessions = c('PR100407', 'ML005101', 'CCMSLIB00003740024'),
#'                           q_spectraTypes = 'av_all',
#'                           updateDb = TRUE,
#'                           copyDb = TRUE,
#'                           outPth = sm_out_pth)
#'
#'
#' @md
#' @export
#' @import dbplyr
spectralMatching <- function(
                              q_dbPth,
                              l_dbPth=NA,

                              q_purity=NA,
                              q_ppmProd=10,
                              q_ppmPrec=5,
                              q_raThres=NA,
                              q_pol=NA,
                              q_instrumentTypes=NA,
                              q_instruments=NA,
                              q_sources=NA,
                              q_spectraTypes=c('av_all', 'inter'),
                              q_pids=NA,
                              q_rtrange=c(NA, NA),
                              q_spectraFilter=TRUE,
                              q_xcmsGroups=NA,
                              q_accessions=NA,

                              l_purity=NA,
                              l_ppmProd=10,
                              l_ppmPrec=5,
                              l_raThres=NA,
                              l_pol='positive',
                              l_instrumentTypes=NA,
                              l_instruments=NA,
                              l_sources=NA,
                              l_spectraTypes=NA,
                              l_pids=NA,
                              l_rtrange=c(NA, NA),
                              l_spectraFilter=FALSE,
                              l_xcmsGroups=NA,
                              l_accessions=NA,
                              usePrecursors=TRUE,
                              raW=0.5,
                              mzW=2,
                              rttol=NA,

                              q_dbType='sqlite',
                              q_dbName=NA,
                              q_dbHost=NA,
                              q_dbUser=NA,
                              q_dbPass=NA,
                              q_dbPort=NA,

                              l_dbType='sqlite',
                              l_dbName=NA,
                              l_dbHost=NA,
                              l_dbUser=NA,
                              l_dbPass=NA,
                              l_dbPort=NA,

                              cores=1,
                              updateDb=FALSE,
                              copyDb=FALSE,
                              outPth='sm_result.sqlite'){
  message("Running msPurity spectral matching function for LC-MS(/MS) data")
  # Call dbplyr explicitly here (so that the bioconductor checks know that we use it)
  dbplyre <- dbplyr::dbplyr_edition()
  
  if(updateDb && copyDb){
    file.copy(from = q_dbPth, to=outPth)
    q_dbPth <- outPth
  }


  if (is.na(l_dbPth)){
    l_dbPth <- system.file("extdata", "library_spectra", "library_spectra.db", package="msPurityData")
  }

  ########################################################
  # Filter the query dataset
  ########################################################
  message("Filter query dataset")
  q_con <- connect2db(pth=q_dbPth,
                      type=q_dbType,
                      user=q_dbUser,
                      pass=q_dbPass,
                      dbname=q_dbName,
                      host=q_dbHost,
                      port=q_dbPort)

  q_speakmeta <- filterSMeta(purity =q_purity,
                             pol = q_pol,
                             instrumentTypes = q_instrumentTypes,
                             instruments = q_instruments,
                             sources = q_sources,
                             pids = q_pids,
                             rtrange = q_rtrange,
                             con = q_con,
                             xcmsGroups = q_xcmsGroups,
                             spectraTypes = q_spectraTypes,
                             accessions = q_accessions)

  ########################################################
  # Filter the library dataset
  ########################################################
  message("Filter library dataset")
  l_con <- connect2db(pth=l_dbPth,
                      type=l_dbType,
                      user=l_dbUser,
                      pass=l_dbPass,
                      dbname=l_dbName,
                      host=l_dbHost,
                      port=l_dbPort)


  l_speakmeta <- filterSMeta(purity = l_purity,
                             raThres = l_raThres,
                             pol = l_pol,
                             instrumentTypes = l_instrumentTypes,
                             instruments = l_instruments,
                             sources = l_sources,
                             pids = l_pids,
                             rtrange = l_rtrange,
                             con = l_con,
                             xcmsGroups = l_xcmsGroups,
                             spectraTypes = l_spectraTypes,
                             accessions = l_accessions)

  ########################################################
  # Loop through the query dataset and spectra match
  # against the library spectra
  ########################################################
  # can't parallize dpylr without non cran package
  # Go back to using good old plyr (but we need to connect to the database() each time
  # we as we can't use the same connection for multiple cores..
  dbDetails <- list(
    'q'=list(pth=q_dbPth,
             type=q_dbType,
             user=q_dbUser,
             pass=q_dbPass,
             dbname=q_dbName,
             host=q_dbHost,
             port=q_dbPort

    ),
    'l'=list(pth=l_dbPth,
             type=l_dbType,
             user=l_dbUser,
             pass=l_dbPass,
             dbname=l_dbName,
             host=l_dbHost,
             port=l_dbPort

    ))


  q_fpids <- pullPid(q_speakmeta)

  l_fpids <- pullPid(l_speakmeta)


  message('aligning and matching')


  # Check cores and choose if parallel or not (do or dopar)
  if(cores>1){
    cl<-parallel::makeCluster(cores, type = "SOCK")
    doSNOW::registerDoSNOW(cl)
    parallel = TRUE
  } else{
    parallel = FALSE
  }

  # run parallel (or not) using foreach
  matched <- plyr::adply(q_fpids, 1, queryVlibrary, l_pids=l_fpids,
                         q_ppmPrec=q_ppmPrec,
                         q_ppmProd=q_ppmProd,
                         l_ppmPrec=l_ppmPrec,
                         l_ppmProd=l_ppmProd,
                         l_spectraFilter=l_spectraFilter,
                         q_spectraFilter=q_spectraFilter,
                         l_raThres=l_raThres,
                         q_raThres=q_raThres,
                         usePrecursors=usePrecursors,
                         mzW=mzW,
                         raW=raW,
                         rttol=rttol,
                         dbDetails=dbDetails,
                         .parallel=parallel)


  if(cores>1){
    parallel::stopCluster(cl)
  }

  if (nrow(matched)==0){
    message('No matches found')
    return(NULL)
  }

  # remove the plyr id column
  matched <- matched[,!names(matched)=='X1']
  matched$mid <- 1:nrow(matched)

  # ensure numeric
  nmCols <- c("mid", "dpc","rdpc","cdpc","mcount", "allcount", "mpercent", "lpid", "qpid")
  matched[,nmCols] <- as.numeric(as.character(unlist(matched[,nmCols])))

  # make sure all NA values are fully NA values

  # Add rtdiff
  matched$rtdiff <- as.numeric(matched$library_rt)-as.numeric(matched$query_rt)

  # Sort out order
  matched <- matched[c('qpid', 'mid', 'dpc', 'rdpc', 'cdpc', 'mcount',
                       'allcount', 'mpercent','lpid', 'library_rt', 'query_rt', 'rtdiff',
                       'library_precursor_mz', 'query_precursor_mz',
                       'library_precursor_ion_purity', 'query_precursor_ion_purity',
                       'library_accession', 'library_precursor_type',
                       'library_entry_name',  'inchikey')]


  # Add information from other tables
  if (DBI::dbExistsTable(l_con, "library_spectra_source")){
    additional_details <- DBI::dbGetQuery(l_con, sprintf('SELECT lsm.id AS lpid,
                                                         s.name AS library_source_name,
                                                         mc.name AS library_compound_name
                                                         FROM library_spectra_source AS s
                                                         LEFT JOIN
                                                         library_spectra_meta AS lsm ON lsm.library_spectra_source_id = s.id
                                                         LEFT JOIN
                                                         metab_compound AS mc ON mc.inchikey_id = lsm.inchikey_id
                                                         WHERE lsm.id IN (%s)', paste(unique(matched$lpid), collapse=",")))
  }else if (DBI::dbExistsTable(l_con, "source")) {

    additional_details <- DBI::dbGetQuery(l_con, sprintf('SELECT pid AS lpid,
                                                         s.name AS library_source_name,
                                                         mc.name AS library_compound_name
                                                         FROM source AS s
                                                         LEFT JOIN
                                                         s_peak_meta AS lsm ON lsm.sourceid = s.id
                                                         LEFT JOIN
                                                         metab_compound AS mc ON mc.inchikey_id = lsm.inchikey_id
                                                         WHERE lsm.pid IN (%s)', paste(unique(matched$lpid), collapse=",")))
  }else{
    additional_details <- NULL
  }

  if (nrow(additional_details)==0){
    additional_details <- NULL
  }

  if(!is.null(additional_details)){
    matched <- merge(matched, additional_details, by='lpid')
  }


  if (updateDb){

    custom_dbWriteTable(name_pk = 'mid', df=matched, fks=NA, table_name = 'sm_matches', con = q_con)

    if (DBI::dbExistsTable(l_con, "metab_compound") ){
      # Schema needs to be updated to be more generic

      if (DBI::dbExistsTable(l_con, "library_spectra_meta") ){
        l_compounds <- DBI::dbGetQuery(l_con, sprintf('SELECT  DISTINCT c.* FROM library_spectra_meta AS m
                                                      LEFT JOIN metab_compound AS
                                                      c on c.inchikey_id=m.inchikey_id
                                                      WHERE m.id IN (%s)', paste(unique(matched$lpid), collapse=",")) )
      }else{
        l_compounds <- DBI::dbGetQuery(l_con, sprintf('SELECT  DISTINCT c.* FROM s_peak_meta AS m
                                                      LEFT JOIN metab_compound AS
                                                      c on c.inchikey_id=m.inchikey_id
                                                      WHERE m.pid IN (%s)', paste(unique(matched$lpid), collapse=",")) )

      }

      if(nrow(l_compounds)>0){
        l_compounds <- data.frame(lapply(l_compounds, as.character), stringsAsFactors=FALSE)

        if(DBI::dbExistsTable(q_con, "metab_compound")){
          q_compounds <- q_con %>% dplyr::tbl("metab_compound")
          q_compounds <- q_compounds %>% dplyr::collect()
          q_inchi <- q_compounds$inchikey_id
          if(length(q_inchi)>0){
            l_compounds <- l_compounds[!l_compounds$inchikey_id %in%  q_inchi,]
          }
          if(length(l_compounds$inchikey_id[!is.na(l_compounds$inchikey_id)])>0){
            DBI::dbWriteTable(q_con, name="metab_compound", value=l_compounds, row.names=FALSE, append=TRUE)
          }
        }else{
          custom_dbWriteTable(name_pk = 'inchikey_id', fks = NA,
                              df=l_compounds, table_name = 'metab_compound', con = q_con, pk_type='TEXT')
        }
      }

      if (DBI::dbExistsTable(l_con, "library_spectra_meta") ){
        library_spectra_meta <- DBI::dbGetQuery(l_con, sprintf('SELECT  * FROM library_spectra_meta AS m
                                                               WHERE m.id IN (%s)', paste(unique(matched$lpid), collapse=",")) )
        pk = 'id'
      }else{
        library_spectra_meta <- DBI::dbGetQuery(l_con, sprintf('SELECT  * FROM s_peak_meta AS m
                                                               WHERE m.pid IN (%s)', paste(unique(matched$lpid), collapse=",")) )
        pk = 'pid'
      }

      #fk_l = list('inchikey_id'=list('new_name'='inchikey_id', 'ref_name'='inchikey_id', 'ref_table'='metab_compound'))

      custom_dbWriteTable(name_pk = pk, fks = NA,
                          df=library_spectra_meta, table_name = 'l_s_peak_meta', con = q_con)

    }

      }

  ########################################################
  # Create a summary table for xcms grouped objects
  ########################################################
  if (DBI::dbExistsTable(q_con, "c_peak_groups")){
    # check if the query is from an XCMS object
    message("Summarising LC feature annotations")

    xcmsMatchedResults <- getXcmsSmSummary(q_con, matched,spectraTypes=q_spectraTypes)
    if (updateDb && !is.null(xcmsMatchedResults)){
      DBI::dbWriteTable(q_con, name='xcms_match', value=xcmsMatchedResults, row.names=FALSE, append=TRUE)
    }

  }else{
    xcmsMatchedResults <- NA
  }

  DBI::dbDisconnect(q_con)
  DBI::dbDisconnect(l_con)

  return(list('q_dbPth' = q_dbPth, 'matchedResults' = matched, 'xcmsMatchedResults' = xcmsMatchedResults))
  }

# filterPid <- function(sp, pids){
#   nms <- names(sp %>% dplyr::collect())
#   if ("library_spectra_meta_id" %in% nms){
#     sp <- sp %>% dplyr::filter(library_spectra_meta_id %in% pids)
#   }else if ("pid" %in% nms){
#     sp <- sp %>% dplyr::filter(pid %in% pids)
#   }else{
#     sp <- sp %>% dplyr::filter(id %in% pids)
#   }
#   return(sp)
# }

pullPid <- function(sp, pids){
  tble <- sp %>% dplyr::collect()
  nms <- colnames(tble)

  if ("pid" %in% nms){
    pids <- tble$pid
  }else{
    pids <- tble$id
  }

  return(pids)
}

getScanPeaksSqlite <- function(con, spectraFilter=TRUE, spectraTypes=NA, raThres=NA, pids=NA){


  if (DBI::dbExistsTable(con, "s_peaks")){
    speaks <- con %>% dplyr::tbl("s_peaks")
  }else if (DBI::dbExistsTable(con, "library_spectra")) {
    # old sqlite format
    speaks <- con %>% dplyr::tbl("library_spectra")
  }else{
    stop('No spectra available')
  }

  cn <- getPeakCols(con)

  if ('pid' %in% cn$name  && !anyNA(pids)){
    speaks <- speaks %>% dplyr::filter(pid %in% pids)
  }else if("library_spectra_meta_id" %in% cn$name  && !anyNA(pids)){
    speaks <- speaks %>% dplyr::filter(library_spectra_meta_id %in% pids)
  }

  if ('pass_flag' %in% cn$name && spectraFilter ){
    speaks <- speaks %>% dplyr::filter(pass_flag==TRUE)
  }

  if ('type' %in% cn$name && !anyNA(spectraTypes)){
    speaks <- speaks %>% dplyr::filter(type %in% spectraType)
  }

  if ('ra' %in% cn$name && !is.na(raThres)){
    speaks <- speaks %>% dplyr::filter(ra>raThres)
  }

  return(speaks)



}

getXcmsSmSummary <- function(con, matched, scoreF=0, fragNmF=1, spectraTypes='scan'){

  if ('scan' %in% spectraTypes){
    sqlStmt <- sprintf("SELECT c_peak_groups.*,
                               cp.cid as c_peak_cid,
                               cp.mz as c_peak_mz,
                               cp.rt as c_peak_rt,
                               s_peak_meta.pid,
                               s_peak_meta.precursorScanNum
                               FROM c_peak_groups
                       LEFT JOIN c_peak_X_c_peak_group AS cXg ON cXg.grpid=c_peak_groups.grpid
                       LEFT JOIN c_peaks AS cp on cp.cid=cXg.cid
                       LEFT JOIN c_peak_X_s_peak_meta AS cXs ON cXs.cid=cp.cid
                       LEFT JOIN s_peak_meta ON cXs.pid=s_peak_meta.pid
                       WHERE s_peak_meta.pid in (%s)", paste(unique(matched$qpid), collapse=','))



  }else{
    sqlStmt <- sprintf("SELECT cpg.*, spm.pid FROM c_peak_groups AS cpg
                       LEFT JOIN s_peak_meta AS spm ON cpg.grpid=spm.grpid
                       WHERE spm.pid in (%s)", paste(unique(matched$qpid), collapse=','))
  }

  xcmsGroupedPeaks <- DBI::dbGetQuery(con, sqlStmt)

  xcmsMatchedResults <- merge(xcmsGroupedPeaks, matched, by.x='pid', by.y='qpid')


  if(nrow(xcmsMatchedResults)==0){
    message('NO MATCHES FOR XCMS')
    return(NULL)
  }
  # Remove pid (is duplicate)
  #xcmsMatchedResults <- xcmsMatchedResults[ , !(names(xcmsMatchedResults) %in% c('pid'))]


  xcmsMatchedResults <- xcmsMatchedResults[order(xcmsMatchedResults$grpid, -as.numeric(xcmsMatchedResults$dpc)),]


  return(xcmsMatchedResults)


}

filterSMeta <- function(purity=NA,
                        raThres=0,
                        pol='positive',
                        instrumentTypes=NA,
                        instruments=NA,
                        sources=NA,
                        xcmsGroups=NA,
                        pids=NA,
                        rtrange=c(NA, NA),
                        spectraTypes=NA,
                        accessions=NA,
                        con){

  # get column names
  # PRAGMA table_info();
  meta_cn <- getMetaCols(con)

  speakmeta <- getSmeta(con, pids)

  if('accession' %in% meta_cn$name && !anyNA(accessions)){
    speakmeta <- speakmeta %>% dplyr::filter(accession %in% accessions)
  }
  #print('accession')
  #print(speakmeta)

  if('inPurity' %in% meta_cn$name && !is.na(purity)){
    speakmeta <- speakmeta %>% dplyr::filter(inPurity > purity)
  }
  #print('purity')
  #print(speakmeta)

  if ('polarity' %in% meta_cn$name && !is.na(pol)){
    speakmeta <- speakmeta %>% dplyr::filter(lower(polarity) == lower(pol))
  }

  #print('polarity')
  #print(speakmeta)


  if ('instrument_type' %in% meta_cn$name &&  'instrument' %in% meta_cn$name &&  !anyNA(instrumentTypes)  && !anyNA(instruments)){
    speakmeta <- speakmeta %>% dplyr::filter(instrument_type %in% instrumentTypes || instrument %in% instruments)
  }else if ('instrument_type' %in% meta_cn$name && !anyNA(instrumentTypes)){
    speakmeta <- speakmeta %>% dplyr::filter(instrument_type %in% instrumentTypes)
  }else if ('instrument' %in% meta_cn$name && !anyNA(instruments)){
    speakmeta <- speakmeta %>% dplyr::filter(instrument %in% instruments)
  }

  #print('instruments')
  #print(speakmeta)


  if(!anyNA(sources)){
    if (DBI::dbExistsTable(con, "library_spectra_source")){
      sourcetbl <- con %>% dplyr::tbl("library_spectra_source")
      speakmeta <- speakmeta %>% dplyr::left_join(sourcetbl,  by=c('library_spectra_source_id'='id'),suffix = c("", ".y")) %>%
        dplyr::filter(name.y %in% sources)

    }else if (DBI::dbExistsTable(con, "source")){
      sourcetbl <- con %>% dplyr::tbl("source")
      speakmeta <- speakmeta %>% dplyr::left_join(sourcetbl,  by=c('sourceid'='id'),suffix = c("", ".y")) %>%
        dplyr::filter(name.y %in% sources)
    }
  }

  #print('sources')
  #print(speakmeta)



  if('retention_time' %in% meta_cn$name && !anyNA(rtrange)){
    speakmeta <- speakmeta %>% dplyr::filter(retention_time >= rtrange[1] & retention_time <= rtrange[2])
  }

  #print('rtrange')
  #print(speakmeta)


  if ( !anyNA(xcmsGroups) && DBI::dbExistsTable(con, "c_peak_groups")){

    XLI <- DBI::dbGetQuery(con, paste0('SELECT cpg.grpid, spm.pid FROM s_peak_meta AS spm
                                       LEFT JOIN c_peak_X_s_peak_meta AS cXs ON cXs.pid=spm.pid
                                       LEFT JOIN c_peaks AS cp on cp.cid=cXs.cid
                                       LEFT JOIN c_peak_X_c_peak_group AS cXg ON cXg.cid=cp.cid
                                       LEFT JOIN c_peak_groups AS cpg ON cXg.grpid=cpg.grpid
                                       WHERE cpg.grpid in (', paste(xcmsGroups, collapse = ','), ')'))

    xcmsGroups <- as.character(xcmsGroups)

    # doesn't work with database calls on travis (have to split into to filters)
    # speakmeta <- speakmeta %>% dplyr::filter(grpid %in% xcmsGroups | pid %in% XLI$pid)
    metaGrpPids <- speakmeta %>% dplyr::filter(grpid %in% xcmsGroups) %>% dplyr::pull(pid)
    allGrpPids <- unique(c(XLI$pid,metaGrpPids))
    speakmeta <- speakmeta %>% dplyr::filter(pid %in% allGrpPids)



  }

  #print('xcms groups')
  #print(speakmeta)


  if ('spectrum_type' %in% meta_cn$name && !anyNA(spectraTypes)){
    if ('av_all' %in% spectraTypes){
      spectraTypes[spectraTypes=='av_all'] = 'all'
    }

    speakmeta <- speakmeta %>% dplyr::filter(spectrum_type %in% spectraTypes)

  }


  return(speakmeta)


}

filterPrecursors <- function(q_pid, q_speakmeta, l_speakmeta, q_ppmPrec, l_ppmPrec){

  return(l_speakmetaFiltered)

}

getSmeta <- function(con, pids=NA){
  if (DBI::dbExistsTable(con, "s_peak_meta")){
    speakmeta <- con %>% dplyr::tbl("s_peak_meta")
    if (!anyNA(pids)){
      speakmeta <- speakmeta  %>%  dplyr::filter(pid %in% pids)
    }

  }else if (DBI::dbExistsTable(con, "library_spectra_meta")) {
    # old sqlite format
    speakmeta <- con %>% dplyr::tbl("library_spectra_meta")
    if (!anyNA(pids)){

      speakmeta <- speakmeta  %>% dplyr::filter(id %in% pids)
    }

  }else{
    stop('No meta data for spectra available')
  }
  return(speakmeta)
}

getMetaCols <- function(con){
  if(DBI::dbExistsTable(con, "s_peak_meta")){
    meta_cn <- DBI::dbGetQuery(con, 'PRAGMA table_info(s_peak_meta)')
  }else{
    meta_cn <- DBI::dbGetQuery(con, 'PRAGMA table_info(library_spectra_meta)')
  }
  return(meta_cn)
}

getPeakCols <- function(con){
  if(DBI::dbExistsTable(con, "s_peak_meta")){
    cn <- DBI::dbGetQuery(con, 'PRAGMA table_info(s_peaks)')
  }else{
    cn <- DBI::dbGetQuery(con, 'PRAGMA table_info(library_spectra)')
  }
  return(cn)
}

queryVlibrary <- function(q_pid, l_pids, q_ppmPrec, q_ppmProd, l_ppmPrec, l_ppmProd,
                          l_spectraFilter, q_spectraFilter, l_raThres, q_raThres, usePrecursors, mzW, raW, rttol,
                          dbDetails){

  q_con <- connect2db(pth=dbDetails$q$pth,
                      type=dbDetails$q$type,
                      user=dbDetails$q$user,
                      pass=dbDetails$q$pass,
                      dbname=dbDetails$q$dbname,
                      host=dbDetails$q$host,
                      port=dbDetails$q$port)

  l_con <- connect2db(pth=dbDetails$l$pth,
                      type=dbDetails$l$type,
                      user=dbDetails$l$user,
                      pass=dbDetails$l$pass,
                      dbname=dbDetails$l$dbname,
                      host=dbDetails$l$host,
                      port=dbDetails$l$port)


  q_speakmetai <- getSmeta(q_con, q_pid) %>% dplyr::collect()



  q_speaksi <- getScanPeaksSqlite(q_con, spectraFilter=q_spectraFilter, pids=q_pid) %>% dplyr::collect()

  # if no peaks, then skip
  if (nrow(q_speaksi)==0){
    return(NULL)
  }



  l_speakmeta <- getSmeta(l_con, l_pids) %>% dplyr::collect()
  l_speaks <- getScanPeaksSqlite(l_con, spectraFilter=l_spectraFilter, pids=l_pids) %>% dplyr::collect()

  if(usePrecursors){

    q_precMZ <- q_speakmetai$precursor_mz
    # Check if precursors are within tolerance
    # We have ppm tolerances for both the library and the query
    q_precMZlo = q_precMZ - ((q_precMZ*0.000001)*q_ppmPrec)
    q_precMZhi = q_precMZ + ((q_precMZ*0.000001)*q_ppmPrec)

    # Search against the range for the library
    l_fspeakmeta <- l_speakmeta %>%
      dplyr::filter( (q_precMZhi >= precursor_mz - ((precursor_mz*0.000001)*l_ppmPrec))
                     &
                       (precursor_mz + ((precursor_mz*0.000001)*l_ppmPrec) >= q_precMZlo)) %>%
      #summarise(id)  %>% # need to change pid
      dplyr::collect()
  }else{
    l_fspeakmeta <- l_speakmeta %>% dplyr::collect()
  }


  if(!is.na(rttol)){

    l_fspeakmeta <- l_fspeakmeta %>% dplyr::filter(abs(retention_time-q_speakmetai$retention_time)<rttol)
  }

  if(nrow(l_fspeakmeta)==0){
    return(NULL)
  }

  if ('pid' %in% colnames(l_fspeakmeta )){
    l_fpids <- l_fspeakmeta$pid

  }else{
    l_fpids <- l_fspeakmeta$id
  }

  searched <- plyr::adply(l_fpids , 1, queryVlibrarySingle,
                          q_speaksi=q_speaksi,
                          l_speakmeta=l_speakmeta,
                          l_speaks=l_speaks,
                          q_ppmProd=q_ppmProd,
                          l_ppmProd=l_ppmProd,
                          raW=raW,
                          mzW=mzW

  )

  searched$qpid <- q_pid
  searched$query_rt <- q_speakmetai$retention_time
  searched$query_precursor_mz <- q_speakmetai$precursor_mz

  if ("inPurity" %in% colnames(q_speakmetai)){
    searched$query_precursor_ion_purity <- q_speakmetai$inPurity
  }else{
    searched$query_precursor_ion_purity <- NA
  }

  DBI::dbDisconnect(q_con)
  DBI::dbDisconnect(l_con)



  return(searched)



}


queryVlibrarySingle <- function(l_pid, q_speaksi, l_speakmeta, l_speaks, q_ppmProd, l_ppmProd, raW, mzW){




  if ('pid' %in% colnames(l_speaks)){
    l_speaksi <- l_speaks %>% dplyr::filter(pid==l_pid) %>% dplyr::collect()
    l_speakmetai <- data.frame(l_speakmeta %>% dplyr::filter(pid==l_pid) %>% dplyr::collect())
  }else{
    l_speaksi <- l_speaks %>% dplyr::filter(library_spectra_meta_id==l_pid) %>% dplyr::collect()
    l_speakmetai <- data.frame(l_speakmeta %>% dplyr::filter(id==l_pid) %>% dplyr::collect())
  }

  # ensure we have the relative abundance
  l_speaksi$ra <- (l_speaksi$i/max(l_speaksi$i))*100
  q_speaksi$ra <- (q_speaksi$i/max(q_speaksi$i))*100

  am <- alignAndMatch(q_speaksi, l_speaksi, q_ppmProd, l_ppmProd, raW, mzW)

  if ('pid' %in% colnames(l_speakmetai)){
    lpids <- l_speakmetai$pid
  }else{
    lpids <- l_speakmetai$id
  }

  if ('retention_time' %in% colnames(l_speakmetai)){
    library_rt <- as.numeric(l_speakmetai$retention_time)
  }else{
    library_rt <- NA
  }

  if ("inPurity" %in% colnames(l_speakmetai)){
    library_precursor_ion_purity<- l_speakmetai$inPurity
  }else{
    library_precursor_ion_purity <- NA
  }


  return(c(am,
           'library_rt'=library_rt,
           'library_accession'=l_speakmetai$accession,
           'library_precursor_mz'=l_speakmetai$precursor_mz,
           'library_precursor_ion_purity'=library_precursor_ion_purity,
           'library_precursor_type'=l_speakmetai$precursor_type,
           'library_entry_name'=l_speakmetai$name,
           'inchikey'=l_speakmetai$inchikey_id,
           'lpid'=lpids
  ))
}


alignAndMatch <- function(q_speaksi, l_speaksi, q_ppmProd, l_ppmProd, raW, mzW){
  ###################
  # Align
  ###################

  q_speaksi$w <- (q_speaksi$mz^2) * (q_speaksi$ra^0.5)
  l_speaksi$w <- (l_speaksi$mz^2) * (l_speaksi$ra^0.5)
  q_speaksi <- data.frame(q_speaksi)
  l_speaksi <- data.frame(l_speaksi)



  aligned <-align(q_speaksi, l_speaksi, l_ppmProd, q_ppmProd, raDiffThres=10)

  ##################
  # Match
  ##################
  dpcOut <- dpc(aligned$q, aligned$l)

  rl <- aligned$l[!aligned$l==0]
  rq <- aligned$q[!aligned$l==0]
  rdpcOut <- dpc(rq, rl)

  cdpcOut <- cdpc(aligned$q, aligned$l)

  return(c('dpc'=dpcOut, 'rdpc'=rdpcOut, 'cdpc'=cdpcOut,
           'mcount'=aligned$mcount, 'allcount'=aligned$total,
           'mpercent'=aligned$percMatchAll))

}





mzComparePPMrange <- function(mz1, mz2, ppm1, ppm2){
  mz1Lo = round(mz1 - ((mz1*0.000001)*ppm1), 10)
  mz1Up = round(mz1 + ((mz1*0.000001)*ppm1), 10)

  # get ranges of the "blank" to remove
  mz2Lo = round(mz2 - ((mz2*0.000001)*ppm2), 10)
  mz2Up = round(mz2 + ((mz2*0.000001)*ppm2), 10)

  return(overlap(mz1Lo, mz1Up, mz2Lo, mz2Up))
}


align <-function(q_speaksi,l_speaksi, l_ppmProd=10, q_ppmProd=100, raDiffThres=10){

  # Following the pMatch-hammer method for peak matching but with slight variation that we also check the percentage difference for
  # the relative intensity as well
  # https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3928522/

  # get query and library peaks in the correct format
  q_speaksi$ql <-'query'
  l_speaksi$ql <-'library'

  q_speaksi <- q_speaksi[,c('mz','ra','ql','w')]
  l_speaksi <- l_speaksi[,c('mz','ra','ql','w')]

  # Get matrix of ppm bool, ppm differences and relative abundance differences
  idiff <- ppmdiff <- ppmbool <- matrix(0L, nrow = nrow(q_speaksi), ncol = nrow(l_speaksi))

  for(x in 1:nrow(q_speaksi)){
    for(y in 1:nrow(l_speaksi)){
      # Get ranges of the "sample"
      idiff[x, y] = abs(q_speaksi$ra[x] - l_speaksi$ra[y])   ## need to update to RA!!!
      ppmbool[x, y] = mzComparePPMrange(q_speaksi$mz[x], l_speaksi$mz[y], q_ppmProd, l_ppmProd)
      ppmdiff[x, y] = abs(1e6*(q_speaksi$mz[x]-abs(l_speaksi$mz[y]))/l_speaksi$mz[y])
    }
  }

  # Loop through the differences for the target to library
  allpeaks <-  list()
  mcount = NULL

  # we keep a vector detailing which library peaks we have already matched
  lp_remain <- 1:ncol(ppmbool)

  for (i in 1:nrow(ppmbool)){
    # get ppm diff and intensity diff
    ppmB <- ppmbool[i,]
    ppmD <- ppmdiff[i,]
    iD <- idiff[i,]

    if(sum(ppmB, na.rm = TRUE)==0){
      allpeaks[[i]] <- rbind(q_speaksi[i,],c(0, 0, 'library',0))
      mcount <- append(mcount, 'miss')
      next
    }

    # First check to see if there is a matching intensity value within ra_diff (default 10%)
    intenc <- iD[ppmB==1 & iD<raDiffThres & !is.na(ppmD) & !is.na(raDiffThres)]

    if (!identical(unname(intenc), numeric(0))){
      matchi <- match(min(intenc, na.rm = TRUE), iD)
    }else{
      # if there are identical matches for the ppm range searched. Then pick the match
      # with the smallest ppm difference. This doesn't happen that often, but to
      # keep consistent when it does
      matchi <- match(min(ppmD[ppmB==1], na.rm=TRUE), ppmD)
    }

    allpeaks[[i]] <- rbind(q_speaksi[i,],l_speaksi[matchi,])
    mcount <- append(mcount, 'match')
    lp_remain[matchi] <- NA
    idiff[,matchi] <- NA
    ppmdiff[,matchi] <- NA
    ppmbool[,matchi] <- NA
  }

  cp = length(allpeaks)+1

  # get remaining query empty values
  for (i in lp_remain){
    if(is.na(i)){
      next
    }
    allpeaks[[cp]] <- rbind(c(0, 0, 'query',0), l_speaksi[i,])
    mcount <- append(mcount, 'miss')
    cp = cp+1
  }

  allpeaksm <- data.frame(do.call(rbind, allpeaks))

  wqm <- allpeaksm[allpeaksm[,'ql']=='query',]
  if (is.vector(wqm)){
    wq <- as.numeric(wqm['w'])
  }else{
    wq <- as.numeric(wqm[,'w'])
  }

  wlm <- allpeaksm[allpeaksm[,'ql']=='library',]

  if (is.vector(wlm)){
    wl <- as.numeric(wlm['w'])
  }else{
    wl <- as.numeric(wlm[,'w'])
  }


  pmatch <- sum(mcount=='match')/length(mcount)
  return(list(q=wq, l=wl, mcount=sum(mcount=='match'), total=length(mcount), percMatchAll=pmatch))
}
Viant-Metabolomics/msPurity documentation built on May 13, 2024, 8:23 a.m.
rdrr.io home R language documentation Run R code online
CRAN packages Bioconductor packages R-Forge packages GitHub packages
Note that we can't provide technical support on individual packages. You should contact the package authors for that.
Viant-Metabolomics/msPurity
Automated Evaluation of Precursor Ion Purity for Mass Spectrometry Based Fragmentation in Metabolomics

R/spectralMatching.R
In Viant-Metabolomics/msPurity: Automated Evaluation of Precursor Ion Purity for Mass Spectrometry Based Fragmentation in Metabolomics

Defines functions align mzComparePPMrange alignAndMatch queryVlibrarySingle queryVlibrary getPeakCols getMetaCols getSmeta filterPrecursors filterSMeta getXcmsSmSummary getScanPeaksSqlite pullPid spectralMatching

Documented in spectralMatching

R Package Documentation

Browse R Packages

We want your feedback!

Viant-Metabolomics/msPurity Automated Evaluation of Precursor Ion Purity for Mass Spectrometry Based Fragmentation in Metabolomics

R/spectralMatching.R In Viant-Metabolomics/msPurity: Automated Evaluation of Precursor Ion Purity for Mass Spectrometry Based Fragmentation in Metabolomics

Defines functions align mzComparePPMrange alignAndMatch queryVlibrarySingle queryVlibrary getPeakCols getMetaCols getSmeta filterPrecursors filterSMeta getXcmsSmSummary getScanPeaksSqlite pullPid spectralMatching

Documented in spectralMatching

R Package Documentation

Browse R Packages

We want your feedback!

Viant-Metabolomics/msPurity
Automated Evaluation of Precursor Ion Purity for Mass Spectrometry Based Fragmentation in Metabolomics

R/spectralMatching.R
In Viant-Metabolomics/msPurity: Automated Evaluation of Precursor Ion Purity for Mass Spectrometry Based Fragmentation in Metabolomics