R/readMaxQuantFile.R

Defines functions readMaxQuantFile

Documented in readMaxQuantFile

#' Read Quantitation Data-Files (proteinGroups.txt) Produced From MaxQuant At Protein Level
#'
#' Protein quantification results from \href{https://www.maxquant.org}{MaxQuant} can be read using this function and relevant information extracted.
#' Input files compressed as .gz can be read as well.
#' The protein abundance values (XIC), peptide counting information like number of unique razor-peptides or PSM values and sample-annotation (if available) can be extracted, too.
#' The protein abundance values may be normalized using multiple methods (median normalization as default), the determination of normalization factors can be restricted to specific proteins
#' (normalization to bait protein(s), or to invariable matrix of spike-in experiments).
#' The protein annotation data gets parsed to extract specific fields (ID, name, description, species ...).
#' Besides, a graphical display of the distribution of protein abundance values may be generated before and after normalization.
#'
#' @details
#' \href{https://www.maxquant.org}{MaxQuant} is proteomics quantification software provided by the MaxPlanck institute.
#' By default MaxQuant writes the results of each run to the path \code{combined/txt}, from there (only) the files
#'  'proteinGroups.txt' (main quantitation at protein level), 'summary.txt' and 'parameters.txt' will be used.
#'
#' Meta-data describing the samples and experimental setup may be available from two sources :
#' a) The file \code{summary.txt} which gets produced by MaxQuant in the same folder as the main quantification data.
#' b) Furthermore, meta-data deposited as \code{sdrf} at Pride can be retreived (via the respective github page) when giving the accession number in argument \code{sdrf}.
#' Then, the meta-data will be examined for determining groups of replicates and
#' the results thereof can be found in $sampleSetup$levels.
#' Alternatively, a dataframe formatted like sdrf-files (ie for each sample a separate line, see also function \code{readSdrf}) may be given.
#' In tricky cases it is also possible to precise the column-name to use for defining the groups of replicates or the method for automatically choosing
#'  the most suited column via the 2nd value of the argument \code{sdrf}.
#' Please note, that sdrf is still experimental and only a small fraction of proteomics-data on Pride have been annotated accordingly.
#' If a valid sdrf is furnished, it's information has priority over the information extracted from the MaxQuant produced file summary.txt.
#'
#' This import-function has been developed using MaxQuant versions 1.6.10.x to 2.0.x, the format of the resulting file 'proteinGroups.txt' is typically well conserved between versions.
#' The final output is a list containing these elements: \code{$raw}, \code{$quant}, \code{$annot}, \code{$counts}, \code{$sampleSetup}, \code{$quantNotes}, \code{$notes}, or (if \code{separateAnnot=FALSE}) data.frame
#'   with annotation- and main quantification-content. If \code{sdrf} information has been found, an add-tional list-element \code{setup}
#' will be added containg the entire meta-data as \code{setup$meta} and the suggested organization as \code{setup$lev}.
#'
#'
#' @param path (character) path of file to be read
#' @param fileName (character) name of file to be read (default 'proteinGroups.txt' as typically generated by MaxQuant in txt folder). Gz-compressed files can be read, too.
#' @param normalizeMeth (character) normalization method, defaults to \code{median}, for more details see \code{\link[wrMisc]{normalizeThis}})
#' @param quantCol (character or integer) exact col-names, or if length=1 content of \code{quantCol} will be used as pattern to search among column-names for $quant using \code{grep}
#' @param contamCol (character or integer, length=1) which columns should be used for contaminants
#' @param pepCountCol (character) pattern to search among column-names for count data (1st entry for 'Razor + unique peptides', 2nd fro 'Unique peptides', 3rd for 'MS.MS.count' (PSM))
#' @param read0asNA (logical) decide if initial quntifications at 0 should be transformed to NA (thus avoid -Inf in log2 results)
#' @param sampleNames (character) custom column-names for quantification data; this argument has priority over \code{suplAnnotFile}
#' @param extrColNames (character) column names to be read (1st position: prefix for LFQ quantitation, default 'LFQ.intensity'; 2nd: column name for protein-IDs, default 'Majority.protein.IDs'; 3rd: column names of fasta-headers, default 'Fasta.headers', 4th: column name for number of protein IDs matching, default 'Number.of.proteins')
#' @param specPref (character) prefix to identifiers allowing to separate i) recognize contamination database, ii) species of main identifications and iii) spike-in species
#' @param refLi (character or integer) custom specify which line of data should be used for normalization, ie which line is main species; if character (eg 'mainSpe'), the column 'SpecType' in $annot will be searched for exact match of the (single) term given
#' @param remRev (logical) option to remove all protein-identifications based on reverse-peptides
#' @param remConta (logical) option to remove all proteins identified as contaminants
#' @param separateAnnot (logical) if \code{TRUE} output will be organized as list with \code{$annot}, \code{$abund} for initial/raw abundance values and \code{$quant} with final normalized quantitations
#' @param gr (character or factor) custom defined pattern of replicate association, will override final grouping of replicates from \code{sdrf} and/or \code{suplAnnotFile} (if provided)   \code{}
#' @param sdrf (character, list or data.frame) optional extraction and adding of experimenal meta-data: if character, this may be the ID at ProteomeExchange,
#'   the second & third elements may give futher indicatations for automatic organization of groups of replicates.
#'   Besides, the output from \code{readSdrf} or a list from \code{defineSamples} may be provided;
#'   if \code{gr} is provided, \code{gr} gets priority for grouping of replicates;
#'   if \code{sdrfOrder=TRUE} the output will be put in order of sdrf
#' @param suplAnnotFile (logical or character) optional reading of supplemental files produced by MaxQuant; if \code{gr} is provided, it gets priority for grouping of replicates
#'  if \code{TRUE} default to files 'summary.txt' (needed to match information of \code{sdrf}) and 'parameters.txt' which can be found in the same folder as the main quantitation results;
#'  if \code{character} the respective file-names (relative ro absolute path), 1st is expected to correspond to 'summary.txt' (tabulated text, the samples as given to MaxQuant) and 2nd to 'parameters.txt' (tabulated text, all parameters given to MaxQuant)
#' @param groupPref (list) additional parameters for interpreting meta-data to identify structure of groups (replicates), will be passed to \code{readSampleMetaData}.
#'   May contain \code{(gr= )} vector or factor describing a custom-grouping (will get priority over other sdrf etc) like 
#'     \code{(gr="sdrf")} (global mining of sdrf), 
#'     \code{(gr="sdrf$thisColumn")} (specific column of sdrf, if notg present will fall back to global mining of sdrf),
#'     or \code{(gr="colnames")} to force grouping based on colnames from main data (after stripping terminal nominators) 
#' 
#'   May contain \code{lowNumberOfGroups=FALSE} for automatically choosing a rather elevated number of groups if possible (defaults to low number of groups, ie higher number of samples per group)
#'   May contain \code{chUnit} (logical or character) to be passed to \code{readSampleMetaData()} for (optional) adjustig of unit-prefixes in meta-data group labels, in case multiple different unit-prefixes 
#'   are used (eg '100pMol' and '1nMol').
#' @param plotGraph (logical) optional plot vioplot of initial and normalized data (using \code{normalizeMeth}); alternatively the argument may contain numeric details that will be passed to \code{layout} when plotting
#' @param titGraph (character) custom title to plot of distribution of quantitation values
#' @param wex (numeric)  relative expansion factor of the violin in plot
#' @param silent (logical) suppress messages
#' @param debug (logical) additional messages for debugging
#' @param callFrom (character) allow easier tracking of messages produced
#' @return This function returns a list with  \code{$raw} (initial/raw abundance values), \code{$quant} with final normalized quantitations, \code{$annot} (columns ), \code{$counts} an array with 'PSM' and 'NoOfRazorPeptides',
#'   \code{$quantNotes}, \code{$notes} and optional \code{setup} for meta-data from \code{sdrf}; or a data.frame with quantitation and annotation if \code{separateAnnot=FALSE}
#' @seealso \code{\link[utils]{read.table}}, \code{\link[wrMisc]{normalizeThis}}) , \code{\link{readProteomeDiscovererFile}}; \code{\link{readProlineFile}} (and other imprtfunctions), \code{\link{matrixNAinspect}}
#' @examples
#' path1 <- system.file("extdata", package="wrProteo")
#' # Here we'll load a short/trimmed example file (thus not the MaxQuant default name)
#' fiNa <- "proteinGroupsMaxQuant1.txt.gz"
#' specPr <- c(conta="conta|CON_|LYSC_CHICK", mainSpecies="YEAST", spike="HUMAN_UPS")
#' dataMQ <- readMaxQuantFile(path1, file=fiNa, specPref=specPr, tit="tiny MaxQuant")
#' summary(dataMQ$quant)
#' matrixNAinspect(dataMQ$quant, gr=gl(3,3))
#' @export
readMaxQuantFile <- function(path, fileName="proteinGroups.txt", normalizeMeth="median", quantCol="LFQ.intensity", contamCol="Potential.contaminant",
  pepCountCol=c("Razor + unique peptides","Unique peptides","MS.MS.count"), read0asNA=TRUE, refLi=NULL, sampleNames=NULL,
  extrColNames=c("Majority.protein.IDs","Fasta.headers","Number.of.proteins"), specPref=c(conta="conta|CON_|LYSC_CHICK", mainSpecies="OS=Homo sapiens"),
  remRev=TRUE, remConta=FALSE, separateAnnot=TRUE, gr=NULL, sdrf=NULL, suplAnnotFile=NULL, groupPref=list(lowNumberOfGroups=TRUE, chUnit=TRUE),
  titGraph=NULL, wex=1.6, plotGraph=TRUE, silent=FALSE, debug=FALSE, callFrom=NULL) {
  ## prepare
  fxNa <- wrMisc::.composeCallName(callFrom, newNa="readMaxQuantFile")
  oparMar <- graphics::par("mar")                     # old margins, for rest after figure
  oparLayout <- graphics::par("mfcol")                # old layout, for rest after figure
  on.exit(graphics::par(mar=oparMar, mfcol=oparLayout))            # restore old mar settings
  remStrainNo <- TRUE                   # if TRUE extract Species in very stringent pattern
  cleanDescription <- TRUE              # clean 'Description' for artifacts of truncated text (tailing ';' etc)
  
  ## functions

  ## init check
  reqPa <- c("utils","wrMisc")
  chPa <- sapply(reqPa, requireNamespace, quietly=TRUE)
  if(any(!chPa)) stop("Package(s) '",paste(reqPa[which(!chPa)], collapse="','"),"' not found ! Please install first from CRAN")
  if(isTRUE(debug)) silent <- FALSE else { debug <- FALSE
    if(!isTRUE(silent)) silent <- FALSE }
         excluCol <- "^Abundances.Count"   # exclude this from quantifications columns
  cleanDescription <- TRUE                 # clean 'Description' for artifacts of truncated text (tailing ';' etc)
  infoDat <- infoFi <- setupSd <- parametersD <- NULL        # initialize

  ## check if path & file exist
  if(!grepl("\\.txt$|\\.txt\\.gz$", fileName)) message(fxNa,"Trouble ahead, expecting tabulated text file (the file'",fileName,"' might not be right format) !!")
  paFi <- wrMisc::checkFilePath(fileName, path, expectExt="txt", mode=c("compressedOption","stopIfNothing"), callFrom=fxNa, silent=silent,debug=debug)
  if(debug){ message(fxNa,"rMQ0b .. Ready to read", if(length(path) >0) c(" from path ",path[1])," the file  ",fileName[1])
    rMQ0b <- list(fileName=fileName,path=path,paFi=paFi,normalizeMeth=normalizeMeth,read0asNA=read0asNA,quantCol=quantCol,refLi=refLi,separateAnnot=separateAnnot)}  # annotCol=annotCol,FDRCol=FDRCol

  ## read (main) file
  ## future: look for fast reading of files
  tmp <- try(utils::read.delim(file.path(paFi), stringsAsFactors=FALSE), silent=TRUE)

  if(length(tmp) <1 || inherits(tmp, "try-error") || length(dim(tmp)) <2) {
    if(inherits(tmp, "try-error")) warning("Unable to read input file ('",paFi,"')!  (check format or if rights to read)") else {
      if(!silent) message(fxNa,"Content of  file '",paFi,"' seeps empty or non-conform !  Returning NULL; check if this is really a MaxQuant-file") }
    tmp <- NULL
    return(NULL)
  } else {
    ## start checking format
    if(debug) { message(fxNa,"rMQ1 .. dims of initial data : ", nrow(tmp)," li and ",ncol(tmp)," col "); rMQ1 <- list(fileName=fileName,path=path,paFi=paFi,tmp=tmp,normalizeMeth=normalizeMeth,read0asNA=read0asNA,quantCol=quantCol,
      refLi=refLi,separateAnnot=separateAnnot   )}  # annotCol=annotCol,FDRCol=FDRCol
    ## check which columns can be extracted (for annotation)
    if(is.integer(contamCol) && length(contamCol) >0) contamCol <- colnames(tmp)[contamCol]
    extrColNames <- union(extrColNames, contamCol)                     # add contamCol if not included in extrColNames
    chCol <- extrColNames %in% colnames(tmp)
    if(!any(chCol, na.rm=TRUE)) { extrColNames <- gsub("\\."," ",extrColNames)
      chCol <- extrColNames %in% colnames(tmp) }
    if(all(!chCol, na.rm=TRUE)) stop("Problem locating annotation columns (",wrMisc::pasteC(extrColNames, quoteC="''"),")")
    if(any(!chCol, na.rm=TRUE) ) {
      if(!silent) message(fxNa,"Note: Can't find columns ",wrMisc::pasteC(extrColNames[!chCol], quoteC="'")," !")
    }
    ## 'REVERSE' peptides
    chMajProCol <- extrColNames[1] %in% colnames(tmp)
    if(chMajProCol) {
      chRev <- grep("REV__", tmp[,extrColNames[1]])
      if(length(chRev) >0) {
        if(!silent) message(fxNa,"Note: Found ",length(chRev)," out of ",nrow(tmp)," proteins marked as 'REV_' (reverse peptide identification)", if(isTRUE(remRev)) " - Removing")
        if(isTRUE(remRev)) tmp <- if(length(chRev) < nrow(tmp) -1)  tmp[-1*chRev,] else matrix(tmp[-1*chRev,], nrow=nrow(tmp)-length(remRev), dimnames=list(rownames(tmp)[-1*chRev], colnames(tmp)))
      }
      ## remove MaxQuant internal contaminants CON__
      if(isTRUE(remConta) && nrow(tmp) >0) { isConta <- grep("CON__{0,1}[[:alpha:]]+", tmp[,extrColNames[1]])
        if(length(isConta) >0) {
          if(!silent) message(fxNa,"Note: Found ",length(isConta)," out of ",nrow(tmp)," proteins marked as 'CON_' (contaminants) - Removing")
          tmp <- if(length(isConta) < nrow(tmp) -1) tmp[-1*isConta,] else matrix(tmp[-1*isConta,], nrow=nrow(tmp)-length(isConta), dimnames=list(rownames(tmp)[-1*isConta], colnames(tmp)))
      } }
    } else if(!silent) message(fxNa,"BIZZARE, trouble ahead : Unable to find column '",extrColNames[1],"' (from argument 'extrColNames')")
    if(debug) {message(fxNa,"rMQ2"); rMQ2 <- list(path=path,chPa=chPa,tmp=tmp,extrColNames=extrColNames,chCol=chCol,chMajProCol=chMajProCol,chRev=chRev,remConta=remConta)}
  }
  if(length(tmp) >0) {
    ## further extracting : quantitation
    grepX <- function(x) grep(paste0(x,"\\."), colnames(tmp))
    useDCol <- if(length(quantCol)==1) grepX(quantCol) else unique(as.integer(sapply(quantCol, grepX)))
    if(length(useDCol) <1) stop("NO columns matching terms ",wrMisc::pasteC(quantCol, quoteC="'")," from argument 'quantCol' found !")
    #not needed#MQdat <- as.matrix(tmp[,useDCol])
    quantColP <- NULL                           # initialize
    if(length(quantCol) <1) stop(" 'quantCol' must be provided !")
    if(length(quantCol) >1) { abund <- as.matrix(wrMisc::extrColsDeX(tmp, extrCol=quantCol, doExtractCols=TRUE, silent=silent, callFrom=fxNa))
    } else { chP <- substr(quantCol, nchar(quantCol), nchar(quantCol)) != "."
      quantColP <- quantCol
      quantCol <- if(chP) grep(paste0(quantCol,"\\."), colnames(tmp)) else grep(quantCol, colnames(tmp))
      chNa <- is.na(quantCol)
      if(all(chNa, na.rm=TRUE)) stop("Could not find any of the columns specified in argument 'quantCol' !")
      if(any(chNa, na.rm=TRUE)) {
        if(!silent) message(fxNa,"Could not find columns ",wrMisc::pasteC(quantCol[which(chNa)],quote="'")," .. omit")
        quantCol <- wrMisc::naOmit(quantCol)}
      abund <- as.matrix(tmp[,quantCol]) }           # abundance val
    chNum <- is.numeric(abund)
    if(!chNum) {abund <- apply(tmp[,quantCol], 2, wrMisc::convToNum, convert="allChar", silent=silent, callFrom=fxNa)}
    if(length(dim(abund)) <2 && !is.numeric(abund)) abund <- matrix(as.numeric(abund), ncol=ncol(abund), dimnames=dimnames(abund))
    colnames(abund) <- if(length(quantColP)==1) sub(paste0(quantColP,"\\."),"", colnames(abund)) else wrMisc::.trimLeft(wrMisc::.trimRight(colnames(abund)))
    if(debug) {message(fxNa,"rMQ3"); rMQ3 <- list(abund=abund,path=path,chPa=chPa,tmp=tmp,extrColNames=extrColNames,chCol=chCol,chMajProCol=chMajProCol,chRev=chRev,remConta=remConta)}

    ## convert 0 to NA
    if(!isFALSE(read0asNA)) {ch1 <- abund <= 0
      if(any(ch1, na.rm=TRUE)) { abund[which(ch1)] <- NA
        if(!silent) message(fxNa,"Transform ",sum(ch1),"(",100*round(sum(ch1)/length(ch1),3),"%) initial '0' values to 'NA'")}}

    ## further extracting : prepare for countig data
    ch1 <- grep(" $",pepCountCol)
    if(length(ch1) < length(pepCountCol)) {pepCountCol <- if(length(ch1) >0) paste0(pepCountCol[-1*which(ch1)]," ") else paste0(pepCountCol," ")}  # add tailing ' ' (if not yet present)
    if(length(grep("\\\\",pepCountCol)) <1) pepCountCol <- gsub("\\.","\\\\.",pepCountCol)       # protect '.'  "
    ## prepare for column-name style with '.' or '...'
    tm2 <- lapply(as.list(pepCountCol), function(x) c(x, gsub(" ",".", sub(" \\+ ","...",x))) )
    names(tm2) <- pepCountCol
    usePCol <- lapply(tm2, function(x) {ch1 <- lapply(x, grep, colnames(tmp)); if(length(ch1) >1) ch1[[which.max(sapply(ch1,length))]] else ch1[[1]]})
    usePCol <- lapply(usePCol, wrMisc::naOmit)
    ch2 <- sapply(usePCol, length) - ncol(abund)
    if(any(ch2 >0, na.rm=TRUE)) usePCol[which(ch2 >0)] <- lapply(usePCol[which(ch2 >0)], function(x) x[-1])
    ch2 <- sapply(usePCol, length) ==ncol(abund)
    if(!silent && any(!ch2, na.rm=TRUE)) message(fxNa,"Could not find peptide counts columns (argument 'pepCountCol') matching to '",pepCountCol[which(!ch2)],"'")
    if(debug) {message(fxNa,"rMQ4"); rMQ4 <- list(abund=abund,usePCol=usePCol,ch2=ch2,tm2=tm2,path=path,chPa=chPa,tmp=tmp,extrColNames=extrColNames,chCol=chCol,chMajProCol=chMajProCol,chRev=chRev,remConta=remConta)}

    ## make array of PSM counts etc
    if(any(ch2, na.rm=TRUE)) {
      counts <- array(dim=c(nrow(tmp), ncol(abund), sum(ch2)), dimnames=list(NULL, colnames(abund), pepCountCol[which(ch2)]))
      for(i in 1:sum(ch2)) counts[,,i] <- as.numeric(as.matrix(tmp[,usePCol[[which(ch2)[i]]] ]))
    } else counts <- NULL

    ## Annotation
    useACol <- list(annC=match(extrColNames, colnames(tmp)) )
    annot <- as.matrix(tmp[,useACol$annC])
    if(debug) {message(fxNa,"rMQ4b"); rMQ4b <- list(path=path,chPa=chPa,tmp=tmp,extrColNames=extrColNames,chCol=chCol,chMajProCol=chMajProCol,counts=counts,
      chRev=chRev,quantCol=quantCol,abund=abund,chNum=chNum,ch2=ch2,annot=annot,remConta=remConta,specPref=specPref)}
      

    ## look for tags from  specPref
#    if(length(specPref) >0) {
#      ## set anno#t[,"specPref"] according to specPref
#      annot <- .ex#trSpecPref(specPref, annot, useColumn=c("Majority.protein.IDs","Fasta.headers"), silent=silent, debug=debug, callFrom=fxNa)
#    } else if(debug#) message(fxNa,"Note: Argument 'specPref' not specifed (empty)")

    if(debug) {message(fxNa,"rMQ5"); rMQ5 <- list(path=path,chPa=chPa,tmp=tmp,extrColNames=extrColNames,chCol=chCol,chMajProCol=chMajProCol,counts=counts,
      chRev=chRev,quantCol=quantCol,abund=abund,chNum=chNum,ch2=ch2,annot=annot,remConta=remConta,specPref=specPref)}

    ## remove MQ-internal contaminants
    if(remConta) {
      conLi <- grep("CON__[[:alnum:]]", annot[,"Majority.protein.IDs"])
      if(length(conLi) >0) {
        iniLi <- nrow(annot)
        annot <- annot[-conLi,]
        abund <- abund[-conLi,]
        #specMQ <- specMQ[-conLi]
        counts <- if(length(dim(counts))==3) counts[-conLi,,] else counts[-conLi,,]
        if(debug) message(fxNa,"Removing ",length(conLi)," instances of MaxQuant-contaminants to final ",nrow(annot)," lines/IDs")} }

    ## split Annotation
    remHeader <- c("^conta\\|","^sp\\|")
    MQan2 <- strsplit(sub(remHeader[1], "", sub(remHeader[2], "", annot[,"Majority.protein.IDs"])), "\\|")
    MQanLe <- sapply(MQan2, length)
    MQan3 <- matrix(NA, nrow=nrow(annot), ncol=2, dimnames=list(NULL, c("Accession","EntryName")))
    chLe <- MQanLe==1
    if(any(chLe, na.rm=TRUE)) MQan3[which(chLe),1] <- unlist(MQan2[which(chLe)])
    chLe <- MQanLe==2
    if(any(chLe, na.rm=TRUE)) MQan3[which(chLe),] <- matrix(unlist(MQan2[which(chLe)]), ncol=2, byrow=TRUE)
    chLe <- MQanLe >2
    locAccNo <- function(x) {      # function to select AccessionNumner (eg P02768) and EntryName (eg ALBU_HUMAN) after strsplit() of concatenated annotation
      accIn <- grep("^[[:upper:]]+[[:digit:]]+$|^[[:upper:]]+[[:digit:]]+\\-[[:digit:]]+$", x)
      namId <- grep("[[:upper:]]_[[:upper:]]", x)
      useInd <- c(acc=if(length(accIn) >0) accIn[1] else NA, name=if(length(namId) >0) namId[1] else NA)
      chNA <- is.na(useInd)
      if(any(chNA)) useInd[which(chNA)] <- (1:length(x))[-1*wrMisc::naOmit(unique(c(namId,useInd)))][1:sum(chNA)]
      x[useInd] }
    if(any(chLe, na.rm=TRUE)) MQan3[which(chLe),] <- t(sapply(MQan2[which(chLe)], locAccNo ))
    chSemc <- grep(";", MQan3[,2])                                     # look for semicolon separator  (eg "CATA_HUMAN_UPS;conta")
    if(length(chSemc) >0) MQan3[chSemc,2] <- sub(";[[:print:]]+","",MQan3[chSemc,2])       # remove all after semicolon (eg "CATA_HUMAN_UPS;conta")
    if(debug) {message(fxNa,"rMQ5c"); rMQ5c <- list(path=path,chPa=chPa,tmp=tmp,extrColNames=extrColNames,chCol=chCol,chMajProCol=chMajProCol,counts=counts,
      chRev=chRev,quantCol=quantCol,abund=abund,chNum=chNum,ch2=ch2,annot=annot,remConta=remConta,specPref=specPref)}

    ## contaminants (fuse from column 'Potential.contaminant' and those found via specPref[1])
    contam <- rep(FALSE, nrow(annot))
    if("Potential.contaminant" %in% colnames(annot)) { chCo <- grepl("\\+",annot[,"Potential.contaminant"])
      if(all(chCo)) { chCo <- FALSE ; warning(fxNa,"ALL proteins were marked as 'Potential.contaminant' !   Nothing would remain, thus ignoring..")}
      if(any(chCo, na.rm=TRUE)) contam[which(chCo)] <- TRUE }
    ## extract/add GN
    MQan3 <- cbind(MQan3,GN=NA)
    GNLi <- grep("\\ GN=[[:upper:]]{2,}", annot[,"Fasta.headers"])
      if(length(GNLi) >0) { zz <- sub("[[:print:]]+\\ GN=", "",annot[GNLi,"Fasta.headers"])   # remove surplus to left
        MQan3[GNLi,"GN"] <- sub("[[:punct:]]$","", sub("\\ +$","", sub("\\ [[:print:]]+","",zz))) }    # remove surplus to right (and right trailing space) and trailing ';'
    if(debug) {message(fxNa,"rMQ6")}

    ## finalize annotation
    annot <- cbind(MQan3, Species=NA, Contam=contam, annot)
    ## extract species according to custom search parameters 'specPref'
    .annSpecies <- function(spe=c("_HUMAN","Homo sapiens"), anno=annot, exCoNa=extrColNames) {
      ## extract species tags out of annot[,"Majority.protein.IDs"], place as convert to regular name in anno, return matrix anno
      ch1 <- grep(spe[1], anno[,exCoNa[2]])
      if(length(ch1) >0) anno[ch1,"Species"] <- spe[2]  #"Homo sapiens"
      anno }
    if(remStrainNo) {
      commonSpec <- .commonSpecies()
      for(i in 1:nrow(commonSpec)) annot <- .annSpecies(commonSpec[i,], annot, exCoNa=extrColNames)}
    if(debug) {message(fxNa,"rMQ7"); rMQ7 <- list(path=path,chPa=chPa,tmp=tmp,extrColNames=extrColNames,chCol=chCol,chMajProCol=chMajProCol,chRev=chRev,quantCol=quantCol,remStrainNo=remStrainNo,
      abund=abund,chNum=chNum,ch2=ch2, annot=annot,chLe=chLe,MQan2=MQan2,MQan3=MQan3,contam=contam,GNLi=GNLi,remConta=remConta,counts=counts)}

    ## now complete (overwrite) by info extracted from fasta : ' OS='
        #overWriteSpecies <- TRUE
        #chSpe <- if(overWriteSpecies) 1:nrow(annot) else which(is.na(annot[,"Species"]) | nchar(annot[,"Species"]) <2)    # missing species-info
    chSpe <- grep("[[:print:]]+\\ OS=[[:upper:]][[:lower:]]+", annot[,"Fasta.headers"])   # limit to those not found by species extension on protein name

    if(length(chSpe) >0) {    # if column Species is NA and OS= in fasta-header
      OS <- sub("[[:print:]]+\\ OS=","", annot[chSpe,"Fasta.headers"])
      if(!remStrainNo) {
        ## keep strain information : need to first separate entries with (strain and treat separately from rest (below)
        ## not finished
        ## remove remaining tailing semicolon to comma (in Species)
        ch1 <- grep(";$|,$", OS)
        if(length(ch1) >0) OS[ch1] <- sub(";+$|,*$","", OS[ch1])
        ## remove any other tailing tags (like OX=)
        ch1 <- grep("\\ [[:upper:]]{2}=",annot[,"Species"])
        if(length(ch1) >0) annot[ch1,"Species"] <- sub("\\ [[:upper:]]{2}=[[:print:]]*","",annot[ch1,"Species"])
        OS <- annot[chSpe,"Species"]
      } else {    ## strict 2 word after OS=  (strain names will be cut)
        ch1 <- grep("^[[:upper:]][[:lower:]]*\\ [[:lower:]]+\\ [[:print:]]", OS)
        if(length(ch1) > 0) { nch <- nchar(sub("^[[:upper:]][[:lower:]]*\\ [[:lower:]]+\\ ", "", OS[ch1]))     # loose strain information
          OS[ch1] <- substr(OS[ch1], 1, nchar(OS[ch1]) -nch -1) }
        #annot[chSpe,"Species"] <- OS
      }
      ## check/complete for truncated species names (ie names found) inside other ones
      OS <- gsub(";{1,5}$", "", OS)  # remove tailing separators
      OSna <- unique(OS)
      ch1 <- nchar(OSna) <1
      if(debug) {message(fxNa,"rMQ7b")}
      if(any(ch1, na.rm=TRUE)) OSna <- OSna[which(nchar(OSna) >0)]     # (just in case) remove empty tags
      ch2 <- lapply(OSna, grep, OSna)
      chTr <- sapply(ch2, length) >1
      if(any(chTr, na.rm=TRUE)) { if(!silent) message(fxNa,"Found ",sum(chTr)," species name(s) appearing inside other ones, assume as truncated (eg  ",OSna[which(chTr)[1]],")")
        for(i in which(chTr)) OS[which(OS==OSna[i])] <- OSna[ch2[[i]][1]]
      }
      annot[chSpe,"Species"] <- OS

      if(debug) {message(fxNa,"rMQ7c")}
    }
    chSp <- sum(is.na(annot[,"Species"]))
    if(chSp >0 && !silent) { 
       message(fxNa,"Note: ",chSp," proteins with unknown species")}

    tab <- table(annot[,"Species"])
    if(length(tab) >0) {
      tab <- rbind(names(tab), paste0(": ",tab,",  "))
      if(!silent) message("     data by species : ", apply(tab, 2, paste)) }               # all lines assigned
    if(debug) {message(fxNa,"rMQ8")}

    ## MaxQuant internal contaminants specific : remove non-protein DB parts - if possible, eg "CON__ENSEMBL:ENSBTAP00000007350;CON__P01030" -> "CON__P01030"
    conID <- paste0("CON__",c("ENSEMBL","REFSEQ","H-INV"),":")
    conID <- paste0(conID, c("[[:upper:]]+[[:digit:]]*;{0,1}", "[[:upper:]]+_[[:digit:]]+;{0,1}", "[[:upper:]]+[[:digit:]]+;{0,1}"))
    acc1 <- annot[,1]
    for(i in 1:length(conID)) {
      acc2 <- acc1                     # need previous 'status' to compare if all text disappears
      acc1 <- sub(conID[i], "", acc1)
      chLe <- nchar(acc1)  <2
      if(any(chLe, na.rm=TRUE)) acc1[which(chLe)] <- sub("CON__","", acc2[which(chLe)]) }  # remove entire entry only if something (else) remains
    ## remove first of CON_ entries (only if min 3 characters 3 remain)
    ch2 <- grep("CON__{0,1}[A-Z0-9]+;", acc1)
    if(length(ch2) >0) { acc2 <- acc1
      acc1 <- sub("CON__{0,1}[A-Z0-9]+;", "", acc1)
      chLe <- nchar(acc1) <2
      if(any(chLe, na.rm=TRUE)) acc1[which(chLe)] <- sub("CON__","", acc2[which(chLe)]) }  # remove entire entry only if something (else) remains
    ## remove first of "CON_" marks
    ch2 <- grep("CON_", acc1)
    if(length(ch2) >0) acc1[ch2] <- sub("CON__{0,1}","", acc1[ch2])
    annot[,1] <- acc1

    if(debug) {message(fxNa,"rMQ8b")}

    ## check for composite Accession names, keep only part
    #ch1 <- grep(",|;|_|\\(|\\|", annot[,1])    # note: need to not exclude/mark '-'
    ch1 <- grep(";", annot[,1])    # note: need to not exclude/mark '-'
    if(length(ch1) >0) {
      ## remove 1st part of CON__
      if(!silent) message(fxNa,"Found ",length(ch1)," composite accession-numbers (eg ",annot[ch1[1],1],"), truncating ")
      ch2 <- sort(union(grep("^CON_",annot[ch1,1]), grep(";CON_",annot[ch1,1])))   # if composite Acc number contains CON_, remove this part (and 1st of keep rest)
      if(length(ch2) >0) {
        ch3 <- gsub("CON__{0,1}[0-9A-Z]+[-0-9;]*","", annot[ch1[ch2],1])         # remove CON__A0B1C2 or CON__A0B1C2-1  etc
        ch3b <- nchar(ch3) >3
        if(any(ch3b, na.rm=TRUE)) annot[ch1[ch2[which(ch3b)]],1] <- ch3[which(ch3b)]
      }
      annot[ch1,1] <- sub(";$","",sub("^;","",annot[ch1,1]))   # remove heading or tailing ';'
      ch1 <- grep(",|;|\\(|\\|", annot[,1])
      ## keep 1st part after separator characters
      if(length(ch1) >0) annot[ch1,1] <- sub(paste(paste0(c(",", ";", "\\(", "\\|"), "[[:print:]]*"), collapse="|"), "", annot[ch1,1]) # keep 1st ID: remove all after separator ..
    }
    if(debug) {message(fxNa,"rMQ8c"); rMQ8c <- list(abind=abund,annot=annot,specPref=specPref) }
      

    ## extract UniProtID as EntryName
    tmp <- sub("\\ [[:alnum:]][[:print:]]+", "", annot[,"Fasta.headers"])
    annot[,"EntryName"] <- sub("^[[:print:]]+\\|[[:alnum:]]+\\|","",tmp)  # also remove _UPS?
    colnames(annot)[which(colnames(annot)=="GN")] <- "GeneName"           # correct colname
    annot <- cbind(annot[,1:6], Description=sub("\\ $","",sub("[[:upper:]]{2}=[[:print:]]+","",substring(annot[,"Fasta.headers"], nchar(tmp) +2))), annot[,7:ncol(annot)])
    if(debug) {message(fxNa,"rMQ8d"); rMQ8d <- list(abind=abund,annot=annot,specPref=specPref)}

    ## look for tags from  specPref
    if(length(specPref) >0) {
      ## set annot[,"specPref"] according to specPref
      annot <- .extrSpecPref(specPref, annot, useColumn=c("Majority.protein.IDs","Fasta.headers"), silent=silent, debug=debug, callFrom=fxNa)
    } else if(debug) message(fxNa,"Note: Argument 'specPref' not specifed (empty)")

    ## still needed ??
    chSpPref <- names(specPref) %in% "conta"
    if(any(chSpPref)) { specPrCont <- annot[,"SpecType"] %in% "conta"
      if(any(specPrCont))  annot[which(specPrCont),"Potential.contaminant"] <- TRUE }

    if(debug) {message(fxNa,"rMQ8e"); rMQ8e <- list(abind=abund,annot=annot,specPref=specPref)}


    ## look for unique col from $annot to use as rownames
    chAn <- colSums(apply(annot[,c(1:min(ncol(annot),7))], 2, duplicated), na.rm=TRUE)          # look at first 6 cols : how many elements per column duplicated
    if(!silent) message(fxNa,"Use column '",colnames(annot)[which.min(chAn)],"' as identifyer (has fewest, ie ",chAn[which.min(chAn)]," duplicated entries) as rownames")
    rownames(abund) <- rownames(annot) <- if(any(chAn==0)) annot[,which(chAn==0)[1]] else wrMisc::correctToUnique(annot[,which.min(chAn)], callFrom=fxNa)
    if(length(counts) >0) rownames(counts) <- rownames(annot)
    if(debug) {message(fxNa,"rMQ9"); rMQ9 <- list(path=path,chPa=chPa,tmp=tmp,extrColNames=extrColNames,chCol=chCol,chMajProCol=chMajProCol,chRev=chRev,quantCol=quantCol,abund=abund,chNum=chNum,ch2=ch2,
      annot=annot,chLe=chLe,MQan2=MQan2,MQan3=MQan3,refLi=refLi,contam=contam,GNLi=GNLi,remConta=remConta)}

    ## check for reference for normalization
    refLiIni <- refLi
    if(is.character(refLi) && length(refLi)==1) {
       refLi <- which(annot[,"SpecType"]==refLi)
      if(length(refLi) <1 ) { refLi <- 1:nrow(abund)
        if(!silent) message(fxNa,"Could not find any proteins matching argument 'refLi=",refLiIni,"', ignoring ...")
      } else {
        if(!silent) message(fxNa,"Normalize using (custom) subset of ",length(refLi)," lines specified as '",refLiIni,"'")}}    # may be "mainSpe"

    ## take log2 & normalize
    quant <- try(wrMisc::normalizeThis(log2(abund), method=normalizeMeth, mode="additive", refLines=refLi, silent=silent, debug=debug, callFrom=fxNa), silent=TRUE)
    if(inherits(quant, "try-error")) { warning(fxNa,"PROBLEMS ahead : Unable to normalize as log2-data !!") }

    if(debug) {message(fxNa,"rMQ10"); rMQ10 <- list(path=path,chPa=chPa,tmp=tmp,extrColNames=extrColNames,chCol=chCol,chMajProCol=chMajProCol,chRev=chRev,quantCol=quantCol,abund=abund,chNum=chNum,ch2=ch2,
      quant=quant,annot=annot,sampleNames=sampleNames,sdrf=sdrf,gr=gr,specPref=specPref,groupPref=groupPref,suplAnnotFile=suplAnnotFile,chLe=chLe,MQan2=MQan2,MQan3=MQan3,contam=contam,GNLi=GNLi,remConta=remConta)}


    ### GROUPING OF REPLICATES AND SAMPLE META-DATA  (modif 25sep24)
    if(identical("sdrf",sampleNames)) {specPref$sampleNames <- sampleNames; sampleNames <- NULL}  # transfer info to specPref
    if(length(suplAnnotFile) >0 || length(sdrf) >0) {
      if(!is.list(groupPref)) groupPref <- as.list(groupPref)
      if(length(sampleNames) %in% c(1, ncol(abund))) groupPref$sampleNames <- sampleNames     # custom sample names
      if(length(gr)==0 && "gr" %in% names(specPref)) gr <- specPref$gr                    # transfer from  specPref to gr (if gr empty)
      if(length(gr) %in% c(1, ncol(abund))) { 
        if(!"gr" %in% names(groupPref)) groupPref$gr <- gr  else {                            # pipe content of gr to groupPref (if not already precified)
          if(!silent) message(fxNa,"Note: Content of argument 'gr' NOT used since already/also set in argument 'groupPref' (which has priority)")}
      }
      if("sampleNames" %in% names(specPref) && length(specPref$sampleNames)==ncol(abund)) groupPref$sampleNames <- specPref$sampleNames    # pipe content of specPref$sampleNames  to groupPref
      setupSd <- readSampleMetaData(sdrf=sdrf, suplAnnotFile=suplAnnotFile, quantMeth="MQ", path=path, abund=utils::head(quant), chUnit=isTRUE(groupPref$chUnit), groupPref=groupPref, silent=silent, debug=debug, callFrom=fxNa)
    }
    if(debug) {message(fxNa,"rMQ13 .."); rMQ13 <- list(sdrf=sdrf,gr=gr,suplAnnotFile=suplAnnotFile,abund=abund, quant=quant,specPref=specPref,refLi=refLi,annot=annot,setupSd=setupSd,sampleNames=sampleNames)}

    ## finish groups of replicates & annotation setupSd
    setupSd <- .checkSetupGroups(abund=abund, setupSd=setupSd, gr=gr, sampleNames=sampleNames, quantMeth="MQ", silent=silent, debug=debug, callFrom=fxNa)
    if(debug) {message(fxNa,"rMQ13a .."); rMQ13a <- list(sdrf=sdrf,gr=gr,suplAnnotFile=suplAnnotFile,abund=abund, quant=quant,specPref=specPref,refLi=refLi,annot=annot,setupSd=setupSd,sampleNames=sampleNames)}
    

    ## harmonize sample-names/1
    colNa <- if(length(setupSd$sampleNames)==ncol(abund)) setupSd$sampleNames else setupSd$groups

    ### begin NEW
    ## option : choose (re-)naming of levels & columns on most redundance
    if(length(setupSd$sampleNames)==ncol(quant) && length(setupSd$sampleNaSdrf)==ncol(quant)) {
      ## check if setupSd$sampleNaSdrf  or   setupSd$sampleNames contain better info (some info on replicates)
      chRed1 <- sum(duplicated(sub("(_|\\.| )[[:digit:]]+.*","", setupSd$sampleNaSdrf)), na.rm=TRUE)
      chRed2 <- sum(duplicated(sub("(_|\\.| )[[:digit:]]+.*","", setupSd$sampleNames)), na.rm=TRUE)
      if(chRed2 < chRed1) {                                          # use info for levels depending on where more 
        colNa <- colnames(abund) <- setupSd$sampleNames <- setupSd$sampleNaSdrf                     ## take sample names from sdrf via  setupSd$sampleNaSdrf
      } else {
        colNa <- colnames(abund) <- setupSd$sampleNames }          ## take sample names from sdrf via  setupSd$sampleNaSdrf
      #setupSd$level  
    }    
    ## option : set order of samples as sdrf
    if("sdrfOrder" %in% names(sdrf) && isTRUE(as.logical(sdrf["sdrfOrder"])) && length(setupSd$iniSdrfOrder)==ncol(abund) && ncol(abund) >1) {  # set order according to sdrf (only if >1 samples)
      nOrd <- order(setupSd$iniSdrfOrder)
      abund <- abund[,nOrd]
      if(length(quant) >0) quant <- quant[,nOrd]
      #setupSd$level <- setupSd$level[nOrd]
      ## rename columns according to sdrf and set order of quant and abund ..
      ## now adapt order of setupSd, incl init Sdrf
      if(length(setupSd) >0) { 
        is2dim <- sapply(setupSd, function(x,le) length(dim(x))==2 && nrow(x)==le, le=length(nOrd))    # look for matr or df to change order of lines
        if(any(is2dim) >0) for(i in which(is2dim)) setupSd[[i]] <- setupSd[[i]][nOrd,]
        isVe <- sapply(setupSd, function(x,le) length(x)==le && length(dim(x)) <1, le=length(nOrd))    # look for vector to change order in setupSd
        if(any(isVe) >0) for(i in which(isVe)) setupSd[[i]] <- setupSd[[i]][nOrd] }
      gr <- gr[nOrd]

      if(length(counts) >0 && length(dim(counts))==3) counts <- array(counts[,nOrd,], dim=c(nrow(counts), length(nOrd), dim(counts)[3]), 
        dimnames=list(rownames(counts), colnames(counts)[nOrd], dimnames(counts)[[3]]))
      if(debug) {message(fxNa,"rMQ13c .."); rMQ13c <- list(path=path,chPa=chPa,tmp=tmp,extrColNames=extrColNames,groupPref=groupPref,chCol=chCol,chRev=chRev,quantCol=quantCol,abund=abund,chNum=chNum,ch2=ch2,
        sdrf=sdrf,quant=quant,annot=annot,contam=contam,remConta=remConta,setupSd=setupSd)}
        
      ## try re-adjusting levels
      tm1 <- sub("^[[:alpha:]]+( |_|-|\\.)+[[:alpha:]]+","", colnames(abund))  # remove heading text
      if(all(grepl("^[[:digit:]]", tm1))) {
        tm1 <- try(as.numeric(sub("( |_|-|\\.)*[[:alpha:]].*","", tm1)), silent=TRUE)   # remove tailing text and try converting to numeric
        if(!inherits(tm1, "try-error")) {
          setupSd$level <- match(tm1, sort(unique(tm1)))
          names(setupSd$level) <- tm1
          if(!silent) message(fxNa,"Sucessfully re-adjusted levels after bringing in order of Sdrf")}
      }     
    } else {     # no sdrf-info


      ## begin old
      ## harmonize sample-names/2
      colNa <- colnames(abund)
      chGr <- grepl("^X[[:digit:]]", colNa)                                                # check & remove heading 'X' from initial column-names starting with digits
      if(any(chGr)) colNa[which(chGr)] <- sub("^X","", colNa[which(chGr)])                 #
      colnames(quant) <- colNa
      if(length(abund) >0) colnames(abund) <- colNa  
    }  
    if(length(setupSd$sampleNames)==ncol(abund)) setupSd$sampleNames <- colNa #no#else setupSd$groups <- colNa
    if(length(dim(counts)) >1 && length(counts) >0) colnames(counts) <- colNa

    if(debug) {message(fxNa,"Read sample-meta data, rMQ14"); rMQ14 <- list(sdrf=sdrf,suplAnnotFile=suplAnnotFile,abund=abund, quant=quant,refLi=refLi,annot=annot,setupSd=setupSd,paFi=paFi,infoDat=infoDat,normalizeMeth=normalizeMeth)}

    ## main plotting of distribution of intensities
    custLay <- NULL
    if(is.numeric(plotGraph) && length(plotGraph) >0) {custLay <- as.integer(plotGraph); plotGraph <- TRUE} else {
        if(!isTRUE(plotGraph)) plotGraph <- FALSE}
    if(plotGraph) .plotQuantDistr(abund=abund, quant=quant, custLay=custLay, normalizeMeth=normalizeMeth, softNa="MaxQuant",
      refLi=refLi, refLiIni=refLiIni, tit=titGraph, silent=debug, callFrom=fxNa, debug=debug)
## meta-data
    notes <- c(inpFile=paFi, qmethod="MaxQuant", qMethVersion=if(length(infoDat) >0) unique(infoDat$Software.Revision) else NA,
    	identType="protein", rawFilePath= if(length(infoDat) >0) infoDat$File.Name[1] else NA, normalizeMeth=normalizeMeth, call=deparse(match.call()),
      created=as.character(Sys.time()), wrProteo.version=paste(utils::packageVersion("wrProteo"), collapse="."), machine=Sys.info()["nodename"])
    ## final output
    if(isTRUE(separateAnnot)) list(raw=abund, quant=quant, annot=annot, counts=counts, sampleSetup=setupSd, quantNotes=parametersD, notes=notes) else data.frame(quant,annot) }
}
    

Try the wrProteo package in your browser

Any scripts or data that you put into this service are public.

wrProteo documentation built on April 12, 2025, 1:16 a.m.