Nothing
#' Read (Normalized) Quantitation Data Files Produced By Wombat At Protein Level
#'
#' Protein quantification results from \href{https://github.com/wombat-p}{Wombat-P} using the Bioconductor package Normalizer can be read using this function and relevant information extracted.
#' Input files compressed as .gz can be read as well.
#' The protein abundance values (XIC), peptide counting get extracted. Since protein annotation is not very extensive with this format of data, the function allows reading the
#' initial fasta files (from the directory above the quantitation-results) allowing to extract more protein-annotation (like species).
#' Sample-annotation (if available) can be extracted from sdrf files, which are typically part of the Wombat output, too.
#' The protein abundance values may be normalized using multiple methods (median normalization as default), the determination of normalization factors can be restricted to specific proteins
#' (normalization to bait protein(s), or to invariable matrix of spike-in experiments).
#' The protein annotation data gets parsed to extract specific fields (ID, name, description, species ...).
#' Besides, a graphical display of the distribution of protein abundance values may be generated before and after normalization.
#'
#' @details
#' By standard workflow of Wombat-P writes the results of each analysis-method/quantification-algorithm as .csv files
#' Meta-data describing the proteins may be available from two sources :
#' a) The 1st column of the Wombat/normalizer output.
#' b) Form the .fasta file in the directory above the analysis/quantiication results of the Wombar-workflow
#'
#'
#' Meta-data describing the samples and experimental setup may be available from a sdrf-file (from the directory above the analysis/quantiication results)
#' If available, the meta-data will be examined for determining groups of replicates and
#' the results thereof can be found in $sampleSetup$levels.
#' Alternatively, a dataframe formatted like sdrf-files (ie for each sample a separate line, see also function \code{readSdrf}) may be given, too.
#'
#' This import-function has been developed using Wombat-P version 1.x.
#' The final output is a list containing these elements: \code{$raw}, \code{$quant}, \code{$annot}, \code{$counts}, \code{$sampleSetup}, \code{$quantNotes}, \code{$notes}, or (if \code{separateAnnot=FALSE}) data.frame
#' with annotation- and main quantification-content. If \code{sdrf} information has been found, an add-tional list-element \code{setup}
#' will be added containg the entire meta-data as \code{setup$meta} and the suggested organization as \code{setup$lev}.
#'
#'
#' @param fileName (character) name of file to be read (default 'proteinGroups.txt' as typically generated by Compomics in txt folder). Gz-compressed files can be read, too.
#' @param path (character) path of file to be read
#' @param quantSoft (character) qunatification-software used inside Wombat-P
#' @param fasta (logical or character) if \code{TRUE} the (first) fasta from one direcory higher than \code{fileName} will be read as fasta-file to extract further protein annotation;
#' if \code{character} a fasta-file at this location will be read/used/
#' @param isLog2 (logical) typically data read from Wombat are expected to be \code{isLog2=TRUE}
#' @param normalizeMeth (character) normalization method, defaults to \code{median}, for more details see \code{\link[wrMisc]{normalizeThis}})
#' @param quantCol (character or integer) exact col-names, or if length=1 content of \code{quantCol} will be used as pattern to search among column-names for $quant using \code{grep}
#' @param contamCol (character or integer, length=1) which columns should be used for contaminants
#' @param pepCountCol (character) pattern to search among column-names for count data (1st entry for 'Razor + unique peptides', 2nd fro 'Unique peptides', 3rd for 'MS.MS.count' (PSM))
#' @param read0asNA (logical) decide if initial quntifications at 0 should be transformed to NA (thus avoid -Inf in log2 results)
#' @param sampleNames (character) custom column-names for quantification data; this argument has priority over \code{suplAnnotFile}
#' @param extrColNames (character) column names to be read (1st position: prefix for LFQ quantitation, default 'LFQ.intensity'; 2nd: column name for protein-IDs, default 'Majority.protein.IDs'; 3rd: column names of fasta-headers, default 'Fasta.headers', 4th: column name for number of protein IDs matching, default 'Number.of.proteins')
#' @param specPref (character) prefix to identifiers allowing to separate i) recognize contamination database, ii) species of main identifications and iii) spike-in species
#' @param refLi (character or integer) custom specify which line of data should be used for normalization, ie which line is main species; if character (eg 'mainSpe'), the column 'SpecType' in $annot will be searched for exact match of the (single) term given
#' @param remRev (logical) option to remove all protein-identifications based on reverse-peptides
#' @param remConta (logical) option to remove all proteins identified as contaminants
#' @param separateAnnot (logical) if \code{TRUE} output will be organized as list with \code{$annot}, \code{$abund} for initial/raw abundance values and \code{$quant} with final normalized quantitations
#' @param gr (character or factor) custom defined pattern of replicate association, will override final grouping of replicates from \code{sdrf} and/or \code{suplAnnotFile} (if provided) \code{}
#' @param sdrf (logical, character, list or data.frame) optional extraction and adding of experimenal meta-data:
#' if \code{sdrf=TRUE} the 1st sdrf in the directory above \code{fileName} will be used
#' if character, this may be the ID at ProteomeExchange,
#' the second element may give futher indicatations for automatic organization of groups of replicates.
#' Besides, the output from \code{readSdrf} or a list from \code{defineSamples} may be provided; if \code{gr} is provided, \code{gr} gets priority for grouping of replicates
#' @param suplAnnotFile (logical or character) optional reading of supplemental files produced by Compomics; if \code{gr} is provided, it gets priority for grouping of replicates
#' if \code{TRUE} default to files 'summary.txt' (needed to match information of \code{sdrf}) and 'parameters.txt' which can be found in the same folder as the main quantitation results;
#' if \code{character} the respective file-names (relative ro absolute path), 1st is expected to correspond to 'summary.txt' (tabulated text, the samples as given to Compomics) and 2nd to 'parameters.txt' (tabulated text, all parameters given to Compomics)
#' @param groupPref (list) additional parameters for interpreting meta-data to identify structure of groups (replicates), will be passed to \code{readSampleMetaData}.
#' May contain \code{lowNumberOfGroups=FALSE} for automatically choosing a rather elevated number of groups if possible (defaults to low number of groups, ie higher number of samples per group)
#' May contain \code{chUnit} (logical or character) to be passed to \code{readSampleMetaData()} for (optional) adjustig of unit-prefixes in meta-data group labels, in case multiple different unit-prefixes
#' are used (eg '100pMol' and '1nMol').
#' @param plotGraph (logical) optional plot vioplot of initial and normalized data (using \code{normalizeMeth}); alternatively the argument may contain numeric details that will be passed to \code{layout} when plotting
#' @param titGraph (character) custom title to plot of distribution of quantitation values
#' @param wex (numeric) relative expansion factor of the violin in plot
#' @param silent (logical) suppress messages
#' @param debug (logical) additional messages for debugging
#' @param callFrom (character) allow easier tracking of messages produced
#' @return This function returns a list with \code{$raw} (initial/raw abundance values), \code{$quant} with final normalized quantitations, \code{$annot} (columns ), \code{$counts} an array with 'PSM' and 'NoOfRazorPeptides',
#' \code{$quantNotes}, \code{$notes} and optional \code{setup} for meta-data from \code{sdrf}; or a data.frame with quantitation and annotation if \code{separateAnnot=FALSE}
#' @seealso \code{\link[utils]{read.table}}, \code{\link[wrMisc]{normalizeThis}}) , \code{\link{readProteomeDiscovererFile}}; \code{\link{readProlineFile}} (and other import-functions), \code{\link{matrixNAinspect}}
#' @examples
#' path1 <- system.file("extdata", package="wrProteo")
#' # Here we'll load a short/trimmed example file (originating from Compomics)
#' fiNa <- "tinyWombCompo1.csv.gz"
#' dataWB <- readWombatNormFile(file=fiNa, path=path1, tit="tiny Wombat/Compomics, Normalized ")
#' summary(dataWB$quant)
#' @export
readWombatNormFile <- function(fileName, path=NULL, quantSoft="(quant software not specified)", fasta=NULL, isLog2=TRUE, normalizeMeth="none", quantCol="abundance_", contamCol=NULL,
pepCountCol=c("number_of_peptides"), read0asNA=TRUE, refLi=NULL, sampleNames=NULL,
extrColNames=c("protein_group"), specPref=NULL,
remRev=TRUE, remConta=FALSE, separateAnnot=TRUE, gr=NULL, sdrf=NULL, suplAnnotFile=NULL, groupPref=list(lowNumberOfGroups=TRUE, chUnit=TRUE),
titGraph=NULL, wex=1.6, plotGraph=TRUE, silent=FALSE, debug=FALSE, callFrom=NULL) {
## prepare
fxNa <- wrMisc::.composeCallName(callFrom, newNa="readWombatNormFile")
oparMar <- if(plotGraph) graphics::par("mar") else NULL # only if figure might be drawn
remStrainNo <- TRUE # if TRUE extract Species in very stringent pattern
cleanDescription <- TRUE # clean 'Description' for artifacts of truncated text (tailing ';' etc)
fixSpeciesNames <- TRUE
oparMar <- graphics::par("mar")
## functions
.cleanMQann <- function(x, sep="\\|", silent=FALSE, debug=FALSE, callFrom=NULL) {
## split multiple protein entries as with 1st column of MaxQuant data
## return matrix with
## example ann1 <- read.delim(file.path(system.file("extdata", package="wrProteo"), "tinyWombCompo1.csv.gz"), sep=",", stringsAsFactors=FALSE)[,1]
## .cleanMQann(ann1)
# x=rWB4a$tmp[c(5,31:32,81:82,111:114),1]
xIni <- x # keep backup for recuperating bizzare nonparsed
isCont <- grepl("CON__", x)
mult <- nchar(x) - nchar(gsub(";", "", x))
chMult <- mult >0
if(any(chMult)) {
spl1 <- strsplit(x[which(chMult)], ";")
## use entry with most separators (when multiple entries, eg 'sp|P00761|CON__TRYP_PIG;CON__P00761')
spl1 <- sapply(spl1, function(y) { nSep <- nchar(y) - nchar(gsub("|","",y)); y[which.max(nSep)] })
x[which(chMult)] <- spl1 }
## split separators
chSpl <- function(y) {chID <- grepl("^[[:upper:]]{1,3}[[:digit:]]{2,}|^[[:upper:]]{1,3}[[:digit:]]+[[:upper:]]+[[:digit:]]*", y); chName <- grepl("[A-Z0-9]_[[:upper:]]",y); # extract db, ID & prot-name
c(dbIni= if((length(y) >1 && grepl("^[[:lower:]]{1,8}$", y[1])) || length(y) >2 && grepl("^[[:lower:]]{2}|[[:lower:]]{2}$",
y[1])) y[1] else NA, IDini=if(any(chID)) y[which(chID)[1]] else NA, nameIni=if(any(chName)) y[which(chName)[1]] else NA) }
x <- t(sapply(strsplit(x, sep), chSpl))
nColIni <- ncol(x)
cleanID <- function(y, useCol=c(db=1, ID=2, name=3)) {
ext <- grepl("[[:lower:]]+$", y[,useCol[2]]) # look for extension like 'P08758ups'
extNoDb <- which(ext & is.na(y[,useCol[1]]))
if(any(ext)) { cleanID <- sub("[[:lower:]]+$","", y[which(ext), useCol[2]])
if(length(extNoDb) >0) y[which(ext), useCol[1]] <- substring(y[which(ext), useCol[2]], nchar(cleanID) +1 )
y[which(ext), useCol[2]] <- cleanID }
prefi <- grepl("^[[:upper:]]+__[[:upper:]]", y[,useCol[3]]) # look for prefix like 'CON__FA5_BOVIN'
if(any(prefi)) { ch2 <- grepl("[A-Z0-9]_[[:upper:]]", y[which(prefi), useCol[3]]); if(any(ch2)) {
y[which(prefi)[which(ch2)], useCol[1]] <- tolower(sub("__[[:upper:]].+","", y[which(prefi)[which(ch2)], useCol[3]]))
y[which(prefi)[which(ch2)], useCol[3]] <- sub("^[[:upper:]]+__","", y[which(prefi)[which(ch2)], useCol[3]])}}
colnames(y) <- c("db","ID","name")
y }
x <- cbind(x, cleanID(x, useCol=c(db=1, ID=2, name=3)))
x <- cbind(x, conta=grepl("^con|^REV_", x[,"db"]) | grepl("__CON__",xIni))
## recuperate all (bizarre) non-parsed into ID
isNa <- rowSums(is.na(x)) > nColIni -2
if(any(isNa)) x[which(isNa),c(2+nColIni)] <- xIni[which(isNa)]
cbind(x[,c((nColIni+1):ncol(x), 1:nColIni)], iniSoftAnn=xIni) }
.cleanPLann <- function(anno, inclLowerCaseExt=TRUE) {
## parse annotation from Proline quantification sheet (2nd col 'protein_group')
## entries like c("sp|P06169|PDC1_YEAST","sp|P02994|EF1A_YEAST","P00549|KPYK1_YEAST","sp|P00925|ENO2_YEAST")
ann2 <- matrix(ncol=3, nrow=length(anno), dimnames=list(NULL,c("database","uniqueIdentifier","proteinName")))
## use longest of multiple ID-entries (eg "Q14CN4-2|CON__K2C72_HUMAN; sp|Q14CN4-3|CON__K2C72_HUMAN; sp|Q14CN4|CON__K2C72_HUMAN")
## split concatenated multiple IDs, use longest
chMult <- nchar(sub(":","", anno)) < nchar(anno)
if(any(chMult)) { spl1 <- strsplit(as.character(anno[chMult]),";")
anno[which(chMult)] <- sapply(spl1, function(x) x[which.max(nchar(x))]) }
## database identifier
chDB <- grep("^[[:lower:]]+\\|.", anno)
if(length(chDB) >0) { ann2[chDB,1] <- sub("\\|.+", "", anno[chDB])
anno[chDB] <- sub("^[[:lower:]]+\\|", "", anno[chDB])}
## protein/uniProtID
chPat0 <- "^[[:upper:]]+[[:digit:]]+[0-9A-Z]*\\|."
chPat <- "^[[:upper:]]+[[:digit:]]+[0-9A-Z]*(\\-[[:digit:]]{1,3}){0,1}\\|." # like 'P10636|', 'P10636-8|'
chPat2 <- "^[[:upper:]]+[[:digit:]]+[0-9A-Z]*(\\-[[:digit:]]{1,3}){0,1}[[:lower:]]{0,12}\\|." # like 'P10636|', 'P10636-8|', 'P10636-8ups|'
chID <- grep(if(isTRUE(inclLowerCaseExt)) chPat2 else chPat, anno)
if(length(chID) >0) { ann2[chID,2] <- sub("\\|.+", "", anno[chID]) # retreive entry (until next separator)
anno[chID] <- sub(".+\\|", "", anno[chID])}
## last entry
nch <- which(nchar(anno) >0)
if(length(nch) >0) ann2[nch,3] <- anno[nch]
ann2 <- cbind(database=ann2[1], protein_group=sub("[[:lower:]]+$","", ann2[,2]), ann2[,-1], iniSoftAnn=anno)
ann2 }
.cleanSingleAnn <- function(anno, inclLowerCaseExt=TRUE) {
## parse annotation from single column (ie character vector) , eg Compomics quatification sheet (2nd col 'protein_group')
## return 3 columns ; ID, IDext and iniSoftAnn
if(length(dim(anno)) >1) {warning(".cleanSingleann : expecting text vector and NOT matrix"); anno <- try(as.character(as.matrix(anno)))}
iniSoftAnn <- WOext <- anno
chPat3 <- "[[:upper:]][0-9A-Z]+(\\-[[:digit:]]{1,3}){0,1}[[:lower:]]*$" # recognize IDs eg 'P00915ups', 'P00915-3ups',
## still need to split multiple and use longest (eg from 'P02768,P02768ups')
chMult <- grep(",", anno)
if(length(chMult) >0) {
spl <- strsplit(as.character(anno[chMult]), ",")
anno[chMult] <- sapply(spl, function(x) x[which.max(nchar(x))])
}
## split lower-case extenstions from IDs
if(isTRUE(inclLowerCaseExt)) { chExt <- grep(chPat3, anno)
if(length(chExt) >0) {
WOext <- anno
WOext[chExt] <- sub("[[:lower:]]+$", "", anno[chExt])
ext <- rep(NA, length(anno))
ext[chExt] <- substring(anno[chExt], nchar(WOext[chExt]) +1)
anno <- cbind(iniSoftAnn=iniSoftAnn, ID=WOext, IDext=ext)
} }
anno }
## end functions
## init check
reqPa <- c("utils","wrMisc")
chPa <- sapply(reqPa, requireNamespace, quietly=TRUE)
if(any(!chPa)) stop("Package(s) '",paste(reqPa[which(!chPa)], collapse="','"),"' not found ! Please install first from CRAN")
if(!isTRUE(silent)) silent <- FALSE
if(isTRUE(debug)) silent <- FALSE else debug <- FALSE
excluCol <- "^Abundances.Count" # exclude this from quantifications columns
cleanDescription <- TRUE # clean 'Description' for artifacts of truncated text (tailing ';' etc)
infoDat <- infoFi <- setupSd <- parametersD <- annot <- annotMQ <- annotPL <- NULL # initialize
## check if path & file exist
paFi <- wrMisc::checkFilePath(fileName, path, expectExt="csv", compressedOption=TRUE, stopIfNothing=TRUE, callFrom=fxNa, silent=silent,debug=debug)
## read (main) file
## future: look for fast reading of files
# read.delim("C:\\E\\projects\\MassSpec\\smallProj\\ElixirBenchmark\\deWombat\\deGit24may23\\PXD009815dev\\dev\\stand_prot_quant_mergedcompomics.csv", sep=",")
tmp <- try(utils::read.delim(paFi, sep=",", stringsAsFactors=FALSE), silent=TRUE)
if(length(tmp) <1 || inherits(tmp, "try-error") || length(dim(tmp)) <2) {
if(inherits(tmp, "try-error")) warning("Unable to read input file ('",paFi,"')! (check format or if rights to read)") else {
if(!silent) message(fxNa,"Content of file '",paFi,"' seeps empty or non-conform ! Returning NULL; check if this is really a Compomics-file") }
tmp <- NULL
return(NULL)
} else {
## start checking format of initial read of Normalizer-output
if(debug) { message(fxNa,"rWB1 .. dims of initial data : ", nrow(tmp)," li and ",ncol(tmp)," col "); rWB1 <- list(fileName=fileName,path=path,paFi=paFi,tmp=tmp,normalizeMeth=normalizeMeth,read0asNA=read0asNA,quantCol=quantCol,
refLi=refLi,separateAnnot=separateAnnot )} # annotCol=annotCol,FDRCol=FDRCol
## check which columns can be extracted (for annotation)
if(length(extrColNames) <1) { extrColNames <- colnames(tmp)[1] # default : use only 1st col for annotation
} else {
if(is.integer(contamCol) && length(contamCol) >0) contamCol <- colnames(tmp)[contamCol]
extrColNames <- union(extrColNames, contamCol) # add contamCol if not included in extrColNames
chCol <- extrColNames %in% colnames(tmp)
if(!any(chCol, na.rm=TRUE)) { extrColNames <- gsub("\\."," ",extrColNames)
chCol <- extrColNames %in% colnames(tmp) }
if(all(!chCol, na.rm=TRUE)) stop("Problem locating annotation columns (",wrMisc::pasteC(extrColNames, quoteC="''"),")")
if(any(!chCol, na.rm=TRUE)) {
if(!silent) message(fxNa,"Note: Can't find columns ",wrMisc::pasteC(extrColNames[!chCol], quoteC="'")," !")}
}
if(debug) {message(fxNa,"rWB1c"); rWB1c <- list(tmp=tmp,paFi=paFi,fasta=fasta,quantCol=quantCol,extrColNames==extrColNames)}
}
if(length(tmp) >0) {
## check for lines with absent IDs => eliminate
chNa <- which(if("protein_group" %in% colnames(tmp)) is.na(tmp[,"protein_group"]) else rowSums(is.na(tmp))==ncol(tmp))
if(length(chNa) >0) {
if(!silent) message(fxNa,"Removing ",length(chNa)," lines since absent ID or all NA (won't be able to do anything lateron withour ID ..)")
tmp <- tmp[-chNa,]
}
if(debug) {message(fxNa,"rWB1d"); rWB1d <- list(tmp=tmp,paFi=paFi,fasta=fasta,quantCol=quantCol,extrColNames==extrColNames)}
## further extracting : quantitation
useDCol <- grep(paste0("^",quantCol), colnames(tmp))
if(length(useDCol) <1) stop("NO columns matching term ",wrMisc::pasteC(quantCol, quoteC="'")," from argument 'quantCol' found !")
abund <- as.matrix(tmp[,useDCol]) # normalized log2 abundances
if(debug) {message(fxNa,"rWB1e"); rWB1e <- list(tmp=tmp,paFi=paFi,fasta=fasta,abund=abund)}
#iniAbundColNa <- colnames(abund)
chNum <- try(is.numeric(abund), silent=TRUE)
if(!chNum) {abund <- try(apply(tmp[,quantCol], 2, wrMisc::convToNum, convert="allChar", silent=silent, callFrom=fxNa), silent=TRUE)
if(inherits(abund, "try-error")) {datOK <- FALSE; warning(fxNa,"CANNOT transform 'abund' to numeric data !'")} }
if(length(dim(abund)) <2 && !is.numeric(abund)) abund <- matrix(as.numeric(abund), ncol=ncol(abund), dimnames=dimnames(abund))
ch1 <- grepl("^abundance_CT\\.mixture\\.QY\\.", colnames(abund))
if(all(ch1)) colnames(abund) <- sub("abundance_CT\\.mixture\\.QY\\.", "", colnames(abund))
ch1 <- grepl("^abundance_", colnames(abund))
if(all(ch1)) colnames(abund) <- sub("abundance_", "", colnames(abund))
ch1 <- grepl("\\.CV\\.Standards\\.Research\\.Group", colnames(abund))
if(all(ch1)) colnames(abund) <- sub("\\.CV\\.Standards\\.Research\\.Group", "", colnames(abund))
trimColNames <- FALSE ## further trim quantitation colnames
if(trimColNames) { ## further trim
colnames(abund) <- wrMisc::.trimLeft(wrMisc::.trimRight( sub(paste0("^",quantCol),"", colnames(abund))))
## no trim needed for Wombat ?
}
if(debug) {message(fxNa,"rWB3"); rWB3 <- list(abund=abund,paFi=paFi,path=path,chPa=chPa,tmp=tmp,extrColNames=extrColNames,chCol=chCol,remConta=remConta,pepCountCol=pepCountCol,fasta=fasta)}
## convert 0 to NA
## note tpp abundance values are all neg !!
chNa <- !all(is.na(abund))
if(chNa) {
chRa <- range(abund, na.rm=TRUE)
if(all(chRa <0) & isTRUE(read0asNA)) {read0asNA <- FALSE
if(debug) message(fxNa,"All abundance values are <0 ! (omit transforming neg values to NA)") } }
if(!isFALSE(read0asNA)) {
ch1 <- abund <= 0
if(any(ch1, na.rm=TRUE)) { abund[which(ch1)] <- NA
if(!silent) message(fxNa,"Transform ",sum(ch1),"(",100*round(sum(ch1)/length(ch1),3),"%) initial '0' values to 'NA'")}}
## further extracting : prepare for countig data
ch1 <- grep(pepCountCol[1], colnames(tmp))
if(length(ch1) ==ncol(abund)) {
counts <- array(dim=c(nrow(tmp), ncol(abund), 1), dimnames=list(NULL, colnames(abund), pepCountCol))
counts[,,1] <- suppressWarnings(as.numeric(as.matrix(tmp[,ch1])))
} else {
counts <- NULL
if(!silent) message(fxNa,"Could not find column(s) with peptide per protein counts (argument 'pepCountCol') matching to '",pepCountCol,"'") }
if(debug) {message(fxNa,"rWB4"); rWB4 <- list(abund=abund,counts=counts,annot=annot,paFi=paFi,path=path,chPa=chPa,tmp=tmp,pepCountCol=pepCountCol,extrColNames=extrColNames,chCol=chCol,remConta=remConta,quantSoft=quantSoft,fasta=fasta,annotMQ=annotMQ)}
## add more measure-types if avail
if(length(pepCountCol) >1) {
for(i in 2:length(pepCountCol)) {
ch1 <- grep(pepCountCol[i], colnames(tmp))
if(length(ch1) ==ncol(abund)) {
counts[,,i] <- suppressWarnings(as.numeric(as.matrix(tmp[,ch1]))) }
}
}
if(debug) {message(fxNa,"rWB4a"); rWB4a <- list(abund=abund,counts=counts,annot=annot,paFi=paFi,path=path,chPa=chPa,tmp=tmp,pepCountCol=pepCountCol,extrColNames=extrColNames,chCol=chCol,remConta=remConta,quantSoft=quantSoft,fasta=fasta,annotMQ=annotMQ)}
## Annotation
if(any(c("CP","Compomics","compomics") %in% quantSoft) && extrColNames[1] %in% colnames(tmp)) { # special case PL: parse annot from column 'protein_group'
annotPL <- .cleanSingleAnn(tmp[,extrColNames[1]]) #"protein_group"
colnames(annotPL)[2] <- "protein_group"
}
if(any(c("MQ","MaxQuant","maxquant") %in% quantSoft) && extrColNames[1] %in% colnames(tmp)) { # special case MQ: parse annot from column 'protein_group'
annotMQ <- .cleanMQann(tmp[,extrColNames[1]]) #"protein_group"
}
if(any(c("PL","Proline","proline") %in% quantSoft) && extrColNames[1] %in% colnames(tmp)) { # special case PL: parse annot from column 'protein_group'
annotPL <- .cleanPLann(tmp[,extrColNames[1]]) #"protein_group"
}
if(any(c("TPP","tpp") %in% quantSoft) && extrColNames[1] %in% colnames(tmp)) { # special case CP: remove lower-case extensions from 'uniqueIdentifier'
annotPL <- .cleanPLann(tmp[,extrColNames[1]]) #"protein_group"
}
if(debug) {message(fxNa,"rWB4aa"); rWB4aa <- list(annotMQ=annotMQ,annotPL=annotPL,abund=abund,counts=counts,annot=annot,paFi=paFi,path=path,chPa=chPa,tmp=tmp,pepCountCol=pepCountCol,extrColNames=extrColNames,chCol=chCol,remConta=remConta,quantSoft=quantSoft,fasta=fasta,annotMQ=annotMQ)}
## read fasta from higher dir (specific to Wombat)
if(length(fasta) >0) {fasta <- fasta[1]; if(isFALSE(fasta) || is.na(fasta)) fasta <- NULL}
if(isTRUE(fasta)) {
hiDir <- dir(file.path(dirname(paFi),".."))
chFa <- grep("\\.fasta$", hiDir)
faFi <- if(length(chFa) >0) file.path(dirname(paFi),"..",hiDir[chFa[1]]) else NULL
} else faFi <- fasta
if(length(faFi) >0) { # has fasta for recuperating annotation
fasta <- try(readFasta2(filename=faFi, tableOut=TRUE, silent=silent,debug=debug,callFrom=fxNa), silent=TRUE)
## Potential problem with inconsistent format of fasta
if(inherits(fasta, "try-error")) { fasta <- NULL
if(!silent) message(fxNa,"Unable to read/open fasta file '",faFi,"' (check rights to read ?)")
} else {
tmpAnn <- if(length(annotMQ) >0) annotMQ[,2] else {if(length(annotPL) >0) annotPL[,2] else tmp[,extrColNames[1]]} # 'P02768' still missing
tm2 <- wrMisc::concatMatch(tmpAnn, fasta[,2], sepPattern=NULL, globalPat="digitExtension", silent=silent, debug=debug, callFrom=fxNa) # clean protein-names (eg digit extensions, concateneated IDs) & match to data
iniAnn <- if(length(annotMQ) >0) annotMQ else {if(length(annotPL) >0) annotPL else cbind(iniSoftAnn=tmp[,extrColNames[1]])}
colnames(iniAnn) <- c("iniSoftAnn", if(ncol(iniAnn) >1) paste0(colnames(iniAnn)[-1],".",quantSoft))
useFaCol <- match(c("uniqueIdentifier","entryName","proteinName","OS","OX","GN","database"), colnames(fasta)) # do not export full 'sequence'
annot <- cbind(trimIdentifier=names(tm2), fasta[tm2, useFaCol], iniAnn=tmpAnn)
if(debug) {message(fxNa,"rWB4ab"); rWB4ab <- list(abund=abund,counts=counts,annot=annot,fasta=fasta,tmp=tmp,chNa=chNa,annotPL=annotPL,annotMQ=annotMQ,extrColNames=extrColNames,extrColNames=extrColNames) }
foundFastaCol <- !is.na(useFaCol)
if(any(foundFastaCol)) colnames(annot)[1:sum(foundFastaCol)] <- c("Accession","AccessionFull","Description","EntryName","Species","OX","GeneName","Database")[which(foundFastaCol)]
## strip species details
if("Species" %in% annot) annot[,"Species"] <- sub(" \\(.+", "", annot[,"Species"])
}
} else {
if(debug) message(fxNa,"NO fasta available !")
annot <- if(length(annotMQ) >0) annotMQ else {if(length(annotPL) >0) annotPL else tmp[,extrColNames[1]]}
if(length(annotMQ) <1 && length(annotPL) <1) {
sep <- ","
if(debug) message(fxNa,"rWB4ab")
## if no other annot : pick 1st of multiple, clean from lowerCase extension, clean from -digit extension
## need to test further ?
chMult <- nchar(tmp[,extrColNames[1]]) - nchar(sub(sep, "", annot))
if(any(chMult) >0) annot <- sapply(strsplit(annot, sep), function(x) {nCha <- nchar(x); if(length(x) >1) x[which.max(nCha)] else x })
annot <- cbind(Accession=sub("\\-[[:digit:]]+$","", sub("[[:lower:]]+$","", annot)), IniAccession=tmp[,extrColNames[1]])
} else {
## need to arrange columns to proper order & names
warning(fxNa,"NOTE : Supplemental arranging of columns to proper order & names not yet implemented")
}
if(debug) message(fxNa,"NOTE : No fasta-file found in main directory ...")
}
if(debug) {message(fxNa,"dim annot ",nrow(annot)," ",ncol(annot)," rWB4b"); rWB4b <- list(annot=annot,faFi=faFi,abund=abund,tmp=tmp,fasta=fasta,annotMQ=annotMQ)}
## check ID col of annot
chID <- match(c("Accession","protein_group","uniqueIdentifier"), colnames(annot))
if(all(is.na(chID))) { chID <- wrMisc::naOmit(match(c("protein_group","ID"), colnames(annot)))
if(length(chID) < 1) warning("PROBLEM : UNEXPECTED colnames in annot") #else colnames(annot)[chID][1] <- "Accession"
}
if(!identical(wrMisc::naOmit(chID), 1) || length(wrMisc::naOmit(chID)) >0) {
annot <- annot[,c(wrMisc::naOmit(chID)[1], (1:ncol(annot))[-wrMisc::naOmit(chID)[1]] )] # adjust order to have ID in 1st column
colnames(annot)[1] <- "Accession"
if(!silent) message(fxNa,"Adjusting order of annot to have ID in 1st column") }
## remove lines wo IDs
chNa <- is.na(annot[,1])
if(any(chNa)) {
if(!silent) message(fxNa,"Removing ",sum(chNa)," out of ",nrow(abund)," lines wo ID")
rmLi <- which(chNa)
tmp <- tmp[-rmLi,]
annot <- annot[-rmLi,]
if(length(dim(annot)) <2) annot <- matrix(annot, ncol=1, dimnames=list(NULL,colnames(tmp)[1]))
abund <- abund[-rmLi,]
if(length(counts) >0) counts <- if(length(dim(counts))==3) counts[-rmLi,,] else counts[-rmLi,]
}
if(debug) {message(fxNa,"dim annot",nrow(annot)," ",ncol(annot)," rWB4d"); rWB4d <- list(annot=annot,faFi=faFi,abund=abund,tmp=tmp)}
## unique ID
chD <- duplicated(annot[,1])
uniqueID <- if(any(chD, na.rm=TRUE)) wrMisc::correctToUnique(annot[,1], silent=silent, callFrom=fxNa) else annot[,1] # extrColNames[1]
rownames(annot) <- rownames(abund) <- uniqueID
if(length(counts) >0) rownames(counts) <- uniqueID
if(debug) {message(fxNa,"rWB4e"); rWB4e <- list(paFi=paFi,path=path,chPa=chPa,tmp=tmp,extrColNames=extrColNames,chCol=chCol,counts=counts,
quantCol=quantCol,abund=abund,chNum=chNum,annot=annot,remConta=remConta,specPref=specPref)}
## remove Wombat contaminants
#useColumn <- wrMisc::naOmit(match(c("Accession","protein_group"), colnames(annot)))
conLi <- grep("CON__[[:alnum:]]", annot[, if(ncol(annot) >1) wrMisc::naOmit(match(c("Accession","protein_group"), colnames(annot)))[1] else 1])
if(remConta) {
if(length(conLi) >0) {
iniLi <- nrow(annot)
annot <- annot[-conLi,]
abund <- abund[-conLi,]
counts <- if(length(dim(counts))==3) counts[-conLi,,] else counts[-conLi,,]
if(debug) message(fxNa,"Removing ",length(conLi)," instances of contaminants to final ",nrow(annot)," lines/IDs")}
}
## split Annotation
if(debug) {message(fxNa,"rWB4f"); rWB4f <- list(path=path,chPa=chPa,tmp=tmp,extrColNames=extrColNames,chCol=chCol,counts=counts,
quantCol=quantCol,abund=abund,chNum=chNum,annot=annot,remConta=remConta,specPref=specPref)}
## finalize annotation
chCols <- c("EntryName","GeneName","Species","Contam","Description")
chCol2 <- chCols %in% colnames(annot)
if(any(!chCol2)) annot <- cbind(annot, matrix(NA, nrow=nrow(annot), ncol=sum(!chCol2), dimnames=list(NULL, chCols[which(!chCol2)]))) # add columns so far not present
if(!remConta && length(conLi) >0) annot[conLi, "Contam"] <- "TRUE"
if(debug) {message(fxNa,"rWB5"); rWB5 <- list(path=path,chPa=chPa,tmp=tmp,extrColNames=extrColNames,chCol=chCol,counts=counts,
quantCol=quantCol,abund=abund,chNum=chNum,annot=annot,remConta=remConta,remStrainNo=remStrainNo, specPref=specPref)}
## extract species according to custom search parameters 'specPref'
if(remStrainNo && any(!is.na(annot[,"Species"]))) {
annot[,"Species"] <- sub(" \\(strain [[:alnum:]].+","", annot[,"Species"])
}
## complete species annot by info extracted from fasta : ' OS='
.completeSpeciesAnnot <- function(spe=c("Homo sapiens", "_HUMAN"), anno=annot, exCoNa=c("Species", "EntryName","name","proteinName")) { # re-written 12jun23
## complete species if missing in anno[,exCoNa[2]] but found in anno[,exCoNa[1]]; return corrected anno
chNa <- is.na(anno[,exCoNa[1]]) | nchar(anno[,exCoNa[1]]) <1 # missing (species) annotation
if(any(chNa, na.rm=TRUE)) { # suppose that all 'exCoNa' are present as colnames in 'annot'
useColumn <- if(all(is.na(anno[,exCoNa[2]]))) wrMisc::naOmit(match(exCoNa[3:length(exCoNa)], colnames(anno))) else exCoNa[2]
if(length(useColumn) >1) useColumn <- useColumn[1]
chS <- grep(spe[1], anno[,useColumn])
if(length(chS) >0) anno[chS, exCoNa[1]] <- spe[2]
}
anno }
if(isTRUE(fixSpeciesNames)) { # try to recuperate/fix non-given/bad formatted species
chNa <- is.na(annot[,"Species"])
if(any(chNa)) {
commonSpec <- .commonSpecies()
for(i in 1:nrow(commonSpec)) annot[which(chNa),] <- .completeSpeciesAnnot(commonSpec[i,], annot[which(chNa),], exCoNa=c("Species","EntryName","name","proteinName")) }
if(debug) {message(fxNa,"rWB6"); rWB6 <- list(path=path,chPa=chPa,tmp=tmp,extrColNames=extrColNames,chCol=chCol,counts=counts,
quantCol=quantCol,abund=abund,chNum=chNum,annot=annot,remConta=remConta,remStrainNo=remStrainNo, specPref=specPref)}
## check/complete for truncated species names (ie names found inside other ones)
chSpe <- which(!is.na(annot[,"Species"]) & nchar(annot[,"Species"]) >0)
if(length(chSpe) >0) {
OS <- gsub(";{1,5}$", "", annot[chSpe,"Species"]) # remove tailing separators
OSna <- unique(OS)
ch1 <- nchar(OSna) <1
if(debug) {message(fxNa,"rWB6b")}
if(any(ch1, na.rm=TRUE)) OSna <- OSna[which(nchar(OSna) >0)] # (just in case) remove empty tags
ch2 <- lapply(OSna, grep, OSna)
chTr <- sapply(ch2, length) >1
if(any(chTr, na.rm=TRUE)) { if(!silent) message(fxNa,"Found ",sum(chTr)," species name(s) appearing inside other ones, assume as truncated (eg ",OSna[which(chTr)[1]],")")
for(i in which(chTr)) OS[which(OS==OSna[i])] <- OSna[ch2[[i]][1]]
}
annot[chSpe,"Species"] <- OS }
}
## in case "Accession" is avail not "EntryName" is not
if(debug) {message(fxNa,"rWB7"); rWB7 <- list(path=path,chPa=chPa,tmp=tmp,extrColNames=extrColNames,chCol=chCol,quantCol=quantCol,remStrainNo=remStrainNo,
abund=abund,chNum=chNum,specPref=specPref, annot=annot,remConta=remConta,counts=counts)}
## look for tags from specPref
if(length(specPref) >0) {
## set annot[,"specPref"] according to specPref
annot <- .extrSpecPref(specPref, annot, useColumn=c("Description","Species","EntryName","GeneName"), silent=silent, debug=debug, callFrom=fxNa)
} else if(debug) message(fxNa,"Note: Argument 'specPref' not specifed (empty)")
if(debug) {message(fxNa,"rWB7b") }
if(!silent) { chSp <- sum(is.na(annot[,"Species"]))
if(chSp >0) message(fxNa,"Note: ",chSp," proteins with unknown species")
tab <- table(annot[,"Species"])
if(length(tab) >0) {
tab <- rbind(names(tab), paste0(": ",tab,", "))
if(!silent) message(" data by species : ", apply(tab, 2, paste)) } } # all lines assigned
if(debug) {message(fxNa,"rWB8"); rWB8 <- list(path=path,chPa=chPa,tmp=tmp,extrColNames=extrColNames,chCol=chCol,quantCol=quantCol,remStrainNo=remStrainNo,
abund=abund,chNum=chNum, annot=annot,remConta=remConta,counts=counts) }
## look for unique col from $annot to use as rownames
if(nrow(annot) <1) warning("annot is empty (NO lines)")
## maybe annot is empty ?
chAn <- colSums(apply(annot[,c(1:min(ncol(annot),7))], 2, duplicated), na.rm=TRUE) # look at first 6 cols : how many elements per column duplicated
if(!silent) message(fxNa,"Use column '",colnames(annot)[which.min(chAn)],"' as identifyer (has fewest, ie ",chAn[which.min(chAn)]," duplicated entries) as rownames")
rownames(abund) <- rownames(annot) <- if(any(chAn==0)) annot[,which(chAn==0)[1]] else wrMisc::correctToUnique(annot[,which.min(chAn)], callFrom=fxNa)
if(length(counts) >0) rownames(counts) <- rownames(annot)
if(debug) {message(fxNa,"rWB9"); rWB9 <- list(path=path,chPa=chPa,tmp=tmp,extrColNames=extrColNames,chCol=chCol,quantCol=quantCol,abund=abund,chNum=chNum,
annot=annot,refLi=refLi,remConta=remConta)}
## check for reference for normalization
refLiIni <- refLi
if(is.character(refLi) && length(refLi)==1) {
refLi <- which(annot[,"SpecType"]==refLi)
if(length(refLi) <1 ) { refLi <- 1:nrow(abund)
if(!silent) message(fxNa,"Could not find any proteins matching argument 'refLi=",refLiIni,"', ignoring ...")
} else {
if(!silent) message(fxNa,"Normalize using (custom) subset of ",length(refLi)," lines specified as '",refLiIni,"'")}} # may be "mainSpe"
## take log2 & normalize
quant <- try(wrMisc::normalizeThis(if(isLog2) abund else log2(abund), method=normalizeMeth, mode="additive", refLines=refLi, silent=silent, debug=debug, callFrom=fxNa), silent=TRUE)
if(inherits(quant, "try-error")) { warning(fxNa,"PROBLEMS ahead : Unable to normalize as log2-data !!") }
if(debug) {message(fxNa,"rWB10"); rWB10 <- list(path=path,chPa=chPa,tmp=tmp,extrColNames=extrColNames,chCol=chCol,quantCol=quantCol,abund=abund,chNum=chNum,
quant=quant,annot=annot,remConta=remConta,groupPref=groupPref,quantSoft=quantSoft,suplAnnotFile=suplAnnotFile, sdrf=sdrf,paFi=paFi )}
### GROUPING OF REPLICATES AND SAMPLE META-DATA
## prepare for sdrf (search in directory above)
if(isTRUE(sdrf)) {
hiDir <- dir(file.path(dirname(paFi),".."))
chFa <- grep("^sdrf.+\\.tsv$", hiDir)
if(length(chFa) >0) sdrf <- file.path(dirname(paFi),"..",hiDir[chFa[1]]) else {sdrf <- NULL
if(!silent) message(fxNa,"NO sdrf file found in directory above main data !")}
}
if(length(suplAnnotFile) >0 || length(sdrf) >0) {
headAbund <- utils::head(quant)
chX <- grepl("^X[[:digit:]]",colnames(quant)) #check for heading X in all colnames
if(any(chX)) colnames(headAbund)[which(chX)] <- sub("^X", "", colnames(headAbund)[which(chX)])
## check for matching : (as done within readSampleMetaData) - can't , sdrf not read yet ...
if(length(sampleNames) %in% c(1, ncol(abund))) groupPref$sampleNames <- sampleNames
if(length(gr) %in% c(1, ncol(abund))) groupPref$gr <- gr
setupSd <- readSampleMetaData(sdrf=sdrf, suplAnnotFile=suplAnnotFile, quantMeth="WB", path=path, abund=utils::head(quant), chUnit=isTRUE(groupPref$chUnit), groupPref=groupPref, silent=silent, debug=debug, callFrom=fxNa)
}
if(debug) {message(fxNa,"rWB13 .."); rWB13 <- list(sdrf=sdrf,gr=gr,suplAnnotFile=suplAnnotFile,abund=abund, quant=quant,refLi=refLi,annot=annot,setupSd=setupSd,sampleNames=sampleNames)}
## finish groups of replicates & annotation setupSd
setupSd <- .checkSetupGroups(abund=abund, setupSd=setupSd, gr=gr, sampleNames=sampleNames, quantMeth="WB", silent=silent, debug=debug, callFrom=fxNa)
## harmonize sample-names/1
colNa <- if(length(setupSd$sampleNames)==ncol(abund)) setupSd$sampleNames else setupSd$groups
## option : set order of samples as sdrf
if("sdrfOrder" %in% names(sdrf) && isTRUE(as.logical(sdrf["sdrfOrder"])) && length(setupSd$iniSdrfOrder)==ncol(abund) && ncol(abund) >1) { # set order according to sdrf (only if >1 samples)
nOrd <- order(setupSd$iniSdrfOrder)
## rename columns according to sdrf and set order of quant and abund ..
abund <- abund[,nOrd]
if(length(quant) >0) quant <- quant[,nOrd]
if(length(setupSd$sampleNames)==ncol(quant)) {
colNa <- colnames(abund) <- setupSd$sampleNames <- setupSd$sampleNaSdrf[nOrd] #old# setupSd$sampleNames[nOrd] ## take sample names from sdrf via setupSd$sampleNaSdrf
if(length(quant) >0) colnames(quant) <- setupSd$sampleNaSdrf[nOrd] #old# setupSd$sampleNames[nOrd]
} else colNa <- colnames(abund)
## now adapt order of setupSd, incl init Sdrf
if(length(setupSd) >0) {
is2dim <- sapply(setupSd, function(x,le) length(dim(x))==2 && nrow(x)==le, le=length(nOrd)) # look for matr or df to change order of lines
if(any(is2dim) >0) for(i in which(is2dim)) setupSd[[i]] <- setupSd[[i]][nOrd,]
isVe <- sapply(setupSd, function(x,le) length(x)==le && length(dim(x)) <1, le=length(nOrd)) # look for vector to change order in setupSd
if(any(isVe) >0) for(i in which(isVe)) setupSd[[i]] <- setupSd[[i]][nOrd] }
gr <- gr[nOrd]
if(length(counts) >0 && length(dim(counts))==3) counts <- array(counts[,nOrd,], dim=c(nrow(counts), length(nOrd), dim(counts)[3]),
dimnames=list(rownames(counts), colnames(counts)[nOrd], dimnames(counts)[[3]]))
## try re-adjusting levels
tm1 <- sub("^[[:alpha:]]+( |_|-|\\.)+[[:alpha:]]+","", colnames(abund)) # remove heading text
if(all(grepl("^[[:digit:]]", tm1))) {
tm1 <- try(as.numeric(sub("( |_|-|\\.)*[[:alpha:]].*","", tm1)), silent=TRUE) # remove tailing text and try converting to numeric
if(!inherits(tm1, "try-error")) {
setupSd$level <- match(tm1, sort(unique(tm1)))
names(setupSd$level) <- tm1
if(!silent) message(fxNa,"Sucessfully re-adjusted levels after bringing in order of Sdrf")}
}
} else {
## harmonize sample-names/2
colNa <- colnames(abund)
chGr <- grepl("^X[[:digit:]]", colNa) # check & remove heading 'X' from initial column-names starting with digits
if(any(chGr)) colNa[which(chGr)] <- sub("^X","", colNa[which(chGr)]) #
colnames(quant) <- colNa
if(length(abund) >0) colnames(abund) <- colNa
}
if(length(setupSd$sampleNames)==ncol(abund)) setupSd$sampleNames <- colNa #no#else setupSd$groups <- colNa
if(length(dim(counts)) >1 && length(counts) >0) colnames(counts) <- colNa
if(debug) {message(fxNa,"Read sample-meta data, rWB14"); rWB14 <- list(sdrf=sdrf,suplAnnotFile=suplAnnotFile,abund=abund, quant=quant,refLi=refLi,annot=annot,setupSd=setupSd,plotGraph=plotGraph,normalizeMeth=normalizeMeth,isLog2=isLog2)}
## main plotting of distribution of intensities
custLay <- NULL
if(is.numeric(plotGraph) && length(plotGraph) >0) {custLay <- as.integer(plotGraph); plotGraph <- TRUE} else {
if(!isTRUE(plotGraph)) plotGraph <- FALSE}
if(plotGraph) .plotQuantDistr(abund=if(isFALSE(isLog2) || "none" %in% normalizeMeth) NULL else abund, quant=quant, custLay=custLay, normalizeMeth=normalizeMeth, softNa=paste("Wombat-P",quantSoft),
refLi=refLi, refLiIni=refLiIni, tit=titGraph, silent=silent, callFrom=fxNa, debug=debug)
## meta-data
notes <- c(inpFile=paFi, qmethod=paste("Wombat-P",quantSoft), qMethVersion=if(length(infoDat) >0) unique(infoDat$Software.Revision) else NA,
rawFilePath= if(length(infoDat) >0) infoDat$File.Name[1] else NA, normalizeMeth=normalizeMeth, call=deparse(match.call()),
created=as.character(Sys.time()), wrProteo.version=utils::packageVersion("wrProteo"), machine=Sys.info()["nodename"])
## final output
if(isTRUE(separateAnnot)) list(raw=abund, quant=quant, annot=annot, counts=counts, sampleSetup=setupSd, quantNotes=parametersD, notes=notes) else data.frame(quant,annot) }
}
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.