R/pepmatch.R

#' Identify peak pairs.
#' 
#' Identifies pairs of labelled peptides in an \code{lpm_input} object.  
#' @author Rik Verdonck & Gerben Menschaert
#' @seealso \code{\link{lpm_mockdata}}
#' @param lpm_input      Input object of class "LPM_input", generated by the read functions of labelpepmatch, or example data object present in package. 
#' @param elutionthresh  Numeric. Threshold for elution time difference between two peaks to be considered a peakpair. Default is 0.2 (minutes)     	
#' @param elutionunit    Character. Default is "minutes", alternative is "seconds". This is sketchy, only just for FDR estimation. 		
#' @param labelthresh    Numeric. Threshold for molecular weight difference (in Dalton) between to peaks to differ from theoretical mass difference between two labelled peptides. Regardless of number of labels. Default is 0.1. 	
#' @param labelcountmax	Integer. Maximal number of labels allowed. Default is 6.
#' @param labellightmass Numeric. Mass of light label. Default is 128.1177, the mass of light TMAB
#' @param labelheavymass Numeric. Mass of heavy label. Default is 137.1728, the mass of heavy TMAB
#' @param label          Character. Optional argument. If it is "TMAB", automatically the right values for light and heavy TMAB are used. Can be extended in the future with newer labels.  
#' @param minmolweight   Numeric. Minimal molecular weight for a peptide to be retained. Default is 132, the smallest possible peptide.                  
#' @param quantmin       Numeric vector of one or two elements. Minimal untransformed quantity for a feature to be retained. The highest value of the vector is the cutoff for the most abundant peak in a pair, the lowest for the least abundant. If you want to make sure you pick up extreme up-or downregulations, the vector should contain one zero or very small number. If you want to discart a peak pair once one of the peaks is below a quantity threshold, you can set the threshold with one minimal value (a vector with one element). This is equivalent to a vector with two equal elements.                           
#' @param FDR	          Logical. Generates mock data according to a restricted randomization procedure that produces mock data with a very comparable structure to the original data. Structure elements that are retained are the dispersion of features in the retention time - m/z space and the structure of decimals for m/z. This is important because decimals are non random in m/z data. For more information see \code{\link{lpm_mockdata}}.
#' @param iterations     Integer: the number of iterations to estimate the FDR. Careful: for large datasets this can take a long time! A modest number (3 to 5) of iterations will already give a good indication of FDR. 
#' @param cores          Interger. Number of cores that can be used on the computer for calculation. When >1, the packages foreach and doParallel will be used for multithreading. Optimally, the number of cores is a multiple of the number of runs. Since every run makes up a single thread, it makes no sense to use more cores than the number of runs. 
#' @param verbose        Logical. Gives verbose output.
#' @details This is the central function of the labelpepmatch package. Candidate matching features are those features that have the same charge, that co-elute within a given time interval and that have a mass difference of an integer multiple (the number of labels) of the mass difference between two labels. However, only m/z values are given, and masses have to be calculated first (deconvolution). Note that in the case one or more labels are present, the deconvoluted mass will not be correct because an unknown number of charges originate from labels rather than from protons. However, this error is systematic and would by definition be the same for two features within a peak pair. In other words, the mass difference within a peak pair is still calculated correctly. The function follows a simple algorithm where features are first sorted based on their preliminarily deconvoluted masses and retention time. Every feature is then matched to a set of equally charged features that fall within the same retention time window. Once a peak pair is detected, the number of labels is inferred from the mass difference and the correct mass of the peptide is recalculated.
#' @import bitops
#' @import brew
#' @import foreach
#' @import doParallel
#' @export
#' @exportClass pepmatched
#' @return An object of class \code{pepmatched} without mass matchings. This object can serve as an imput to the \code{pep.id} function, or in case of no mass matching, can go directly in \code{make.statlist}. If FDR is TRUE, it also contains a summary of the false discovery rate: the mean, median, minimal and maximal number of false positive hits per run over all iterations, along with their proportion to the presumed real positives. Finally, if FDR is TRUE, the pepmatched object contains all the mass match precisions of the false positive hits along with the actual mockdata used for FDR estimation. 

#### TO DO: samplename_L en samplename_H zijn volledig nutteloze kolommen. Die moeten er nog uit. ALSOOK m_L en m_H kolommen...
#### TO DO: check wat er gebeurt als FDR nullen oplevert


pepmatch <-
function  (lpm_input, 
                    elutionthresh  		= 	0.2,
                    elutionunit			=	"minutes",	#alternative is seconds, this is sketchy, just for FDR estimation
                    labelthresh 		= 	0.1,
                    labelcountmax		= 	6,
                    labellightmass,
                    labelheavymass,
                    label                       =   "unspecified",   # can be "TMAB", then TMAB D0 and D9 masses will be used
                    minmolweight				=	132,                        
                    quantmin                    =   0,                                  
                    FDR							= 	T,
                    iterations					=	10,				# careful, this can take time!
                    cores                       =   1,  
                    verbose                     =   FALSE
                    )
                                        
{
### First some controls before we get started!
if(class(lpm_input)!="lpm_input"){stop("ERROR: lpm_input is not of class lpm_input")}
if(label=="TMAB"){labellightmass=128.1177;labelheavymass=137.17280}
labeldiff   <-labelheavymass-labellightmass
design		  <-lpm_input$design
runcount    <-nrow(design)
	quantminlowest <-min(quantmin)
	quantminhighest<-max(quantmin)



### HERE THE FUNCTION THAT ACTUALLY MATCHES PEAK PAIRS IS DEFINED, no parameters other than the lpm_input object and the runnumber are given because it only runs from within this environment anyway. 
MATCHER <- function	(x,labeldiff,k,FDR)
{
X<-x$frame

 	run		<- cbind	(
					X[,1:2],		        # properties (id,z)
					X[,2+k],				# mass/charge
					X[,2+k+runcount],       # quantity
					X[,2+k+(2*runcount)]    # retention time
					)
							                    #	Give column names

	colnames(run) <- c("id","z","mz","quant","ret")

	### Features with too low quantities are kicked out here. 
	### We do this before peak pair detection, so we don't waste time on features we will not be analyzing anyway. 
	###
	### If you want to retain all features for first analysis, set the threshold values to 0. 
	### You will however encounter a lot of nonsense. 
	
	if(FDR==F)		
	{
		run<-run[order(run$quant),]
		run<-run[run$quant>=quantminlowest, ]
	}


							###	Order matrix for performance
 	run		<- run	[
					order	(
						as.numeric(run[,2]),
						as.numeric(run[,5]),
						as.numeric(run[,3])
						)
				,]
   

								#####################################
								#	Let the looping begin...   		#
 								#####################################
 
							###	Create empty matchlist
	matchlist		<-as.data.frame(matrix(nrow = 0, ncol=17))


								### LOOPING OVER i FEATURES	
	for (i in 1:nrow(run))																	#Loop runs from here...
	{    
							###	Create new matrix, with 4 conditions:
							###	1) Mass				IMPORTANT!!! 
							###		- prevents getting all peak pairs in 2 directions
							###		- puts lightweight always first in matchlist
							###	2) Within elution threshold
							###	3) Different row as row(i)
							###	4) Same charge

     		match		<-	as.matrix	(
							rbind	(
								run	[i,],	
								run	[
									run[,3] >= run[i,3]						& #mass
									run[,2] == run[i,2]						& #charge
									run[,1] != run[i,1]						& #different row
									as.numeric(run[,5]) <= as.numeric(run[i,5]) + elutionthresh 	& #elution
									as.numeric(run[,5]) >= as.numeric(run[i,5]) - elutionthresh 	,	
									]
								)
							)



		if(nrow(match)>1)
		{
         						###	massTemporary = (m/z  * z) - z*H+ 
         	
         	z <-run[i,2]				
			mass1temp <- ((as.numeric(match[1,3])*as.numeric(match[1,2]))-(as.numeric(match[1,2])*1.007276))
			for (j in 2:nrow(match))	###	For j in 2:nrow because first row corresponds to run[i,]
			{     
							###	Check if mass difference corresponds to one or more labels

				mass2temp <- ((as.numeric(match[j,3])*as.numeric(match[j,2]))-(as.numeric(match[j,2])*1.007276)) 

         						###	labelCountInt = abs(m1 - m2)/deltaLabel
				labelcountInt <- round(abs(mass1temp-mass2temp)/(labeldiff))

        				 		###	Mass difference minus number of label differences
				massDifferenceMinusLabels <- abs(abs(mass1temp-mass2temp) - (labelcountInt * labeldiff))

         						###	labelCountRest
         						###	labelcountRest <- abs(mass1temp-mass2temp)/(labeldiff) - labelcountInt
								###	Write result for those rows where labelCount <= labelCountMax and within labeltresh     
				if 	(	
					labelcountInt != 0 			&& 
					labelcountInt < (labelcountmax + 1)	&& 						
					massDifferenceMinusLabels <= labelthresh
					) 
				{ 
      	   				calcmass <- mean	(c	(
									min(mass1temp,mass2temp)-(labelcountInt*labellightmass)+(labelcountInt*1.007276),
									max(mass1temp,mass2temp)-(labelcountInt*labelheavymass)+(labelcountInt*1.007276)
									)
								)
					
					newrow			<- c	(
									run[i,],
									mass1temp,
									match[j,],
									mass2temp,
									calcmass,
									labelcountInt
									) 
					newrow 			<- as.data.frame(newrow)
					IDcomb			<- paste(run[i,1],match[j,1],sep="_")
					newrow			<- cbind(IDcomb,newrow)
					colnames(newrow)<- c	(
									"ID",
									"id1", "z1", "mz1", "quant1","ret1", "m1temp", 
									"id2", "z2", "mz2", "quant2","ret2", "m2temp", 
									"calcmass", 
									"labelcount"
									)
					matchlist		<-rbind(matchlist,newrow)
				} 
			}
			} 
 		}
 		
 																							#To here!
 		### What if no hits?				
    if(nrow(matchlist)==0)
    {
      matchlist=cbind.data.frame(	
        "ID"=NA,
        "samplename_L"=NA,
        "ID_L"=NA,
        "z_L"=NA,
        "mz_L"=NA,
        "quant_L"=NA,
        "ret_L"=NA,
        "m_L"=NA,
        "samplename_H"=NA,
        "ID_H"=NA,
        "z_H"=NA,
        "mz_H"=NA,
        "quant_H"=NA,
        "ret_H"=NA,
        "m_H"=NA,	
        "MW"=NA,
        "labelcount"=NA,
        "precision"=NA,
        "isID"=NA,
        "pepID"=NA,
        "pepseq"=NA,
        "pepmass"=NA,
        "delta_m"=NA,
        "N_identifications"=NA
      )
      matchlist=matchlist[-1,]
    }else{
																				
								### Add two very useful and informative columns
		precision	<- (abs(matchlist$mz1-matchlist$mz2)*matchlist$z1)/matchlist$labelcount-labeldiff
		matchlist	<- cbind(matchlist$ID,NA,matchlist[2:7],NA,matchlist[8:15],precision,NA,NA,NA,NA,NA,0)

								###	Give column names and remove the NA rows 
 				colnames(matchlist) 	<- c	(	
												"ID",
												"samplename_L","ID_L","z_L","mz_L","quant_L","ret_L","m_L",
												"samplename_H","ID_H","z_H","mz_H","quant_H","ret_H","m_H",	
												"MW"		,
												"labelcount",
												"precision"	,
												"isID",
                        "pepID",
                        "pepseq",
                        "pepmass",
                        "delta_m",
                        "N_identifications"
												)



	### Kick out features with a too low molecular weight or too low minimal quantity
	matchlist<-matchlist[order(matchlist$MW),]
	matchlist<-matchlist[matchlist$MW>minmolweight,]
	
	if(FDR==F)
	{
		matchlist<-matchlist[matchlist$quant_L>=quantminhighest | matchlist$quant_H>=quantminhighest,]
	}
    }
		
if(verbose ==TRUE){cat(paste("run",k,"is ready",sep=" "));cat("\n")}
return("matchlist"=matchlist)		
} # END OF MATCHER FUNCTION
####################################################################
##########################################################
#################################################
#########################################
##################################
##############################
############################
###########################
##########################
##########################




# Now we either loop it over the runs, or do i in parallel
###############################################################     
if(verbose==T){cat("Matching peptide peaks\n") }			###
if(cores==1)                                                ###        
{                                                           ###
                                                            ###
    matchlistlist<-foreach (k = 1:runcount) %do%            ###     
    {                                                       ###    
        matchlist<-MATCHER(lpm_input,labeldiff,k,FDR=F)           ###
        return(matchlist)                                   ###                         
    }                                                       ###
}else # parallel !                                          ###            
{                                                           ###
    cl<- parallel::makeCluster(cores)                       ###                 
    doParallel::registerDoParallel(cl)                      ###
    matchlistlist<-foreach (k = 1:runcount) %dopar%         ###     
    {                                                       ###    
        matchlist<-MATCHER(lpm_input,labeldiff,k,FDR=F)           ###
        return(matchlist)                                   ###             
    }                                                       ###
    parallel::stopCluster(cl)                               ###
}                                                           ###
                                                            ###              
###############################################################
#closeAllConnections()
matchlistnames<-NULL
for(i in 1:runcount){matchlistnames<-c(matchlistnames,paste("matchlist",design[i,1],sep="_"))}
names(matchlistlist)<-matchlistnames





# And FDR estimation:
if(FDR==T)
{
mockdatalist<-list()
FDRlistlist<-list()

FDR_summary_frame<-data.frame(matrix(nrow=iterations,ncol=runcount))
FDR_summary_frame_prop<-data.frame(matrix(nrow=iterations,ncol=runcount))
FDR_precisionvectorvector<-NULL

for (iteration in 1:iterations)
{
      FDR_precisionvector<-NULL
      ### Now we use the lpm_mockdata function to generate mock data with properties very similar to the real data.
      mockdata<-lpm_mockdata(lpm_input,masslevel=100,retlevel=5,elutionunit=elutionunit,graphics=F)
      mockdataname<-paste("mockdata",iteration,sep="_")
      assign(mockdataname,mockdata)
      mockdatalist[[iteration]]<-get(mockdataname)
      
      #matchlist<<-MATCHER(mockdata,labeldiff,1)
      
      # Now we either loop it over the runs, or do it in parallel
      # According to documentation, I can rest assured that the order in which 
      # the elements of the final FDRlist appear, is the actual order of the runs. 
#####################################################################  
                                                                  ###
      if(verbose==T){cat("Testing FDR\n") }						  ###
      if(cores==1)                                                ###        
      {                                                           ###
          FDRlist<-foreach (k = 1:runcount) %do%                  ###     
          {                                                       ###    
              matchlist<-MATCHER(mockdata,labeldiff,k,FDR=T)      ###
              return(matchlist)                                   ###                         
          }                                                       ###
      }else # parallel !                                          ###            
      {                                                           ###
        cl<- parallel::makeCluster(cores)                         ###                 
        doParallel::registerDoParallel(cl)                        ###
          FDRlist<-foreach (k = 1:runcount) %dopar%               ###     
          {                                                       ###    
              matchlist<-MATCHER(mockdata,labeldiff,k,FDR=T)      ###
              return(matchlist)                                   ###             
          }                                                       ###
        parallel::stopCluster(cl)                                 ###
      }                                                           ###
                                                                  ###              
#####################################################################
      #closeAllConnections()
      FDRlistnames<-NULL
      for(k in 1:runcount){FDRlistnames<-c(FDRlistnames,paste("FDRlist",k,sep="_"))}
      names(FDRlist)<-FDRlistnames
      #FDRlist$ID<-iteration
      #colnames(FDRlist)[1]<-"iteration"
      
      FDRlistlist[[iteration]]<-FDRlist
      
      FDR_summary<-NULL
      FDR_summary_prop<-NULL
      realhits<-NULL
      precisionnames<-NULL
      for (k in 1:runcount)
      	{
      		FDR_summary         <-c(FDR_summary,(nrow(FDRlist[[k]])))
      		FDR_summary_prop	<-c(FDR_summary_prop,(nrow(FDRlist[[k]])/nrow(matchlistlist[[k]])))
      		FDR_precisionvector <-c(FDR_precisionvector,FDRlist[[k]]$precision) 
      		precisionnames		<-c(precisionnames,rep(design$RunName[k],nrow(FDRlist[[k]])))
      		realhits			<-c(realhits,nrow(matchlistlist[[k]]))
      	}
      names(FDR_precisionvector)<-precisionnames	
      names(FDR_summary)<-as.character(c(1:runcount))
      names(FDR_summary_prop)<-as.character(c(1:runcount))
      
      FDR_summary_frame[iteration,]<-FDR_summary
      FDR_summary_frame_prop[iteration,]<-FDR_summary_prop
      #print(FDR_summary)
      #print(FDR_summary_frame)
      FDR_precisionvectorvector<-c(FDR_precisionvectorvector,FDR_precisionvector)
}# end of iterations




### Now collapse that FDRlistlist thing in pieces. 
reshape_fdrlist<-function(FDRlistlist,iterations,runcount){
  fdr_frame<-as.data.frame(FDRlistlist[[1]][[1]][0,])
  fdr_frame<-cbind(fdr_frame,NULL)
  colnames(fdr_frame)[c(1,2)]<-c("run","iteration") 
  
  for (k in 1:runcount)
  {
    for(i in 1:iterations)
    {
      rowstoadd<-FDRlistlist[[i]][[k]]
      if(nrow(rowstoadd)>0){
        rowstoadd<-cbind(rep(k,nrow(rowstoadd)),rowstoadd)
        colnames(rowstoadd)[c(1,2)]<-c("run","iteration")
        rowstoadd$run<-k
        rowstoadd$iteration<-i
        fdr_frame<-rbind(fdr_frame,rowstoadd)
      }
    }
  }
  return(fdr_frame)
}
FDR_frame<<-reshape_fdrlist(FDRlistlist,iterations,runcount)


#FDR_summary_frame<<-FDR_summary_frame
FDR_summary<-rbind(
				round(apply(as.matrix(FDR_summary_frame),2,mean),2), 					# mean
				round(apply(as.matrix(FDR_summary_frame),2,median),2),					# median
				round(apply(as.matrix(FDR_summary_frame),2,min),2),						# minimal value
				round(apply(as.matrix(FDR_summary_frame),2,max),2),						# maximal value
				round(apply(as.matrix(FDR_summary_frame),2,sd),2), 						# standard deviation
				round((apply(as.matrix(FDR_summary_frame),2,sd)/sqrt(iterations)),2) 	# standard error of the mean
				)
				
FDR_summary_prop<-rbind(
				round(100*apply(as.matrix(FDR_summary_frame_prop),2,mean),2), 					# mean
				round(100*apply(as.matrix(FDR_summary_frame_prop),2,median),2),					# median
				round(100*apply(as.matrix(FDR_summary_frame_prop),2,min),2),						# minimal value
				round(100*apply(as.matrix(FDR_summary_frame_prop),2,max),2),						# maximal value
				round(100*apply(as.matrix(FDR_summary_frame_prop),2,sd),2), 						# standard deviation
				round((100*apply(as.matrix(FDR_summary_frame_prop),2,sd)/sqrt(iterations)),2) 	# standard error of the mean
				)
				
colnames(FDR_summary)<-design$RunName
colnames(FDR_summary_prop)<-design$RunName
rownames(FDR_summary)<-c("mean","median","min","max","sd","sem")
rownames(FDR_summary_prop)<-c("meanprop","medianprop","minprop","maxprop","sdprop","semprop")
FDR_summary<-cbind.data.frame("N_real_hits"=realhits,t(FDR_summary),t(FDR_summary_prop))

}else{FDR_summary="No FDR executed";FDR_precisionvector="No FDR executed";mockdata="No FDR executed"}


metadata<-list  (
				"design"		=design,
				"pepmatch_parameters" =data.frame( 
					"elutionthresh"   =as.numeric(elutionthresh),
					"labelthresh"     =as.numeric(labelthresh),
					"labelcountmax"   =as.numeric(labelcountmax),
					"labellightmass"  =as.numeric(labellightmass),
					"labelheavymass"  =labelheavymass,
					"label"           =label,
					"minmolweight"    =as.numeric(minmolweight),
					"quantminlowest"  =as.numeric(quantminlowest),
					"quantminhighest" =as.numeric(quantminhighest)
					)
                )
if(FDR==T)
{
metadata[[length(metadata)+1]]<-FDR_frame
names(metadata)[[length(metadata)]]<-"pepmatch_FDR_matchlist"
metadata[[length(metadata)+1]]<-FDR_summary
names(metadata)[[length(metadata)]]<-"pepmatch_FDR_summary"
FDRdata<-list	(
				"precision"		=FDR_precisionvectorvector,
				"mockdatalist"	=mockdatalist,
				"iterations"	=iterations
				)
metadata[[length(metadata)+1]]<-FDRdata
names(metadata)[[length(metadata)]]<-"pepmatch_FDR_details"				
}

out<-append(matchlistlist,metadata)
class(out)<-"pepmatched"
return(out)
#new("pepmatched",out) ### STILL DON'T UNDERSTAND WHY THIS DOESN'T WORK
}
goat-anti-rabbit/labelpepmatch.R documentation built on May 17, 2019, 7:29 a.m.